# DS-GA-1007 Programming for Data Science

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Any *textual answers* that need to be provided will be marked with "YOUR ANSWER HERE". Replace this text with your answer to the question.

Any *code answers* that need to be provided will be marked with:

```
# YOUR CODE HERE
raise NotImplementedError()
```

Replace all this code with your answer to the question. If you do not answer the question, the `NotImplementedError` exception will be raised, which will indicate to the grader that no answer has been supplied.

In many cases, code answers will also have some associated test code. You should execute the tests after you have entered your code in order to ensure that your answer is correct. You should not proceed to the next question until your answer is correct.

Finally, insert your Net ID and the Net ID's of any collaborators in the cell below.

In [1]:
NET_ID = "lh1036"
COLLABORATORS = ""

---

## Assignment 3: Pandas I
### Munging GoT data using Pandas
This assignment will give you some experience munging data relating to the Game of Thrones. The dataset used in this example and other datasets related to *A Song of Ice and Fire series by George R. R. Martin* can be found [here](https://www.kaggle.com/mylesoneill/game-of-thrones). Download the `character-predictions.csv` file and place it in the same location as the notebook.

Define a function called `read_characters` that will read the character data from a file into a DataFrame. The first argument of the function should be a string containing name of the file.

Make sure that the first column is specified to be the index of the DataFrame.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.util import testing

%matplotlib inline

def read_characters(filename):
    '''
    Function to read in a csv file with the name supplied as a string argument. Returns None if non string is passed.
    Sources consulted: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
    '''
    if type(filename) is str:
        df = pd.read_csv(filename, index_col = 0)
        return df
    else:
        return None
    
characters = read_characters('character-predictions.csv')

What is the size of the DataFrame?

In [2]:
characters.shape

(1946, 32)

The first element indicates the number of rows and the second the number of columns.

In [35]:
characters.index.shape

1946

In [4]:
characters.columns.shape

(32,)

In [5]:
testing.assert_equal(characters.shape, (1946, 32))
testing.assert_equal(characters.index.name, 'S.No')

## What kind of information do we have for each character?
Check out column names and data types, notice that columns can have different data types.

For a detailed data dictionary please check **Feature selection** section of the dataset website [here](https://got.show/machine-learning-algorithm-predicts-death-game-of-thrones)

In [6]:
characters.ftypes

actual                 int64:dense
pred                   int64:dense
alive                float64:dense
plod                 float64:dense
name                  object:dense
title                 object:dense
male                   int64:dense
culture               object:dense
dateOfBirth          float64:dense
DateoFdeath          float64:dense
mother                object:dense
father                object:dense
heir                  object:dense
house                 object:dense
spouse                object:dense
book1                  int64:dense
book2                  int64:dense
book3                  int64:dense
book4                  int64:dense
book5                  int64:dense
isAliveMother        float64:dense
isAliveFather        float64:dense
isAliveHeir          float64:dense
isAliveSpouse        float64:dense
isMarried              int64:dense
isNoble                int64:dense
age                  float64:dense
numDeadRelations       int64:dense
boolDeadRelations   

You can also get additional information about the DataFrame

In [8]:
characters.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1946 entries, 1 to 1946
Data columns (total 32 columns):
actual               1946 non-null int64
pred                 1946 non-null int64
alive                1946 non-null float64
plod                 1946 non-null float64
name                 1946 non-null object
title                938 non-null object
male                 1946 non-null int64
culture              677 non-null object
dateOfBirth          433 non-null float64
DateoFdeath          444 non-null float64
mother               21 non-null object
father               26 non-null object
heir                 23 non-null object
house                1519 non-null object
spouse               276 non-null object
book1                1946 non-null int64
book2                1946 non-null int64
book3                1946 non-null int64
book4                1946 non-null int64
book5                1946 non-null int64
isAliveMother        21 non-null float64
isAliveFather        26 non

Define a function called `generate_subset` that takes two arguments `characters` and `n` where `characters` is a DataFrame and `n` is the number of rows to extract. The function should generate a new DataFrame containing only the columns `actual` and `predicted` for the first `n` rows, then return this DataFrame transposed.

In [39]:
def generate_subset(characters, n=3):
    '''
    Function to return transposed DF of "actual" and "pred" for specified rows
    Sources consulted: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.transpose.html
    '''
    df = characters[:n][["actual", "pred"]]

    return pd.DataFrame.transpose(df)

In [37]:
# This assertion should hold
data = {
    1 : {'actual': 0, 'pred': 0},
    2 : {'actual': 1, 'pred': 0},
    3 : {'actual': 1, 'pred': 0}
}

expected = pd.DataFrame(data)
expected.columns.name = 'S.No'
subset = generate_subset(characters, 3)

testing.assert_frame_equal(expected, subset)

Define a function called `get_ages` that takes two arguments `characters` and `n` where `characters` is a DataFrame and `n` is the number of rows to extract. The function should generate a new Series containing only the column `ages` for the **last** `n` rows, then return this Series.

In [17]:
def get_ages(characters, n=3):
    return characters[-n:]["age"]

In [18]:
# This assertion should hold
series = { 1944: None, 1945: None, 1946: 47.0}

expected = pd.Series(series)
expected.name = 'age'
expected.index.name = 'S.No'
column = get_ages(characters, 3)

testing.assert_series_equal(expected, column)

Define a function called `negative_ages` that takes one argument `characters` that is a DataFrame. The function should return a Series containing only the column `ages` for all rows that have characters with negative ages.

In [20]:
def negative_ages(characters):
    ser = characters["age"]
    return ser[ser < 0]

In [21]:
# This assertion should hold
series = { 1685: -277980.0, 1869: -298001.0}

expected = pd.Series(series)
expected.name = 'age'
expected.index.name = 'S.No'
negative = negative_ages(characters)

testing.assert_series_equal(expected, negative)

Using your knowledge of how to manipulat DataFrames and Series, replace all the negative ages in the orgingal dataset with the value zero.

In [26]:
def replace_neg(characters):
    ser = characters["age"]
    ser[ser < 0] = 0 # change to ser will also change the DF
    return characters

In [30]:
# This assertion should hold
testing.assert_equal(len(negative_ages(characters)), 0)

Notice that the Series in the *characters_ages* is a reference to the DataFrame column, when we change the DataFrame column, the Series is also changed.

## Focus on a specific type of data
Define a function called `categorical_features` that takes one argument `characters` that is a DataFrame. The function should return an **Index** containing the names of the categorical features in the data.

In [31]:
def categorical_features(characters):
    '''
    Sources consulted: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html
    http://pandas.pydata.org/pandas-docs/stable/indexing.html
    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html?highlight=index
    '''
    
    categoricals = characters.select_dtypes(include=["object"])[0:0]
    ind = pd.Index(categoricals.columns.values)
    
    return ind

In [32]:
categorical = pd.Index([u'name', u'title', u'culture', u'mother', u'father', u'heir', u'house', u'spouse'])
testing.assert_index_equal(categorical, categorical_features(characters))

Define a function called `categorical_counts` that takes one argument `characters` that is a DataFrame. Using the function you defined previously, this function should return a Series containing the number of unique values in each categorical feature, sorted in descending order by values (features with more distinct elements on top).

In [33]:
def categorical_counts(characters):    
    '''
    Function to return a Series of unique values for categorical vars from given DF.
    Calls categorical_features() function.
    Sources consulted:
    http://pandas.pydata.org/pandas-docs/stable/basics.html
    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html
    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html
    '''
    
    categoricals = characters[categorical_features(characters)]
    unique_ser = pd.Series()
    
    for column in categoricals:
        unique_ser[column] = len(categoricals[column].value_counts())
        
    # return Series sorted and (per test code) converted to float64
    return unique_ser.sort_values(ascending=False).astype(float) 

In [34]:
# This assertion should hold
series = {'name': 1946.0, 'house': 347.0, 'title': 262.0, 'spouse': 254.0,
          'culture': 64.0, 'heir': 22.0, 'father': 20.0, 'mother': 17.0}

expected = pd.Series(series).sort_values(ascending=False)
testing.assert_series_equal(expected, categorical_counts(characters))

We will conclude this exercise next week by analyzing the data to determine some interesting facts.