# Homework 5

### Use this notebook for question statement and solution checkup. 

For questions 1-3, you will use the APD dataset:
```
df = pd.read_csv('COBRA-YTD2017.csv.gz')
``` 

For questions 4-5, you will use the cereal dataset:

```
cer = pd.read_csv('cereal.csv')
```


## Load *COBRA-YTD2017.csv.gz* into a pandas DataFram to solve Question 1, 2 & 3. 


## Question 1

Write a function called "variable_helper" which takes one argument:

* df, which is a pandas data frame

and returns:

* d, a _dictionary_ where keys are the column names of df and values are one of "numeric", "categorical", "ordinal", "date/time", or "text", corresponding to the feature type of each column. 

In [1]:
df = pd.read_csv('COBRA-YTD2017.csv.gz')
cer = pd.read_csv('cereal.csv')


In [105]:
def variable_helper(input_df):
    
    out = dict()
    # for num_cols, check if numerical or ordinal
    # for others, first check if date/time then check if categorical
    
    num_cols = set(input_df.describe().columns)
    other_cols = set(input_df.columns) - num_cols    
    
    dtypes = input_df.dtypes
    n_row = input_df.shape[0]
    
    # most elements in an ordinal column should be integer and incremental
    # however, integer columns could also be categorical
    if num_cols:
        for num_col in num_cols:
            if 'int' in str(dtypes[num_col]):
                increments = set(np.diff(input_df[num_col]))
                n_ele = len(input_df[num_col])
                if len(increments) / n_ele < 0.05 and len(input_df[num_col].unique()) / n_ele >= 0.95:
                    out[num_col] = 'ordinal'
                else:
                    if len(input_df[num_col].unique()) / len(input_df[num_col]) <= 0.2:
                        out[num_col] = 'categorical'
                    else:
                        out[num_col] = 'numeric'
            else:
                out[num_col] = 'numeric'
    
    if other_cols:
        for other_col in other_cols:
            # check a few random rows to see if can be converted into datetime
            # ideally we should be checking one at a time to avoid malformations
            # but i'm too lazy for that
            try:
                tmp = pd.to_datetime(input_df.loc[np.random.randint(n_row, size=10), other_col])
            except Exception:
                if len(input_df[other_col].unique()) / len(input_df[other_col]) <= 0.2:
                    out[other_col] = 'categorical'
                else:
                    out[other_col] = 'text'
            else:
                out[other_col] = 'date/time'
    
    return out

In [106]:
variable_helper(df[['offense_id','beat','x','y']])

{'beat': 'categorical',
 'x': 'numeric',
 'offense_id': 'numeric',
 'y': 'numeric'}

### Sample output:

```
In [1]: variable_helper(df[['offense_id','beat','x','y']])
Out[1]: {'beat': 'categorical',
         'offense_id': 'ordinal',
         'x': 'numeric',
         'y': 'numeric'}
```

Short explanation: _offense_\__id_ is a number assigned to each offense. There is a natural ordering implied in the id number (based on order of occurrence). Because of this, _offense_\__id_ is an ordinal feature. The _beat_ uses a numeric label, but refers to a geographic location. There is no natural ordering, so _beat_ is a categorical feature. The location variables (_x_ and _y_) are numeric position coordinates. 

## Question 2

Write a function called "get_categories" which takes one argument:

* df, which is a pandas data frame

and returns:

* cat, a dictionary where keys are names of columns of df corresponding to categorical features, and values are arrays of all the unique values that the feature can take.

In [110]:
def get_categories(input_df):
    
    result = dict()
    
    helper = variable_helper(input_df)
    cat_cols = [k for k in helper if helper[k] == 'categorical']
    if cat_cols:
        result = {cat_col: np.sort(input_df[cat_col].unique()) for cat_col in cat_cols}

    return result

In [111]:
get_categories(df[['offense_id','beat','UC2 Literal']])

{'beat': array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
        114, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212,
        213, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312,
        313, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412,
        413, 414, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511,
        512, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612,
        701, 702, 703, 704, 705, 706, 707, 708, 710]),
 'UC2 Literal': array(['AGG ASSAULT', 'AUTO THEFT', 'BURGLARY-NONRES',
        'BURGLARY-RESIDENCE', 'HOMICIDE', 'LARCENY-FROM VEHICLE',
        'LARCENY-NON VEHICLE', 'RAPE', 'ROBBERY-COMMERCIAL',
        'ROBBERY-PEDESTRIAN', 'ROBBERY-RESIDENCE'], dtype=object)}

### Sample output:

```
In [1]: get_categories(df[['offense_id','beat','UC2 Literal']])
Out[1]: {'UC2 Literal': array(['AGG ASSAULT', 'AUTO THEFT', 'BURGLARY-NONRES',
                'BURGLARY-RESIDENCE', 'HOMICIDE', 'LARCENY-FROM VEHICLE',
                'LARCENY-NON VEHICLE', 'RAPE', 'ROBBERY-COMMERCIAL',
                'ROBBERY-PEDESTRIAN', 'ROBBERY-RESIDENCE'], dtype=object),
         'beat': array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
                114, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212,
                213, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312,
                313, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412,
                413, 414, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511,
                512, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612,
                701, 702, 703, 704, 705, 706, 707, 708, 710])}
```

Short explanation: UC2 Literal and beat are the only categorical variables in the data frame `df[['offense_id','beat','UC2 Literal']]`. 

## Question 3

Write a function called "code_shift" which takes one argument:
    
* df, which is a pandas data frame

and returns:

* a pandas data frame with columns "offense_id", "Shift", "ShiftID", where ShiftID is 0 if "Shift" is "Unk", 1 if "Morn", 2 if "Day", and 3 if "Eve".

In [152]:
def code_shift(input_df):
    code = ["Unk", "Morn","Day","Eve"]
    shift_dict = dict(zip(code, range(len(code))))
    
    # it is recommended by pandas to create a new DF when changes need to be made to the input DF
    new = input_df[['offense_id', 'Shift']].copy()
    new['ShiftID'] = list(new["Shift"].map(shift_dict))
    return new

In [153]:
code_shift(df[:5])

Unnamed: 0,offense_id,Shift,ShiftID
0,172490115,Morn,1
1,172490265,Eve,3
2,172490322,Morn,1
3,172490390,Morn,1
4,172490401,Morn,1


### Sample output:

```
In [1]: code_shift(df[:5])
Out[1]:  	offense_id 	Shift 	ShiftID
        0 	172490115 	Morn 	1
        1 	172490265 	Eve  	3
        2 	172490322 	Morn 	1
        3 	172490390 	Morn 	1
        4 	172490401 	Morn 	1
```

## Load *cereal.csv* into a pandas DataFram to solve Question 4 & 5. 


## Question 4

Write a function called "rating_confusion" which takes one argument:

* cer, which is a pandas data frame

and returns:

* cf, a confusion matrix where the rows correspond to predicted_ratingID and the columns correspond to ratingID.

In [154]:
def rating_confusion(input_df):
    # Step 1: Making confusion_matrix
    confusion_matrix = [[((input_df["predicted_ratingID"] == i) & (input_df["ratingID"] == j)).sum() \
     for j in range(2)] for i in range(2) ]

    # Step 2: Set index & columns name
    cf = pd.DataFrame(confusion_matrix)
    cf.index.name = "predicted_ratingID"
    cf.columns.name = "ratingID"

    return cf

In [155]:
rating_confusion(cer[:20])

ratingID,0,1
predicted_ratingID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,15,0
1,3,2


### Sample output:


```
In [1]: rating_confusion(cer[:20])
Out[1]: ratingID 	 0 	1
predicted_ratingID 		
                0 	15	0
                1 	3 	2
```

## Question 5

Write a function called "prediction_metrics" which takes one argument:

* cer, which is a pandas data frame

and returns:

* metrics_dict, a python dictionary object where the keys are 'precision', 'recall', 'F1' and the values are the numeric values for precision, recall, and F1 score, where ratingID is the prediction target and predicted_ratingID is a model output.

In [162]:
def prediction_metrics(input_df):
    conf_matrix = rating_confusion(input_df)
    tp = conf_matrix.loc[1,1]
    fp = conf_matrix.loc[1,0]
    fn = conf_matrix.loc[0, 1]
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    f1 = 2*(precision*recall)/(precision+recall)
    result_dict = {"F1":f1, "precision": "{0:.2f}".format(precision),"recall": recall}

    return result_dict

In [163]:
prediction_metrics(cer[:20])

{'F1': 0.5714285714285715, 'precision': '0.40', 'recall': 1.0}

### Sample output:


```
In [1]: prediction_metrics(cer[:20])
Out[1]: {'F1': 0.5714285714285715, 'precision': 0.4, 'recall': 1}
```