# Skills challenge \#9
Below are a series of questions. Use the loaded data to answer the questions. You will almost certainly need to import more packages (`pandas`, `numpy`, etc.) to complete these. You are welcome to use any source except for your classmates. So Google away!

You will be graded on both the **correctness** and **cleanliness** of your work. So don't submit poorly written code or your grade will reflect that. Use Markdown describing what you have done. If you get stuck, move on to another part. Most questions don't rely on the answer to earlier questions.

### Imports

In [67]:
import pandas as pd

### Data loading

In [39]:
df = pd.read_csv('../../data/2016_austin_crime.csv')

In [40]:
df.head()

Unnamed: 0,GO Primary Key,Council District,GO Highest Offense Desc,Highest NIBRS/UCR Offense Description,GO Report Date,GO Location,Clearance Status,Clearance Date,GO District,GO Location Zip,GO Census Tract,GO X Coordinate,GO Y Coordinate
0,201610188.0,8.0,AGG ASLT ENHANC STRANGL/SUFFOC,Agg Assault,1-Jan-16,8600 W SH 71 ...,C,12-Jan-16,D,78735.0,19.08,3067322.0,10062796.0
1,201610643.0,9.0,THEFT,Theft,1-Jan-16,219 E 6TH ST ...,C,4-Jan-16,G,78701.0,11.0,3114957.0,10070462.0
2,201610892.0,4.0,AGG ROBBERY/DEADLY WEAPON,Robbery,1-Jan-16,701 W LONGSPUR BLVD ...,N,3-May-16,E,78753.0,18.23,3129181.0,10106923.0
3,201610893.0,9.0,THEFT,Theft,1-Jan-16,404 COLORADO ST ...,N,22-Jan-16,G,78701.0,11.0,3113643.0,10070357.0
4,201611018.0,4.0,SEXUAL ASSAULT W/ OBJECT,Rape,1-Jan-16,,C,10-Mar-16,E,78753.0,18.33,,


### Data description

This data is all the crimes recorded by the Austin PD in 2016, which you used previously in skills challenge 1. The columns that we are interested are:
- **Council District**: The district in which the crime was committed ([map of districts](https://www.austinchronicle.com/binary/35e1/pols_feature51.jpg))
- **GO Highest Offense Desc**: A text description of the offense using the APD description
- **Highest NIBRS/UCR Offense Description**: A text description using the FBI description
- **GO Report Date**: The date on which the crime was reported
- **Clearance Status**: Whether or not the crime was "cleared" (i.e. the case was closed due to an arrest)
- **Clearance Date**: When the crime was cleared
- **GO Location Zip**: The zip code where the crime occurred

## Tasks

### Data cleaning
**DC1:** Drop all columns except those listed above. Also drop any rows with any missing values. Save the result back to `df`.

**DC2:** Rename the columns to be all lowercase, replace spaces with underscores ("_"), and remove "GO" from all column names. Finally, make sure there are no spaces at the start or finish of a column name. For example, ``'  my_col '`` should be renamed to `'my_col'` (notice that the spaces are gone), and "GO Report Date" should become "report_date". Rename "Highest NIBRS/UCR Offense Description" to "fbi_desc".

**DC3:** Create four new columns: `report_month`, `report_day`, `clearance_month` and `clearance_day` (we don't need year, because they're all from 2016) using the report date  and clearance date columns, respectively. Make sure the values in `*_day` are integers, not strings (you can have strings for `*_month`). Once you are done, drop the report date and clearance date columns.

**DC4:** Label encode all of the categorical columns if they are not already integers. Save the label encoders into a dictionary where the key is the (cleaned) column name. So you should have a dictionary `label_encoders` which looks like `label_encoder['council_district'] = the label encoder for the council_district column`, and so forth.

### Model building

**MB1:** Split the data into training and testing sets. Your test set should have 20% of the data, and the training set should have the remaining 80%. Split the training and testing sets into X and y, where y is the clearance status.

**MB2:** Train a decision tree classifier to predict the clearance status given all other columns. Set the `max_depth` to 3 and the `min_samples_split` to 5. Be sure to train on the training set only. 

### Model evaluation

**ME1:** Print a classification report to show the results. Be sure to make your predictions/do your evaluations on the test set only. Create a Markdown cell discussing what you found.

**ME2:** Print a confusion matrix by following [the sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html). Set `values_format='d'` to show numbers and not scientific notation. Do this for the test set. Create a Markdown cell discussing what you see.

## Model tuning

**MT1:** Build decision tree classifiers with all possible combinations of the following values: `max_depth = [2, 3, 4, 5, 6, 7, 8, 9, 10]`, `min_samples_split = [2, 5, 10, 15, 20]`. Do this using `for` loops. So you should have a model with (for example) `max_depth=2` and `min_samples_split=1`, and another with `max_depth=2` and `min_samples_split=5`, and so forth. This will result in 9 * 5 = 45 different decision tree models. Find which model has the best accuracy (hint: using the `.score()` method for a classification model will return the accuracy). Use the `.get_params()` method for the best decision tree to see what its depth and min samples split was.

**MT2:** Use grid search cross validation to search the same hyperparameters as you did above. Show what the best hyperparameters it finds are.

### Bonus

**B1:** Try both under and oversampling the data to see if you can get a model which will actually predict any clearance status of 2. Compare the accuracy of it to what you got from other decision tree models.