# Skills challenge \#12
Below are a series of questions. Use the loaded data to answer the questions. You will almost certainly need to import more packages (`pandas`, `numpy`, etc.) to complete these. You are welcome to use any source except for your classmates. So Google away!

You will be graded on both the **correctness** and **cleanliness** of your work. So don't submit poorly written code or your grade will reflect that. Use Markdown describing what you have done. If you get stuck, move on to another part. Most questions don't rely on the answer to earlier questions.

### Imports

In [1]:
import pandas as pd

import numpy as np

### Data loading

In [2]:
df = pd.read_csv('../../data/2016_austin_crime.csv')

In [3]:
df.head()

Unnamed: 0,GO Primary Key,Council District,GO Highest Offense Desc,Highest NIBRS/UCR Offense Description,GO Report Date,GO Location,Clearance Status,Clearance Date,GO District,GO Location Zip,GO Census Tract,GO X Coordinate,GO Y Coordinate
0,201610188.0,8.0,AGG ASLT ENHANC STRANGL/SUFFOC,Agg Assault,1-Jan-16,8600 W SH 71 ...,C,12-Jan-16,D,78735.0,19.08,3067322.0,10062796.0
1,201610643.0,9.0,THEFT,Theft,1-Jan-16,219 E 6TH ST ...,C,4-Jan-16,G,78701.0,11.0,3114957.0,10070462.0
2,201610892.0,4.0,AGG ROBBERY/DEADLY WEAPON,Robbery,1-Jan-16,701 W LONGSPUR BLVD ...,N,3-May-16,E,78753.0,18.23,3129181.0,10106923.0
3,201610893.0,9.0,THEFT,Theft,1-Jan-16,404 COLORADO ST ...,N,22-Jan-16,G,78701.0,11.0,3113643.0,10070357.0
4,201611018.0,4.0,SEXUAL ASSAULT W/ OBJECT,Rape,1-Jan-16,,C,10-Mar-16,E,78753.0,18.33,,


### Data description

This data is all the crimes recorded by the Austin PD in 2016. The columns that we are interested are:
- **Council District**: The district in which the crime was committed ([map of districts](https://www.austinchronicle.com/binary/35e1/pols_feature51.jpg))
- **GO Highest Offense Desc**: A text description of the offense using the APD description
- **Highest NIBRS/UCR Offense Description**: A text description using the FBI description
- **GO Report Date**: The date on which the crime was reported
- **Clearance Status**: Whether or not the crime was "cleared" (i.e. the case was closed due to an arrest)
- **Clearance Date**: When the crime was cleared
- **GO Location Zip**: The zip code where the crime occurred

This data is from all flights taken in January 2019.

## Tasks

### Data cleaning
The following is taken from the week 1 skills challenge. You do not need to do anything here. This is just to show what data cleaning was done.

**DC1:** Drop all columns that are not in the list above. Save this back as the variable `df`.

In [4]:
df = df[['Council District', 'GO Highest Offense Desc', 'Highest NIBRS/UCR Offense Description', 'GO Report Date', 'Clearance Status', 'Clearance Date', 'GO Location Zip']]

**DC2:** Rename the columns to be all lowercase, replace spaces with underscores ("_"), and remove "GO" from all column names. Finally, make sure there are no spaces at the start or finish of a column name. For example, ``'  my_col '`` should be renamed to `'my_col'` (notice that the spaces are gone), and "GO Report Date" should become "report_date". Rename "Highest NIBRS/UCR Offense Description" to "fbi_desc", and "GO Highest Offense Desc" to "apd_desc".

In [5]:
clean_cols = [c if c != 'Highest NIBRS/UCR Offense Description' else 'fbi_desc' for c in df.columns]
clean_cols = [c if c != 'GO Highest Offense Desc' else 'apd_desc' for c in clean_cols]
clean_cols = [c.replace('GO', '').replace(' ', '_').lower().strip().strip('_') for c in clean_cols]
df.columns = clean_cols

**DC3:** For each column, print how many `None` or `NaN` values are in the column, along with what percentage of the rows are missing. Round the percentage to two decimal places. Your output should look like:

```
col1_name: 20 (0.05%) missing values 
col2_name: 150 (1.56%) missing values 
```
Then, drop any rows which have missing values.

In [10]:
for c in df.columns:
    missing_values = df[c].isna().sum()
    pct_missing = missing_values / df.shape[0]
    col_summary = f'{c}: {missing_values} ({pct_missing:.2f}%) missing values'
    print(col_summary)

council_district: 0 (0.00%) missing values
apd_desc: 0 (0.00%) missing values
fbi_desc: 0 (0.00%) missing values
report_date: 0 (0.00%) missing values
clearance_status: 0 (0.00%) missing values
clearance_date: 0 (0.00%) missing values
location_zip: 0 (0.00%) missing values


**DC4:** Drop any rows which have any missing values. Save the result back to `df`.

In [7]:
df = df.dropna(how='any')

**DC5:** For any column which is a `float`, check if the numbers really are floats (i.e. is there a reason they're a decimal?). If they're not really decimals (for instance, if all of them have .0 at the end), then convert the column to integers.

In [8]:
df['council_district'] = df['council_district'].astype(int)
df['location_zip'] = df['location_zip'].astype(int)

In [9]:
df.head()

Unnamed: 0,council_district,apd_desc,fbi_desc,report_date,clearance_status,clearance_date,location_zip
0,8,AGG ASLT ENHANC STRANGL/SUFFOC,Agg Assault,1-Jan-16,C,12-Jan-16,78735
1,9,THEFT,Theft,1-Jan-16,C,4-Jan-16,78701
2,4,AGG ROBBERY/DEADLY WEAPON,Robbery,1-Jan-16,N,3-May-16,78753
3,9,THEFT,Theft,1-Jan-16,N,22-Jan-16,78701
4,4,SEXUAL ASSAULT W/ OBJECT,Rape,1-Jan-16,C,10-Mar-16,78753


### Interactivity

**I1:** 