# Skills challenge \#1
Below are a series of questions. Use the loaded data to answer the questions. You will almost certainly need to import more packages (`pandas`, `numpy`, etc.) to complete these. You are welcome to use any source except for your classmates. So Google away!

You will be graded on both the **correctness** and **cleanliness** of your work. So don't submit poorly written code or your grade will reflect that. **Do not leave any scratch work.** The only code in the cell should be your polished solution. If you get stuck, move on to another part. Most questions don't rely on the answer to earlier questions.

### Imports

In [3]:
import pandas as pd
import numpy as np

### Data loading

In [4]:
df = pd.read_csv('../data/2016_austin_crime.csv')

In [5]:
df.head()

Unnamed: 0,GO Primary Key,Council District,GO Highest Offense Desc,Highest NIBRS/UCR Offense Description,GO Report Date,GO Location,Clearance Status,Clearance Date,GO District,GO Location Zip,GO Census Tract,GO X Coordinate,GO Y Coordinate
0,201610188.0,8.0,AGG ASLT ENHANC STRANGL/SUFFOC,Agg Assault,1-Jan-16,8600 W SH 71 ...,C,12-Jan-16,D,78735.0,19.08,3067322.0,10062796.0
1,201610643.0,9.0,THEFT,Theft,1-Jan-16,219 E 6TH ST ...,C,4-Jan-16,G,78701.0,11.0,3114957.0,10070462.0
2,201610892.0,4.0,AGG ROBBERY/DEADLY WEAPON,Robbery,1-Jan-16,701 W LONGSPUR BLVD ...,N,3-May-16,E,78753.0,18.23,3129181.0,10106923.0
3,201610893.0,9.0,THEFT,Theft,1-Jan-16,404 COLORADO ST ...,N,22-Jan-16,G,78701.0,11.0,3113643.0,10070357.0
4,201611018.0,4.0,SEXUAL ASSAULT W/ OBJECT,Rape,1-Jan-16,,C,10-Mar-16,E,78753.0,18.33,,


### Data description

This data is all the crimes recorded by the Austin PD in 2016. The columns that we are interested are:
- **Council District**: The district in which the crime was committed ([map of districts](https://www.austinchronicle.com/binary/35e1/pols_feature51.jpg))
- **GO Highest Offense Desc**: A text description of the offense using the APD description
- **Highest NIBRS/UCR Offense Description**: A text description using the FBI description
- **GO Report Date**: The date on which the crime was reported
- **Clearance Status**: Whether or not the crime was "cleared" (i.e. the case was closed due to an arrest)
- **Clearance Date**: When the crime was cleared
- **GO Location Zip**: The zip code where the crime occurred

## Tasks

### Data cleaning
**DC1:** Drop all columns that are not in the list above. Save this back as the variable `df`.

In [6]:
df = df[['Council District', 'GO Highest Offense Desc', 'Highest NIBRS/UCR Offense Description', 'GO Report Date', 'Clearance Status', 'Clearance Date', 'GO Location Zip']]

**DC2:** Rename the columns to be all lowercase, replace spaces with underscores ("_"), and remove "GO" from all column names. Finally, make sure there are no spaces at the start or finish of a column name. For example, ``'  my_col '`` should be renamed to `'my_col'` (notice that the spaces are gone), and "GO Report Date" should become "report_date". Rename "Highest NIBRS/UCR Offense Description" to "fbi_desc", and "GO Highest Offense Desc" to "apd_desc".

In [8]:
clean_cols = [c if c != 'Highest NIBRS/UCR Offense Description' else 'fbi_desc' for c in df.columns]
clean_cols = [c if c != 'GO Highest Offense Desc' else 'apd_desc' for c in clean_cols]
clean_cols = [c.replace('GO', '').replace(' ', '_').lower().strip().strip('_') for c in clean_cols]
df.columns = clean_cols

**DC3:** For each column, print how many `None` or `NaN` values are in the column, along with what percentage of the rows are missing. Round the percentage to two decimal places. Your output should look like:

```
col1_name: 20 (0.05%) missing values 
col2_name: 150 (1.56%) missing values 
```
Then, drop any rows which have missing values.

In [9]:
for c in df.columns:
    missing_values = df[c].isna().sum()
    pct_missing = missing_values / df.shape[0]
    col_summary = f'{c}: {missing_values} ({pct_missing:.2f}%) missing values'

**DC4:** Drop any rows which have any missing values. Save the result back to `df`.

In [10]:
df = df.dropna(how='any')

**DC5:** For any column which is a `float`, check if the numbers really are floats (i.e. is there a reason they're a decimal?). If they're not really decimals (for instance, if all of them have .0 at the end), then convert the column to integers.

In [11]:
df['council_district'] = df['council_district'].astype(int)
df['location_zip'] = df['location_zip'].astype(int)

### Data exporation
**DE1:** Print out each district, along with what percentage of the crimes occurred in it.

In [17]:
# Function which will do this for any column

# Takes in the dataframe to work with, along with the name of the column you want to summarize
def col_pct_summary(df, col):
    print(f'Summary for column {col}')
    
    # Go through all unique values in that column
    for x in df[col].unique():
        # Find all rows which have that value
        x_df = df[df[col] == x]
        # Calculate the percentage (# of rows with that value / total # of rows)
        x_pct = x_df.shape[0] / df.shape[0]
        # Print it using f-strings. Multiply to 100 to make the percent look nicer (not required)
        print(f'{x}: {100*x_pct:.2f}%')

col_pct_summary(df, 'council_district')

Summary for column council_district
8: 5.51%
9: 16.18%
4: 14.12%
1: 10.10%
3: 15.84%
5: 8.43%
2: 8.89%
7: 11.35%
6: 5.42%
10: 4.16%


**DE2:** Do the same for each zip code.

In [14]:
col_pct_summary(df, 'location_zip')

Summary for column location_zip
78735: 0.77%
78701: 5.65%
78753: 8.31%
78724: 1.45%
78741: 8.97%
78704: 6.81%
78748: 3.65%
78758: 6.59%
78744: 5.25%
78747: 0.61%
78756: 0.91%
78751: 2.52%
78759: 3.18%
78723: 5.62%
78752: 3.54%
78745: 6.04%
78749: 2.10%
78731: 1.41%
78702: 4.27%
78722: 0.71%
78705: 3.15%
78757: 2.98%
78721: 1.37%
78739: 0.41%
78729: 1.48%
78613: 0.89%
78617: 0.75%
78746: 2.09%
78750: 0.90%
78719: 0.44%
78703: 1.84%
78736: 0.32%
78653: 0.11%
78727: 1.43%
78652: 0.04%
78754: 1.11%
78726: 0.58%
78717: 0.80%
78660: 0.44%
78725: 0.19%
78712: 0.02%
78730: 0.13%
78742: 0.15%
78728: 0.01%
78732: 0.00%
78737: 0.00%


**DE3:** Print what percentage of crimes were cleared and what percentage were not.

In [15]:
col_pct_summary(df, 'clearance_status')

Summary for column clearance_status
C: 14.04%
N: 83.14%
O: 2.82%


**DE4:** Do the same for crimes by the FBI description (so percentage of each type of crime).

In [16]:
col_pct_summary(df, 'fbi_desc')

Summary for column fbi_desc
Agg Assault: 5.87%
Theft: 69.87%
Robbery: 2.56%
Rape: 1.89%
Burglary: 14.13%
Auto Theft: 5.59%
Murder: 0.08%


### Bonus questions
**B1:** Create a dictionary (Python `dict`) that has the FBI description as the key and a list of all APD descriptions that map to it as the values. So for example, it may look like `{'Theft': ['THEFT FROM BUILDING', 'THEFT', ...], 'Robbery': ['AGG ROBBERY/DEADLY WEAPON', 'PURSE SNATCHING', ...]}`. 

In [21]:
soln_dict = {d: list(df[df['fbi_desc'] == d]['apd_desc'].unique()) for d in df['fbi_desc'].unique()}

**B2:** Write a function which allows a person to type in an FBI description, and the function returns a dictionary with the following summary:
- Number of crimes comitted with that description.
- Percentage of crimes committed with that description. Leave it as a float between 0 and 1.
- The percentage of crimes with that description which were "cleared" (clearance status of "C").
- The zip in which the crime occurred most often.
- The district in which the crime occurred most often.

The function should still work even if the person types in the FBI description with incorrect capitalization or spacing. So for instance, if the FBI description is "Theft", then any of the following should still work:
- 'Theft'
- 'THEFT'
- 'theft'
- 'thEFt'
- '    theft'
- '    THeft   '

In [18]:
def crimes_summary(df, fbi_desc):
    cleaned_desc = fbi_desc.lower().strip()
    summary_df = df[df['fbi_desc'].str.lower() == cleaned_desc]
    n_crimes = summary_df.shape[0]
    pct_crimes = n_crimes / df.shape[0]
    pct_cleared = summary_df[summary_df['clearance_status'] == 'C'].shape[0] / n_crimes
    top_zip = summary_df['location_zip'].value_counts().index[0]
    top_district = summary_df['council_district'].value_counts().index[0]

    summary_dict = {'n_crimes': n_crimes,
                   'pct_crimes': pct_crimes,
                   'pct_cleared': pct_cleared,
                   'top_zip': top_zip,
                   'top_district': top_district}
    return summary_dict

In [24]:
# Showing off the function working
for desc in df['fbi_desc'].unique():
    print(f'Summary for {desc}')
    summary_dict = crimes_summary(df, desc)
    print(summary_dict)
    print()

Summary for Agg Assault
{'n_crimes': 2086, 'pct_crimes': 0.05866471680071995, 'pct_cleared': 0.44534995206136146, 'top_zip': 78753, 'top_district': 4}

Summary for Theft
{'n_crimes': 24845, 'pct_crimes': 0.6987175881658136, 'pct_cleared': 0.11318172670557457, 'top_zip': 78753, 'top_district': 9}

Summary for Robbery
{'n_crimes': 911, 'pct_crimes': 0.025620113617188817, 'pct_cleared': 0.30954994511525796, 'top_zip': 78741, 'top_district': 4}

Summary for Rape
{'n_crimes': 673, 'pct_crimes': 0.018926823780865066, 'pct_cleared': 0.15453194650817237, 'top_zip': 78741, 'top_district': 4}

Summary for Burglary
{'n_crimes': 5025, 'pct_crimes': 0.14131840935935655, 'pct_cleared': 0.09611940298507463, 'top_zip': 78741, 'top_district': 3}

Summary for Auto Theft
{'n_crimes': 1988, 'pct_crimes': 0.05590865627988076, 'pct_cleared': 0.17806841046277666, 'top_zip': 78741, 'top_district': 3}

Summary for Murder
{'n_crimes': 30, 'pct_crimes': 0.000843691996175263, 'pct_cleared': 0.9333333333333333, 't