# Exploratory Data Analysis for Crimes in Chicago in 2018



## Copy from s3

In [None]:
import pandas as pd
import boto3
import botocore

bucket = "sagemaker-chicago-data"
key = "Crimes_-_2018.csv"

s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(key, "crimes_2018.csv")

In [None]:
df = pd.read_csv("../Data/crimes_2018.csv", index_col = "ID")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df["Primary Type"].value_counts()

## Analysis

Based on a few simple lines of code, we can conclude the following. From January through September of 2018, there were over 180,000 criminal events in Chicago. Specifically they can be broken down as following
- 122 kidnappings
- 1,054 sexual assaults
- 396 homicides
- 6,823 motor vehicle thefts
- 3,815 weapons violations
- 34,903 cases of battery
- 44,086 cases of theft
    
Each crime is recorded with 22 columns, described as below:

In [None]:
def print_row_headers(df):
    for h in list(df):
        print (h)
print_row_headers(df)

## For a criminal prediction project, we will only consider this data set the "Y", or the target variable.

That means we only need to keep records indicating that this crime occured. Let's drop everything else

In [None]:
keep_list = ["Case Number", "Date", "Block", "Primary Type", "Description", "Location Description", "Arrest", "Year", "Location"]

In [None]:
reduced_df = df[keep_list]

## Now let's remove the rows with missing values on those reduced columns

In [None]:
for h in list(reduced_df):
    print (h)
    print (df[h].isna().sum())

It appears that we have 426 rows missing a location description, and 935 rows missing a location. Before we drop them, we need to make sure they are not correlated with our outcome variables, ie crime. 

In [None]:
missing_location = df.loc[ df["Location Description"].isna() > 0 ]

In [None]:
missing_location

Most of the rows missing the location description appear to be about finacial crimes, ie financial identity theft. This indicates that if we want to build a prediction model for financial crimes, we would not be able to use the location description, becaause it is closely correlated with the outcome variable. Dropping it would introduce sample bias into our model.

For this demonstration we are only going to model the following criminal activities:
- Kidnapping
- Sexual Assault
- Homocide
- Moto Vehicle Theft
- Weapons Violations
- Battery
- Theft

Because Location Description is not correlated with any of these columns, we are good to drop the 496 rows that are missing Location Description. This will allow us to utilize the rest of the information contained in the Location Description column, without introducing bias into our model.

In [None]:
import numpy as np

df = df[ (df["Location Description"]).isna() == False ]

In [None]:
# If this returns a 0, then our row removal step was successful
df["Location Description"].isna().sum()

Moving on to the location column. Effectively we have 935 rows that are missing locations, and we need to decide if we will simply drop them. In order to make that decision, we need to know if they are correlated with the outcome variable, crime.

In [None]:
missing_geo = df.loc[ df["Location"].isna() > 0]

In [None]:
missing_geo["Primary Type"].value_counts()

In [None]:
missing_geo["Description"].value_counts()

It appears that most of our 900 + rows with missing values for location are about theft under $500. That also happens to be our largest prediction category, with almost 17,000 records in that category. Given this magnitude, I am not concerned about introducing downward bias into the model against petty theft. We'll drop those rows as well.

In [None]:
df["Description"].value_counts()

In [None]:
df = df[ (df["Location"]).isna() == False ]

In [None]:
# If this returns a 0, our removal was successful
df["Location"].isna().sum()

# Great! We've reduced our data set and removed the empty rows, let's write that to a csv.

In [None]:
df.to_csv("../Data/crimes_2018_reduced.csv")

Another very helpful step is wrapping all of these steps as a single Python function, so we can more easily use it later.

In [None]:
def main(f_name):
    
    df = pd.read_csv(f_name, index_col = "ID")
    
    # keep a subset of columns
    keep_list = ["Case Number", "Date", "Block", "Primary Type", "Description", "Location Description", "Arrest", "Year", "Location"]
    reduced_df = df[keep_list]

    # drop rows that are missing Location Description
    df = df[ (df["Location Description"]).isna() == False ]
    
    # drop rows that are missing Location, geo coordinates
    df = df[ (df["Location"]).isna() == False ]
    
    # write to disk
    df.to_csv("../Data/crimes_2018_reduced.csv")

main("../Data/crimes_2018.csv")   