In [28]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

The first thing that we need to do is to read in all of the csv and combine them into one dataset

In [29]:
df_2001_to_2004 = pd.read_csv('crimes-in-chicago/Chicago_Crimes_2001_to_2004.csv', error_bad_lines=False)

df_2005_to_2007 = pd.read_csv('crimes-in-chicago/Chicago_Crimes_2005_to_2007.csv', error_bad_lines=False)

df_2008_to_2011 = pd.read_csv('crimes-in-chicago/Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False)

df_2012_to_2017 = pd.read_csv('crimes-in-chicago/Chicago_Crimes_2012_to_2017.csv', error_bad_lines=False)

df_all_crimes = pd.concat([df_2001_to_2004, df_2005_to_2007, df_2008_to_2011, df_2012_to_2017], ignore_index=False, 
                          axis=0)


del(df_2001_to_2004,df_2005_to_2007,df_2008_to_2011,df_2012_to_2017)

b'Skipping line 1513591: expected 23 fields, saw 24\n'
b'Skipping line 533719: expected 23 fields, saw 24\n'
b'Skipping line 1149094: expected 23 fields, saw 41\n'


Now that we have read in all of the data from 2001 to 2018 lets now take a look at it and see what it consists of 

In [30]:
df_all_crimes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7941282 entries, 0 to 1456713
Data columns (total 23 columns):
Unnamed: 0              int64
ID                      int64
Case Number             object
Date                    object
Block                   object
IUCR                    object
Primary Type            object
Description             object
Location Description    object
Arrest                  bool
Domestic                bool
Beat                    int64
District                float64
Ward                    float64
Community Area          float64
FBI Code                object
X Coordinate            float64
Y Coordinate            object
Year                    float64
Updated On              object
Latitude                object
Longitude               float64
Location                object
dtypes: bool(2), float64(6), int64(3), object(12)
memory usage: 1.3+ GB


Now lets clean up some of the data here with a few quick steps
1. Remove duplicate instances
2. Remove all unwanted features
3. Clean up data types of wanted fields


So the first step we are going to take is to remove any duplicate entries in the dataset. The unique identifier for 
this dataset is the 'ID' so we will remove any instances that have the same 'ID'.

In [31]:
# 1. Remove duplicate instances
print("Instances before removing duplicates: " + str(len(df_all_crimes)))
df_all_crimes.drop_duplicates(subset=['ID'], inplace=True)
print("Instances after removing duplicates: " + str(len(df_all_crimes)))

Instances before removing duplicates: 7941282
Instances after removing duplicates: 6170812


As you can see we were able to remove a lot of instances that were falsely entered. Now that we have succeed in 
getting only the instances that we need in the dataset now lets clean up the features. 

To clean up the features we will start by removing all of the features that are not need in the dataset. The include 
duplicate features and features that dont add value to the dataset. 

In [32]:
print("Number of features before:" + str(len(df_all_crimes.keys())))
unwantedFeatures = ['ID','Unnamed: 0', 'Case Number', 'IUCR', 'Beat', 'Ward', 'Community Area', 'FBI Code','X '
                                                                                                           'Coordinate',
                    'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude']
df_all_crimes.drop(unwantedFeatures, axis=1, inplace=True)

print("Number of features after:" + str(len(df_all_crimes.keys())))


Number of features before:23
Number of features after:10


Now that we have removed the unwanted features lets take a look at what the dataset looks like now and see if we need
 to do any work to the remaining features. 

In [33]:
df_all_crimes.info()
df_all_crimes.head(3)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6170812 entries, 0 to 1456713
Data columns (total 10 columns):
ID                      int64
Date                    object
Block                   object
Primary Type            object
Description             object
Location Description    object
Arrest                  bool
Domestic                bool
District                float64
Location                object
dtypes: bool(2), float64(1), int64(1), object(6)
memory usage: 435.5+ MB


Unnamed: 0,ID,Date,Block,Primary Type,Description,Location Description,Arrest,Domestic,District,Location
0,4786321,01/01/2004 12:01:00 AM,082XX S COLES AVE,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,4.0,
1,4676906,03/01/2003 12:00:00 AM,004XX W 42ND PL,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,True,9.0,"(41.817229156, -87.637328162)"
2,4789749,06/20/2004 11:00:00 AM,025XX N KIMBALL AVE,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,14.0,


So we can see now that our dataset is much cleaner but we are not quite down with it yet. Before we can continue to 
evaluate it we first have to change the 'Date' feature to be a data type of a Datetime. We also want to remove any 
null (Nan) values in our dataset. I am going to completly remove the record if it has a null value associated with it
. This is a extremly large dataset and the exclusion of these records wont contribute to making the model inaccurate.  

In [34]:
df_all_crimes['Date'] = pd.to_datetime(df_all_crimes['Date'], format='%m/%d/%Y %I:%M:%S %p')
# Need to do this for plotting over years
df_all_crimes.set_index('Date', inplace=True)

df_all_crimes = df_all_crimes.dropna()
df_all_crimes.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6085064 entries, 2003-03-01 00:00:00 to 2016-05-03 23:38:00
Data columns (total 9 columns):
ID                      int64
Block                   object
Primary Type            object
Description             object
Location Description    object
Arrest                  bool
Domestic                bool
District                float64
Location                object
dtypes: bool(2), float64(1), int64(1), object(5)
memory usage: 383.0+ MB


Okay there we go we have cleaned up our dataset and now we can start to look at the features and analysing how they 
interact with each other. 

But first lets export our cleaned up dataset so that we can use it in other notebooks. 

In [35]:
df_all_crimes.to_csv('cleaned_chicago_crime.csv')
