# STA141C Final Project: Crime in Chicago
## A clustering analysis and visualization for crime in Chicago from 2005-2017
### By: Joe Akanesuvan, Navid Al Nadvi, Sailesh Patnala

This analysis will determine the the components that can be use to determine the crime type in Chicago. We hope that through a thorough analysis of the dataset, we will able to predict the the type of a crime, given a feature, accurately. By doing, so we wish that we will able to analyze the type of crime most common to a given area to, ultimately, lessen the impact of crime in Chicago.

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks as cf
import plotly.plotly as py
import plotly.graph_objs as go
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline

In [14]:
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot

In [15]:
init_notebook_mode(connected=True)

In [16]:
cf.go_offline()
cf.set_config_file(theme='ggplot')

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


In [17]:
crimes1 = pd.read_csv('Chicago_Crimes_2005_to_2007.csv',error_bad_lines=False)
crimes2 = pd.read_csv('Chicago_Crimes_2008_to_2011.csv',error_bad_lines=False)
crimes3 = pd.read_csv('Chicago_Crimes_2012_to_2017.csv',error_bad_lines=False)
crimes = pd.concat([crimes1, crimes2, crimes3], ignore_index=False, axis=0)
crimes.drop_duplicates(subset=['ID', 'Case Number'], inplace=True)

b'Skipping line 533719: expected 23 fields, saw 24\n'
b'Skipping line 1149094: expected 23 fields, saw 41\n'


## Understanding the Dataset

In [18]:
crimes.head()

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,0,4673626,HM274058,04/02/2006 01:00:00 PM,055XX N MANGO AVE,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,...,45.0,11.0,26,1136872.0,1936499.0,2006,04/15/2016 08:55:02 AM,41.981913,-87.771996,"(41.981912692, -87.771996382)"
1,1,4673627,HM202199,02/26/2006 01:40:48 PM,065XX S RHODES AVE,2017,NARCOTICS,MANU/DELIVER:CRACK,SIDEWALK,True,...,20.0,42.0,18,1181027.0,1861693.0,2006,04/15/2016 08:55:02 AM,41.775733,-87.61192,"(41.775732538, -87.611919814)"
2,2,4673628,HM113861,01/08/2006 11:16:00 PM,013XX E 69TH ST,051A,ASSAULT,AGGRAVATED: HANDGUN,OTHER,False,...,5.0,69.0,04A,1186023.0,1859609.0,2006,04/15/2016 08:55:02 AM,41.769897,-87.593671,"(41.769897392, -87.593670899)"
3,4,4673629,HM274049,04/05/2006 06:45:00 PM,061XX W NEWPORT AVE,0460,BATTERY,SIMPLE,RESIDENCE,False,...,38.0,17.0,08B,1134772.0,1922299.0,2006,04/15/2016 08:55:02 AM,41.942984,-87.780057,"(41.942984005, -87.780056951)"
4,5,4673630,HM187120,02/17/2006 09:03:14 PM,037XX W 60TH ST,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,True,...,13.0,65.0,18,1152412.0,1864560.0,2006,04/15/2016 08:55:02 AM,41.784211,-87.716745,"(41.784210853, -87.71674491)"


In [19]:
crimes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4336556 entries, 0 to 1456713
Data columns (total 23 columns):
Unnamed: 0              int64
ID                      int64
Case Number             object
Date                    object
Block                   object
IUCR                    object
Primary Type            object
Description             object
Location Description    object
Arrest                  bool
Domestic                bool
Beat                    int64
District                float64
Ward                    float64
Community Area          float64
FBI Code                object
X Coordinate            float64
Y Coordinate            float64
Year                    int64
Updated On              object
Latitude                float64
Longitude               float64
Location                object
dtypes: bool(2), float64(7), int64(4), object(10)
memory usage: 736.1+ MB


There are unnecessary columns which would not be able to use for any relevant visualization nor any clustering algorithm. Remove any columns which may be too specific.

In [20]:
crimes.drop(['Unnamed: 0', 'Case Number', 'IUCR', 'X Coordinate', 'Y Coordinate',
             'Updated On','FBI Code' ,'Ward', 'Location', 'District', 'Block', 'Year', 'Beat'], inplace=True, axis=1)

Since there is Date column, the Date format should be changed to its corresponding Pandas format

In [21]:
crimes.Date = pd.to_datetime(crimes.Date, format='%m/%d/%Y %I:%M:%S %p')
crimes.index = pd.DatetimeIndex(crimes.Date)


In [22]:
crimes.head()

Unnamed: 0_level_0,ID,Date,Primary Type,Description,Location Description,Arrest,Domestic,Community Area,Latitude,Longitude
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2006-04-02 13:00:00,4673626,2006-04-02 13:00:00,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,False,11.0,41.981913,-87.771996
2006-02-26 13:40:48,4673627,2006-02-26 13:40:48,NARCOTICS,MANU/DELIVER:CRACK,SIDEWALK,True,False,42.0,41.775733,-87.61192
2006-01-08 23:16:00,4673628,2006-01-08 23:16:00,ASSAULT,AGGRAVATED: HANDGUN,OTHER,False,False,69.0,41.769897,-87.593671
2006-04-05 18:45:00,4673629,2006-04-05 18:45:00,BATTERY,SIMPLE,RESIDENCE,False,False,17.0,41.942984,-87.780057
2006-02-17 21:03:14,4673630,2006-02-17 21:03:14,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,True,False,65.0,41.784211,-87.716745


## Visualizing the Data

In [23]:
crime_years = pd.DataFrame(crimes.groupby([crimes.index.year]).size().reset_index(name="count"))

In [24]:
crime_years

Unnamed: 0,Date,count
0,2005,453666
1,2006,448037
2,2007,436924
3,2008,426964
4,2009,392556
5,2010,370140
6,2011,351555
7,2012,335670
8,2013,306703
9,2014,274527


In [25]:
crime_years.iplot(kind="line", x='Date', y='count',
                 xTitle='Year', yTitle='Total Crimes',title='Total of Crimes per Year')


pandas.tslib is deprecated and will be removed in a future version.
You can access Timestamp as pandas.Timestamp



In [26]:
crime_types_count = pd.DataFrame(crimes.groupby(["Primary Type"]).size().reset_index(name="count"))
crime_types_count.sort_values('count', ascending=True, inplace=True)

In [27]:
crime_types_count.tail()

Unnamed: 0,Primary Type,count
24,OTHER OFFENSE,264200
17,NARCOTICS,473790
6,CRIMINAL DAMAGE,499426
2,BATTERY,778164
32,THEFT,907831


In [28]:
data = [
    go.Bar(
        x=crime_types_count['count'],
        y=crime_types_count['Primary Type'],
        orientation='h',
    )
]

layout = go.Layout(
    title='Total Crime Types',
    yaxis=dict(
        title='Crime Types',
        tickfont=dict(
        size=8,
        color='black',
        ),
    ),
    xaxis=dict(
        title='Count',
    ),
    margin=go.Margin(
        l=200,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='total-crime-types')

In [29]:
crime_location_count = pd.DataFrame(crimes.groupby(["Location Description"]).size().reset_index(name="count"))
crime_location_count.sort_values('count', ascending=True, inplace=True)
crime_location_count.shape

(161, 2)

In [30]:
crime_location_count.tail()

Unnamed: 0,Location Description,count
110,OTHER,158276
139,SIDEWALK,482153
17,APARTMENT,489249
123,RESIDENCE,713880
143,STREET,1079497


In [31]:
crime_location_count.head()

Unnamed: 0,Location Description,count
160,YMCA,1
44,CLEANERS/LAUNDROMAT,1
56,"CTA ""L"" TRAIN",1
42,CHURCH PROPERTY,1
40,CHA PLAY LOT,1


Since there are many features with low counts, we will only take the features with a high number of occurences into account.

In [32]:
crime_location_count.drop(crime_location_count.index[0:136], inplace=True)

In [33]:
data = [
    go.Bar(
        x=crime_location_count['count'],
        y=crime_location_count['Location Description'],
        orientation='h',
    )
]

layout = go.Layout(
    title='Total Crimes by Location',
    yaxis=dict(
        title='Location Description',
        tickfont=dict(
        size=8,
        color='black',
        ),
    ),
    xaxis=dict(
        title='Count',
    ),
    margin=go.Margin(
        l=200,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='total-crime-location')

Since we have decided that only the 25 top location description will be taken into account during the analysis, we will need to update the crimes dataset to reflect the change.

In [34]:
crime_day_count = pd.DataFrame(crimes.groupby([crimes.index.dayofweek]).size().reset_index(name="count"))
days = ["Mon", "Tue", "Wed", "Thurs", "Fri", "Sat", "Sun"]
crime_day_count['index'] = days
crime_day_count

Unnamed: 0,Date,count,index
0,0,609221,Mon
1,1,621967,Tue
2,2,626060,Wed
3,3,618657,Thurs
4,4,654870,Fri
5,5,619533,Sat
6,6,586248,Sun


In [35]:
crime_day_count.iplot(kind="bar", x='index', y='count',
                 xTitle='Day of the Week', yTitle='Total Crimes',title='Total of Crimes per Day')

In [36]:
crime_month_count = pd.DataFrame(crimes.groupby([crimes.index.month]).size().reset_index(name="count"))
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
crime_month_count['index'] = months
crime_month_count

Unnamed: 0,Date,count,index
0,1,348282,Jan
1,2,292889,Feb
2,3,357875,Mar
3,4,357918,Apr
4,5,388137,May
5,6,385351,Jun
6,7,402970,Jul
7,8,398654,Aug
8,9,372780,Sep
9,10,375799,Oct


In [37]:
crime_month_count.iplot(kind="bar", x='index', y='count',
                 xTitle='Month', yTitle='Total Crimes',title='Total of Crimes per Month')

In [38]:
crime_count = pd.DataFrame(crimes.groupby(["Primary Type", crimes.index]).size().reset_index(name="count"))
crime_count.sort_values('Primary Type', ascending=True, inplace=True)
crime_count.reset_index(drop=True, inplace=True)
type_count = crime_count.pivot_table(index='Date' ,columns='Primary Type', values='count').reset_index()
type_count.fillna(0,inplace=True)
type_count.head()

Primary Type,Date,ARSON,ASSAULT,BATTERY,BURGLARY,CONCEALED CARRY LICENSE VIOLATION,CRIM SEXUAL ASSAULT,CRIMINAL DAMAGE,CRIMINAL TRESPASS,DECEPTIVE PRACTICE,...,OTHER OFFENSE,PROSTITUTION,PUBLIC INDECENCY,PUBLIC PEACE VIOLATION,RITUALISM,ROBBERY,SEX OFFENSE,STALKING,THEFT,WEAPONS VIOLATION
0,2005-01-01 00:00:00,0.0,1.0,5.0,3.0,0.0,8.0,7.0,0.0,11.0,...,15.0,0.0,0.0,0.0,0.0,0.0,16.0,0.0,71.0,3.0
1,2005-01-01 00:01:00,0.0,1.0,3.0,2.0,0.0,13.0,8.0,0.0,14.0,...,6.0,0.0,0.0,0.0,0.0,0.0,11.0,1.0,84.0,1.0
2,2005-01-01 00:01:42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,2005-01-01 00:02:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,2005-01-01 00:03:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0


### Preprocessing: We will encode the label so they can be normalized


In [43]:
le = preprocessing.LabelEncoder()
le.fit(crimes['Primary Type'].unique())
crimes['Primary Type'] = le.transform(crimes['Primary Type'])
locationEncoder = preprocessing.LabelEncoder()
locationEncoder.fit(crimes['Location Description'].fillna('0'))
crimes['Location Description'] = locationEncoder.transform(crimes['Location Description'].fillna('0'))

In [54]:
crimes['Arrest'] = crimes.Arrest.fillna() astype(int)
crimes['Domestic'] = crimes.Domestic.astype(int)
crimes['Month'] = crimes.index.month
crimes['Day'] = crimes.index.day
crimes['Hour'] = crimes.index.hour
crimes.reset_index(drop=True, inplace=True)

In [56]:
crimes.head()

Unnamed: 0,ID,Date,Primary Type,Description,Location Description,Arrest,Domestic,Community Area,Latitude,Longitude,Month,Day,Hour
0,4673626,2006-04-02 13:00:00,24,HARASSMENT BY TELEPHONE,124,0,0,11.0,41.981913,-87.771996,4,2,13
1,4673627,2006-02-26 13:40:48,17,MANU/DELIVER:CRACK,140,1,0,42.0,41.775733,-87.61192,2,26,13
2,4673628,2006-01-08 23:16:00,1,AGGRAVATED: HANDGUN,111,0,0,69.0,41.769897,-87.593671,1,8,23
3,4673629,2006-04-05 18:45:00,2,SIMPLE,124,0,0,17.0,41.942984,-87.780057,4,5,18
4,4673630,2006-02-17 21:03:14,17,POSS: CANNABIS 30GMS OR LESS,16,1,0,65.0,41.784211,-87.716745,2,17,21


In [66]:
X = crimes.filter(['Month', 'Day', 'Hour', 'Location Description', 'Community Area', 'Arrest', 'Domestic']).fillna(0)

In [72]:
X.head()

Unnamed: 0,Month,Day,Hour,Location Description,Community Area,Arrest,Domestic
0,4,2,13,124,11.0,0,0
1,2,26,13,140,42.0,1,0
2,1,8,23,111,69.0,0,0
3,4,5,18,124,17.0,0,0
4,2,17,21,16,65.0,1,0


In [None]:
#num_clusters = len(crimes['Primary Type'].unique())
#kmeans = KMeans(n_clusters=num_clusters).fit(X.values, crimes['Primary Type'].values)