# Assessment - Arrests in Chicago

## Purpose

I'm executing a bit of EDA on Crime Statistics in Chicago.

My goal is to ascertain (relatively quickly) whether there is a useful or interesting "signal" or pattern when treating "Arrest" as the target.  That is, can I use the remaining data as features to provide meaningful predictions or insight relative to when one gets arrested?

## Data Import

In [1]:
import pandas as pd
import numpy as np

Read in the data, either from the very beginning or from an earlier stage.

In [2]:
filename = '/mnt/Data/Projects/McNulty/Crimes_-_2001_to_present.csv'
df_pickle = '/mnt/Data/Projects/McNulty/Crimes_00.pkl'
#df = pd.read_csv(filename)
# Immediately pickle
#df.to_pickle(df_pickle, compression='gzip')

df = pd.read_pickle(df_pickle, compression='gzip')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6531511 entries, 0 to 6531510
Data columns (total 22 columns):
ID                      int64
Case Number             object
Date                    object
Block                   object
IUCR                    object
Primary Type            object
Description             object
Location Description    object
Arrest                  bool
Domestic                bool
Beat                    int64
District                float64
Ward                    float64
Community Area          float64
FBI Code                object
X Coordinate            float64
Y Coordinate            float64
Year                    int64
Updated On              object
Latitude                float64
Longitude               float64
Location                object
dtypes: bool(2), float64(7), int64(3), object(10)
memory usage: 1009.1+ MB


## Exploratory Data Analysis

In [4]:
df['Arrest'].value_counts() / len(df)

False    0.720074
True     0.279926
Name: Arrest, dtype: float64

OK.  This doesn't appear too bad.  Balance is tilted somewhat.  But not terribly so.

In [6]:
df.corr()

Unnamed: 0,ID,Arrest,Domestic,Beat,District,Ward,Community Area,X Coordinate,Y Coordinate,Year,Latitude,Longitude
ID,1.0,-0.049223,0.040213,-0.033163,-0.003145,0.016103,-0.003419,-0.002427,-0.007603,0.98907,-0.007589,-0.002491
Arrest,-0.049223,1.0,-0.070116,-0.015722,-0.016064,-0.015956,-0.006409,-0.029899,0.004693,-0.048823,0.004806,-0.030257
Domestic,0.040213,-0.070116,1.0,-0.040943,-0.037776,-0.050487,0.07343,0.004795,-0.073722,0.042117,-0.073532,0.003708
Beat,-0.033163,-0.015722,-0.040943,1.0,0.936411,0.641422,-0.504622,-0.469228,0.608752,-0.035029,0.608905,-0.467222
District,-0.003145,-0.016064,-0.037776,0.936411,1.0,0.692727,-0.497407,-0.525445,0.616528,-0.004237,0.616887,-0.524162
Ward,0.016103,-0.015956,-0.050487,0.641422,0.692727,1.0,-0.531834,-0.43076,0.618278,0.015458,0.618212,-0.428223
Community Area,-0.003419,-0.006409,0.07343,-0.504622,-0.497407,-0.531834,1.0,0.248688,-0.740303,-0.003856,-0.739158,0.241486
X Coordinate,-0.002427,-0.029899,0.004795,-0.469228,-0.525445,-0.43076,0.248688,1.0,-0.386029,-0.001732,-0.387373,0.99977
Y Coordinate,-0.007603,0.004693,-0.073722,0.608752,0.616528,0.618278,-0.740303,-0.386029,1.0,-0.007426,0.999993,-0.385046
Year,0.98907,-0.048823,0.042117,-0.035029,-0.004237,0.015458,-0.003856,-0.001732,-0.007426,1.0,-0.007415,-0.001786


Alright... THIS is concerning.  But there may be more we can explore.

There's only so much I can do with the entire data set before I address some memory management issues.

But, let's be smart about this.  With a binary target which is already a boolean, I don't need to do a full blown explosion of dummy variables.

In [12]:
pd.concat([
df[['Primary Type','Arrest']].groupby('Primary Type').count(),
df[['Primary Type','Arrest']].groupby('Primary Type').mean()
],axis=1
)

Unnamed: 0_level_0,Arrest,Arrest
Primary Type,Unnamed: 1_level_1,Unnamed: 2_level_1
ARSON,10856,0.131264
ASSAULT,401935,0.234065
BATTERY,1191848,0.228596
BURGLARY,378644,0.057244
CONCEALED CARRY LICENSE VIOLATION,158,0.905063
CRIM SEXUAL ASSAULT,25632,0.159488
CRIMINAL DAMAGE,748936,0.070857
CRIMINAL TRESPASS,187855,0.737904
DECEPTIVE PRACTICE,246147,0.175968
DOMESTIC VIOLENCE,1,1.0


Now... THIS is just fascinating.  There's almost nothing that IS average or random.  The vast majority of this is strongly correlated or anti-correlated.

I have a feeling that "FBI Code" and "Primary Type" are very strongly correlated.  Let's see...

In [14]:
pd.concat([
df[['FBI Code','Arrest']].groupby('FBI Code').count(),
df[['FBI Code','Arrest']].groupby('FBI Code').mean()
],axis=1
)

Unnamed: 0_level_0,Arrest,Arrest
FBI Code,Unnamed: 1_level_1,Unnamed: 2_level_1
01A,8914,0.47173
01B,29,0.827586
02,29650,0.171838
03,248124,0.097137
04A,102519,0.324145
04B,172517,0.2102
05,378644,0.057244
06,1366057,0.12027
07,306626,0.092014
08A,302492,0.202752


Yeah... You can sort of see it.

In [20]:
df[['FBI Code','Primary Type','Arrest']].groupby(['Primary Type','FBI Code']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Arrest
Primary Type,FBI Code,Unnamed: 2_level_1
ARSON,09,10743
ARSON,26,113
ASSAULT,04A,102519
ASSAULT,08A,299416
BATTERY,04B,172497
BATTERY,08B,1019351
BURGLARY,05,378644
CONCEALED CARRY LICENSE VIOLATION,15,158
CRIM SEXUAL ASSAULT,02,25632
CRIMINAL DAMAGE,14,748936


In [29]:
pd.concat([
    (df[['Location Description','Arrest']]
         .groupby('Location Description')
         .count()
         .rename(columns={'Arrest':'Arrest Count'})
    ),
    (df[['Location Description','Arrest']]
         .groupby('Location Description')
         .mean()
         .rename(columns={'Arrest':'Arrest Mean'})
    )
],axis=1
).sort_values('Arrest Count',ascending=False)



Unnamed: 0_level_0,Arrest Count,Arrest Mean
Location Description,Unnamed: 1_level_1,Unnamed: 2_level_1
STREET,1723371,0.288449
RESIDENCE,1106037,0.141585
APARTMENT,669930,0.175321
SIDEWALK,647849,0.517631
OTHER,247593,0.191261
PARKING LOT/GARAGE(NON.RESID.),187559,0.197847
ALLEY,146737,0.464975
"SCHOOL, PUBLIC, BUILDING",139514,0.308191
RESIDENCE-GARAGE,128357,0.057605
RESIDENCE PORCH/HALLWAY,114293,0.319162


Woah.  This is also fascinating.

What is "Location" vs. "Location Description" ?

In [30]:
df['Location'].value_counts()

(41.976290414, -87.905227221)    12759
(41.754592961, -87.741528537)     9239
(41.883500187, -87.627876698)     6589
(41.897895128, -87.624096605)     4056
(41.896888586, -87.628203192)     3028
(41.909664252, -87.742728815)     2811
(41.885487535, -87.726422045)     2600
(41.904192368, -87.647000785)     2545
(41.721627204, -87.624485177)     2349
(41.88233367, -87.627841791)      2329
(41.788987036, -87.74147999)      2308
(41.736259984, -87.628068782)     2292
(41.68995741, -87.637460623)      2239
(41.737094305, -87.572998178)     2219
(41.739265865, -87.604893749)     2106
(41.891990384, -87.611461502)     2093
(41.979006297, -87.906463155)     1989
(41.736148121, -87.629070243)     1987
(41.706070186, -87.653645803)     1968
(41.814007401, -87.628331665)     1957
(41.766102387, -87.573539169)     1863
(41.976200173, -87.905312411)     1862
(41.78210152, -87.586502002)      1830
(41.864493678, -87.639158)        1801
(41.750940757, -87.625185222)     1798
(41.874363279, -87.643013

Oh... just a convenience field grouping lat/lon.  Just how many geo-based features does this dataset need?!?

Well... It does make it easy to check based on UNIQUE location...

In [33]:
pd.concat([
    (df[['Location','Arrest']]
         .groupby('Location')
         .count()
         .rename(columns={'Arrest':'Arrest Count'})
    ),
    (df[['Location','Arrest']]
         .groupby('Location')
         .mean()
         .rename(columns={'Arrest':'Arrest Mean'})
    )
],axis=1
).sort_values('Arrest Count',ascending=False)


Unnamed: 0_level_0,Arrest Count,Arrest Mean
Location,Unnamed: 1_level_1,Unnamed: 2_level_1
"(41.976290414, -87.905227221)",12759,0.261619
"(41.754592961, -87.741528537)",9239,0.484252
"(41.883500187, -87.627876698)",6589,0.700258
"(41.897895128, -87.624096605)",4056,0.494083
"(41.896888586, -87.628203192)",3028,0.646631
"(41.909664252, -87.742728815)",2811,0.700818
"(41.885487535, -87.726422045)",2600,0.813462
"(41.904192368, -87.647000785)",2545,0.749312
"(41.721627204, -87.624485177)",2349,0.607067
"(41.88233367, -87.627841791)",2329,0.559468


Hmm... that's odd.  Some places are random.  Some are correlated more one way or the other.

But what in the world is up with ONE location generating 12k records and about 3k arrests over the years?

Ahhh... That's O'Hare.

## Summary

OK.  There is more that can be done.  I bet there are things we can see based on time.  The initial correlation check gives me concern about geography.  But, we'll see..

All in all, categoricals rush forward for the win here.  There are a number of things we can say just from those correlations alone.

This option for scope choice has been deemed "not yet ruled out".