# Crime over Time of Chicago

In this .ipynb we analyze the change of crime over time in Chicago.

Overall there has been a significant decrease in crime in Chicago.

## Load the data

First we need to load the data.

In [13]:
import matplotlib.pyplot as plt
import numpy as np

import pandas as pd 

In [14]:
%%time

# hdfs_port = "hdfs://orion11:26990"
# hdfs_path = "/FL_insurance_sample.csv"

hdfs_port = "hdfs://orion11:13030"
hdfs_path = "/crime-since-2001-chicago.csv"
df = spark.read.format('csv').option("header", "true").load(hdfs_port + hdfs_path)

CPU times: user 1.34 ms, sys: 978 µs, total: 2.32 ms
Wall time: 1.5 s


In [15]:
df.columns

['ID',
 'Case Number',
 'Date',
 'Block',
 'IUCR',
 'Primary Type',
 'Description',
 'Location Description',
 'Arrest',
 'Domestic',
 'Beat',
 'District',
 'Ward',
 'Community Area',
 'FBI Code',
 'X Coordinate',
 'Y Coordinate',
 'Year',
 'Updated On',
 'Latitude',
 'Longitude',
 'Location']

In [16]:
# # 02/10/2018 03:50:01 PM
# from datetime import datetime
# import time

# time_format = '%m/%d/%Y %I:%M:%S %p'
# time_str = "02/10/2018 03:50:01 PM"
# time_object = time.strptime(time_str, time_format)
# time_object

## Binning

Here we use an SQL query to bin the data and select the features we want. We are binning by the lat/lon and counting the number of crimes that happen per year.

In [17]:
dg = df

dg.createOrReplaceTempView("crime_data")

query_str = f'''
SELECT ROUND(Latitude, 4) AS lat,
    ROUND(Longitude, 4) AS lon,
    COUNT(ID) AS count,
    CAST(Year AS INT) AS year
FROM crime_data
WHERE Latitude is NOT NULL AND Longitude is NOT NULL
GROUP BY lat, lon, year
ORDER BY count DESC
'''

print(query_str)


SELECT ROUND(Latitude, 4) AS lat,
    ROUND(Longitude, 4) AS lon,
    COUNT(ID) AS count,
    CAST(Year AS INT) AS year
FROM crime_data
WHERE Latitude is NOT NULL AND Longitude is NOT NULL
GROUP BY lat, lon, year
ORDER BY count DESC



## Load into Pandas

Here we apply the SQL query and load the results into a pandas dataframe

In [18]:
%%time

dh = spark.sql(query_str)
p_df = dh.toPandas()

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/home4/mcdomingo/.conda/envs/py3/lib/python3.7/site-packages/IPython/core/magics/execution.py", line 1246, in time
    exec(code, glob, local_ns)
  File "<timed exec>", line 3, in <module>
  File "/usr/local/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 1968, in toPandas
    pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
  File "/home4/mcdomingo/.conda/envs/py3/lib/python3.7/site-packages/pandas/core/frame.py", line 1269, in from_records
    coerce_float=coerce_float)
  File "/home4/mcdomingo/.conda/envs/py3/lib/python3.7/site-packages/pandas/core/frame.py", line 7475, in _to_arrays
    dtype=dtype)
  File "/home4/mcdomingo/.conda/envs/py3/lib/python3.7/site-packages/pandas/core/frame.py", line 7554, in _list_to_arrays
    coerce_float=coerce_float)
  File "/home4/mcdomingo/.conda/envs/py3/lib/python3.7/site-packages/pandas/core/frame.py", line 7621, in _convert_object_array
    arrays = [convert(arr) 

KeyboardInterrupt: 

In [19]:
p_df['normcount']= (p_df['count']-p_df['count'].min())/(p_df['count'].max()-p_df['count'].min())
maximum_count = p_df['count'].max()

p_df['logcount'] = np.log(p_df['count'])

## Setting some constants

Chicago Lat and Lon: 41.8781° N, 87.6298° W

In [20]:
chicago_location = (41.8781, -87.6298)

# 2011 Data

Crime seems to happen in 4 major areas.

In [20]:
import folium
import folium.plugins as plugins

m = folium.Map(location=chicago_location, zoom_start=10)

data_list = p_df.loc[p_df['year'] == 2011][['lat', 'lon', 'normcount']].values

hm = plugins.HeatMap(data_list, min_opacity=0.2, radius=7, max_zoom=1)

m.add_child(hm)


KeyboardInterrupt



## 2013 Data

Not much has changed since 2011.

In [None]:
m = folium.Map(location=chicago_location, zoom_start=10)

data_list = p_df.loc[p_df['year'] == 2013][['lat', 'lon', 'normcount']].values

hm = plugins.HeatMap(data_list, min_opacity=0.2, radius=7, max_zoom=1)

m.add_child(hm)

## 2015 Data

There has been a significant decrease in crime in South Chicago, but Central Chicago still has some major problems.

In [None]:
m = folium.Map(location=chicago_location, zoom_start=10)

data_list = p_df.loc[p_df['year'] == 2015][['lat', 'lon', 'normcount']].values

hm = plugins.HeatMap(data_list, min_opacity=0.2, radius=7, max_zoom=1)

m.add_child(hm)

## 2018 Data

This data shows that crime has dramatically decreased in the South Chicago, so much so that is almost the same as the background.

While considerably better, Central Chicago still holds some very dense pockets of crime.

In [None]:
m = folium.Map(location=chicago_location, zoom_start=10)

data_list = p_df.loc[p_df['year'] == 2018][['lat', 'lon', 'normcount']].values.tolist()

hm = plugins.HeatMap(data_list, min_opacity=0.2, radius=7, max_zoom=1)

m.add_child(hm)

## Takeaway

After studying this data, it appears that the police in Chicago are doing a great job in cleaning up crime. This is also due to the South Chicago gentrification that is going on.