In [57]:
import pandas as pd
import numpy as np
import folium

from folium import plugins
from folium.plugins import HeatMap


## 1. Data Cleaning

Here, we will be using data on crime provided by the police department of New York City, linked [here](https://www1.nyc.gov/site/nypd/stats/crime-statistics/citywide-crime-stats.page). At this link, you can find additional information regarding not only arrests, but also data regarding court summons, shootings, and complaints.

At the link above, you can find crime data dating back to 2013, yet I chose to focus on this most recent year for analysis. This is for a few reasons, mainly that it greatly minimized the number of datapoints and allowed for easier data analysis. I also noticed that upon preliminary inspection, older data seemed to be patchier and just generally worse to work with. Furthermore, I wasn't planning on analyzing the time element to these datasets, so I wouldn't need a wide range of dates for effective analysis.

This is an example of the data from the start. It contains ~100,000 data points regarding specific arrests made in New York City over the past year. These data points include information about the crime itself, the perpetrator's age, race, sex, and the location of the crime. 

In [58]:
df = pd.read_csv("NYPD_Arrest_Data__Year_to_Date_.csv")

df.head()

Unnamed: 0,ARREST_KEY,ARREST_DATE,PD_CD,PD_DESC,KY_CD,OFNS_DESC,LAW_CODE,LAW_CAT_CD,ARREST_BORO,ARREST_PRECINCT,JURISDICTION_CODE,AGE_GROUP,PERP_SEX,PERP_RACE,X_COORD_CD,Y_COORD_CD,Latitude,Longitude,New Georeferenced Column
0,234233843,09/29/2021,105.0,STRANGULATION 1ST,106.0,FELONY ASSAULT,PL 1211200,F,B,42,0,25-44,M,BLACK,1009231,240290,40.826189,-73.909738,POINT (-73.90973778899996 40.82618898100003)
1,234129823,09/27/2021,157.0,RAPE 1,104.0,RAPE,PL 1303501,F,K,77,0,25-44,M,BLACK,1003606,185050,40.674583,-73.930222,POINT (-73.93022154099998 40.67458330800008)
2,234040747,09/25/2021,109.0,"ASSAULT 2,1,UNCLASSIFIED",106.0,FELONY ASSAULT,PL 1200501,F,Q,101,0,25-44,M,BLACK,1049232,159210,40.603441,-73.765986,POINT (-73.76598558899997 40.60344094100003)
3,234047720,09/25/2021,101.0,ASSAULT 3,344.0,ASSAULT 3 & RELATED OFFENSES,PL 1200001,M,B,44,0,25-44,M,BLACK,1006537,244511,40.837782,-73.919458,POINT (-73.91945797099999 40.83778161800007)
4,234042526,09/25/2021,101.0,ASSAULT 3,344.0,ASSAULT 3 & RELATED OFFENSES,PL 1200001,M,B,44,0,25-44,M,BLACK,1007418,243859,40.83599,-73.916276,POINT (-73.91627635999998 40.83598980000005)


As you can see, not much cleaning should be needed here. I mostly would like to first drop some columns I don't plan on using going forward

In [59]:
df = df.drop(columns=["ARREST_KEY", "PD_CD", "PD_DESC", "LAW_CODE", "ARREST_PRECINCT", 
                      "X_COORD_CD", "Y_COORD_CD", "New Georeferenced Column"])

df.head()

Unnamed: 0,ARREST_DATE,KY_CD,OFNS_DESC,LAW_CAT_CD,ARREST_BORO,JURISDICTION_CODE,AGE_GROUP,PERP_SEX,PERP_RACE,Latitude,Longitude
0,09/29/2021,106.0,FELONY ASSAULT,F,B,0,25-44,M,BLACK,40.826189,-73.909738
1,09/27/2021,104.0,RAPE,F,K,0,25-44,M,BLACK,40.674583,-73.930222
2,09/25/2021,106.0,FELONY ASSAULT,F,Q,0,25-44,M,BLACK,40.603441,-73.765986
3,09/25/2021,344.0,ASSAULT 3 & RELATED OFFENSES,M,B,0,25-44,M,BLACK,40.837782,-73.919458
4,09/25/2021,344.0,ASSAULT 3 & RELATED OFFENSES,M,B,0,25-44,M,BLACK,40.83599,-73.916276


Here, I drop the null entries of the database. I don't think I have any shortage of data, so I don't see the need to try and salvage these rows going forward.

In [60]:
df = df.dropna()

df.head()

Unnamed: 0,ARREST_DATE,KY_CD,OFNS_DESC,LAW_CAT_CD,ARREST_BORO,JURISDICTION_CODE,AGE_GROUP,PERP_SEX,PERP_RACE,Latitude,Longitude
0,09/29/2021,106.0,FELONY ASSAULT,F,B,0,25-44,M,BLACK,40.826189,-73.909738
1,09/27/2021,104.0,RAPE,F,K,0,25-44,M,BLACK,40.674583,-73.930222
2,09/25/2021,106.0,FELONY ASSAULT,F,Q,0,25-44,M,BLACK,40.603441,-73.765986
3,09/25/2021,344.0,ASSAULT 3 & RELATED OFFENSES,M,B,0,25-44,M,BLACK,40.837782,-73.919458
4,09/25/2021,344.0,ASSAULT 3 & RELATED OFFENSES,M,B,0,25-44,M,BLACK,40.83599,-73.916276


Here, we just make sure we're working with floats for later.

In [61]:
df.loc['Latitude'] = df['Latitude'].astype(float)
df.loc['Longitude'] = df['Longitude'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


## 2. Data Analysis and Visualization

First, I just want to get an idea of what offenses are listed within this dataset so I know what to look for in the future. I did this by taking all unique values of offenses listed and printing them.

In [62]:
offenses = set(df['OFNS_DESC'])

print(offenses)

{'GRAND LARCENY OF MOTOR VEHICLE', 'OFFENSES AGAINST PUBLIC SAFETY', 'HARRASSMENT 2', nan, "BURGLAR'S TOOLS", 'OFFENSES INVOLVING FRAUD', 'OFFENSES AGAINST THE PERSON', 'VEHICLE AND TRAFFIC LAWS', 'GRAND LARCENY', 'INTOXICATED & IMPAIRED DRIVING', 'ASSAULT 3 & RELATED OFFENSES', 'NYS LAWS-UNCLASSIFIED FELONY', 'FELONY SEX CRIMES', 'NEW YORK CITY HEALTH CODE', 'ESCAPE 3', 'POSSESSION OF STOLEN PROPERTY', 'KIDNAPPING', 'FELONY ASSAULT', 'JOSTLING', 'BURGLARY', 'ARSON', 'FRAUDS', 'PROSTITUTION & RELATED OFFENSES', 'HOMICIDE-NEGLIGENT,UNCLASSIFIE', 'FRAUDULENT ACCOSTING', 'OTHER OFFENSES RELATED TO THEF', 'DANGEROUS WEAPONS', 'MURDER & NON-NEGL. MANSLAUGHTE', 'ADMINISTRATIVE CODE', 'CHILD ABANDONMENT/NON SUPPORT', 'ALCOHOLIC BEVERAGE CONTROL LAW', 'PARKING OFFENSES', 'INTOXICATED/IMPAIRED DRIVING', 'THEFT-FRAUD', 'ROBBERY', 'OTHER STATE LAWS (NON PENAL LAW)', 'AGRICULTURE & MRKTS LAW-UNCLASSIFIED', 'ANTICIPATORY OFFENSES', 'THEFT OF SERVICES', 'LOITERING/GAMBLING (CARDS, DIC', 'KIDNAPPING 

In [63]:
heatmap = folium.Map(location=[40.73, -74], zoom_start=11)

heat_df = df.sample(frac=.01)
heat_df = heat_df[['Latitude', 'Longitude']]
heat_data = [[row['Latitude'],row['Longitude']] for index, row in heat_df.iterrows()]

title_html = '''
             <h3 align="center" style="font-size:20px"><b>Crime in NYC 2021</b></h3>
             '''
heatmap.get_root().html.add_child(folium.Element(title_html))

HeatMap(heat_data).add_to(heatmap)

heatmap