In [1]:
import pandas as pd

DATA = '/kaggle/input/nypd-complaint-data-historic/NYPD_Complaint_Data_Historic.csv'
df = pd.read_csv(filepath_or_buffer=DATA, parse_dates=['RPT_DT'], low_memory=False,)
df['year'] = df['RPT_DT'].dt.year
df.shape

(8914838, 36)

Let's break them out by year.

In [2]:
df['year'].value_counts().sort_index().to_frame().T

year,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
count,530891,536330,529973,512985,509731,498589,504351,497264,491332,478583,478934,469017,464138,461926,413639,450042,531996,555117


And let's see what we get if we break them out by borough.

In [3]:
df['BORO_NM'].value_counts(dropna=False).sort_index().to_frame().T

BORO_NM,(null),BRONX,BROOKLYN,MANHATTAN,QUEENS,STATEN ISLAND
count,7884,1928510,2618580,2149468,1800709,409687


Our data is definitely not evenly distributed across the boroughs, and we have a few thousand records where the borough is not known or not applicable.

We can plot the totals by year and borough. Is a heatmap a good way to do that?

In [4]:
from plotly import express
from plotly import io

io.renderers.default = 'iframe'

express.density_heatmap(
    data_frame=df[df['BORO_NM'] != '(null)'][['year', 'BORO_NM']].value_counts().to_frame().reset_index(),
    x='year', y='BORO_NM', z='count', nbinsx=18
)

Maybe a line plot would do a better job.

In [5]:
express.line(
    data_frame=df[df['BORO_NM'] != '(null)'][['year', 'BORO_NM']].value_counts().to_frame().reset_index().sort_values(by=['year', 'BORO_NM']),
    x='year', color='BORO_NM', y='count', 
)

A line plot does do a better job. Our data looks more like a time series this way, and we can really see the impact of the COVID pandemic on crime data.

We have so many records let's just look at one year's worth.

In [6]:
year_df = df[df['year'] == 2023]

Let's make a map of our one year of data. A half-million points is too many to plot on an interactive map, so we have to take a sample.

In [7]:
express.scatter_mapbox(mapbox_style='open-street-map', lat='Latitude', lon='Longitude', color='BORO_NM', data_frame=year_df.sample(n=5000), height=800, zoom=10)

We can see the boroughs pretty clearly. Let's plot by jurisdiction.

In [8]:
express.scatter_mapbox(mapbox_style='open-street-map', lat='Latitude', lon='Longitude', color='KY_CD', data_frame=year_df.sample(n=5000), height=800, zoom=10)

In [9]:
express.scatter_mapbox(mapbox_style='open-street-map', lat='Latitude', lon='Longitude', color='PD_CD', data_frame=year_df.sample(n=5000), height=800, zoom=10)

What are our top crimes by code?

In [10]:
year_df['PD_CD'].value_counts().head(n=10).to_frame().T

PD_CD,638.0,333.0,101.0,109.0,637.0,639.0,922.0,259.0,441.0,198.0
count,64807,51234,44137,20770,19147,16548,13683,12389,11429,10892


What are they by name?

In [11]:
year_df['PD_DESC'].value_counts().head(n=10).to_frame().T

PD_DESC,"HARASSMENT,SUBD 3,4,5","LARCENY,PETIT FROM STORE-SHOPL",ASSAULT 3,"ASSAULT 2,1,UNCLASSIFIED","HARASSMENT,SUBD 1,CIVILIAN",AGGRAVATED HARASSMENT 2,"TRAFFIC,UNCLASSIFIED MISDEMEAN","CRIMINAL MISCHIEF,UNCLASSIFIED 4","LARCENY,GRAND OF AUTO",CRIMINAL CONTEMPT 1
count,64807,51234,44137,20770,19147,16548,13683,12389,11429,10892


Let's try plotting our GTAs.

In [12]:
express.scatter_mapbox(mapbox_style='open-street-map', lat='Latitude', lon='Longitude', color='PD_CD', data_frame=year_df[year_df['PD_DESC'] == 'LARCENY,GRAND OF AUTO'], height=800, zoom=10)

Wow. They are not evenly distributed geographically. Let's try counting them by borough.

In [13]:
year_df[year_df['PD_DESC'] == 'LARCENY,GRAND OF AUTO']['BORO_NM'].value_counts().to_frame().T

BORO_NM,BRONX,QUEENS,BROOKLYN,MANHATTAN,STATEN ISLAND
count,3991,3276,2599,1135,428
