# Arrest data in NYC, an exploration and regression analysis
## Author: Jack Robbins

In [28]:
# Important imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
from sklearn import preprocessing
from matplotlib.gridspec import GridSpec

In [29]:
# Read in our dataframe
arrests = pd.read_csv("data/NYPD_Arrests_Data__Historic__20241018.csv")

In [30]:
# Let's see what we're working with
arrests.head()

Unnamed: 0,ARREST_KEY,ARREST_DATE,PD_CD,PD_DESC,KY_CD,OFNS_DESC,LAW_CODE,LAW_CAT_CD,ARREST_BORO,ARREST_PRECINCT,JURISDICTION_CODE,AGE_GROUP,PERP_SEX,PERP_RACE,X_COORD_CD,Y_COORD_CD,Latitude,Longitude,Lon_Lat
0,186134240,08/07/2018,184.0,,,,PL 12070E1,F,K,73,0.0,45-64,M,BLACK,1007585.0,183788.0,40.67111,-73.915881,POINT (-73.91588130999997 40.67110980800004)
1,220476154,11/13/2020,397.0,"ROBBERY,OPEN AREA UNCLASSIFIED",105.0,ROBBERY,PL 1600500,F,B,40,0.0,25-44,M,BLACK,1005041.0,234533.0,40.810398,-73.924895,POINT (-73.92489531099994 40.810398494000026)
2,199148493,07/01/2019,440.0,,,,PL 1553502,F,M,23,1.0,25-44,M,BLACK HISPANIC,998829.0,226859.0,40.789348,-73.947352,POINT (-73.94735241299998 40.78934789300007)
3,209928408,02/22/2020,569.0,"MARIJUANA, SALE 4 & 5",235.0,DANGEROUS DRUGS,PL 2214000,M,M,32,0.0,25-44,M,BLACK,1001610.0,241367.0,40.829163,-73.937272,POINT (-73.93727189399993 40.829163304000076)
4,220330574,11/10/2020,101.0,ASSAULT 3,344.0,ASSAULT 3 & RELATED OFFENSES,PL 1200001,M,B,49,0.0,25-44,M,WHITE,1024396.0,250744.0,40.854826,-73.85488,POINT (-73.85487970999998 40.85482622300003)


## Data Preprocessing
As we can see from above, we have a lot of NULLs and data that we may or may not want. We'll clean this data up before doing anything involving data analysis.

In [31]:
null_values=arrests.isnull().sum()
print("Detecting missing values:\n", null_values)

Detecting missing values:
 ARREST_KEY               0
ARREST_DATE              0
PD_CD                  876
PD_DESC               9169
KY_CD                 9756
OFNS_DESC             9169
LAW_CODE               196
LAW_CAT_CD           23600
ARREST_BORO              8
ARREST_PRECINCT          0
JURISDICTION_CODE       10
AGE_GROUP               17
PERP_SEX                 0
PERP_RACE                0
X_COORD_CD               1
Y_COORD_CD               1
Latitude                 1
Longitude                1
Lon_Lat                  1
dtype: int64


### Unneeded columns
Looking here, we have a column titled "ARREST KEY" which is likely the primary key for the database where this is stored. We don't need this column and therefore will drop it. We also don't need "LON_LAT" as it is just a combination of two other columns. The same can be said for X_COORD_CD and Y_COORD_CD, because these are just proxies for longitutde and latitute. Offense description is a free entry text field, and likely to contain large amounts of junk in it with no consistent pattern, so we'll get rid of that as well.

In [32]:
# Dropping columns
arrests.drop(['ARREST_KEY', 'X_COORD_CD', 'Y_COORD_CD',\
                 'OFNS_DESC','Lon_Lat'], axis=1, inplace=True)

In [33]:
# Let's see how w're looking now
null_values=arrests.isnull().sum()
print("Detecting missing values:\n", null_values)

Detecting missing values:
 ARREST_DATE              0
PD_CD                  876
PD_DESC               9169
KY_CD                 9756
LAW_CODE               196
LAW_CAT_CD           23600
ARREST_BORO              8
ARREST_PRECINCT          0
JURISDICTION_CODE       10
AGE_GROUP               17
PERP_SEX                 0
PERP_RACE                0
Latitude                 1
Longitude                1
dtype: int64


In [34]:
arrests.shape

(5725522, 14)

### Removing NA's
As we can see, we have nearly 6 million rows of data to work with here. Additionally, there are only at most around 40,000 NA's in the dataset. In my opinion, dropping these is an accceptable loss. A lot of these NA's are also in categorical data columns, so there's really no way to fill them accordingly.

In [35]:
# Drop any rows that have at least one null column
arrests.dropna(how='any', inplace=True)

In [36]:
# Let's see how w're looking now
null_values=arrests.isnull().sum()
print("Detecting missing values:\n", null_values)

Detecting missing values:
 ARREST_DATE          0
PD_CD                0
PD_DESC              0
KY_CD                0
LAW_CODE             0
LAW_CAT_CD           0
ARREST_BORO          0
ARREST_PRECINCT      0
JURISDICTION_CODE    0
AGE_GROUP            0
PERP_SEX             0
PERP_RACE            0
Latitude             0
Longitude            0
dtype: int64


In [37]:
arrests.shape

(5692330, 14)

As we can see, we still have well over 5 million records to analyze after doing the NA removal, and currently no more NA's in our dataset

### Cleaning up type mismatches and other miscellaneous preprocessing tasks

In [38]:
arrests.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5692330 entries, 1 to 5725521
Data columns (total 14 columns):
 #   Column             Dtype  
---  ------             -----  
 0   ARREST_DATE        object 
 1   PD_CD              float64
 2   PD_DESC            object 
 3   KY_CD              float64
 4   LAW_CODE           object 
 5   LAW_CAT_CD         object 
 6   ARREST_BORO        object 
 7   ARREST_PRECINCT    int64  
 8   JURISDICTION_CODE  float64
 9   AGE_GROUP          object 
 10  PERP_SEX           object 
 11  PERP_RACE          object 
 12  Latitude           float64
 13  Longitude          float64
dtypes: float64(5), int64(1), object(8)
memory usage: 651.4+ MB


Arrest date is one that specifically interests me, but I'm fairly certain that date is too granular. However **month** may not be, so I'm going to convert all of these dates to months.

In [40]:
#Convert a date to a month
def date_to_months(date):
    s = date.split("/")
    if(len(s) != 3):
        return Nan
    else:
        return s[0]

arrests['ARREST_MONTH'] = arrests['ARREST_DATE'].apply(date_to_months)

In [41]:
# Let's see how we're looking now
null_values=arrests.isnull().sum()
print("Detecting missing values:\n", null_values)

Detecting missing values:
 ARREST_DATE          0
PD_CD                0
PD_DESC              0
KY_CD                0
LAW_CODE             0
LAW_CAT_CD           0
ARREST_BORO          0
ARREST_PRECINCT      0
JURISDICTION_CODE    0
AGE_GROUP            0
PERP_SEX             0
PERP_RACE            0
Latitude             0
Longitude            0
ARREST_MONTH         0
dtype: int64
