<a href="https://colab.research.google.com/github/ipeirotis-org/datasets/blob/main/NYPD_Complaint/NYPD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NYPD Dataset

Dataset description at
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i



| Column | Description |
|--------|-------------------|
| CMPLNT_NUM |  Randomly generated persistent ID for each complaint  |  
| ADDR_PCT_CD |  The precinct in which the incident occurred |  
| BORO |  The name of the borough in which the incident occurred |  
| CMPLNT_FR_DT |  Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) |  
| CMPLNT_FR_TM |  Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) |  
| CMPLNT_TO_DT |  Ending date of occurrence for the reported event, if exact time of occurrence is unknown |  
| CMPLNT_TO_TM |  Ending time of occurrence for the reported event, if exact time of occurrence is unknown |  
| CRM_ATPT_CPTD_CD |  Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely |  
| HADEVELOPT |  Name of NYCHA housing development of occurrence, if applicable |  
| HOUSING_PSA |  Development Level Code |  
| JURISDICTION_CODE |  Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc. |  
| JURIS_DESC |  Description of the jurisdiction code |  
| KY_CD |  Three digit offense classification code |  
| LAW_CAT_CD |  Level of offense: felony, misdemeanor, violation  |  
| LOC_OF_OCCUR_DESC |  Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of |  
| OFNS_DESC |  Description of offense corresponding with key code |  
| PARKS_NM |  Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) |  
| PATROL_BORO |  The name of the patrol borough in which the incident occurred |  
| PD_CD |  Three digit internal classification code (more granular than Key Code) |  
| PD_DESC |  Description of internal classification corresponding with PD code (more granular than Offense Description) |  
| PREM_TYP_DESC |  Specific description of premises; grocery store, residence, street, etc. |  
| RPT_DT |  Date event was reported to police  |  
| STATION_NAME |  Transit station name |  
| SUSP_AGE_GROUP |  Suspect’s Age Group |  
| SUSP_RACE |  Suspect’s Race Description |  
| SUSP_SEX |  Suspect’s Sex Description |  
| TRANSIT_DISTRICT |  Transit district in which the offense occurred. |  
| VIC_AGE_GROUP |  Victim’s Age Group |  
| VIC_RACE |  Victim’s Race Description |  
| VIC_SEX |  Victim’s Sex Description |  
| X_COORD_CD |  X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Y_COORD_CD |  Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Latitude |  Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)  |  
| Longitude |  Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |


In [1]:
!pip install -q google-cloud-secret-manager

from google.colab import auth
auth.authenticate_user()

from google.cloud import secretmanager

def access_secret_version(project_id, secret_id, version_id):
    """
    Access the payload of the given secret version and return it.

    Args:
        project_id (str): Google Cloud project ID.
        secret_id (str): ID of the secret to access.
        version_id (str): ID of the version to access.
    Returns:
        str: The secret version's payload, or None if
        the version does not exist.
    """
    client = secretmanager.SecretManagerServiceClient()
    name = f"projects/{project_id}/secrets/{secret_id}/versions/{version_id}"
    response = client.access_secret_version(request={"name": name})
    return response.payload.data.decode("UTF-8")


mysql_pass = access_secret_version("nyu-datasets", "MYSQL_PASSWORD", "latest")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/218.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m112.6/218.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m215.0/218.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.1/218.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import pandas as pd
import numpy as np

In [3]:
# We load everything as an object/string, because some data types (e.g., some IDs)
# are recognized as decimals, and it is a mess to restore them back
# So we will do all the conversions ourselves later on

# From https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/data
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD' -o nypd.csv


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3079M    0 3079M    0     0  4794k      0 --:--:--  0:10:57 --:--:-- 4871k


In [4]:
%%time
df = pd.read_csv('nypd.csv', low_memory = True, dtype='object')

CPU times: user 47.3 s, sys: 6.69 s, total: 54 s
Wall time: 53.8 s


In [5]:
len(df)

9491946

In [6]:
df.query("KY_CD == '101'")

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,ADDR_PCT_CD,RPT_DT,KY_CD,OFNS_DESC,PD_CD,...,SUSP_SEX,TRANSIT_DISTRICT,Latitude,Longitude,Lat_Lon,PATROL_BORO,STATION_NAME,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
302687,63975410H16535,07/22/2009,21:24:00,,(null),83,07/22/2009,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,(null),,40.6984738177025,-73.917768981221,"(40.6984738177025, -73.917768981221)",PATROL BORO BKLYN NORTH,(null),18-24,BLACK,M
302704,63256138H16352,06/30/2009,02:41:00,,(null),103,06/30/2009,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,(null),,40.7072398161698,-73.7927267255908,"(40.7072398161698, -73.7927267255908)",PATROL BORO QUEENS SOUTH,(null),25-44,BLACK,M
302716,61400094H15961,05/03/2009,01:05:00,,(null),75,05/03/2009,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,(null),,40.6713598203364,-73.8818110231735,"(40.6713598203364, -73.8818110231735)",PATROL BORO BKLYN NORTH,(null),25-44,WHITE HISPANIC,M
302722,69373208H17529,12/29/2009,01:00:00,,(null),32,12/29/2009,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,(null),,40.81591307653,-73.9451493066481,"(40.81591307653, -73.9451493066481)",PATROL BORO MAN NORTH,(null),25-44,BLACK HISPANIC,M
302853,62074233H16082,05/24/2009,22:32:00,,(null),43,05/24/2009,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,(null),,40.8229123084767,-73.8700413043181,"(40.8229123084767, -73.8700413043181)",PATROL BORO BRONX,(null),18-24,WHITE HISPANIC,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8926707,227344444H1,04/25/2021,00:55:00,,(null),71,04/09/2024,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,M,,40.664599,-73.952395,"(40.664599, -73.952395)",PATROL BORO BKLYN SOUTH,(null),18-24,BLACK,M
8926765,296704248H2,11/18/2024,10:25:00,,(null),17,11/18/2024,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,M,,40.74081,-73.972518,"(40.74081, -73.972518)",PATROL BORO MAN SOUTH,(null),65+,ASIAN / PACIFIC ISLANDER,M
8926787,288867605H1,06/20/2024,22:00:00,,(null),77,06/20/2024,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,(null),,40.673237,-73.964989,"(40.673237, -73.964989)",PATROL BORO BKLYN NORTH,(null),45-64,ASIAN / PACIFIC ISLANDER,M
8926809,283885168H1,03/17/2024,02:17:00,,(null),78,03/17/2024,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,M,,40.682173,-73.979826,"(40.682173, -73.979826)",PATROL BORO BKLYN SOUTH,(null),18-24,BLACK HISPANIC,F


In [7]:
df.replace(to_replace = '(null)', value=np.nan, inplace = True)

In [8]:
len(df)

9491946

In [9]:
df.query("KY_CD == '101'")

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,ADDR_PCT_CD,RPT_DT,KY_CD,OFNS_DESC,PD_CD,...,SUSP_SEX,TRANSIT_DISTRICT,Latitude,Longitude,Lat_Lon,PATROL_BORO,STATION_NAME,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
302687,63975410H16535,07/22/2009,21:24:00,,,83,07/22/2009,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,,,40.6984738177025,-73.917768981221,"(40.6984738177025, -73.917768981221)",PATROL BORO BKLYN NORTH,,18-24,BLACK,M
302704,63256138H16352,06/30/2009,02:41:00,,,103,06/30/2009,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,,,40.7072398161698,-73.7927267255908,"(40.7072398161698, -73.7927267255908)",PATROL BORO QUEENS SOUTH,,25-44,BLACK,M
302716,61400094H15961,05/03/2009,01:05:00,,,75,05/03/2009,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,,,40.6713598203364,-73.8818110231735,"(40.6713598203364, -73.8818110231735)",PATROL BORO BKLYN NORTH,,25-44,WHITE HISPANIC,M
302722,69373208H17529,12/29/2009,01:00:00,,,32,12/29/2009,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,,,40.81591307653,-73.9451493066481,"(40.81591307653, -73.9451493066481)",PATROL BORO MAN NORTH,,25-44,BLACK HISPANIC,M
302853,62074233H16082,05/24/2009,22:32:00,,,43,05/24/2009,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,,,40.8229123084767,-73.8700413043181,"(40.8229123084767, -73.8700413043181)",PATROL BORO BRONX,,18-24,WHITE HISPANIC,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8926707,227344444H1,04/25/2021,00:55:00,,,71,04/09/2024,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,M,,40.664599,-73.952395,"(40.664599, -73.952395)",PATROL BORO BKLYN SOUTH,,18-24,BLACK,M
8926765,296704248H2,11/18/2024,10:25:00,,,17,11/18/2024,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,M,,40.74081,-73.972518,"(40.74081, -73.972518)",PATROL BORO MAN SOUTH,,65+,ASIAN / PACIFIC ISLANDER,M
8926787,288867605H1,06/20/2024,22:00:00,,,77,06/20/2024,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,,,40.673237,-73.964989,"(40.673237, -73.964989)",PATROL BORO BKLYN NORTH,,45-64,ASIAN / PACIFIC ISLANDER,M
8926809,283885168H1,03/17/2024,02:17:00,,,78,03/17/2024,101,MURDER & NON-NEGL. MANSLAUGHTER,,...,M,,40.682173,-73.979826,"(40.682173, -73.979826)",PATROL BORO BKLYN SOUTH,,18-24,BLACK HISPANIC,F


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9491946 entries, 0 to 9491945
Data columns (total 35 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   CMPLNT_NUM         object
 1   CMPLNT_FR_DT       object
 2   CMPLNT_FR_TM       object
 3   CMPLNT_TO_DT       object
 4   CMPLNT_TO_TM       object
 5   ADDR_PCT_CD        object
 6   RPT_DT             object
 7   KY_CD              object
 8   OFNS_DESC          object
 9   PD_CD              object
 10  PD_DESC            object
 11  CRM_ATPT_CPTD_CD   object
 12  LAW_CAT_CD         object
 13  BORO_NM            object
 14  LOC_OF_OCCUR_DESC  object
 15  PREM_TYP_DESC      object
 16  JURIS_DESC         object
 17  JURISDICTION_CODE  object
 18  PARKS_NM           object
 19  HADEVELOPT         object
 20  HOUSING_PSA        object
 21  X_COORD_CD         object
 22  Y_COORD_CD         object
 23  SUSP_AGE_GROUP     object
 24  SUSP_RACE          object
 25  SUSP_SEX           object
 26  TRANSIT_DISTRI

## Data Cleaning

In [11]:
# These columns are redundant
to_drop = ['Lat_Lon','X_COORD_CD','Y_COORD_CD']
# We created the CMPLNT_FR and CMPLNT_TO columns
# We have the longitude and latitude so the other coordinates are not needed
df = df.drop(to_drop, axis='columns')

###  CMPLNT_NUM         object   

In [12]:
# Remove any non-numeric characters from the CMPLNT_NUM attribute
df['CMPLNT_NUM'] = df['CMPLNT_NUM'].str.replace(r'\D', '', regex=True)

In [13]:
df['CMPLNT_NUM'] = pd.to_numeric(df['CMPLNT_NUM'], errors="coerce")
df['CMPLNT_NUM'] = np.abs(df['CMPLNT_NUM'].astype('int32'))
df = df[~df['CMPLNT_NUM'].isna()]
# Drop cases with duplicated complaint numbers
key_cnt = df['CMPLNT_NUM'].value_counts()
key_cnt [ key_cnt>1 ]
df = df[ ~df['CMPLNT_NUM'].isin( key_cnt [ key_cnt>1 ].index.values ) ]

In [14]:
len(df)

9489700

### CMPLNT_FR_DT       object
### CMPLNT_FR_TM       object
### CMPLNT_TO_DT       object
### CMPLNT_TO_TM       object

In [15]:
# There are a few rows that contain year 1015, 1016, ... that trigger an error during date conversion
# We replace all years written as 10XX with 20XX
# Note the usage of regular expressions
df.CMPLNT_FR_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )
df.CMPLNT_TO_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )

# Similarly, a few hours are written as 24:00:00, which also triggers errors.
# We fix these hours
df.CMPLNT_FR_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)
df.CMPLNT_TO_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)

# Convert the two separate date and time columns into single datetime columns
df['CMPLNT_FR'] = pd.to_datetime(df.CMPLNT_FR_DT + ' ' + df.CMPLNT_FR_TM, format='%m/%d/%Y %H:%M:%S', cache=True, errors="coerce")
df['CMPLNT_TO'] = pd.to_datetime(df.CMPLNT_TO_DT + ' ' + df.CMPLNT_TO_TM, format='%m/%d/%Y %H:%M:%S', cache=True, errors="coerce")

# These columns are redundant
to_drop = ['CMPLNT_FR_DT','CMPLNT_TO_DT','CMPLNT_FR_TM','CMPLNT_TO_TM']
# We created the CMPLNT_FR and CMPLNT_TO columns
# We have the longitude and latitude so the other coordinates are not needed
df = df.drop(to_drop, axis='columns')

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.CMPLNT_FR_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.CMPLNT_TO_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )
The be

In [16]:
df.CMPLNT_FR.isnull().sum()

np.int64(702)

In [17]:
df.CMPLNT_TO.isnull().sum()

np.int64(1849305)

In [18]:
df = df [ ~df.CMPLNT_FR.isnull() ]

###  ADDR_PCT_CD        object

In [19]:

df.ADDR_PCT_CD.replace(to_replace = '-99', value='99', inplace = True)
# df = df [ ~df.ADDR_PCT_CD.isnull() ]
df.ADDR_PCT_CD = pd.Categorical(df.ADDR_PCT_CD)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.ADDR_PCT_CD.replace(to_replace = '-99', value='99', inplace = True)


###  RPT_DT             object

In [20]:


# Convert RPT_DT to date
df.RPT_DT = pd.to_datetime(df.RPT_DT, format="%m/%d/%Y", cache=True)

### 7   KY_CD              object
### 8   OFNS_DESC          object

In [21]:
df.KY_CD.value_counts()

Unnamed: 0_level_0,count
KY_CD,Unnamed: 1_level_1
341,1666438
578,1272889
344,998208
109,831495
351,732692
...,...
460,16
357,15
123,7
362,5


In [22]:
df.OFNS_DESC.replace(to_replace = 'KIDNAPPING', value='KIDNAPPING & RELATED OFFENSES', inplace = True)
df.OFNS_DESC.replace(to_replace = 'KIDNAPPING AND RELATED OFFENSES', value='KIDNAPPING & RELATED OFFENSES', inplace = True)

df.OFNS_DESC.replace(to_replace = 'AGRICULTURE & MRKTS LAW-UNCLASSIFIED', value='OTHER STATE LAWS (NON PENAL LAW)', inplace = True)
df.OFNS_DESC.replace(to_replace = 'OTHER STATE LAWS (NON PENAL LA', value='OTHER STATE LAWS (NON PENAL LAW)', inplace = True)

df.OFNS_DESC.replace(to_replace = 'ENDAN WELFARE INCOMP', value='OFFENSES RELATED TO CHILDREN', inplace = True)

df.OFNS_DESC.replace(to_replace = 'THEFT OF SERVICES', value='OTHER OFFENSES RELATED TO THEF', inplace = True)

df.OFNS_DESC.replace(to_replace = 'NYS LAWS-UNCLASSIFIED VIOLATION', value='OTHER STATE LAWS', inplace = True)

df.OFNS_DESC.replace(to_replace = 'FELONY SEX CRIMES', value='SEX CRIMES', inplace = True)

df.loc[df.KY_CD=='120','OFNS_DESC'] ='CHILD ABANDONMENT/NON SUPPORT'

df.loc[df.KY_CD=='125','OFNS_DESC'] ='NYS LAWS-UNCLASSIFIED FELONY'

offenses = df[ ["KY_CD", "OFNS_DESC"] ].drop_duplicates().dropna()
offenses['KY_CD'] = pd.Categorical(pd.to_numeric(offenses['KY_CD'] ).astype(int))
offenses = offenses.set_index("KY_CD")
offenses = offenses.sort_index()
offenses = offenses.reset_index()
offenses


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.OFNS_DESC.replace(to_replace = 'KIDNAPPING', value='KIDNAPPING & RELATED OFFENSES', inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.OFNS_DESC.replace(to_replace = 'KIDNAPPING AND RELATED OFFENSES', value='KIDNAPPING & RELATED OFFENSES', inplace = True)
The b

Unnamed: 0,KY_CD,OFNS_DESC
0,101,MURDER & NON-NEGL. MANSLAUGHTER
1,102,HOMICIDE-NEGLIGENT-VEHICLE
2,103,"HOMICIDE-NEGLIGENT,UNCLASSIFIE"
3,104,RAPE
4,105,ROBBERY
...,...,...
71,676,NEW YORK CITY HEALTH CODE
72,677,OTHER STATE LAWS
73,678,MISCELLANEOUS PENAL LAW
74,685,ADMINISTRATIVE CODES


In [23]:
df.KY_CD = pd.Categorical(df.KY_CD)

In [24]:
df = df.drop('OFNS_DESC', axis='columns')

### 9   PD_CD              object
### 10  PD_DESC            object

In [25]:



df.loc[df.PD_CD=='694','PD_DESC'] ='INCEST'

df.loc[df.PD_CD=='234','PD_DESC'] ='BURGLARY,UNKNOWN TIME'

internal = df[ ["PD_CD", "PD_DESC"] ].drop_duplicates().dropna()
internal['PD_CD'] = pd.Categorical(pd.to_numeric(internal['PD_CD'] ).astype(int))
internal = internal.set_index("PD_CD")
internal = internal.sort_index()
internal = internal.reset_index()
internal

Unnamed: 0,PD_CD,PD_DESC
0,100,STALKING COMMIT SEX OFFENSE
1,101,ASSAULT 3
2,102,ASSAULT SCHOOL SAFETY AGENT
3,103,ASSAULT TRAFFIC AGENT
4,104,VEHICULAR ASSAULT (INTOX DRIVE
...,...,...
438,917,LEAVING THE SCENE OF AN ACCIDENT (SPI)
439,918,RECKLESS DRIVING
440,922,"TRAFFIC,UNCLASSIFIED MISDEMEAN"
441,969,"TRAFFIC,UNCLASSIFIED INFRACTIO"


In [26]:
df.PD_CD.isnull().sum()

np.int64(7957)

In [27]:
# df = df[~df.PD_CD.isnull()]

In [28]:
df.PD_CD = pd.Categorical(df.PD_CD)

In [29]:
df = df.drop('PD_DESC', axis='columns')

### 11  CRM_ATPT_CPTD_CD   object

In [30]:
df.CRM_ATPT_CPTD_CD.value_counts()

Unnamed: 0_level_0,count
CRM_ATPT_CPTD_CD,Unnamed: 1_level_1
COMPLETED,9332827
ATTEMPTED,156003


In [31]:
df.CRM_ATPT_CPTD_CD = pd.Categorical(df.CRM_ATPT_CPTD_CD)

In [32]:
df.CRM_ATPT_CPTD_CD.isnull().sum()

np.int64(168)

In [33]:
df = df [ ~df.CRM_ATPT_CPTD_CD.isnull() ]


### 12  LAW_CAT_CD         object

In [34]:
df.LAW_CAT_CD.value_counts()

Unnamed: 0_level_0,count
LAW_CAT_CD,Unnamed: 1_level_1
MISDEMEANOR,5215026
FELONY,2979507
VIOLATION,1294297


In [35]:
df.LAW_CAT_CD = pd.Categorical(df.LAW_CAT_CD)

### 16  JURIS_DESC         object
### 17  JURISDICTION_CODE  object

In [36]:
df.JURISDICTION_CODE.isnull().sum()

np.int64(0)

In [37]:
# df = df[ ~df.JURISDICTION_CODE.isnull() ]

jusridiction = df[ ["JURISDICTION_CODE", "JURIS_DESC", ] ].drop_duplicates().dropna()
jusridiction['JURISDICTION_CODE'] = pd.to_numeric(jusridiction['JURISDICTION_CODE'] )
jusridiction['JURISDICTION_CODE'] = jusridiction['JURISDICTION_CODE'].astype(int)
jusridiction = jusridiction.set_index("JURISDICTION_CODE")
jusridiction = jusridiction.sort_index()
jusridiction = jusridiction.reset_index()
jusridiction

Unnamed: 0,JURISDICTION_CODE,JURIS_DESC
0,0,N.Y. POLICE DEPT
1,1,N.Y. TRANSIT POLICE
2,2,N.Y. HOUSING POLICE
3,3,PORT AUTHORITY
4,4,TRI-BORO BRDG TUNNL
5,6,LONG ISLAND RAILRD
6,7,AMTRACK
7,8,CONRAIL
8,9,STATN IS RAPID TRANS
9,11,N.Y. STATE POLICE


In [38]:
df.JURISDICTION_CODE = pd.Categorical(df.JURISDICTION_CODE)


In [39]:
df = df.drop('JURIS_DESC', axis='columns')

###  13  BORO_NM            object

In [40]:
df.BORO_NM.value_counts()

Unnamed: 0_level_0,count
BORO_NM,Unnamed: 1_level_1
BROOKLYN,2776696
MANHATTAN,2287458
BRONX,2053705
QUEENS,1928168
STATEN ISLAND,434143


In [41]:
# df.BORO_NM.replace(to_replace = '(null)', value=np.nan, inplace = True)

In [42]:
df.BORO_NM.isnull().sum()

np.int64(8660)

In [43]:
df = df[~df.BORO_NM.isnull()]

In [44]:
df.BORO_NM = pd.Categorical(df.BORO_NM)

### 23  SUSP_AGE_GROUP     object
### 32  VIC_AGE_GROUP      object

In [45]:
# Both columns have a lot of noisy entries. We keep only the dominant groups, and also define an order
df.SUSP_AGE_GROUP = pd.Categorical(df.SUSP_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])
df.VIC_AGE_GROUP = pd.Categorical(df.VIC_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])


### 24  SUSP_RACE          object
### 25  SUSP_SEX           object

### 33  VIC_RACE           object
### 34  VIC_SEX            object

In [46]:
df.VIC_SEX.isnull().sum()

np.int64(305)

In [47]:
df.VIC_SEX.value_counts()

Unnamed: 0_level_0,count
VIC_SEX,Unnamed: 1_level_1
F,3685949
M,3154831
E,1378381
D,1250660
L,10040
U,4


In [48]:
df = df[~df.VIC_SEX.isnull()]

df.VIC_SEX.replace(to_replace = 'U', value=np.nan, inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.VIC_SEX.replace(to_replace = 'U', value=np.nan, inplace = True)


In [49]:
df.VIC_RACE.isnull().sum()

np.int64(325)

In [50]:
df.VIC_RACE.value_counts()

Unnamed: 0_level_0,count
VIC_RACE,Unnamed: 1_level_1
UNKNOWN,3092102
BLACK,2282912
WHITE HISPANIC,1563915
WHITE,1562440
ASIAN / PACIFIC ISLANDER,593950
BLACK HISPANIC,342196
AMERICAN INDIAN/ALASKAN NATIVE,41994
OTHER,31


In [51]:
df.VIC_RACE.replace(to_replace = 'OTHER', value='UNKNOWN', inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.VIC_RACE.replace(to_replace = 'OTHER', value='UNKNOWN', inplace = True)


In [52]:
df.VIC_RACE.replace(to_replace = np.nan, value='UNKNOWN', inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.VIC_RACE.replace(to_replace = np.nan, value='UNKNOWN', inplace = True)


In [53]:
# df = df[~df.VIC_RACE.isnull()]

In [54]:
df.SUSP_SEX.value_counts()

Unnamed: 0_level_0,count
SUSP_SEX,Unnamed: 1_level_1
M,3447548
U,1110625
F,1044163


In [133]:
# U is unknown, same is NULL.
df.SUSP_SEX = df.SUSP_SEX.replace(to_replace = 'U', value=np.nan)
df.SUSP_SEX = df.SUSP_SEX.replace(to_replace = '(null)', value=np.nan)
df.SUSP_SEX = df.SUSP_SEX.replace(to_replace = 'UNKNOWN', value=np.nan)
# Very small amount of OTHER values
df.SUSP_RACE = df.SUSP_RACE.replace(to_replace = 'OTHER', value='UNKNOWN')
df.SUSP_RACE = df.SUSP_RACE.replace(to_replace = np.nan, value='UNKNOWN')


  df.SUSP_SEX = df.SUSP_SEX.replace(to_replace = 'UNKNOWN', value=np.nan)


In [56]:
df.SUSP_RACE.value_counts()

Unnamed: 0_level_0,count
SUSP_RACE,Unnamed: 1_level_1
UNKNOWN,5316341
BLACK,2102782
WHITE HISPANIC,965713
WHITE,584859
BLACK HISPANIC,301749
ASIAN / PACIFIC ISLANDER,192358
AMERICAN INDIAN/ALASKAN NATIVE,16063


In [57]:
df.SUSP_RACE = pd.Categorical(df.SUSP_RACE)
df.SUSP_SEX = pd.Categorical(df.SUSP_SEX)
df.VIC_RACE = pd.Categorical(df.VIC_RACE)
df.VIC_SEX = pd.Categorical(df.VIC_SEX)

In [58]:
df.dtypes

Unnamed: 0,0
CMPLNT_NUM,int32
ADDR_PCT_CD,category
RPT_DT,datetime64[ns]
KY_CD,category
PD_CD,category
CRM_ATPT_CPTD_CD,category
LAW_CAT_CD,category
BORO_NM,category
LOC_OF_OCCUR_DESC,object
PREM_TYP_DESC,object


###  14  LOC_OF_OCCUR_DESC  object

In [59]:
df.LOC_OF_OCCUR_DESC.value_counts()

Unnamed: 0_level_0,count
LOC_OF_OCCUR_DESC,Unnamed: 1_level_1
INSIDE,4854593
FRONT OF,2250728
OPPOSITE OF,236124
REAR OF,189572
OUTSIDE,4895


In [60]:
df.LOC_OF_OCCUR_DESC.replace(to_replace = '(null)', value=np.nan, inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.LOC_OF_OCCUR_DESC.replace(to_replace = '(null)', value=np.nan, inplace = True)


In [61]:
df.LOC_OF_OCCUR_DESC.isnull().sum()

np.int64(1943953)

In [62]:
df.LOC_OF_OCCUR_DESC = pd.Categorical(df.LOC_OF_OCCUR_DESC)

### Latitude                     object
### Longitude                    object

In [63]:
# !sudo apt-get update
!sudo apt-get install python3-rtree
!sudo pip3 install geopandas descartes shapely ngram # matplotlib==3.1.3

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libspatialindex-c6 libspatialindex-dev libspatialindex6
The following NEW packages will be installed:
  libspatialindex-c6 libspatialindex-dev libspatialindex6 python3-rtree
0 upgraded, 4 newly installed, 0 to remove and 35 not upgraded.
Need to get 365 kB of archives.
After this operation, 1,799 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libspatialindex6 amd64 1.9.3-2 [247 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libspatialindex-c6 amd64 1.9.3-2 [55.8 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libspatialindex-dev amd64 1.9.3-2 [16.0 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 python3-rtree all 0.9.7-1 [46.4 kB]
Fetched 365 kB in 2s (195 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog

In [64]:
import geopandas as gpd

In [65]:
df.Latitude = pd.to_numeric(df.Latitude, downcast='float')
df.Longitude  = pd.to_numeric(df.Longitude, downcast='float')

In [66]:
%%time
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))

CPU times: user 8.68 s, sys: 1.42 s, total: 10.1 s
Wall time: 9.98 s


In [67]:
shapefile_url = 'https://services5.arcgis.com/GfwWNkhOj9bNBqoJ/arcgis/rest/services/NYC_Neighborhood_Tabulation_Areas_2020/FeatureServer/0/query?where=1=1&outFields=*&outSR=4326&f=pgeojson'
df_nyc = gpd.GeoDataFrame.from_file(shapefile_url)

In [68]:
df_nyc = df_nyc.to_crs(4326)

In [69]:
# !mkdir -p maps
# !curl -s https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/nynta2010_23a.zip  -o maps/nynta2010_23a.zip
# !cd maps && unzip -o nynta2010_23a.zip
# shapefile = f"maps/nynta2010_23a/nynta2010.shp"
# df_nyc = gpd.GeoDataFrame.from_file(shapefile)
# df_nyc = df_nyc.to_crs(4326)

In [70]:
df_nyc

Unnamed: 0,OBJECTID,BoroCode,BoroName,CountyFIPS,NTA2020,NTAName,NTAAbbrev,NTAType,CDTA2020,CDTAName,Shape__Area,Shape__Length,geometry
0,1,3,Brooklyn,047,BK0101,Greenpoint,Grnpt,0,BK01,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),3.532181e+07,28919.560811,"POLYGON ((-73.93214 40.72817, -73.93238 40.728..."
1,2,3,Brooklyn,047,BK0102,Williamsburg,Wllmsbrg,0,BK01,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),2.885285e+07,28134.082324,"POLYGON ((-73.95814 40.72441, -73.95772 40.724..."
2,3,3,Brooklyn,047,BK0103,South Williamsburg,SWllmsbrg,0,BK01,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),1.520896e+07,18250.280543,"POLYGON ((-73.95024 40.70548, -73.94984 40.705..."
3,4,3,Brooklyn,047,BK0104,East Williamsburg,EWllmsbrg,0,BK01,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),5.226741e+07,43184.798988,"POLYGON ((-73.92406 40.71412, -73.92404 40.714..."
4,5,3,Brooklyn,047,BK0201,Brooklyn Heights,BkHts,0,BK02,BK02 Downtown Brooklyn-Fort Greene (CD 2 Appro...,9.982322e+06,14312.504975,"POLYGON ((-73.99237 40.6897, -73.99436 40.6902..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
257,258,5,Staten Island,085,SI0391,Freshkills Park (South),FrshklPK_S,9,SI03,SI03 South Shore (CD 3 Approximation),4.775877e+07,33945.420421,"POLYGON ((-74.20059 40.57952, -74.19888 40.579..."
258,259,5,Staten Island,085,SI9561,Fort Wadsworth,FtWdswrth,6,SI95,SI95 Great Kills Park-Fort Wadsworth (JIA 95 A...,9.867249e+06,14814.414741,"POLYGON ((-74.05975 40.59386, -74.06014 40.594..."
259,260,5,Staten Island,085,SI9591,Hoffman & Swinburne Islands,HffmnIsl,9,SI95,SI95 Great Kills Park-Fort Wadsworth (JIA 95 A...,6.357020e+05,4743.128127,"MULTIPOLYGON (((-74.05314 40.57771, -74.05406 ..."
260,261,5,Staten Island,085,SI9592,Miller Field,MllrFld,9,SI95,SI95 Great Kills Park-Fort Wadsworth (JIA 95 A...,1.086680e+07,19197.200973,"POLYGON ((-74.08469 40.57149, -74.08595 40.570..."


In [71]:
%%time
# Match each accident with a neighborhood.
# Will take ~30 mins to run
# This is done with left join,
# so we preserve all the data points
# but we know which ones are not matching with the shapefile
gdf.crs = df_nyc.crs
gdf = gpd.sjoin(gdf, df_nyc, how='left')


CPU times: user 59.1 s, sys: 2.65 s, total: 1min 1s
Wall time: 1min 1s


In [72]:
gdf.dtypes

Unnamed: 0,0
CMPLNT_NUM,int32
ADDR_PCT_CD,category
RPT_DT,datetime64[ns]
KY_CD,category
PD_CD,category
CRM_ATPT_CPTD_CD,category
LAW_CAT_CD,category
BORO_NM,category
LOC_OF_OCCUR_DESC,category
PREM_TYP_DESC,object


In [73]:
# We keep only boro_name and ntaname
todrop = [
    'index_right', 'BoroCode', 'CountyFIPS', 'OBJECTID', 'NTAAbbrev', 'NTAType', 'CDTA2020', 'CDTAName',
    'Shape__Area', 'Shape__Length'
]

gdf = gdf.drop(todrop, axis='columns')

# Rename the columns
gdf = gdf.rename({
    'BoroName': 'BOROUGH',
    'NTAName': 'NEIGHBORHOOD',
    'NTA2020': 'NEIGHBORHOOD_CODE',
}, axis='columns')

In [74]:
gdf['BOROUGH'] = gdf['BOROUGH'].str.upper()

In [75]:
# Mark as NULL all the lon/lat entries outside the NYC area
gdf.loc[gdf.BOROUGH.isnull(), 'Latitude'] = None
gdf.loc[gdf.BOROUGH.isnull(), 'Longitude'] = None

In [76]:
# Mark as NULL all the lon/lat entries that generate inconsistencies
mask = gdf.query('BOROUGH != BORO_NM and Latitude==Latitude and Longitude==Longitude').CMPLNT_NUM.values
condition = gdf.CMPLNT_NUM.isin(mask)

gdf.loc[condition, 'Latitude'] = None
gdf.loc[condition, 'Longitude'] = None

In [77]:
# Delete the cases where the reported and detected boroughs are different
# to_delete = sorted(set(to_delete))
# gdf = gdf [ ~gdf.CMPLNT_NUM.isin(to_delete) ]

In [78]:
gdf = gdf.drop('geometry', axis='columns')

In [79]:
df = pd.DataFrame(gdf)

In [80]:
df.BORO_NM.value_counts()

Unnamed: 0_level_0,count
BORO_NM,Unnamed: 1_level_1
BROOKLYN,2776589
MANHATTAN,2287398
BRONX,2053630
QUEENS,1928116
STATEN ISLAND,434132


In [81]:
#Temporarily, we drop these. We will add them back in the future
# df.drop(['BOROUGH','NEIGHBORHOOD','NEIGHBORHOOD_CODE'], axis='columns', inplace=True)

In [82]:
df = df[df.BOROUGH == df.BORO_NM]

In [83]:
df.drop(['BOROUGH'], axis='columns', inplace=True)
df.NEIGHBORHOOD_CODE = pd.Categorical(df.NEIGHBORHOOD_CODE)
df.NEIGHBORHOOD = pd.Categorical(df.NEIGHBORHOOD)

### TRANSIT_DISTRICT

In [84]:
df.TRANSIT_DISTRICT.value_counts()


Unnamed: 0_level_0,count
TRANSIT_DISTRICT,Unnamed: 1_level_1
4,34002
2,26477
1,22332
33,21389
3,20893
20,20890
12,18107
11,16680
32,15676
30,14102


In [85]:
len(df) - df.TRANSIT_DISTRICT.isnull().sum()

np.int64(224019)

In [86]:
df.drop('TRANSIT_DISTRICT', axis='columns', inplace=True)


### PREM_TYP_DESC

In [87]:
df.PREM_TYP_DESC.value_counts()

Unnamed: 0_level_0,count
PREM_TYP_DESC,Unnamed: 1_level_1
STREET,2959717
RESIDENCE - APT. HOUSE,2023141
RESIDENCE-HOUSE,917889
RESIDENCE - PUBLIC HOUSING,689667
CHAIN STORE,276022
...,...
CLOTHING BOUTIQUE,4
DOCTOR/DENTIST,2
PHOTO/COPY STORE,2
CHECK CASH,1


In [88]:
df.PREM_TYP_DESC.isnull().sum()

np.int64(51778)

In [89]:
df = df [~df.PREM_TYP_DESC.isnull()]

In [90]:
df.PREM_TYP_DESC = pd.Categorical(df.PREM_TYP_DESC)

In [91]:
df.PARKS_NM.value_counts()

Unnamed: 0_level_0,count
PARKS_NM,Unnamed: 1_level_1
CENTRAL PARK,2539
FLUSHING MEADOWS CORONA PARK,2187
WASHINGTON SQUARE PARK,1824
CONEY ISLAND BEACH & BOARDWALK,1581
RIVERSIDE PARK,805
...,...
CHILDREN'S MAGICAL GARDEN,1
ST. MARY'S PARK PLAYGROUND BROOKLYN,1
VALENTINO PIER,1
AESOP PARK,1


In [92]:
df.PARKS_NM.value_counts().sum()

np.int64(40392)

In [93]:
df.drop('PARKS_NM', axis='columns', inplace=True)



 19  HADEVELOPT         object


In [94]:
df.HADEVELOPT.value_counts()

Unnamed: 0_level_0,count
HADEVELOPT,Unnamed: 1_level_1
INGERSOLL,4839
WALD,2817
NOSTRAND,2567
WILLIAMSBURG,2557
RIIS,2102
MARLBORO,2055
MANHATTANVILLE,2010
GRANT,1993
SHEEPSHEAD BAY,1841
MARBLE HILL,1733


In [95]:
df.drop('HADEVELOPT', axis='columns', inplace=True)


 20  HOUSING_PSA        object



In [96]:
df.HOUSING_PSA.value_counts()

Unnamed: 0_level_0,count
HOUSING_PSA,Unnamed: 1_level_1
670,9724
887,9690
720,8371
845,8288
632,7905
...,...
73638,1
73474,1
60863,1
64967,1


In [97]:
df.HOUSING_PSA.value_counts().sum()

np.int64(703514)

In [98]:
df.drop('HOUSING_PSA', axis='columns', inplace=True)

 30  PATROL_BORO        object


In [99]:
df.PATROL_BORO.value_counts()

Unnamed: 0_level_0,count
PATROL_BORO,Unnamed: 1_level_1
PATROL BORO BRONX,2038256
PATROL BORO BKLYN SOUTH,1385590
PATROL BORO BKLYN NORTH,1375151
PATROL BORO MAN SOUTH,1156163
PATROL BORO MAN NORTH,1117858
PATROL BORO QUEENS NORTH,996362
PATROL BORO QUEENS SOUTH,915980
PATROL BORO STATEN ISLAND,431864


In [100]:
df.PATROL_BORO = pd.Categorical(df.PATROL_BORO)

In [101]:
df.PATROL_BORO.isnull().sum()

np.int64(159)

In [102]:
df = df[~df.PATROL_BORO.isnull()]

 31  STATION_NAME       object

In [103]:
df.STATION_NAME.value_counts()

Unnamed: 0_level_0,count
STATION_NAME,Unnamed: 1_level_1
125 STREET,9798
14 STREET,5524
42 ST.-PORT AUTHORITY BUS TERM,5373
34 ST.-PENN STATION,4779
42 ST.-TIMES SQUARE,4245
...,...
MYRTLE/WYCKOFF AVENUES,22
DISTRICT 12 OFFICE,21
DISTRICT 34 OFFICE,18
OFF-SYSTEM,8


In [104]:
df.STATION_NAME.replace(to_replace = '(null)', value=np.nan, inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.STATION_NAME.replace(to_replace = '(null)', value=np.nan, inplace = True)


In [105]:
df.STATION_NAME.isnull().sum()



np.int64(9195615)

In [106]:
df.drop('STATION_NAME', axis='columns', inplace=True)

In [107]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9417224 entries, 0 to 9491945
Data columns (total 24 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float32       
 15  Longitude          float32       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          datetime64

## Data exploration

In this part we check the different values that appear in the columns. When we detect noisy results, we delete the corresponding values. In fact, many of the operations that are performed above, in the 'data cleaning' section, are the result of observations that we make here.

In [108]:
# Find the unique values in each column
#
# df.describe(include = [np.object, 'category']).T['unique']
unique = df.describe(include = 'all').T['unique'].sort_values()

In [109]:
unique

Unnamed: 0,unique
CRM_ATPT_CPTD_CD,2.0
LAW_CAT_CD,3.0
SUSP_SEX,3.0
VIC_SEX,5.0
BORO_NM,5.0
LOC_OF_OCCUR_DESC,5.0
VIC_AGE_GROUP,5.0
SUSP_AGE_GROUP,5.0
VIC_RACE,7.0
SUSP_RACE,7.0


In [110]:
for column in unique.index:
    if unique[column] < 200:
        print(df[column].value_counts())
        print("=====")

CRM_ATPT_CPTD_CD
COMPLETED    9262109
ATTEMPTED     155115
Name: count, dtype: int64
=====
LAW_CAT_CD
MISDEMEANOR    5174298
FELONY         2955968
VIOLATION      1286958
Name: count, dtype: int64
=====
SUSP_SEX
M          3420884
UNKNOWN    1104017
F          1037611
Name: count, dtype: int64
=====
VIC_SEX
F    3668878
M    3135422
E    1359667
D    1243863
L       9393
Name: count, dtype: int64
=====
BORO_NM
BROOKLYN         2760745
MANHATTAN        2275381
BRONX            2038242
QUEENS           1910991
STATEN ISLAND     431865
Name: count, dtype: int64
=====
LOC_OF_OCCUR_DESC
INSIDE         4829011
FRONT OF       2237252
OPPOSITE OF     234986
REAR OF         188581
OUTSIDE           1459
Name: count, dtype: int64
=====
VIC_AGE_GROUP
25-44    3144851
45-64    1630122
18-24     941846
<18       436475
65+       354238
Name: count, dtype: int64
=====
SUSP_AGE_GROUP
25-44    1784140
18-24     643998
45-64     621211
<18       215004
65+        58326
Name: count, dtype: int64
=====
V

In [111]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9417224 entries, 0 to 9491945
Data columns (total 24 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float32       
 15  Longitude          float32       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          datetime64

In [112]:
df['PREM_TYP_DESC'].value_counts().count()

np.int64(94)

In [113]:
# All columns, except for the dates and spatial coordinates, are categorical
# Columns with less than a few thousand unique values are good candidates
# for ENUMs in the database given that the dataset is static.
# Also, in Pandas the internal representation becomes much more efficient
# as the Categoricals are stored as integers and not as strings
for column in unique.index:
    if column == 'RPT_DT':
        continue
    if df[column].value_counts().count() < 1000:
      df[column] = pd.Categorical(df[column])

In [114]:
# With all the proper data typing the dataset went down in size from 1.9Gb+ to 425Mb.
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9417224 entries, 0 to 9491945
Data columns (total 24 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float32       
 15  Longitude          float32       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          datetime64

In [115]:
df.memory_usage()

Unnamed: 0,0
Index,75337792
CMPLNT_NUM,37668896
ADDR_PCT_CD,9419952
RPT_DT,75337792
KY_CD,9419928
PD_CD,18854544
CRM_ATPT_CPTD_CD,9417348
LAW_CAT_CD,9417356
BORO_NM,9417436
LOC_OF_OCCUR_DESC,9417436


In [116]:
df.dtypes

Unnamed: 0,0
CMPLNT_NUM,int32
ADDR_PCT_CD,category
RPT_DT,datetime64[ns]
KY_CD,category
PD_CD,category
CRM_ATPT_CPTD_CD,category
LAW_CAT_CD,category
BORO_NM,category
LOC_OF_OCCUR_DESC,category
PREM_TYP_DESC,category


## Storing in a MySQL database

In [117]:
!sudo pip3 install -U -q PyMySQL sqlalchemy

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [118]:
import os
from sqlalchemy import create_engine
from sqlalchemy import text

conn_string = 'mysql+pymysql://{user}:{password}@{host}/?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org',
    user = 'root',
    password = mysql_pass,
    encoding = 'utf8mb4')

engine = create_engine(conn_string)


In [119]:
# Query to create a database
db_name = 'nypd'

sql = f"DROP DATABASE IF EXISTS {db_name}"
with engine.connect() as connection:
  connection.execute(text(sql))

# Create a database
sql = f"CREATE DATABASE IF NOT EXISTS {db_name} DEFAULT CHARACTER SET 'utf8mb4'"
with engine.connect() as connection:
  connection.execute(text(sql))


In [120]:

# And lets switch to the database
sql = f"USE {db_name}"
with engine.connect() as connection:
  connection.execute(text(sql))


In [121]:
df['NEIGHBORHOOD'] = df['NEIGHBORHOOD'].str.replace('\'', '’', regex=False)

In [122]:
NEIGHBORHOOD_enum = "ENUM('" + ("','".join(sorted(df.NEIGHBORHOOD.astype(str).unique()))) + "')"


In [123]:
print(NEIGHBORHOOD_enum)

ENUM('Allerton','Alley Pond Park','Annadale-Huguenot-Prince’s Bay-Woodrow','Arden Heights-Rossville','Astoria (Central)','Astoria (East)-Woodside (North)','Astoria (North)-Ditmars-Steinway','Astoria Park','Auburndale','Baisley Park','Barren Island-Floyd Bennett Field','Bath Beach','Bay Ridge','Bay Terrace-Clearview','Bayside','Bedford Park','Bedford-Stuyvesant (East)','Bedford-Stuyvesant (West)','Bellerose','Belmont','Bensonhurst','Borough Park','Breezy Point-Belle Harbor-Rockaway Park-Broad Channel','Brighton Beach','Bronx Park','Brooklyn Heights','Brooklyn Navy Yard','Brownsville','Bushwick (East)','Bushwick (West)','Calvary & Mount Zion Cemeteries','Calvert Vaux Park','Cambria Heights','Canarsie','Canarsie Park & Pier','Carroll Gardens-Cobble Hill-Gowanus-Red Hook','Castle Hill-Unionport','Central Park','Chelsea-Hudson Yards','Chinatown-Two Bridges','Claremont Park','Claremont Village-Claremont (East)','Clinton Hill','Co-op City','College Point','Concourse-Concourse Village','Coney 

In [124]:
NCODE_enum = "ENUM('" + ("','".join(sorted(df.NEIGHBORHOOD_CODE.astype(str).unique()))) + "')"

In [125]:
# In principle, we can let Pandas create the table, but we want to be a bit more predise
# with the data types, and we want to add documentation for each column
# from https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i


sql = f'''
CREATE TABLE nypd (
  CMPLNT_NUM int,
  CMPLNT_FR datetime,
  CMPLNT_TO datetime,
  RPT_DT date,
  KY_CD SMALLINT,
  PD_CD SMALLINT,
  CRM_ATPT_CPTD_CD enum('COMPLETED','ATTEMPTED'),
  LAW_CAT_CD enum('FELONY','MISDEMEANOR','VIOLATION'),
  JURISDICTION_CODE SMALLINT,
  BORO_NM enum('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN ISLAND'),
  NEIGHBORHOOD {NEIGHBORHOOD_enum},
  NEIGHBORHOOD_CODE {NCODE_enum},
  ADDR_PCT_CD SMALLINT,
  LOC_OF_OCCUR_DESC enum('FRONT OF','INSIDE','OPPOSITE OF','OUTSIDE','REAR OF'),
  PATROL_BORO enum('PATROL BORO BRONX', 'PATROL BORO BKLYN SOUTH','PATROL BORO BKLYN NORTH','PATROL BORO MAN SOUTH','PATROL BORO MAN NORTH','PATROL BORO QUEENS NORTH','PATROL BORO QUEENS SOUTH','PATROL BORO STATEN ISLAND'),
  PREM_TYP_DESC varchar(30),
  SUSP_RACE enum('UNKNOWN', 'BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  VIC_RACE enum('UNKNOWN', 'BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  SUSP_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  VIC_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  SUSP_SEX enum('M', 'F'),
  VIC_SEX enum('M', 'F', 'E', 'D', 'L'),
  Latitude double,
  Longitude double,
  PRIMARY KEY (CMPLNT_NUM)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))

In [127]:
df.SUSP_SEX.value_counts()

Unnamed: 0_level_0,count
SUSP_SEX,Unnamed: 1_level_1
M,3420884
UNKNOWN,1104017
F,1037611


In [134]:
# Create a table
# See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html for the documentation
from tqdm import tqdm
batchsize = 50000
batches = len(df) // batchsize + 1

t = tqdm(range(batches))

for i in t:
    # print("Batch:",i)
    # continue # Cannot execute this on Travis
    start = batchsize * i
    end = batchsize * (i+1)
    df[start:end].to_sql(
        name = 'nypd',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False,
        chunksize = 1000)

100%|██████████| 189/189 [41:43<00:00, 13.25s/it]


In [132]:
df.iloc[217:220]

Unnamed: 0,CMPLNT_NUM,ADDR_PCT_CD,RPT_DT,KY_CD,PD_CD,CRM_ATPT_CPTD_CD,LAW_CAT_CD,BORO_NM,LOC_OF_OCCUR_DESC,PREM_TYP_DESC,...,Latitude,Longitude,PATROL_BORO,VIC_AGE_GROUP,VIC_RACE,VIC_SEX,CMPLNT_FR,CMPLNT_TO,NEIGHBORHOOD_CODE,NEIGHBORHOOD
217,50825934,17,2008-09-03,341,338,COMPLETED,MISDEMEANOR,MANHATTAN,INSIDE,RESIDENCE - APT. HOUSE,...,40.755352,-73.969643,PATROL BORO MAN SOUTH,45-64,WHITE,M,2007-12-07 09:00:00,2008-09-02 23:00:00,MN0604,East Midtown-Turtle Bay
218,47797253,81,2008-06-27,106,109,COMPLETED,FELONY,BROOKLYN,FRONT OF,STREET,...,40.685268,-73.92952,PATROL BORO BKLYN NORTH,25-44,BLACK,M,2008-06-27 08:59:00,NaT,BK0302,Bedford-Stuyvesant (East)
219,54699419,23,2008-12-06,235,567,COMPLETED,MISDEMEANOR,MANHATTAN,,STREET,...,40.796074,-73.943481,PATROL BORO MAN NORTH,,UNKNOWN,E,2008-12-06 21:00:00,NaT,MN1101,East Harlem (South)




In [135]:
sql = "CREATE INDEX ix_lat ON nypd.nypd(Latitude)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [136]:
sql = "CREATE INDEX ix_lon ON nypd.nypd(Longitude)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [137]:
sql = "CREATE INDEX ix_LAW_CAT_CD ON nypd.nypd(LAW_CAT_CD)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [138]:
sql = "CREATE INDEX ix_BORO_NM ON nypd.nypd(BORO_NM)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [139]:
sql = "CREATE INDEX ix_KY_CD ON nypd.nypd(KY_CD)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [140]:
sql = "CREATE INDEX ix_RPT_DT ON nypd.nypd(RPT_DT)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [141]:
sql = "CREATE INDEX ix_CMPLNT_FR ON nypd.nypd(CMPLNT_FR)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [142]:
offenses = offenses[offenses.OFNS_DESC != "(null)"]

In [152]:
offenses = offenses.groupby('KY_CD', observed=False).first()['OFNS_DESC']

In [156]:
offenses = offenses.reset_index()

In [153]:
# offenses.drop(39,inplace=True)

In [157]:
sql = "DROP TABLE IF EXISTS offense_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE offense_codes (
  KY_CD smallint,
  OFNS_DESC varchar(32),
  PRIMARY KEY (KY_CD)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))

offenses.to_sql(
        name = 'offense_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

75

In [158]:
sql = "DROP TABLE IF EXISTS jurisdiction_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE jurisdiction_codes (
  JURISDICTION_CODE smallint,
  JURIS_DESC varchar(40),
  PRIMARY KEY (JURISDICTION_CODE)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))


jusridiction.to_sql(
        name = 'jurisdiction_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

28

In [None]:
internal.PD_DESC.str.len().max()


In [None]:
internal = internal.query("PD_DESC !=	'CRIMINAL DISPOSAL FIREARM 1 &' ")
internal = internal.query("PD_DESC !=	'UNFINSH FRAME 2' ")
internal = internal.query("PD_DESC !=	'WEAPONS POSSESSION 1 & 2' ")
internal = internal.query("PD_DESC !=	'CRIM POS WEAP 4' ")


In [159]:
internal

Unnamed: 0,PD_CD,PD_DESC
0,100,STALKING COMMIT SEX OFFENSE
1,101,ASSAULT 3
2,102,ASSAULT SCHOOL SAFETY AGENT
3,103,ASSAULT TRAFFIC AGENT
4,104,VEHICULAR ASSAULT (INTOX DRIVE
...,...,...
438,917,LEAVING THE SCENE OF AN ACCIDENT (SPI)
439,918,RECKLESS DRIVING
440,922,"TRAFFIC,UNCLASSIFIED MISDEMEAN"
441,969,"TRAFFIC,UNCLASSIFIED INFRACTIO"


In [160]:
sql = "DROP TABLE IF EXISTS penal_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE penal_codes (
  PD_CD smallint,
  PD_DESC varchar(80),
  PRIMARY KEY (PD_CD)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))


internal.to_sql(
        name = 'penal_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

443

In [164]:
internal

Unnamed: 0,PD_CD,PD_DESC
0,100,STALKING COMMIT SEX OFFENSE
1,101,ASSAULT 3
2,102,ASSAULT SCHOOL SAFETY AGENT
3,103,ASSAULT TRAFFIC AGENT
4,104,VEHICULAR ASSAULT (INTOX DRIVE
...,...,...
438,917,LEAVING THE SCENE OF AN ACCIDENT (SPI)
439,918,RECKLESS DRIVING
440,922,"TRAFFIC,UNCLASSIFIED MISDEMEAN"
441,969,"TRAFFIC,UNCLASSIFIED INFRACTIO"


In [161]:
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/files/65f25845-1551-4d21-91dc-869c977cd93d?download=true&filename=PDCode_PenalLaw.xlsx' -o PDCode_PenalLaw.xlsx

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  233k    0  233k    0     0   105k      0 --:--:--  0:00:02 --:--:--  105k


In [162]:
penal_code_df = pd.read_excel('PDCode_PenalLaw.xlsx')

In [165]:
penal_code_df.to_sql(
        name = 'pd_code_penal_law',
        schema = db_name,
        con = engine,
        if_exists = 'replace',
        index = False)

4671

## TODO

The fields

  
PREM_TYP_DESC    
HADEVELOPT    
PARKS_NM                     

would be better off as foreign keys or enums. They take too much space as strings.