<a href="https://colab.research.google.com/github/ipeirotis-org/datasets/blob/main/NYPD_Complaint/NYPD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NYPD Dataset

Dataset description at
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i



| Column | Description |
|--------|-------------------|
| CMPLNT_NUM |  Randomly generated persistent ID for each complaint  |  
| ADDR_PCT_CD |  The precinct in which the incident occurred |  
| BORO |  The name of the borough in which the incident occurred |  
| CMPLNT_FR_DT |  Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) |  
| CMPLNT_FR_TM |  Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) |  
| CMPLNT_TO_DT |  Ending date of occurrence for the reported event, if exact time of occurrence is unknown |  
| CMPLNT_TO_TM |  Ending time of occurrence for the reported event, if exact time of occurrence is unknown |  
| CRM_ATPT_CPTD_CD |  Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely |  
| HADEVELOPT |  Name of NYCHA housing development of occurrence, if applicable |  
| HOUSING_PSA |  Development Level Code |  
| JURISDICTION_CODE |  Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc. |  
| JURIS_DESC |  Description of the jurisdiction code |  
| KY_CD |  Three digit offense classification code |  
| LAW_CAT_CD |  Level of offense: felony, misdemeanor, violation  |  
| LOC_OF_OCCUR_DESC |  Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of |  
| OFNS_DESC |  Description of offense corresponding with key code |  
| PARKS_NM |  Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) |  
| PATROL_BORO |  The name of the patrol borough in which the incident occurred |  
| PD_CD |  Three digit internal classification code (more granular than Key Code) |  
| PD_DESC |  Description of internal classification corresponding with PD code (more granular than Offense Description) |  
| PREM_TYP_DESC |  Specific description of premises; grocery store, residence, street, etc. |  
| RPT_DT |  Date event was reported to police  |  
| STATION_NAME |  Transit station name |  
| SUSP_AGE_GROUP |  Suspect’s Age Group |  
| SUSP_RACE |  Suspect’s Race Description |  
| SUSP_SEX |  Suspect’s Sex Description |  
| TRANSIT_DISTRICT |  Transit district in which the offense occurred. |  
| VIC_AGE_GROUP |  Victim’s Age Group |  
| VIC_RACE |  Victim’s Race Description |  
| VIC_SEX |  Victim’s Sex Description |  
| X_COORD_CD |  X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Y_COORD_CD |  Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Latitude |  Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)  |  
| Longitude |  Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |


In [1]:
!pip install -q google-cloud-secret-manager

from google.colab import auth
auth.authenticate_user()

from google.cloud import secretmanager

def access_secret_version(project_id, secret_id, version_id):
    """
    Access the payload of the given secret version and return it.

    Args:
        project_id (str): Google Cloud project ID.
        secret_id (str): ID of the secret to access.
        version_id (str): ID of the version to access.
    Returns:
        str: The secret version's payload, or None if
        the version does not exist.
    """
    client = secretmanager.SecretManagerServiceClient()
    name = f"projects/{project_id}/secrets/{secret_id}/versions/{version_id}"
    response = client.access_secret_version(request={"name": name})
    return response.payload.data.decode("UTF-8")


mysql_pass = access_secret_version("nyu-datasets", "MYSQL_PASSWORD", "latest")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/218.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m112.6/218.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m215.0/218.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.1/218.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import pandas as pd
import numpy as np

In [3]:
# We load everything as an object/string, because some data types (e.g., some IDs)
# are recognized as decimals, and it is a mess to restore them back
# So we will do all the conversions ourselves later on

# From https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/data
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD' -o nypd.csv


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3079M    0 3079M    0     0  4794k      0 --:--:--  0:10:57 --:--:-- 4871k


In [201]:
%%time
df = pd.read_csv('nypd.csv', low_memory = True, dtype='object')

CPU times: user 1min 31s, sys: 27.4 s, total: 1min 59s
Wall time: 1min 49s


In [202]:
len(df)

9491946

In [203]:
df = df.replace(to_replace = '(null)', value=np.nan)

In [204]:
df = df.replace(to_replace = 'UNKNOWN', value=np.nan)

In [205]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9491946 entries, 0 to 9491945
Data columns (total 35 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   CMPLNT_NUM         object
 1   CMPLNT_FR_DT       object
 2   CMPLNT_FR_TM       object
 3   CMPLNT_TO_DT       object
 4   CMPLNT_TO_TM       object
 5   ADDR_PCT_CD        object
 6   RPT_DT             object
 7   KY_CD              object
 8   OFNS_DESC          object
 9   PD_CD              object
 10  PD_DESC            object
 11  CRM_ATPT_CPTD_CD   object
 12  LAW_CAT_CD         object
 13  BORO_NM            object
 14  LOC_OF_OCCUR_DESC  object
 15  PREM_TYP_DESC      object
 16  JURIS_DESC         object
 17  JURISDICTION_CODE  object
 18  PARKS_NM           object
 19  HADEVELOPT         object
 20  HOUSING_PSA        object
 21  X_COORD_CD         object
 22  Y_COORD_CD         object
 23  SUSP_AGE_GROUP     object
 24  SUSP_RACE          object
 25  SUSP_SEX           object
 26  TRANSIT_DISTRI

## Data Cleaning

In [206]:
# These columns are redundant
to_drop = ['Lat_Lon','X_COORD_CD','Y_COORD_CD']

# We have the longitude and latitude so the other coordinates are not needed
df = df.drop(to_drop, axis='columns')

###  CMPLNT_NUM         object   

In [207]:
before = len(df)

# Remove any non-numeric characters from the CMPLNT_NUM attribute
df['CMPLNT_NUM'] = df['CMPLNT_NUM'].str.replace(r'\D', '', regex=True)

df['CMPLNT_NUM'] = pd.to_numeric(df['CMPLNT_NUM'], errors="coerce")
df['CMPLNT_NUM'] = np.abs(df['CMPLNT_NUM'].astype('int32'))

df = df[~df['CMPLNT_NUM'].isna()]
# Drop cases with duplicated complaint numbers
key_cnt = df['CMPLNT_NUM'].value_counts()
key_cnt [ key_cnt>1 ]
df = df[ ~df['CMPLNT_NUM'].isin( key_cnt [ key_cnt>1 ].index.values ) ]

after = len(df)
print(f'Removed {before - after} rows')

Removed 2246 rows


### CMPLNT_FR_DT       object
### CMPLNT_FR_TM       object
### CMPLNT_TO_DT       object
### CMPLNT_TO_TM       object

In [186]:
# CMPLNT_FR_DT_mask = df.CMPLNT_FR_DT.str.match(r'(\d\d)/(\d\d)/10(\d\d)', na=False)

# CMPLNT_TO_DT_mask = df.CMPLNT_TO_DT.str.match(r'(\d\d)/(\d\d)/10(\d\d)', na=False)

# df[CMPLNT_TO_DT_mask]

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,ADDR_PCT_CD,RPT_DT,KY_CD,OFNS_DESC,PD_CD,...,SUSP_RACE,SUSP_SEX,TRANSIT_DISTRICT,Latitude,Longitude,PATROL_BORO,STATION_NAME,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
4291950,170300670,10/10/2017,15:30:00,09/28/1017,,1,10/11/2017,126,MISCELLANEOUS PENAL LAW,204,...,,,,40.71448787,-74.01358485,PATROL BORO MAN SOUTH,,,,D
4662866,172812947,12/08/1017,12:00:00,12/08/1017,12:30:00,81,12/20/2017,578,HARRASSMENT 2,638,...,BLACK,M,,40.686534,-73.928285,PATROL BORO BKLYN NORTH,,45-64,BLACK,F
4789336,188736598,09/28/1018,12:46:00,09/28/1018,12:56:00,1,10/13/2018,126,MISCELLANEOUS PENAL LAW,644,...,,,,40.703815,-74.013151,PATROL BORO MAN SOUTH,,,,E
5888082,219765984,10/27/2020,09:40:00,10/15/1010,,26,10/27/2020,578,HARRASSMENT 2,638,...,BLACK HISPANIC,M,,40.814367,-73.957138,PATROL BORO MAN NORTH,,45-64,WHITE HISPANIC,F
7093917,271099041,06/28/1023,19:04:00,06/28/1023,19:15:00,72,07/10/2023,578,HARRASSMENT 2,637,...,WHITE HISPANIC,M,,40.643706,-74.011775,PATROL BORO BKLYN SOUTH,,,,F
7115480,274775227,08/29/1023,12:15:00,08/29/1023,12:19:00,72,09/21/2023,121,CRIMINAL MISCHIEF & RELATED OF,269,...,BLACK,F,,40.655237,-74.006726,PATROL BORO BKLYN SOUTH,,25-44,BLACK,F
9438812,295964434,11/05/2024,13:00:00,10/24/1024,,48,11/05/2024,109,GRAND LARCENY,414,...,,U,,40.8466,-73.887884,PATROL BORO BRONX,,65+,BLACK,M


In [208]:
# There are a few rows that contain year 1015, 1016, ... that trigger an error during date conversion
# We replace all years written as 10XX with 20XX
# Note the usage of regular expressions
df.CMPLNT_FR_DT = df.CMPLNT_FR_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', regex=True )
df.CMPLNT_TO_DT = df.CMPLNT_TO_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', regex=True )

# Similarly, a few hours are written as 24:00:00, which also triggers errors.
# We fix these hours
df.CMPLNT_FR_TM = df.CMPLNT_FR_TM.replace(to_replace = '24:00:00', value='00:00:00')
df.CMPLNT_FR_TM = df.CMPLNT_TO_TM.replace(to_replace = '24:00:00', value='00:00:00')

# Convert the two separate date and time columns into single datetime columns
df['CMPLNT_FR'] = pd.to_datetime(df.CMPLNT_FR_DT + ' ' + df.CMPLNT_FR_TM, format='%m/%d/%Y %H:%M:%S', cache=True, errors="coerce")
df['CMPLNT_TO'] = pd.to_datetime(df.CMPLNT_TO_DT + ' ' + df.CMPLNT_TO_TM, format='%m/%d/%Y %H:%M:%S', cache=True, errors="coerce")

# We created the CMPLNT_FR and CMPLNT_TO columns, these columns are redundant
to_drop = ['CMPLNT_FR_DT','CMPLNT_TO_DT','CMPLNT_FR_TM','CMPLNT_TO_TM']
df = df.drop(to_drop, axis='columns')

In [209]:
len(df)

9489700

In [210]:
df.CMPLNT_FR.isnull().sum()

np.int64(1842340)

In [211]:
df.CMPLNT_TO.isnull().sum()

np.int64(1849305)

In [212]:
before = len(df)
# df = df [ ~df.CMPLNT_FR.isnull() ]
after = len(df)
print(f'Removed {before - after} rows')

Removed 0 rows


In [213]:
len(df)

9489700

###  ADDR_PCT_CD        object

In [214]:
df.ADDR_PCT_CD = df.ADDR_PCT_CD.replace(to_replace = '-99', value='99')
# df = df [ ~df.ADDR_PCT_CD.isnull() ]
df.ADDR_PCT_CD = pd.Categorical(df.ADDR_PCT_CD)

###  RPT_DT             object

In [215]:
# Convert RPT_DT to date
df.RPT_DT = pd.to_datetime(df.RPT_DT, format="%m/%d/%Y", cache=True)

###   KY_CD  &  OFNS_DESC

In [216]:
df.KY_CD.value_counts()

Unnamed: 0_level_0,count
KY_CD,Unnamed: 1_level_1
341,1666565
578,1272983
344,998253
109,831594
351,732786
...,...
460,16
357,15
123,7
362,5


In [217]:
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'KIDNAPPING', value='KIDNAPPING & RELATED OFFENSES')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'KIDNAPPING AND RELATED OFFENSES', value='KIDNAPPING & RELATED OFFENSES')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'AGRICULTURE & MRKTS LAW-UNCLASSIFIED', value='OTHER STATE LAWS (NON PENAL LAW)')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'OTHER STATE LAWS (NON PENAL LA', value='OTHER STATE LAWS (NON PENAL LAW)')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'ENDAN WELFARE INCOMP', value='OFFENSES RELATED TO CHILDREN')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'THEFT OF SERVICES', value='OTHER OFFENSES RELATED TO THEF')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'NYS LAWS-UNCLASSIFIED VIOLATION', value='OTHER STATE LAWS')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'FELONY SEX CRIMES', value='SEX CRIMES')

df.loc[df.KY_CD=='120','OFNS_DESC'] ='CHILD ABANDONMENT/NON SUPPORT'
df.loc[df.KY_CD=='125','OFNS_DESC'] ='NYS LAWS-UNCLASSIFIED FELONY'

offenses = df[ ["KY_CD", "OFNS_DESC"] ].drop_duplicates().dropna()
offenses['KY_CD'] = pd.Categorical(pd.to_numeric(offenses['KY_CD'] ).astype(int))
offenses = offenses.set_index("KY_CD")
offenses = offenses.sort_index()
offenses = offenses.reset_index()
display(offenses)


Unnamed: 0,KY_CD,OFNS_DESC
0,101,MURDER & NON-NEGL. MANSLAUGHTER
1,102,HOMICIDE-NEGLIGENT-VEHICLE
2,103,"HOMICIDE-NEGLIGENT,UNCLASSIFIE"
3,104,RAPE
4,105,ROBBERY
...,...,...
71,676,NEW YORK CITY HEALTH CODE
72,677,OTHER STATE LAWS
73,678,MISCELLANEOUS PENAL LAW
74,685,ADMINISTRATIVE CODES


In [218]:
df.KY_CD = pd.Categorical(df.KY_CD)

In [219]:
df = df.drop('OFNS_DESC', axis='columns')

### 9   PD_CD   &  PD_DESC           

In [220]:
df.loc[df.PD_CD=='694','PD_DESC'] ='INCEST'
df.loc[df.PD_CD=='234','PD_DESC'] ='BURGLARY,UNKNOWN TIME'

internal = df[ ["PD_CD", "PD_DESC"] ].drop_duplicates().dropna()
internal['PD_CD'] = pd.Categorical(pd.to_numeric(internal['PD_CD'] ).astype(int))
internal = internal.set_index("PD_CD")
internal = internal.sort_index()
internal = internal.reset_index()
display(internal)

Unnamed: 0,PD_CD,PD_DESC
0,100,STALKING COMMIT SEX OFFENSE
1,101,ASSAULT 3
2,102,ASSAULT SCHOOL SAFETY AGENT
3,103,ASSAULT TRAFFIC AGENT
4,104,VEHICULAR ASSAULT (INTOX DRIVE
...,...,...
438,917,LEAVING THE SCENE OF AN ACCIDENT (SPI)
439,918,RECKLESS DRIVING
440,922,"TRAFFIC,UNCLASSIFIED MISDEMEAN"
441,969,"TRAFFIC,UNCLASSIFIED INFRACTIO"


In [221]:
df.PD_CD.isnull().sum()

np.int64(7957)

In [27]:
# df = df[~df.PD_CD.isnull()]

In [222]:
df.PD_CD = pd.Categorical(df.PD_CD)

In [223]:
df = df.drop('PD_DESC', axis='columns')

### 11  CRM_ATPT_CPTD_CD   object

In [224]:
df.CRM_ATPT_CPTD_CD.value_counts(dropna=False)

Unnamed: 0_level_0,count
CRM_ATPT_CPTD_CD,Unnamed: 1_level_1
COMPLETED,9333518
ATTEMPTED,156014


In [225]:
df.CRM_ATPT_CPTD_CD = pd.Categorical(df.CRM_ATPT_CPTD_CD)

In [226]:
df.CRM_ATPT_CPTD_CD.isnull().sum()

np.int64(168)

In [227]:
df = df [ ~df.CRM_ATPT_CPTD_CD.isnull() ]


### 12  LAW_CAT_CD         object

In [229]:
df.LAW_CAT_CD.isnull().sum()

np.int64(0)

In [228]:
df.LAW_CAT_CD.value_counts(dropna=False)

Unnamed: 0_level_0,count
LAW_CAT_CD,Unnamed: 1_level_1
MISDEMEANOR,5215402
FELONY,2979737
VIOLATION,1294393


In [230]:
df.LAW_CAT_CD = pd.Categorical(df.LAW_CAT_CD)

### 16  JURIS_DESC         object
### 17  JURISDICTION_CODE  object

In [231]:
df.JURISDICTION_CODE.isnull().sum()

np.int64(0)

In [232]:
# df = df[ ~df.JURISDICTION_CODE.isnull() ]

jusridiction = df[ ["JURISDICTION_CODE", "JURIS_DESC", ] ].drop_duplicates().dropna()
jusridiction['JURISDICTION_CODE'] = pd.to_numeric(jusridiction['JURISDICTION_CODE'] )
jusridiction['JURISDICTION_CODE'] = jusridiction['JURISDICTION_CODE'].astype(int)
jusridiction = jusridiction.set_index("JURISDICTION_CODE")
jusridiction = jusridiction.sort_index()
jusridiction = jusridiction.reset_index()
display(jusridiction)

Unnamed: 0,JURISDICTION_CODE,JURIS_DESC
0,0,N.Y. POLICE DEPT
1,1,N.Y. TRANSIT POLICE
2,2,N.Y. HOUSING POLICE
3,3,PORT AUTHORITY
4,4,TRI-BORO BRDG TUNNL
5,6,LONG ISLAND RAILRD
6,7,AMTRACK
7,8,CONRAIL
8,9,STATN IS RAPID TRANS
9,11,N.Y. STATE POLICE


In [233]:
df.JURISDICTION_CODE = pd.Categorical(df.JURISDICTION_CODE)


In [234]:
df = df.drop('JURIS_DESC', axis='columns')

###  13  BORO_NM            object

In [235]:
df.BORO_NM.value_counts(dropna=False)

Unnamed: 0_level_0,count
BORO_NM,Unnamed: 1_level_1
BROOKLYN,2776925
MANHATTAN,2287612
BRONX,2053821
QUEENS,1928347
STATEN ISLAND,434166
,8661


In [41]:
# df.BORO_NM.replace(to_replace = '(null)', value=np.nan, inplace = True)

In [236]:
df.BORO_NM.isnull().sum()

np.int64(8661)

In [237]:
df = df[~df.BORO_NM.isnull()]

In [238]:
df.BORO_NM = pd.Categorical(df.BORO_NM)

### 23  SUSP_AGE_GROUP     object
### 32  VIC_AGE_GROUP      object

In [240]:
df.SUSP_AGE_GROUP.value_counts(dropna=False).head(10)

Unnamed: 0_level_0,count
SUSP_AGE_GROUP,Unnamed: 1_level_1
,6128929
25-44,1800287
18-24,649134
45-64,626721
<18,216145
65+,58930
1022,25
1023,20
2021,19
2014,17


In [241]:
df.VIC_AGE_GROUP.value_counts(dropna=False).head(10)

Unnamed: 0_level_0,count
VIC_AGE_GROUP,Unnamed: 1_level_1
25-44,3161892
,2937747
45-64,1638257
18-24,946969
<18,439154
65+,356107
930,18
936,17
940,15
935,14


In [242]:
# Both columns have a lot of noisy entries. We keep only the dominant groups, and also define an order
df.SUSP_AGE_GROUP = pd.Categorical(df.SUSP_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])
df.VIC_AGE_GROUP = pd.Categorical(df.VIC_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])


### 24  SUSP_RACE          object
### 25  SUSP_SEX           object

### 33  VIC_RACE           object
### 34  VIC_SEX            object

In [243]:
df.VIC_SEX.value_counts(dropna=False)

Unnamed: 0_level_0,count
VIC_SEX,Unnamed: 1_level_1
F,3686282
M,3155112
E,1378403
D,1250725
L,10040
,305
U,4


In [244]:
df.VIC_SEX = df.VIC_SEX.replace(to_replace = 'U', value=np.nan)
df = df[~df.VIC_SEX.isnull()]

In [250]:
df.VIC_RACE.value_counts(dropna=False)

Unnamed: 0_level_0,count
VIC_RACE,Unnamed: 1_level_1
,3092582
BLACK,2283097
WHITE HISPANIC,1564041
WHITE,1562616
ASIAN / PACIFIC ISLANDER,594007
BLACK HISPANIC,342221
AMERICAN INDIAN/ALASKAN NATIVE,41998


In [249]:
df.VIC_RACE = df.VIC_RACE.replace(to_replace = 'OTHER', value=np.nan)

In [253]:
df.SUSP_SEX.value_counts(dropna=False)

Unnamed: 0_level_0,count
SUSP_SEX,Unnamed: 1_level_1
,3877927
M,3447738
U,1110664
F,1044233


In [254]:
df.SUSP_RACE.value_counts(dropna=False)

Unnamed: 0_level_0,count
SUSP_RACE,Unnamed: 1_level_1
,5316792
BLACK,2102908
WHITE HISPANIC,965757
WHITE,584904
BLACK HISPANIC,301757
ASIAN / PACIFIC ISLANDER,192368
AMERICAN INDIAN/ALASKAN NATIVE,16065
OTHER,11


In [255]:
# U is unknown, same is NULL.
df.SUSP_SEX = df.SUSP_SEX.replace(to_replace = 'U', value=np.nan)
# df.SUSP_SEX = df.SUSP_SEX.replace(to_replace = '(null)', value=np.nan)
# df.SUSP_SEX = df.SUSP_SEX.replace(to_replace = 'UNKNOWN', value=np.nan)

# Very small amount of OTHER values
df.SUSP_RACE = df.SUSP_RACE.replace(to_replace = 'OTHER', value=np.nan)



In [257]:
df.SUSP_RACE = pd.Categorical(df.SUSP_RACE)
df.SUSP_SEX = pd.Categorical(df.SUSP_SEX)
df.VIC_RACE = pd.Categorical(df.VIC_RACE)
df.VIC_SEX = pd.Categorical(df.VIC_SEX)

###  14  LOC_OF_OCCUR_DESC  object

In [259]:
df.LOC_OF_OCCUR_DESC.value_counts(dropna=False)

Unnamed: 0_level_0,count
LOC_OF_OCCUR_DESC,Unnamed: 1_level_1
INSIDE,4854935
FRONT OF,2250910
,1944082
OPPOSITE OF,236153
REAR OF,189590
OUTSIDE,4892


In [260]:
df.LOC_OF_OCCUR_DESC = pd.Categorical(df.LOC_OF_OCCUR_DESC)

### Latitude  & Longitude

In [166]:
# !sudo apt-get update
# !sudo apt-get install python3-rtree
# !sudo pip3 install geopandas descartes shapely ngram # matplotlib==3.1.3

In [261]:
import geopandas as gpd

In [263]:
df.Latitude = pd.to_numeric(df.Latitude)
df.Longitude  = pd.to_numeric(df.Longitude)

In [264]:
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))

In [265]:
# https://data.cityofnewyork.us/City-Government/2020-Neighborhood-Tabulation-Areas-NTAs-/9nt8-h7nd/about_data
shapefile_url = 'https://data.cityofnewyork.us/resource/9nt8-h7nd.geojson'
df_nyc = gpd.GeoDataFrame.from_file(shapefile_url)
df_nyc = df_nyc.to_crs(4326)

In [70]:
df_nyc

Unnamed: 0,OBJECTID,BoroCode,BoroName,CountyFIPS,NTA2020,NTAName,NTAAbbrev,NTAType,CDTA2020,CDTAName,Shape__Area,Shape__Length,geometry
0,1,3,Brooklyn,047,BK0101,Greenpoint,Grnpt,0,BK01,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),3.532181e+07,28919.560811,"POLYGON ((-73.93214 40.72817, -73.93238 40.728..."
1,2,3,Brooklyn,047,BK0102,Williamsburg,Wllmsbrg,0,BK01,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),2.885285e+07,28134.082324,"POLYGON ((-73.95814 40.72441, -73.95772 40.724..."
2,3,3,Brooklyn,047,BK0103,South Williamsburg,SWllmsbrg,0,BK01,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),1.520896e+07,18250.280543,"POLYGON ((-73.95024 40.70548, -73.94984 40.705..."
3,4,3,Brooklyn,047,BK0104,East Williamsburg,EWllmsbrg,0,BK01,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),5.226741e+07,43184.798988,"POLYGON ((-73.92406 40.71412, -73.92404 40.714..."
4,5,3,Brooklyn,047,BK0201,Brooklyn Heights,BkHts,0,BK02,BK02 Downtown Brooklyn-Fort Greene (CD 2 Appro...,9.982322e+06,14312.504975,"POLYGON ((-73.99237 40.6897, -73.99436 40.6902..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
257,258,5,Staten Island,085,SI0391,Freshkills Park (South),FrshklPK_S,9,SI03,SI03 South Shore (CD 3 Approximation),4.775877e+07,33945.420421,"POLYGON ((-74.20059 40.57952, -74.19888 40.579..."
258,259,5,Staten Island,085,SI9561,Fort Wadsworth,FtWdswrth,6,SI95,SI95 Great Kills Park-Fort Wadsworth (JIA 95 A...,9.867249e+06,14814.414741,"POLYGON ((-74.05975 40.59386, -74.06014 40.594..."
259,260,5,Staten Island,085,SI9591,Hoffman & Swinburne Islands,HffmnIsl,9,SI95,SI95 Great Kills Park-Fort Wadsworth (JIA 95 A...,6.357020e+05,4743.128127,"MULTIPOLYGON (((-74.05314 40.57771, -74.05406 ..."
260,261,5,Staten Island,085,SI9592,Miller Field,MllrFld,9,SI95,SI95 Great Kills Park-Fort Wadsworth (JIA 95 A...,1.086680e+07,19197.200973,"POLYGON ((-74.08469 40.57149, -74.08595 40.570..."


In [266]:
%%time
# Match each accident with a neighborhood.
# Will take ~1 min to run
# This is done with left join,
# so we preserve all the data points
# but we know which ones are not matching with the shapefile
gdf.crs = df_nyc.crs
gdf = gpd.sjoin(gdf, df_nyc, how='left')


CPU times: user 1min 29s, sys: 20.5 s, total: 1min 50s
Wall time: 1min 9s


In [267]:
gdf.dtypes

Unnamed: 0,0
CMPLNT_NUM,int32
ADDR_PCT_CD,category
RPT_DT,datetime64[ns]
KY_CD,category
PD_CD,category
CRM_ATPT_CPTD_CD,category
LAW_CAT_CD,category
BORO_NM,category
LOC_OF_OCCUR_DESC,category
PREM_TYP_DESC,object


In [270]:
# We keep only boro_name and ntaname
todrop = [
    'index_right', 'shape_area', 'cdtaname', 'borocode', 'countyfips',
    'ntaabbrev', 'ntatype', 'cdta2020', 'shape_leng'
]

gdf = gdf.drop(todrop, axis='columns')

# Rename the columns
gdf = gdf.rename({
    'boroname': 'BOROUGH',
    'ntaname': 'NEIGHBORHOOD',
    'nta2020': 'NEIGHBORHOOD_CODE',
}, axis='columns')

In [271]:
gdf['BOROUGH'] = gdf['BOROUGH'].str.upper()

In [286]:
print("Entries without a detected BOROUGH:", gdf[gdf.BOROUGH.isnull()].shape[0])
# Mark as NULL all the lon/lat entries outside the NYC area
gdf.loc[gdf.BOROUGH.isnull(), 'Latitude'] = None
gdf.loc[gdf.BOROUGH.isnull(), 'Longitude'] = None

Entries without a detected BOROUGH: 1367


In [277]:
mask = gdf.query('BOROUGH != BORO_NM and Latitude==Latitude and Longitude==Longitude').CMPLNT_NUM.values

In [287]:
inconsistent = gdf.query('BOROUGH != BORO_NM and Latitude==Latitude and Longitude==Longitude').shape[0]
print("Entries where reported lon/lat is inconsistent with the reported borough:", inconsistent)

Entries where reported lon/lat is inconsistent with the reported borough: 9098


In [288]:
# Mark as NULL all the lon/lat entries that generate inconsistencies
mask = gdf.query('BOROUGH != BORO_NM and Latitude==Latitude and Longitude==Longitude').CMPLNT_NUM.values
condition = gdf.CMPLNT_NUM.isin(mask)

gdf.loc[condition, 'Latitude'] = None
gdf.loc[condition, 'Longitude'] = None

In [290]:
# We do not need the geometry anymore
gdf = gdf.drop('geometry', axis='columns')

In [291]:
df = pd.DataFrame(gdf)

In [293]:
df.BORO_NM.value_counts(dropna=False)

Unnamed: 0_level_0,count
BORO_NM,Unnamed: 1_level_1
BROOKLYN,2776816
MANHATTAN,2287552
BRONX,2053745
QUEENS,1928294
STATEN ISLAND,434155


In [294]:
# Drop the cases where the reported borough
# is different than the one detected through lon/lat
df = df[df.BOROUGH == df.BORO_NM]

In [None]:
df.drop(['BOROUGH'], axis='columns', inplace=True)

In [324]:
# We do this to allow for easier insertion to a database later on
df['NEIGHBORHOOD'] = df['NEIGHBORHOOD'].str.replace('\'', '’', regex=False)

In [325]:
df.NEIGHBORHOOD_CODE = pd.Categorical(df.NEIGHBORHOOD_CODE)
df.NEIGHBORHOOD = pd.Categorical(df.NEIGHBORHOOD)

### TRANSIT_DISTRICT

In [296]:
df.TRANSIT_DISTRICT.value_counts(dropna=False)


Unnamed: 0_level_0,count
TRANSIT_DISTRICT,Unnamed: 1_level_1
,9246073
4.0,34003
2.0,26478
1.0,22332
33.0,21389
3.0,20894
20.0,20890
12.0,18107
11.0,16681
32.0,15677


In [297]:
df.drop('TRANSIT_DISTRICT', axis='columns', inplace=True)


### PREM_TYP_DESC

In [299]:
df.PREM_TYP_DESC.value_counts(dropna=False)

Unnamed: 0_level_0,count
PREM_TYP_DESC,Unnamed: 1_level_1
STREET,2959961
RESIDENCE - APT. HOUSE,2023347
RESIDENCE-HOUSE,918023
RESIDENCE - PUBLIC HOUSING,689694
CHAIN STORE,276106
...,...
CLOTHING BOUTIQUE,4
DOCTOR/DENTIST,2
PHOTO/COPY STORE,2
CHECK CASH,1


In [300]:
df.PREM_TYP_DESC.isnull().sum()

np.int64(51791)

In [301]:
df = df [~df.PREM_TYP_DESC.isnull()]

In [302]:
df.PREM_TYP_DESC = pd.Categorical(df.PREM_TYP_DESC)

In [304]:
df.PARKS_NM.value_counts(dropna=False)

Unnamed: 0_level_0,count
PARKS_NM,Unnamed: 1_level_1
,9377915
CENTRAL PARK,2539
FLUSHING MEADOWS CORONA PARK,2187
WASHINGTON SQUARE PARK,1824
CONEY ISLAND BEACH & BOARDWALK,1581
...,...
COURT SQUARE PARK,1
ST. MARY'S PARK PLAYGROUND BROOKLYN,1
VALENTINO PIER,1
AESOP PARK,1


In [305]:
df.PARKS_NM.value_counts().sum()

np.int64(40391)

In [306]:
df.drop('PARKS_NM', axis='columns', inplace=True)



 19  HADEVELOPT         object


In [308]:
df.HADEVELOPT.value_counts(dropna=False)

Unnamed: 0_level_0,count
HADEVELOPT,Unnamed: 1_level_1
,9384972
INGERSOLL,4839
WALD,2818
NOSTRAND,2567
WILLIAMSBURG,2557
RIIS,2102
MARLBORO,2055
MANHATTANVILLE,2010
GRANT,1993
SHEEPSHEAD BAY,1842


In [309]:
df.drop('HADEVELOPT', axis='columns', inplace=True)


 20  HOUSING_PSA        object



In [310]:
df.HOUSING_PSA.value_counts(dropna=False)

Unnamed: 0_level_0,count
HOUSING_PSA,Unnamed: 1_level_1
,8714765
670,9725
887,9690
720,8371
845,8288
...,...
73629,1
64975,1
55645,1
63783,1


In [311]:
df.HOUSING_PSA.value_counts().sum()

np.int64(703541)

In [312]:
df.drop('HOUSING_PSA', axis='columns', inplace=True)

 30  PATROL_BORO        object


In [314]:
df.PATROL_BORO.value_counts(dropna=False)

Unnamed: 0_level_0,count
PATROL_BORO,Unnamed: 1_level_1
PATROL BORO BRONX,2038116
PATROL BORO BKLYN SOUTH,1385720
PATROL BORO BKLYN NORTH,1374503
PATROL BORO MAN SOUTH,1156248
PATROL BORO MAN NORTH,1117927
PATROL BORO QUEENS NORTH,996785
PATROL BORO QUEENS SOUTH,916961
PATROL BORO STATEN ISLAND,431887
,159


In [315]:
df = df[~df.PATROL_BORO.isnull()]

In [316]:
df.PATROL_BORO = pd.Categorical(df.PATROL_BORO)

 31  STATION_NAME       object

In [318]:
df.STATION_NAME.value_counts(dropna=False)

Unnamed: 0_level_0,count
STATION_NAME,Unnamed: 1_level_1
,9196533
125 STREET,9799
14 STREET,5524
42 ST.-PORT AUTHORITY BUS TERM,5373
34 ST.-PENN STATION,4780
...,...
DISTRICT 30 OFFICE,22
DISTRICT 12 OFFICE,21
DISTRICT 34 OFFICE,18
OFF-SYSTEM,8


In [319]:
df.drop('STATION_NAME', axis='columns', inplace=True)

In [320]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9418147 entries, 0 to 9491945
Data columns (total 24 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float64       
 15  Longitude          float64       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          datetime64

## Data exploration

In this part we check the different values that appear in the columns. When we detect noisy results, we delete the corresponding values. In fact, many of the operations that are performed above, in the 'data cleaning' section, are the result of observations that we make here.

In [321]:
# Find the unique values in each column
#
# df.describe(include = [np.object, 'category']).T['unique']
unique = df.describe(include = 'all').T['unique'].sort_values()

display(unique)

Unnamed: 0,unique
CRM_ATPT_CPTD_CD,2.0
SUSP_SEX,2.0
LAW_CAT_CD,3.0
VIC_SEX,5.0
BORO_NM,5.0
LOC_OF_OCCUR_DESC,5.0
VIC_AGE_GROUP,5.0
SUSP_AGE_GROUP,5.0
VIC_RACE,6.0
SUSP_RACE,6.0


In [322]:
for column in unique.index:
    if unique[column] < 200:
        print(df[column].value_counts())
        print("=====")

CRM_ATPT_CPTD_CD
COMPLETED    9263026
ATTEMPTED     155121
Name: count, dtype: int64
=====
SUSP_SEX
M    3421192
F    1037663
Name: count, dtype: int64
=====
LAW_CAT_CD
MISDEMEANOR    5174794
FELONY         2956237
VIOLATION      1287116
Name: count, dtype: int64
=====
VIC_SEX
F    3669216
M    3135787
E    1359745
D    1244005
L       9394
Name: count, dtype: int64
=====
BORO_NM
BROOKLYN         2760214
MANHATTAN        2275535
BRONX            2038102
QUEENS           1912408
STATEN ISLAND     431888
Name: count, dtype: int64
=====
LOC_OF_OCCUR_DESC
INSIDE         4829483
FRONT OF       2237476
OPPOSITE OF     235013
REAR OF         188600
OUTSIDE           1459
Name: count, dtype: int64
=====
VIC_AGE_GROUP
25-44    3145184
45-64    1630270
18-24     941933
<18       436530
65+       354272
Name: count, dtype: int64
=====
SUSP_AGE_GROUP
25-44    1784297
18-24     644021
45-64     621236
<18       215022
65+        58333
Name: count, dtype: int64
=====
VIC_RACE
BLACK                  

In [114]:
# With all the proper data typing the dataset went down in size from 1.9Gb+ to 425Mb.
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9417224 entries, 0 to 9491945
Data columns (total 24 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float32       
 15  Longitude          float32       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          datetime64

In [116]:
df.dtypes

Unnamed: 0,0
CMPLNT_NUM,int32
ADDR_PCT_CD,category
RPT_DT,datetime64[ns]
KY_CD,category
PD_CD,category
CRM_ATPT_CPTD_CD,category
LAW_CAT_CD,category
BORO_NM,category
LOC_OF_OCCUR_DESC,category
PREM_TYP_DESC,category


## Storing in a MySQL database

In [323]:
!sudo pip3 install -U -q PyMySQL sqlalchemy

In [326]:
import os
from sqlalchemy import create_engine
from sqlalchemy import text

conn_string = 'mysql+pymysql://{user}:{password}@{host}/?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org',
    user = 'root',
    password = mysql_pass,
    encoding = 'utf8mb4')

engine = create_engine(conn_string)


In [327]:
# Query to create a database
db_name = 'nypd'

sql = f"DROP DATABASE IF EXISTS {db_name}"
with engine.connect() as connection:
  connection.execute(text(sql))

# Create a database
sql = f"CREATE DATABASE IF NOT EXISTS {db_name} DEFAULT CHARACTER SET 'utf8mb4'"
with engine.connect() as connection:
  connection.execute(text(sql))


In [332]:
# And lets switch to the database
sql = f"USE {db_name}"
with engine.connect() as connection:
  connection.execute(text(sql))


In [328]:
NEIGHBORHOOD_enum = "ENUM('" + ("','".join(sorted(df.NEIGHBORHOOD.astype(str).unique()))) + "')"


In [329]:
print(NEIGHBORHOOD_enum)

ENUM('Allerton','Alley Pond Park','Annadale-Huguenot-Prince’s Bay-Woodrow','Arden Heights-Rossville','Astoria (Central)','Astoria (East)-Woodside (North)','Astoria (North)-Ditmars-Steinway','Astoria Park','Auburndale','Baisley Park','Barren Island-Floyd Bennett Field','Bath Beach','Bay Ridge','Bay Terrace-Clearview','Bayside','Bedford Park','Bedford-Stuyvesant (East)','Bedford-Stuyvesant (West)','Bellerose','Belmont','Bensonhurst','Borough Park','Breezy Point-Belle Harbor-Rockaway Park-Broad Channel','Brighton Beach','Bronx Park','Brooklyn Heights','Brooklyn Navy Yard','Brownsville','Bushwick (East)','Bushwick (West)','Calvary & Mount Zion Cemeteries','Calvert Vaux Park','Cambria Heights','Canarsie','Canarsie Park & Pier','Carroll Gardens-Cobble Hill-Gowanus-Red Hook','Castle Hill-Unionport','Central Park','Chelsea-Hudson Yards','Chinatown-Two Bridges','Claremont Park','Claremont Village-Claremont (East)','Clinton Hill','Co-op City','College Point','Concourse-Concourse Village','Coney 

In [330]:
NCODE_enum = "ENUM('" + ("','".join(sorted(df.NEIGHBORHOOD_CODE.astype(str).unique()))) + "')"

In [333]:
# In principle, we can let Pandas create the table, but we want to be a bit more predise
# with the data types, and we want to add documentation for each column
# from https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i


sql = f'''
CREATE TABLE {db_name}.nypd (
  CMPLNT_NUM int,
  CMPLNT_FR datetime,
  CMPLNT_TO datetime,
  RPT_DT date,
  KY_CD SMALLINT,
  PD_CD SMALLINT,
  CRM_ATPT_CPTD_CD enum('COMPLETED','ATTEMPTED'),
  LAW_CAT_CD enum('FELONY','MISDEMEANOR','VIOLATION'),
  JURISDICTION_CODE SMALLINT,
  BORO_NM enum('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN ISLAND'),
  NEIGHBORHOOD {NEIGHBORHOOD_enum},
  NEIGHBORHOOD_CODE {NCODE_enum},
  ADDR_PCT_CD SMALLINT,
  LOC_OF_OCCUR_DESC enum('FRONT OF','INSIDE','OPPOSITE OF','OUTSIDE','REAR OF'),
  PATROL_BORO enum('PATROL BORO BRONX', 'PATROL BORO BKLYN SOUTH','PATROL BORO BKLYN NORTH','PATROL BORO MAN SOUTH','PATROL BORO MAN NORTH','PATROL BORO QUEENS NORTH','PATROL BORO QUEENS SOUTH','PATROL BORO STATEN ISLAND'),
  PREM_TYP_DESC varchar(30),
  SUSP_RACE enum('UNKNOWN', 'BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  VIC_RACE enum('UNKNOWN', 'BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  SUSP_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  VIC_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  SUSP_SEX enum('M', 'F'),
  VIC_SEX enum('M', 'F', 'E', 'D', 'L'),
  Latitude double,
  Longitude double,
  PRIMARY KEY (CMPLNT_NUM)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))

In [None]:
# Create a table
# See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html for the documentation
from tqdm import tqdm
batchsize = 50000
batches = len(df) // batchsize + 1

t = tqdm(range(batches))

for i in t:
    # print("Batch:",i)
    # continue # Cannot execute this on Travis
    start = batchsize * i
    end = batchsize * (i+1)
    df[start:end].to_sql(
        name = 'nypd',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False,
        chunksize = 1000)

  3%|▎         | 5/189 [01:25<52:05, 16.99s/it]

In [None]:
sql = "CREATE INDEX ix_lat ON nypd.nypd(Latitude)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [None]:
sql = "CREATE INDEX ix_lon ON nypd.nypd(Longitude)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [None]:
sql = "CREATE INDEX ix_LAW_CAT_CD ON nypd.nypd(LAW_CAT_CD)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [None]:
sql = "CREATE INDEX ix_BORO_NM ON nypd.nypd(BORO_NM)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [None]:
sql = "CREATE INDEX ix_KY_CD ON nypd.nypd(KY_CD)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [None]:
sql = "CREATE INDEX ix_RPT_DT ON nypd.nypd(RPT_DT)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [None]:
sql = "CREATE INDEX ix_CMPLNT_FR ON nypd.nypd(CMPLNT_FR)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [None]:
offenses = offenses[offenses.OFNS_DESC != "(null)"]

In [None]:
offenses = offenses.groupby('KY_CD', observed=False).first()['OFNS_DESC']

In [None]:
offenses = offenses.reset_index()

In [153]:
# offenses.drop(39,inplace=True)

In [None]:
sql = "DROP TABLE IF EXISTS offense_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE offense_codes (
  KY_CD smallint,
  OFNS_DESC varchar(32),
  PRIMARY KEY (KY_CD)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))

offenses.to_sql(
        name = 'offense_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

In [None]:
sql = "DROP TABLE IF EXISTS jurisdiction_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE jurisdiction_codes (
  JURISDICTION_CODE smallint,
  JURIS_DESC varchar(40),
  PRIMARY KEY (JURISDICTION_CODE)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))


jusridiction.to_sql(
        name = 'jurisdiction_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

In [None]:
internal.PD_DESC.str.len().max()


In [None]:
internal = internal.query("PD_DESC !=	'CRIMINAL DISPOSAL FIREARM 1 &' ")
internal = internal.query("PD_DESC !=	'UNFINSH FRAME 2' ")
internal = internal.query("PD_DESC !=	'WEAPONS POSSESSION 1 & 2' ")
internal = internal.query("PD_DESC !=	'CRIM POS WEAP 4' ")


In [None]:
internal

In [None]:
sql = "DROP TABLE IF EXISTS penal_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE penal_codes (
  PD_CD smallint,
  PD_DESC varchar(80),
  PRIMARY KEY (PD_CD)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))


internal.to_sql(
        name = 'penal_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

In [None]:
internal

In [None]:
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/files/65f25845-1551-4d21-91dc-869c977cd93d?download=true&filename=PDCode_PenalLaw.xlsx' -o PDCode_PenalLaw.xlsx

In [None]:
penal_code_df = pd.read_excel('PDCode_PenalLaw.xlsx')

In [None]:
penal_code_df.to_sql(
        name = 'pd_code_penal_law',
        schema = db_name,
        con = engine,
        if_exists = 'replace',
        index = False)