<a href="https://colab.research.google.com/github/ipeirotis-org/datasets/blob/main/NYPD_Complaint/NYPD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NYPD Dataset

Dataset description at
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i



| Column | Description |
|--------|-------------------|
| CMPLNT_NUM |  Randomly generated persistent ID for each complaint  |  
| ADDR_PCT_CD |  The precinct in which the incident occurred |  
| BORO |  The name of the borough in which the incident occurred |  
| CMPLNT_FR_DT |  Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) |  
| CMPLNT_FR_TM |  Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) |  
| CMPLNT_TO_DT |  Ending date of occurrence for the reported event, if exact time of occurrence is unknown |  
| CMPLNT_TO_TM |  Ending time of occurrence for the reported event, if exact time of occurrence is unknown |  
| CRM_ATPT_CPTD_CD |  Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely |  
| HADEVELOPT |  Name of NYCHA housing development of occurrence, if applicable |  
| HOUSING_PSA |  Development Level Code |  
| JURISDICTION_CODE |  Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc. |  
| JURIS_DESC |  Description of the jurisdiction code |  
| KY_CD |  Three digit offense classification code |  
| LAW_CAT_CD |  Level of offense: felony, misdemeanor, violation  |  
| LOC_OF_OCCUR_DESC |  Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of |  
| OFNS_DESC |  Description of offense corresponding with key code |  
| PARKS_NM |  Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) |  
| PATROL_BORO |  The name of the patrol borough in which the incident occurred |  
| PD_CD |  Three digit internal classification code (more granular than Key Code) |  
| PD_DESC |  Description of internal classification corresponding with PD code (more granular than Offense Description) |  
| PREM_TYP_DESC |  Specific description of premises; grocery store, residence, street, etc. |  
| RPT_DT |  Date event was reported to police  |  
| STATION_NAME |  Transit station name |  
| SUSP_AGE_GROUP |  Suspect’s Age Group |  
| SUSP_RACE |  Suspect’s Race Description |  
| SUSP_SEX |  Suspect’s Sex Description |  
| TRANSIT_DISTRICT |  Transit district in which the offense occurred. |  
| VIC_AGE_GROUP |  Victim’s Age Group |  
| VIC_RACE |  Victim’s Race Description |  
| VIC_SEX |  Victim’s Sex Description |  
| X_COORD_CD |  X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Y_COORD_CD |  Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Latitude |  Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)  |  
| Longitude |  Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |


In [1]:
!pip install -q google-cloud-secret-manager

from google.colab import auth
auth.authenticate_user()

from google.cloud import secretmanager

def access_secret_version(project_id, secret_id, version_id):
    """
    Access the payload of the given secret version and return it.

    Args:
        project_id (str): Google Cloud project ID.
        secret_id (str): ID of the secret to access.
        version_id (str): ID of the version to access.
    Returns:
        str: The secret version's payload, or None if
        the version does not exist.
    """
    client = secretmanager.SecretManagerServiceClient()
    name = f"projects/{project_id}/secrets/{secret_id}/versions/{version_id}"
    response = client.access_secret_version(request={"name": name})
    return response.payload.data.decode("UTF-8")


mysql_pass = access_secret_version("nyu-datasets", "MYSQL_PASSWORD", "latest")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/218.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m143.4/218.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.1/218.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import pandas as pd
import numpy as np

In [3]:
# We load everything as an object/string, because some data types (e.g., some IDs)
# are recognized as decimals, and it is a mess to restore them back
# So we will do all the conversions ourselves later on

# From https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/data
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD' -o nypd.csv


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3079M    0 3079M    0     0  5655k      0 --:--:--  0:09:17 --:--:-- 5367k


In [4]:
%%time
df = pd.read_csv('nypd.csv', low_memory = True, dtype='object')

CPU times: user 30.2 s, sys: 3.15 s, total: 33.3 s
Wall time: 33.3 s


In [5]:
len(df)

9491946

In [6]:
df = df.replace(to_replace = '(null)', value=None)

In [7]:
df = df.replace(to_replace = 'UNKNOWN', value=None)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9491946 entries, 0 to 9491945
Data columns (total 35 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   CMPLNT_NUM         object
 1   CMPLNT_FR_DT       object
 2   CMPLNT_FR_TM       object
 3   CMPLNT_TO_DT       object
 4   CMPLNT_TO_TM       object
 5   ADDR_PCT_CD        object
 6   RPT_DT             object
 7   KY_CD              object
 8   OFNS_DESC          object
 9   PD_CD              object
 10  PD_DESC            object
 11  CRM_ATPT_CPTD_CD   object
 12  LAW_CAT_CD         object
 13  BORO_NM            object
 14  LOC_OF_OCCUR_DESC  object
 15  PREM_TYP_DESC      object
 16  JURIS_DESC         object
 17  JURISDICTION_CODE  object
 18  PARKS_NM           object
 19  HADEVELOPT         object
 20  HOUSING_PSA        object
 21  X_COORD_CD         object
 22  Y_COORD_CD         object
 23  SUSP_AGE_GROUP     object
 24  SUSP_RACE          object
 25  SUSP_SEX           object
 26  TRANSIT_DISTRI

## Data Cleaning

In [9]:
# These columns are redundant
to_drop = ['Lat_Lon','X_COORD_CD','Y_COORD_CD']

# We have the longitude and latitude so the other coordinates are not needed
df = df.drop(to_drop, axis='columns')

###  CMPLNT_NUM         object   

In [10]:
before = len(df)

# Remove any non-numeric characters from the CMPLNT_NUM attribute
df['CMPLNT_NUM'] = df['CMPLNT_NUM'].str.replace(r'\D', '', regex=True)

df['CMPLNT_NUM'] = pd.to_numeric(df['CMPLNT_NUM'], errors="coerce")
df['CMPLNT_NUM'] = np.abs(df['CMPLNT_NUM'].astype('Int64'))

df = df[~df['CMPLNT_NUM'].isna()]
# Drop cases with duplicated complaint numbers
key_cnt = df['CMPLNT_NUM'].value_counts()
key_cnt [ key_cnt>1 ]
df = df[ ~df['CMPLNT_NUM'].isin( key_cnt [ key_cnt>1 ].index.values ) ]

after = len(df)
print(f'Removed {before - after} rows')

Removed 2210 rows


### CMPLNT_FR_DT       object
### CMPLNT_FR_TM       object
### CMPLNT_TO_DT       object
### CMPLNT_TO_TM       object

In [11]:
# CMPLNT_FR_DT_mask = df.CMPLNT_FR_DT.str.match(r'(\d\d)/(\d\d)/10(\d\d)', na=False)

# CMPLNT_TO_DT_mask = df.CMPLNT_TO_DT.str.match(r'(\d\d)/(\d\d)/10(\d\d)', na=False)

# df[CMPLNT_TO_DT_mask]

In [12]:
# There are a few rows that contain year 1015, 1016, ... that trigger an error during date conversion
# We replace all years written as 10XX with 20XX
# Note the usage of regular expressions
df.CMPLNT_FR_DT = df.CMPLNT_FR_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', regex=True )
df.CMPLNT_TO_DT = df.CMPLNT_TO_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', regex=True )

# Similarly, a few hours are written as 24:00:00, which also triggers errors.
# We fix these hours
df.CMPLNT_FR_TM = df.CMPLNT_FR_TM.replace(to_replace = '24:00:00', value='00:00:00')
df.CMPLNT_TO_TM = df.CMPLNT_TO_TM.replace(to_replace = '24:00:00', value='00:00:00')

# Convert the two separate date and time columns into single datetime columns
df['CMPLNT_FR'] = pd.to_datetime(df.CMPLNT_FR_DT + ' ' + df.CMPLNT_FR_TM, format='%m/%d/%Y %H:%M:%S', cache=True, errors="coerce")
df['CMPLNT_TO'] = pd.to_datetime(df.CMPLNT_TO_DT + ' ' + df.CMPLNT_TO_TM, format='%m/%d/%Y %H:%M:%S', cache=True, errors="coerce")

# We created the CMPLNT_FR and CMPLNT_TO columns, these columns are redundant
to_drop = ['CMPLNT_FR_DT','CMPLNT_TO_DT','CMPLNT_FR_TM','CMPLNT_TO_TM']
df = df.drop(to_drop, axis='columns')

In [13]:
len(df)

9489736

In [14]:
df.CMPLNT_FR.isnull().sum()

np.int64(702)

In [15]:
df.CMPLNT_TO.isnull().sum()

np.int64(1849326)

In [16]:
before = len(df)
# df = df [ ~df.CMPLNT_FR.isnull() ]
after = len(df)
print(f'Removed {before - after} rows')

Removed 0 rows


In [17]:
len(df)

9489736

###  ADDR_PCT_CD        object

In [18]:
df.ADDR_PCT_CD = df.ADDR_PCT_CD.replace(to_replace = '-99', value='99')
# df = df [ ~df.ADDR_PCT_CD.isnull() ]
# df.ADDR_PCT_CD = pd.Categorical(df.ADDR_PCT_CD)

###  RPT_DT             object

In [19]:
# Convert RPT_DT to date
df.RPT_DT = pd.to_datetime(df.RPT_DT, format="%m/%d/%Y", cache=True)

###   KY_CD  &  OFNS_DESC

In [20]:
df.KY_CD.value_counts(dropna=False)

Unnamed: 0_level_0,count
KY_CD,Unnamed: 1_level_1
341,1666568
578,1272986
344,998256
109,831595
351,732787
...,...
460,16
357,15
123,7
362,5


In [21]:
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'KIDNAPPING', value='KIDNAPPING & RELATED OFFENSES')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'KIDNAPPING AND RELATED OFFENSES', value='KIDNAPPING & RELATED OFFENSES')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'AGRICULTURE & MRKTS LAW-UNCLASSIFIED', value='OTHER STATE LAWS (NON PENAL LAW)')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'OTHER STATE LAWS (NON PENAL LA', value='OTHER STATE LAWS (NON PENAL LAW)')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'ENDAN WELFARE INCOMP', value='OFFENSES RELATED TO CHILDREN')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'THEFT OF SERVICES', value='OTHER OFFENSES RELATED TO THEF')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'NYS LAWS-UNCLASSIFIED VIOLATION', value='OTHER STATE LAWS')
df.OFNS_DESC = df.OFNS_DESC.replace(to_replace = 'FELONY SEX CRIMES', value='SEX CRIMES')

df.loc[df.KY_CD=='120','OFNS_DESC'] ='CHILD ABANDONMENT/NON SUPPORT'
df.loc[df.KY_CD=='125','OFNS_DESC'] ='NYS LAWS-UNCLASSIFIED FELONY'

offenses = df[ ["KY_CD", "OFNS_DESC"] ].drop_duplicates().dropna()
# offenses['KY_CD'] = pd.Categorical(pd.to_numeric(offenses['KY_CD'] ).astype(int))
offenses = offenses.set_index("KY_CD")
offenses = offenses.sort_index()
offenses = offenses.reset_index()

offenses = offenses[offenses.OFNS_DESC != "(null)"]
offenses = offenses.groupby('KY_CD', observed=False).first()['OFNS_DESC']
offenses = offenses.reset_index()

display(offenses)


Unnamed: 0,KY_CD,OFNS_DESC
0,101,MURDER & NON-NEGL. MANSLAUGHTER
1,102,HOMICIDE-NEGLIGENT-VEHICLE
2,103,"HOMICIDE-NEGLIGENT,UNCLASSIFIE"
3,104,RAPE
4,105,ROBBERY
...,...,...
70,676,NEW YORK CITY HEALTH CODE
71,677,OTHER STATE LAWS
72,678,MISCELLANEOUS PENAL LAW
73,685,ADMINISTRATIVE CODES


In [22]:
# df.KY_CD = pd.Categorical(df.KY_CD)

In [23]:
df = df.drop('OFNS_DESC', axis='columns')

### 9   PD_CD   &  PD_DESC           

In [24]:
df.loc[df.PD_CD=='694','PD_DESC'] ='INCEST'
df.loc[df.PD_CD=='234','PD_DESC'] ='BURGLARY,UNKNOWN TIME'

internal = df[ ["PD_CD", "PD_DESC"] ].drop_duplicates().dropna()
# internal['PD_CD'] = pd.Categorical(pd.to_numeric(internal['PD_CD'] ).astype(int))
internal = internal.set_index("PD_CD")
internal = internal.sort_index()
internal = internal.reset_index()

internal = internal.query("PD_DESC !=	'CRIMINAL DISPOSAL FIREARM 1 &' ")
internal = internal.query("PD_DESC !=	'UNFINSH FRAME 2' ")
internal = internal.query("PD_DESC !=	'WEAPONS POSSESSION 1 & 2' ")
internal = internal.query("PD_DESC !=	'CRIM POS WEAP 4' ")

display(internal)

Unnamed: 0,PD_CD,PD_DESC
0,100,STALKING COMMIT SEX OFFENSE
1,101,ASSAULT 3
2,102,ASSAULT SCHOOL SAFETY AGENT
3,103,ASSAULT TRAFFIC AGENT
4,104,VEHICULAR ASSAULT (INTOX DRIVE
...,...,...
438,917,LEAVING THE SCENE OF AN ACCIDENT (SPI)
439,918,RECKLESS DRIVING
440,922,"TRAFFIC,UNCLASSIFIED MISDEMEAN"
441,969,"TRAFFIC,UNCLASSIFIED INFRACTIO"


In [25]:
df.PD_CD.isnull().sum()

np.int64(7975)

In [26]:
df = df[~df.PD_CD.isnull()]

In [27]:
# df.PD_CD = pd.Categorical(df.PD_CD)

In [28]:
df = df.drop('PD_DESC', axis='columns')

### 11  CRM_ATPT_CPTD_CD   object

In [29]:
df.CRM_ATPT_CPTD_CD.value_counts(dropna=False)

Unnamed: 0_level_0,count
CRM_ATPT_CPTD_CD,Unnamed: 1_level_1
COMPLETED,9325579
ATTEMPTED,156014
,168


In [30]:
# df.CRM_ATPT_CPTD_CD = pd.Categorical(df.CRM_ATPT_CPTD_CD)

In [31]:
df.CRM_ATPT_CPTD_CD.isnull().sum()

np.int64(168)

In [32]:
df = df [ ~df.CRM_ATPT_CPTD_CD.isnull() ]


### 12  LAW_CAT_CD         object

In [33]:
df.LAW_CAT_CD.isnull().sum()

np.int64(0)

In [34]:
df.LAW_CAT_CD.value_counts(dropna=False)

Unnamed: 0_level_0,count
LAW_CAT_CD,Unnamed: 1_level_1
MISDEMEANOR,5215413
FELONY,2971784
VIOLATION,1294396


In [35]:
# df.LAW_CAT_CD = pd.Categorical(df.LAW_CAT_CD)

### 16  JURIS_DESC         object
### 17  JURISDICTION_CODE  object

In [36]:
df.JURISDICTION_CODE.isnull().sum()

np.int64(0)

In [37]:
# df = df[ ~df.JURISDICTION_CODE.isnull() ]

jusridiction = df[ ["JURISDICTION_CODE", "JURIS_DESC", ] ].drop_duplicates().dropna()
jusridiction['JURISDICTION_CODE'] = pd.to_numeric(jusridiction['JURISDICTION_CODE'] )
jusridiction['JURISDICTION_CODE'] = jusridiction['JURISDICTION_CODE'].astype(int)
jusridiction = jusridiction.set_index("JURISDICTION_CODE")
jusridiction = jusridiction.sort_index()
jusridiction = jusridiction.reset_index()
display(jusridiction)

Unnamed: 0,JURISDICTION_CODE,JURIS_DESC
0,0,N.Y. POLICE DEPT
1,1,N.Y. TRANSIT POLICE
2,2,N.Y. HOUSING POLICE
3,3,PORT AUTHORITY
4,4,TRI-BORO BRDG TUNNL
5,6,LONG ISLAND RAILRD
6,7,AMTRACK
7,8,CONRAIL
8,9,STATN IS RAPID TRANS
9,11,N.Y. STATE POLICE


In [38]:
# df.JURISDICTION_CODE = pd.Categorical(df.JURISDICTION_CODE)


In [39]:
df = df.drop('JURIS_DESC', axis='columns')

###  13  BORO_NM            object

In [40]:
df.BORO_NM.value_counts(dropna=False)

Unnamed: 0_level_0,count
BORO_NM,Unnamed: 1_level_1
BROOKLYN,2773953
MANHATTAN,2286428
BRONX,2051661
QUEENS,1927001
STATEN ISLAND,433889
,8661


In [41]:
# df.BORO_NM.replace(to_replace = '(null)', value=None, inplace = True)

In [42]:
df.BORO_NM.isnull().sum()

np.int64(8661)

In [43]:
df = df[~df.BORO_NM.isnull()]

In [44]:
# df.BORO_NM = pd.Categorical(df.BORO_NM)

### 23  SUSP_AGE_GROUP     object
### 32  VIC_AGE_GROUP      object

In [45]:
df.SUSP_AGE_GROUP.value_counts(dropna=False).head(10)

Unnamed: 0_level_0,count
SUSP_AGE_GROUP,Unnamed: 1_level_1
,6122945
25-44,1799242
18-24,648687
45-64,626401
<18,216034
65+,58900
1022,25
1023,20
2021,19
2014,17


In [46]:
df.VIC_AGE_GROUP.value_counts(dropna=False).head(10)

Unnamed: 0_level_0,count
VIC_AGE_GROUP,Unnamed: 1_level_1
25-44,3158192
,2937628
45-64,1636934
18-24,945130
<18,438571
65+,355732
930,18
936,17
940,15
935,14


In [47]:
# Both columns have a lot of noisy entries. We keep only the dominant groups,
# and also define an order

# Define the list of valid, ordered age groups
valid_age_groups = ['<18', '18-24', '25-44', '45-64', '65+']

# Iterate over the columns to apply the cleaning logic
for col in ['SUSP_AGE_GROUP', 'VIC_AGE_GROUP']:
  # The 'where' method keeps values that are in the valid_age_groups list.
  # All other values are replaced with None.
  df[col] = df[col].where(df[col].isin(valid_age_groups), None)

In [157]:
df.VIC_AGE_GROUP.value_counts(dropna=False).head(10)

Unnamed: 0_level_0,count
VIC_AGE_GROUP,Unnamed: 1_level_1
25-44,3143326
,2909866
45-64,1629476
18-24,941080
<18,436178
65+,354023


In [158]:
df.SUSP_AGE_GROUP.value_counts(dropna=False).head(10)

Unnamed: 0_level_0,count
SUSP_AGE_GROUP,Unnamed: 1_level_1
,6091955
25-44,1783817
18-24,643829
45-64,621052
<18,214983
65+,58313



### 24  SUSP_RACE          object
### 25  SUSP_SEX           object

### 33  VIC_RACE           object
### 34  VIC_SEX            object

In [48]:
df.VIC_SEX.value_counts(dropna=False)

Unnamed: 0_level_0,count
VIC_SEX,Unnamed: 1_level_1
F,3684981
M,3148473
E,1378406
D,1250727
L,10040
,305


In [49]:
df.VIC_SEX = df.VIC_SEX.replace(to_replace = 'U', value=None)
df = df[~df.VIC_SEX.isnull()]

In [50]:
df.VIC_RACE.value_counts(dropna=False)

Unnamed: 0_level_0,count
VIC_RACE,Unnamed: 1_level_1
,3092536
BLACK,2278257
WHITE HISPANIC,1562445
WHITE,1562024
ASIAN / PACIFIC ISLANDER,593690
BLACK HISPANIC,341646
AMERICAN INDIAN/ALASKAN NATIVE,41998
OTHER,31


In [51]:
df.VIC_RACE = df.VIC_RACE.replace(to_replace = 'OTHER', value=None)

In [52]:
df.SUSP_SEX.value_counts(dropna=False)

Unnamed: 0_level_0,count
SUSP_SEX,Unnamed: 1_level_1
,3871948
M,3445915
U,1110664
F,1044100


In [53]:
# U is unknown, same is NULL.
df.SUSP_SEX = df.SUSP_SEX.replace(to_replace = 'U', value=None)

In [54]:
df.SUSP_RACE.value_counts(dropna=False)

Unnamed: 0_level_0,count
SUSP_RACE,Unnamed: 1_level_1
,5310806
BLACK,2101671
WHITE HISPANIC,965386
WHITE,584800
BLACK HISPANIC,301590
ASIAN / PACIFIC ISLANDER,192298
AMERICAN INDIAN/ALASKAN NATIVE,16065
OTHER,11


In [55]:
# Very small amount of OTHER values
df.SUSP_RACE = df.SUSP_RACE.replace(to_replace = 'OTHER', value=None)

In [56]:
# df.SUSP_RACE = pd.Categorical(df.SUSP_RACE)
# df.SUSP_SEX = pd.Categorical(df.SUSP_SEX)
# df.VIC_RACE = pd.Categorical(df.VIC_RACE)
# df.VIC_SEX = pd.Categorical(df.VIC_SEX)

###  14  LOC_OF_OCCUR_DESC  object

In [57]:
# df.LOC_OF_OCCUR_DESC = df.LOC_OF_OCCUR_DESC.astype(str)

In [139]:
df['LOC_OF_OCCUR_DESC'] = df['LOC_OF_OCCUR_DESC'].replace({np.nan: None})

In [141]:
df.LOC_OF_OCCUR_DESC.value_counts(dropna=False)

Unnamed: 0_level_0,count
LOC_OF_OCCUR_DESC,Unnamed: 1_level_1
INSIDE,4826737
FRONT OF,2237479
,1926120
OPPOSITE OF,235013
REAR OF,188600


In [60]:
# df.LOC_OF_OCCUR_DESC = pd.Categorical(df.LOC_OF_OCCUR_DESC)

### Latitude  & Longitude

In [61]:
import geopandas as gpd

In [62]:
df.Latitude = pd.to_numeric(df.Latitude)
df.Longitude  = pd.to_numeric(df.Longitude)

In [63]:
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))

In [64]:
# https://data.cityofnewyork.us/City-Government/2020-Neighborhood-Tabulation-Areas-NTAs-/9nt8-h7nd/about_data
shapefile_url = 'https://data.cityofnewyork.us/resource/9nt8-h7nd.geojson'
df_nyc = gpd.GeoDataFrame.from_file(shapefile_url)
df_nyc = df_nyc.to_crs(4326)

In [65]:
df_nyc

Unnamed: 0,shape_area,ntaname,cdtaname,shape_leng,boroname,ntatype,nta2020,borocode,countyfips,ntaabbrev,cdta2020,geometry
0,35321809.1041,Greenpoint,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),28919.5608108,Brooklyn,0,BK0101,3,047,Grnpt,BK01,"MULTIPOLYGON (((-73.93213 40.72816, -73.93238 ..."
1,28852852.7038,Williamsburg,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),28134.0823238,Brooklyn,0,BK0102,3,047,Wllmsbrg,BK01,"MULTIPOLYGON (((-73.95814 40.7244, -73.95772 4..."
2,15208960.7339,South Williamsburg,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),18250.2805432,Brooklyn,0,BK0103,3,047,SWllmsbrg,BK01,"MULTIPOLYGON (((-73.95024 40.70547, -73.94984 ..."
3,52267407.9898,East Williamsburg,BK01 Williamsburg-Greenpoint (CD 1 Equivalent),43184.7989883,Brooklyn,0,BK0104,3,047,EWllmsbrg,BK01,"MULTIPOLYGON (((-73.92406 40.71411, -73.92404 ..."
4,9982321.59069,Brooklyn Heights,BK02 Downtown Brooklyn-Fort Greene (CD 2 Appro...,14312.5049745,Brooklyn,0,BK0201,3,047,BkHts,BK02,"MULTIPOLYGON (((-73.99236 40.68969, -73.99436 ..."
...,...,...,...,...,...,...,...,...,...,...,...,...
257,47758768.0799,Freshkills Park (South),SI03 South Shore (CD 3 Approximation),33945.4204211,Staten Island,9,SI0391,5,085,FrshklPK_S,SI03,"MULTIPOLYGON (((-74.20058 40.57951, -74.19888 ..."
258,9867248.986,Fort Wadsworth,SI95 Great Kills Park-Fort Wadsworth (JIA 95 A...,14814.4147411,Staten Island,6,SI9561,5,085,FtWdswrth,SI95,"MULTIPOLYGON (((-74.05975 40.59385, -74.06013 ..."
259,635701.967583,Hoffman & Swinburne Islands,SI95 Great Kills Park-Fort Wadsworth (JIA 95 A...,4743.12812675,Staten Island,9,SI9591,5,085,HffmnIsl,SI95,"MULTIPOLYGON (((-74.05051 40.56642, -74.05047 ..."
260,10866804.1436,Miller Field,SI95 Great Kills Park-Fort Wadsworth (JIA 95 A...,19197.2009732,Staten Island,9,SI9592,5,085,MllrFld,SI95,"MULTIPOLYGON (((-74.08469 40.57148, -74.08595 ..."


In [66]:
%%time
# Match each accident with a neighborhood.
# Will take ~1 min to run
# This is done with left join,
# so we preserve all the data points
# but we know which ones are not matching with the shapefile
gdf.crs = df_nyc.crs
gdf = gpd.sjoin(gdf, df_nyc, how='left')


CPU times: user 53.5 s, sys: 2.42 s, total: 55.9 s
Wall time: 55.7 s


In [67]:
gdf.dtypes

Unnamed: 0,0
CMPLNT_NUM,Int64
ADDR_PCT_CD,object
RPT_DT,datetime64[ns]
KY_CD,object
PD_CD,object
CRM_ATPT_CPTD_CD,object
LAW_CAT_CD,object
BORO_NM,object
LOC_OF_OCCUR_DESC,object
PREM_TYP_DESC,object


In [68]:
# We keep only boro_name and ntaname
todrop = [
    'index_right', 'shape_area', 'cdtaname', 'borocode', 'countyfips',
    'ntaabbrev', 'ntatype', 'cdta2020', 'shape_leng'
]

gdf = gdf.drop(todrop, axis='columns')

# Rename the columns
gdf = gdf.rename({
    'boroname': 'BOROUGH',
    'ntaname': 'NEIGHBORHOOD',
    'nta2020': 'NEIGHBORHOOD_CODE',
}, axis='columns')

In [69]:
gdf['BOROUGH'] = gdf['BOROUGH'].str.upper()

In [70]:
print("Entries without a detected BOROUGH:", gdf[gdf.BOROUGH.isnull()].shape[0])
# Mark as NULL all the lon/lat entries outside the NYC area
gdf.loc[gdf.BOROUGH.isnull(), 'Latitude'] = None
gdf.loc[gdf.BOROUGH.isnull(), 'Longitude'] = None

Entries without a detected BOROUGH: 1275


In [71]:
mask = gdf.query('BOROUGH != BORO_NM and Latitude==Latitude and Longitude==Longitude').CMPLNT_NUM.values

In [72]:
inconsistent = gdf.query('BOROUGH != BORO_NM and Latitude==Latitude and Longitude==Longitude').shape[0]
print("Entries where reported lon/lat is inconsistent with the reported borough:", inconsistent)

Entries where reported lon/lat is inconsistent with the reported borough: 9098


In [73]:
# Mark as NULL all the lon/lat entries that generate inconsistencies
mask = gdf.query('BOROUGH != BORO_NM and Latitude==Latitude and Longitude==Longitude').CMPLNT_NUM.values
condition = gdf.CMPLNT_NUM.isin(mask)

gdf.loc[condition, 'Latitude'] = None
gdf.loc[condition, 'Longitude'] = None

In [74]:
# We do not need the geometry anymore
gdf = gdf.drop('geometry', axis='columns')

In [75]:
df = pd.DataFrame(gdf)

In [76]:
df.BORO_NM.value_counts(dropna=False)

Unnamed: 0_level_0,count
BORO_NM,Unnamed: 1_level_1
BROOKLYN,2773846
MANHATTAN,2286368
BRONX,2051586
QUEENS,1926949
STATEN ISLAND,433878


In [77]:
# Drop the cases where the reported borough
# is different than the one detected through lon/lat
df = df[df.BOROUGH == df.BORO_NM]

In [78]:
df.drop(['BOROUGH'], axis='columns', inplace=True)

In [79]:
# We do this to allow for easier insertion to a database later on
df['NEIGHBORHOOD'] = df['NEIGHBORHOOD'].str.replace('\'', '’', regex=False)

In [80]:
# df.NEIGHBORHOOD_CODE = pd.Categorical(df.NEIGHBORHOOD_CODE)
# df.NEIGHBORHOOD = pd.Categorical(df.NEIGHBORHOOD)

### TRANSIT_DISTRICT

In [81]:
df.TRANSIT_DISTRICT.value_counts(dropna=False)


Unnamed: 0_level_0,count
TRANSIT_DISTRICT,Unnamed: 1_level_1
,9238230
4.0,34003
2.0,26478
1.0,22332
33.0,21389
3.0,20894
20.0,20890
12.0,18107
11.0,16681
32.0,15677


In [82]:
df.drop('TRANSIT_DISTRICT', axis='columns', inplace=True)


### PREM_TYP_DESC

In [83]:
df.PREM_TYP_DESC.value_counts(dropna=False)

Unnamed: 0_level_0,count
PREM_TYP_DESC,Unnamed: 1_level_1
STREET,2959965
RESIDENCE - APT. HOUSE,2023353
RESIDENCE-HOUSE,918025
RESIDENCE - PUBLIC HOUSING,689696
CHAIN STORE,276101
...,...
SMOKE SHOP,1037
CEMETERY,1005
DAYCARE FACILITY,582
LOAN COMPANY,569


In [84]:
df.PREM_TYP_DESC.isnull().sum()

np.int64(48229)

In [85]:
df = df [~df.PREM_TYP_DESC.isnull()]

In [86]:
# df.PREM_TYP_DESC = pd.Categorical(df.PREM_TYP_DESC)

In [87]:
df.PARKS_NM.value_counts(dropna=False)

Unnamed: 0_level_0,count
PARKS_NM,Unnamed: 1_level_1
,9373634
CENTRAL PARK,2539
FLUSHING MEADOWS CORONA PARK,2187
WASHINGTON SQUARE PARK,1824
CONEY ISLAND BEACH & BOARDWALK,1581
...,...
COURT SQUARE PARK,1
ST. MARY'S PARK PLAYGROUND BROOKLYN,1
VALENTINO PIER,1
AESOP PARK,1


In [88]:
df.PARKS_NM.value_counts().sum()

np.int64(40391)

In [89]:
df.drop('PARKS_NM', axis='columns', inplace=True)



 19  HADEVELOPT         object


In [90]:
df.HADEVELOPT.value_counts(dropna=False)

Unnamed: 0_level_0,count
HADEVELOPT,Unnamed: 1_level_1
,9380739
INGERSOLL,4822
WALD,2814
NOSTRAND,2565
WILLIAMSBURG,2552
RIIS,2100
MARLBORO,2053
MANHATTANVILLE,2009
GRANT,1992
SHEEPSHEAD BAY,1839


In [91]:
df.drop('HADEVELOPT', axis='columns', inplace=True)


 20  HOUSING_PSA        object



In [92]:
df.HOUSING_PSA.value_counts(dropna=False)

Unnamed: 0_level_0,count
HOUSING_PSA,Unnamed: 1_level_1
,8184982
,526585
670,9725
887,9690
720,8371
...,...
73315,1
73635,1
50063,1
73473,1


In [93]:
df.HOUSING_PSA.value_counts().sum()

np.int64(702458)

In [94]:
df.drop('HOUSING_PSA', axis='columns', inplace=True)

 30  PATROL_BORO        object


In [95]:
df.PATROL_BORO.value_counts(dropna=False)

Unnamed: 0_level_0,count
PATROL_BORO,Unnamed: 1_level_1
PATROL BORO BRONX,2036975
PATROL BORO BKLYN SOUTH,1385061
PATROL BORO BKLYN NORTH,1373567
PATROL BORO MAN SOUTH,1156083
PATROL BORO MAN NORTH,1117474
PATROL BORO QUEENS NORTH,996540
PATROL BORO QUEENS SOUTH,916530
PATROL BORO STATEN ISLAND,431719
,76


In [96]:
df = df[~df.PATROL_BORO.isnull()]

In [97]:
# df.PATROL_BORO = pd.Categorical(df.PATROL_BORO)

 31  STATION_NAME       object

In [98]:
df.STATION_NAME.value_counts(dropna=False)

Unnamed: 0_level_0,count
STATION_NAME,Unnamed: 1_level_1
,9192335
125 STREET,9799
14 STREET,5524
42 ST.-PORT AUTHORITY BUS TERM,5373
34 ST.-PENN STATION,4780
...,...
DISTRICT 30 OFFICE,22
DISTRICT 12 OFFICE,21
DISTRICT 34 OFFICE,18
OFF-SYSTEM,8


In [99]:
df.drop('STATION_NAME', axis='columns', inplace=True)

In [100]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9413949 entries, 0 to 9491945
Data columns (total 24 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         Int64         
 1   ADDR_PCT_CD        object        
 2   RPT_DT             datetime64[ns]
 3   KY_CD              object        
 4   PD_CD              object        
 5   CRM_ATPT_CPTD_CD   object        
 6   LAW_CAT_CD         object        
 7   BORO_NM            object        
 8   LOC_OF_OCCUR_DESC  object        
 9   PREM_TYP_DESC      object        
 10  JURISDICTION_CODE  object        
 11  SUSP_AGE_GROUP     object        
 12  SUSP_RACE          object        
 13  SUSP_SEX           object        
 14  Latitude           float64       
 15  Longitude          float64       
 16  PATROL_BORO        object        
 17  VIC_AGE_GROUP      object        
 18  VIC_RACE           object        
 19  VIC_SEX            object        
 20  CMPLNT_FR          datetime64

## Data exploration

In this part we check the different values that appear in the columns. When we detect noisy results, we delete the corresponding values. In fact, many of the operations that are performed above, in the 'data cleaning' section, are the result of observations that we make here.

In [101]:
# Find the unique values in each column
#
# df.describe(include = [np.object, 'category']).T['unique']
unique = df.describe(include = 'all').T['unique'].sort_values()

display(unique)

Unnamed: 0,unique
CRM_ATPT_CPTD_CD,2.0
SUSP_SEX,2.0
LAW_CAT_CD,3.0
VIC_SEX,5.0
BORO_NM,5.0
VIC_AGE_GROUP,5.0
SUSP_AGE_GROUP,5.0
LOC_OF_OCCUR_DESC,6.0
VIC_RACE,6.0
SUSP_RACE,6.0


In [102]:
#for column in unique.index:
#    if unique[column] < 200:
#        print(df[column].value_counts())
#        print("=====")

CRM_ATPT_CPTD_CD
COMPLETED    9258828
ATTEMPTED     155121
Name: count, dtype: int64
=====
SUSP_SEX
M    3420363
F    1037576
Name: count, dtype: int64
=====
LAW_CAT_CD
MISDEMEANOR    5174805
FELONY         2952025
VIOLATION      1287119
Name: count, dtype: int64
=====
VIC_SEX
F    3668248
M    3132552
E    1359748
D    1244007
L       9394
Name: count, dtype: int64
=====
BORO_NM
BROOKLYN         2758617
MANHATTAN        2274917
BRONX            2036963
QUEENS           1911732
STATEN ISLAND     431720
Name: count, dtype: int64
=====
VIC_AGE_GROUP
25-44    3143326
45-64    1629476
18-24     941080
<18       436178
65+       354023
Name: count, dtype: int64
=====
SUSP_AGE_GROUP
25-44    1783817
18-24     643829
45-64     621052
<18       214983
65+        58313
Name: count, dtype: int64
=====
LOC_OF_OCCUR_DESC
INSIDE         4826737
FRONT OF       2237479
None           1925887
OPPOSITE OF     235013
REAR OF         188600
nan                233
Name: count, dtype: int64
=====
VIC_RACE


TypeError: boolean value of NA is ambiguous

In [105]:
# With all the proper data typing the dataset went down in size from 1.9Gb+ to 425Mb.
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9413949 entries, 0 to 9491945
Data columns (total 24 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         Int64         
 1   ADDR_PCT_CD        Int64         
 2   RPT_DT             datetime64[ns]
 3   KY_CD              Int64         
 4   PD_CD              Int64         
 5   CRM_ATPT_CPTD_CD   object        
 6   LAW_CAT_CD         object        
 7   BORO_NM            object        
 8   LOC_OF_OCCUR_DESC  object        
 9   PREM_TYP_DESC      object        
 10  JURISDICTION_CODE  Int64         
 11  SUSP_AGE_GROUP     object        
 12  SUSP_RACE          object        
 13  SUSP_SEX           object        
 14  Latitude           float64       
 15  Longitude          float64       
 16  PATROL_BORO        object        
 17  VIC_AGE_GROUP      object        
 18  VIC_RACE           object        
 19  VIC_SEX            object        
 20  CMPLNT_FR          datetime64

In [None]:
df.dtypes

In [104]:
# prompt: Convert all the category data types in the dataframe df into string

#for col in df.select_dtypes(include='category').columns:
#    df[col] = df[col].astype(str)

# df = df.replace(to_replace = 'nan', value=None)

df.KY_CD = pd.to_numeric(df.KY_CD, errors='coerce').astype('Int64')
df.PD_CD = pd.to_numeric(df.PD_CD, errors='coerce').astype('Int64')
df.ADDR_PCT_CD = pd.to_numeric(df.ADDR_PCT_CD, errors='coerce').astype('Int64')
df.JURISDICTION_CODE = pd.to_numeric(df.JURISDICTION_CODE, errors='coerce').astype('Int64')

## Storing in a MySQL database

In [106]:
!sudo pip3 install -U -q PyMySQL sqlalchemy

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [107]:
import os
from sqlalchemy import create_engine
from sqlalchemy import text

conn_string = 'mysql+pymysql://{user}:{password}@{host}/?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org',
    user = 'root',
    password = mysql_pass,
    encoding = 'utf8mb4')

engine = create_engine(conn_string)


In [125]:
# Query to create a database
db_name = 'nypd'

sql = f"DROP DATABASE IF EXISTS {db_name}"
with engine.connect() as connection:
  connection.execute(text(sql))

# Create a database
sql = f"CREATE DATABASE IF NOT EXISTS {db_name} DEFAULT CHARACTER SET 'utf8mb4'"
with engine.connect() as connection:
  connection.execute(text(sql))


In [127]:
# And lets switch to the database
sql = f"USE {db_name}"
with engine.connect() as connection:
  connection.execute(text(sql))


In [128]:
NEIGHBORHOOD_enum = "ENUM('" + ("','".join(sorted(df.NEIGHBORHOOD.astype(str).unique()))) + "')"


In [129]:
print(NEIGHBORHOOD_enum)

ENUM('Allerton','Alley Pond Park','Annadale-Huguenot-Prince’s Bay-Woodrow','Arden Heights-Rossville','Astoria (Central)','Astoria (East)-Woodside (North)','Astoria (North)-Ditmars-Steinway','Astoria Park','Auburndale','Baisley Park','Barren Island-Floyd Bennett Field','Bath Beach','Bay Ridge','Bay Terrace-Clearview','Bayside','Bedford Park','Bedford-Stuyvesant (East)','Bedford-Stuyvesant (West)','Bellerose','Belmont','Bensonhurst','Borough Park','Breezy Point-Belle Harbor-Rockaway Park-Broad Channel','Brighton Beach','Bronx Park','Brooklyn Heights','Brooklyn Navy Yard','Brownsville','Bushwick (East)','Bushwick (West)','Calvary & Mount Zion Cemeteries','Calvert Vaux Park','Cambria Heights','Canarsie','Canarsie Park & Pier','Carroll Gardens-Cobble Hill-Gowanus-Red Hook','Castle Hill-Unionport','Central Park','Chelsea-Hudson Yards','Chinatown-Two Bridges','Claremont Park','Claremont Village-Claremont (East)','Clinton Hill','Co-op City','College Point','Concourse-Concourse Village','Coney 

In [130]:
NCODE_enum = "ENUM('" + ("','".join(sorted(df.NEIGHBORHOOD_CODE.astype(str).unique()))) + "')"

In [131]:
# In principle, we can let Pandas create the table, but we want to be a bit more predise
# with the data types, and we want to add documentation for each column
# from https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i


sql = f'''
CREATE TABLE {db_name}.nypd (
  CMPLNT_NUM int,
  CMPLNT_FR datetime,
  CMPLNT_TO datetime,
  RPT_DT date,
  KY_CD SMALLINT,
  PD_CD SMALLINT,
  CRM_ATPT_CPTD_CD enum('COMPLETED','ATTEMPTED'),
  LAW_CAT_CD enum('FELONY','MISDEMEANOR','VIOLATION'),
  JURISDICTION_CODE SMALLINT,
  BORO_NM enum('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN ISLAND'),
  NEIGHBORHOOD {NEIGHBORHOOD_enum},
  NEIGHBORHOOD_CODE {NCODE_enum},
  ADDR_PCT_CD SMALLINT,
  LOC_OF_OCCUR_DESC enum('FRONT OF','INSIDE','OPPOSITE OF','OUTSIDE','REAR OF'),
  PATROL_BORO enum('PATROL BORO BRONX', 'PATROL BORO BKLYN SOUTH','PATROL BORO BKLYN NORTH','PATROL BORO MAN SOUTH','PATROL BORO MAN NORTH','PATROL BORO QUEENS NORTH','PATROL BORO QUEENS SOUTH','PATROL BORO STATEN ISLAND'),
  PREM_TYP_DESC varchar(30),
  SUSP_RACE enum('UNKNOWN', 'BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  VIC_RACE enum('UNKNOWN', 'BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  SUSP_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  VIC_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  SUSP_SEX enum('M', 'F'),
  VIC_SEX enum('M', 'F', 'E', 'D', 'L'),
  Latitude double,
  Longitude double,
  PRIMARY KEY (CMPLNT_NUM)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))

In [142]:
# Create a table
# See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html for the documentation
from tqdm import tqdm
batchsize = 50000
batches = len(df) // batchsize + 1

t = tqdm(range(batches))

for i in t:
    # print("Batch:",i)
    # continue # Cannot execute this on Travis
    start = batchsize * i
    end = batchsize * (i+1)
    df[start:end].to_sql(
        name = 'nypd',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False,
        chunksize = 1000)

100%|██████████| 189/189 [13:13<00:00,  4.20s/it]


In [143]:
sql = "CREATE INDEX ix_lat ON nypd.nypd(Latitude)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [144]:
sql = "CREATE INDEX ix_lon ON nypd.nypd(Longitude)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [145]:
sql = "CREATE INDEX ix_LAW_CAT_CD ON nypd.nypd(LAW_CAT_CD)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [146]:
sql = "CREATE INDEX ix_BORO_NM ON nypd.nypd(BORO_NM)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [147]:
sql = "CREATE INDEX ix_KY_CD ON nypd.nypd(KY_CD)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [148]:
sql = "CREATE INDEX ix_RPT_DT ON nypd.nypd(RPT_DT)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [149]:
sql = "CREATE INDEX ix_CMPLNT_FR ON nypd.nypd(CMPLNT_FR)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [150]:
sql = "DROP TABLE IF EXISTS offense_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE offense_codes (
  KY_CD smallint,
  OFNS_DESC varchar(32),
  PRIMARY KEY (KY_CD)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))

offenses.to_sql(
        name = 'offense_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

75

In [151]:
sql = "DROP TABLE IF EXISTS jurisdiction_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE jurisdiction_codes (
  JURISDICTION_CODE smallint,
  JURIS_DESC varchar(40),
  PRIMARY KEY (JURISDICTION_CODE)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))


jusridiction.to_sql(
        name = 'jurisdiction_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

28

In [153]:
sql = "DROP TABLE IF EXISTS penal_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE penal_codes (
  PD_CD smallint,
  PD_DESC varchar(80),
  PRIMARY KEY (PD_CD)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))


internal.to_sql(
        name = 'penal_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

441

In [154]:
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/files/65f25845-1551-4d21-91dc-869c977cd93d?download=true&filename=PDCode_PenalLaw.xlsx' -o PDCode_PenalLaw.xlsx

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  233k    0  233k    0     0   541k      0 --:--:-- --:--:-- --:--:--  541k


In [155]:
penal_code_df = pd.read_excel('PDCode_PenalLaw.xlsx')

In [156]:
penal_code_df.to_sql(
        name = 'pd_code_penal_law',
        schema = db_name,
        con = engine,
        if_exists = 'replace',
        index = False)

4671

# BigQuery

In [110]:
!pip install -q google-cloud-bigquery pandas-gbq

from google.colab import auth
auth.authenticate_user()

In [111]:
from google.cloud import bigquery
import pandas_gbq

In [112]:
# Add this section to write to BigQuery
project_id = "nyu-datasets"
dataset_id = "nypd_complaints"

# Initialize BigQuery client
client = bigquery.Client(project=project_id)

# Create the dataset if it doesn't exist
try:
    client.get_dataset(dataset_id)
    print(f"Dataset {dataset_id} already exists.")
except:
    dataset = bigquery.Dataset(f"{project_id}.{dataset_id}")
    dataset.location = "US" # Or your preferred location
    dataset = client.create_dataset(dataset, exists_ok=True)
    print(f"Dataset {dataset_id} created.")


Dataset nypd_complaints already exists.


In [114]:
# prompt: I want to store to BigQuery (to the dataset above) the tables that were written in MySQL. I want to define first the scema for each table, with descriptions for each column, and then use pandas_gbq to store the data in BigQuery.
# Then use SQL code that ALTERs the tables to assign descriptions to them and add (non enforced) PRIMARY and FOREIGN KEY designations in the tables.

# Define schema for nypd table
# The schema is defined as a list of bigquery.SchemaField objects.
nypd_schema_fields = [
    bigquery.SchemaField("CMPLNT_NUM", "INT64", mode="NULLABLE", description="Randomly generated persistent ID for each complaint"),
    bigquery.SchemaField("CMPLNT_FR", "TIMESTAMP", mode="NULLABLE", description="Exact date and time of occurrence for the reported event (or starting date and time of occurrence)"),
    bigquery.SchemaField("CMPLNT_TO", "TIMESTAMP", mode="NULLABLE", description="Ending date and time of occurrence for the reported event, if exact time of occurrence is unknown"),
    bigquery.SchemaField("RPT_DT", "DATE", mode="NULLABLE", description="Date event was reported to police"),
    bigquery.SchemaField("KY_CD", "INT64", mode="NULLABLE", description="Three digit offense classification code"),
    bigquery.SchemaField("PD_CD", "INT64", mode="NULLABLE", description="Three digit internal classification code (more granular than Key Code)"),
    bigquery.SchemaField("CRM_ATPT_CPTD_CD", "STRING", mode="NULLABLE", description="Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely"),
    bigquery.SchemaField("LAW_CAT_CD", "STRING", mode="NULLABLE", description="Level of offense: felony, misdemeanor, violation"),
    bigquery.SchemaField("JURISDICTION_CODE", "INT64", mode="NULLABLE", description="Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc."),
    bigquery.SchemaField("BORO_NM", "STRING", mode="NULLABLE", description="The name of the borough in which the incident occurred"),
    bigquery.SchemaField("NEIGHBORHOOD", "STRING", mode="NULLABLE", description="Name of the Neighborhood Tabulation Area (NTA)"),
    bigquery.SchemaField("NEIGHBORHOOD_CODE", "STRING", mode="NULLABLE", description="Code for the Neighborhood Tabulation Area (NTA)"),
    bigquery.SchemaField("ADDR_PCT_CD", "INT64", mode="NULLABLE", description="The precinct in which the incident occurred"),
    bigquery.SchemaField("LOC_OF_OCCUR_DESC", "STRING", mode="NULLABLE", description="Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of"),
    bigquery.SchemaField("PATROL_BORO", "STRING", mode="NULLABLE", description="The name of the patrol borough in which the incident occurred"),
    bigquery.SchemaField("PREM_TYP_DESC", "STRING", mode="NULLABLE", description="Specific description of premises; grocery store, residence, street, etc."),
    bigquery.SchemaField("SUSP_RACE", "STRING", mode="NULLABLE", description="Suspect’s Race Description"),
    bigquery.SchemaField("VIC_RACE", "STRING", mode="NULLABLE", description="Victim’s Race Description"),
    bigquery.SchemaField("SUSP_AGE_GROUP", "STRING", mode="NULLABLE", description="Suspect’s Age Group"),
    bigquery.SchemaField("VIC_AGE_GROUP", "STRING", mode="NULLABLE", description="Victim’s Age Group"),
    bigquery.SchemaField("SUSP_SEX", "STRING", mode="NULLABLE", description="Suspect’s Sex Description"),
    bigquery.SchemaField("VIC_SEX", "STRING", mode="NULLABLE", description="Victim’s Sex Description"),
    bigquery.SchemaField("Latitude", "FLOAT64", mode="NULLABLE", description="Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)"),
    bigquery.SchemaField("Longitude", "FLOAT64", mode="NULLABLE", description="Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)"),
]

# Convert SchemaField objects to dictionaries for pandas_gbq
nypd_schema = [field.to_api_repr() for field in nypd_schema_fields]


# Define schema for offense_codes table
offense_codes_schema_fields = [
    bigquery.SchemaField("KY_CD", "INT64", mode="NULLABLE", description="Three digit offense classification code"),
    bigquery.SchemaField("OFNS_DESC", "STRING", mode="NULLABLE", description="Description of offense corresponding with key code"),
]
offense_codes_schema = [field.to_api_repr() for field in offense_codes_schema_fields]


# Define schema for jurisdiction_codes table
jurisdiction_codes_schema_fields = [
    bigquery.SchemaField("JURISDICTION_CODE", "INT64", mode="NULLABLE", description="Jurisdiction responsible for incident code"),
    bigquery.SchemaField("JURIS_DESC", "STRING", mode="NULLABLE", description="Description of the jurisdiction code"),
]
jurisdiction_codes_schema = [field.to_api_repr() for field in jurisdiction_codes_schema_fields]


# Define schema for penal_codes table
penal_codes_schema_fields = [
    bigquery.SchemaField("PD_CD", "INT64", mode="NULLABLE", description="Three digit internal classification code"),
    bigquery.SchemaField("PD_DESC", "STRING", mode="NULLABLE", description="Description of internal classification corresponding with PD code"),
]
penal_codes_schema = [field.to_api_repr() for field in penal_codes_schema_fields]


# Define schema for pd_code_penal_law table
pd_code_penal_law_schema_fields = [
    bigquery.SchemaField("PD_CD", "INT64", mode="NULLABLE"),
    bigquery.SchemaField("PD_DESC", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("PENAL_LAW", "STRING", mode="NULLABLE"), # Assuming this column name from the excel file
    bigquery.SchemaField("SECTION", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("SUBDIVISION", "STRING", mode="NULLABLE"),
]
pd_code_penal_law_schema = [field.to_api_repr() for field in pd_code_penal_law_schema_fields]



In [117]:
offenses.KY_CD = pd.to_numeric(offenses.KY_CD).astype('Int64')
jusridiction.JURISDICTION_CODE = pd.to_numeric(jusridiction.JURISDICTION_CODE).astype('Int64')
internal.PD_CD = pd.to_numeric(internal.PD_CD).astype('Int64')


In [118]:

# Store the dataframes to BigQuery
# Ensure that the dataframes (df, offenses, jusridiction, internal, penal_code_df) are defined before running this cell.
pandas_gbq.to_gbq(offenses, f"{dataset_id}.offense_codes", project_id=project_id, if_exists='replace', table_schema=offense_codes_schema)
pandas_gbq.to_gbq(jusridiction, f"{dataset_id}.jurisdiction_codes", project_id=project_id, if_exists='replace', table_schema=jurisdiction_codes_schema)
pandas_gbq.to_gbq(internal, f"{dataset_id}.penal_codes", project_id=project_id, if_exists='replace', table_schema=penal_codes_schema)
pandas_gbq.to_gbq(penal_code_df, f"{dataset_id}.pd_code_penal_law", project_id=project_id, if_exists='replace', table_schema=pd_code_penal_law_schema)



100%|██████████| 1/1 [00:00<00:00, 15592.21it/s]
100%|██████████| 1/1 [00:00<00:00, 18724.57it/s]
100%|██████████| 1/1 [00:00<00:00, 16256.99it/s]
100%|██████████| 1/1 [00:00<00:00, 18001.30it/s]


In [119]:
%%time
pandas_gbq.to_gbq(df, f"{dataset_id}.nypd", project_id=project_id, if_exists='replace', table_schema=nypd_schema)


100%|██████████| 1/1 [00:00<00:00, 19418.07it/s]


In [120]:

# Use SQL to ALTER tables and add descriptions and key designations

# Add descriptions to tables
client.query(f"""
ALTER TABLE `{project_id}.{dataset_id}.nypd`
SET OPTIONS(description='NYPD Complaint Data Historic');

ALTER TABLE `{project_id}.{dataset_id}.offense_codes`
SET OPTIONS(description='Mapping from Offense Code (KY_CD) to Offense Description');

ALTER TABLE `{project_id}.{dataset_id}.jurisdiction_codes`
SET OPTIONS(description='Mapping from Jurisdiction Code to Jurisdiction Description');

ALTER TABLE `{project_id}.{dataset_id}.penal_codes`
SET OPTIONS(description='Mapping from Penal Code (PD_CD) to Penal Description');

ALTER TABLE `{project_id}.{dataset_id}.pd_code_penal_law`
SET OPTIONS(description='Mapping from PD Code to Penal Law sections');
""").result()

print("Table descriptions added.")


Table descriptions added.


In [124]:
# Add PRIMARY KEY designation (BigQuery does not enforce PRIMARY/FOREIGN KEY constraints, but you can add them for documentation/metadata)
client.query(f"""
ALTER TABLE `{project_id}.{dataset_id}.nypd`
  ADD PRIMARY KEY (CMPLNT_NUM) NOT ENFORCED;

ALTER TABLE `{project_id}.{dataset_id}.offense_codes`
  ADD PRIMARY KEY (KY_CD) NOT ENFORCED;

ALTER TABLE `{project_id}.{dataset_id}.jurisdiction_codes`
  ADD PRIMARY KEY (JURISDICTION_CODE) NOT ENFORCED;

ALTER TABLE `{project_id}.{dataset_id}.penal_codes`
  ADD PRIMARY KEY (PD_CD) NOT ENFORCED;
""").result()

print("Primary Key added to tables.")

BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/nyu-datasets/queries/3937f31b-e31d-469f-90e5-901259c1a266?maxResults=0&location=US&prettyPrint=false: Already Exists: Constraint primary key at [2:1]

Location: US
Job ID: 3937f31b-e31d-469f-90e5-901259c1a266


In [123]:
# prompt: Add (non enforced) FOREIGN KEYS in the nypd table

# Add (non enforced) FOREIGN KEY designations in the nypd table
client.query(f"""
ALTER TABLE `{project_id}.{dataset_id}.nypd`
  ADD CONSTRAINT fk_nypd_offense_codes
  FOREIGN KEY (KY_CD) REFERENCES `{project_id}.{dataset_id}.offense_codes` (KY_CD) NOT ENFORCED;

ALTER TABLE `{project_id}.{dataset_id}.nypd`
  ADD CONSTRAINT fk_nypd_penal_codes
  FOREIGN KEY (PD_CD) REFERENCES `{project_id}.{dataset_id}.penal_codes` (PD_CD) NOT ENFORCED;

ALTER TABLE `{project_id}.{dataset_id}.nypd`
  ADD CONSTRAINT fk_nypd_jurisdiction_codes
  FOREIGN KEY (JURISDICTION_CODE) REFERENCES `{project_id}.{dataset_id}.jurisdiction_codes` (JURISDICTION_CODE) NOT ENFORCED;
""").result()

print("Foreign Keys added to nypd table.")


Foreign Keys added to nypd table.


In [122]:
client.query(f"""
-- Add column descriptions to the pd_code_penal_law table
ALTER TABLE `{project_id}.{dataset_id}.pd_code_penal_law`
  ALTER COLUMN PDCODE_VALUE SET OPTIONS (description="Three digit internal classification code"),
  ALTER COLUMN CATEGORY SET OPTIONS (description="Description of internal classification corresponding with PD code"),
  ALTER COLUMN LAW_NYS SET OPTIONS (description="The specific section of NYS Penal Law that the code maps to"),
  ALTER COLUMN LIT_LONG SET OPTIONS (description="The section number within the Penal Law"),
  ALTER COLUMN LIT_SHORT SET OPTIONS (description="The subdivision within the Penal Law section");
""").result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x78465874c910>