<a href="https://colab.research.google.com/github/ipeirotis/datasets/blob/master/notebooks/NYPD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NYPD Dataset

Dataset description at
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i



| Column | Description |
|--------|-------------------|
| CMPLNT_NUM |  Randomly generated persistent ID for each complaint  |  
| ADDR_PCT_CD |  The precinct in which the incident occurred |  
| BORO |  The name of the borough in which the incident occurred |  
| CMPLNT_FR_DT |  Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) |  
| CMPLNT_FR_TM |  Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) |  
| CMPLNT_TO_DT |  Ending date of occurrence for the reported event, if exact time of occurrence is unknown |  
| CMPLNT_TO_TM |  Ending time of occurrence for the reported event, if exact time of occurrence is unknown |  
| CRM_ATPT_CPTD_CD |  Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely |  
| HADEVELOPT |  Name of NYCHA housing development of occurrence, if applicable |  
| HOUSING_PSA |  Development Level Code |  
| JURISDICTION_CODE |  Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc. |  
| JURIS_DESC |  Description of the jurisdiction code |  
| KY_CD |  Three digit offense classification code |  
| LAW_CAT_CD |  Level of offense: felony, misdemeanor, violation  |  
| LOC_OF_OCCUR_DESC |  Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of |  
| OFNS_DESC |  Description of offense corresponding with key code |  
| PARKS_NM |  Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) |  
| PATROL_BORO |  The name of the patrol borough in which the incident occurred |  
| PD_CD |  Three digit internal classification code (more granular than Key Code) |  
| PD_DESC |  Description of internal classification corresponding with PD code (more granular than Offense Description) |  
| PREM_TYP_DESC |  Specific description of premises; grocery store, residence, street, etc. |  
| RPT_DT |  Date event was reported to police  |  
| STATION_NAME |  Transit station name |  
| SUSP_AGE_GROUP |  Suspect’s Age Group |  
| SUSP_RACE |  Suspect’s Race Description |  
| SUSP_SEX |  Suspect’s Sex Description |  
| TRANSIT_DISTRICT |  Transit district in which the offense occurred. |  
| VIC_AGE_GROUP |  Victim’s Age Group |  
| VIC_RACE |  Victim’s Race Description |  
| VIC_SEX |  Victim’s Sex Description |  
| X_COORD_CD |  X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Y_COORD_CD |  Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Latitude |  Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)  |  
| Longitude |  Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |


In [2]:
!pip install -q google-cloud-secret-manager

from google.colab import auth
auth.authenticate_user()

from google.cloud import secretmanager

def access_secret_version(project_id, secret_id, version_id):
    """
    Access the payload of the given secret version and return it.

    Args:
        project_id (str): Google Cloud project ID.
        secret_id (str): ID of the secret to access.
        version_id (str): ID of the version to access.
    Returns:
        str: The secret version's payload, or None if
        the version does not exist.
    """
    client = secretmanager.SecretManagerServiceClient()
    name = f"projects/{project_id}/secrets/{secret_id}/versions/{version_id}"
    response = client.access_secret_version(request={"name": name})
    return response.payload.data.decode("UTF-8")


mysql_pass = access_secret_version("nyu-datasets", "MYSQL_PASSWORD", "latest")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m92.2/116.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.6/116.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import pandas as pd
import numpy as np

In [4]:
# We load everything as an object/string, because some data types (e.g., some IDs)
# are recognized as decimals, and it is a mess to restore them back
# So we will do all the conversions ourselves later on

# From https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/data
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD' -o nypd.csv


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2889M    0 2889M    0     0  5355k      0 --:--:--  0:09:12 --:--:-- 5162k


In [125]:
%%time
df = pd.read_csv('nypd.csv', low_memory = True, dtype='object')

CPU times: user 45.6 s, sys: 4.69 s, total: 50.3 s
Wall time: 50.1 s


In [6]:
# We load directly from the URL
# url = 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD'
# df = pd.read_csv(url, low_memory = True, dtype='object')


In [126]:
df.replace(to_replace = '(null)', value=np.nan, inplace = True)

In [127]:
len(df)

8353049

In [128]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8353049 entries, 0 to 8353048
Data columns (total 35 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   CMPLNT_NUM         object
 1   CMPLNT_FR_DT       object
 2   CMPLNT_FR_TM       object
 3   CMPLNT_TO_DT       object
 4   CMPLNT_TO_TM       object
 5   ADDR_PCT_CD        object
 6   RPT_DT             object
 7   KY_CD              object
 8   OFNS_DESC          object
 9   PD_CD              object
 10  PD_DESC            object
 11  CRM_ATPT_CPTD_CD   object
 12  LAW_CAT_CD         object
 13  BORO_NM            object
 14  LOC_OF_OCCUR_DESC  object
 15  PREM_TYP_DESC      object
 16  JURIS_DESC         object
 17  JURISDICTION_CODE  object
 18  PARKS_NM           object
 19  HADEVELOPT         object
 20  HOUSING_PSA        object
 21  X_COORD_CD         object
 22  Y_COORD_CD         object
 23  SUSP_AGE_GROUP     object
 24  SUSP_RACE          object
 25  SUSP_SEX           object
 26  TRANSIT_DISTRI

## Data Cleaning

In [129]:
# These columns are redundant
to_drop = ['Lat_Lon','X_COORD_CD','Y_COORD_CD']
# We created the CMPLNT_FR and CMPLNT_TO columns
# We have the longitude and latitude so the other coordinates are not needed
df = df.drop(to_drop, axis='columns')

###  CMPLNT_NUM         object   

In [130]:
df.CMPLNT_NUM = pd.to_numeric(df.CMPLNT_NUM, errors="coerce")
df = df[~df.CMPLNT_NUM.isna()]
# Drop cases with duplicated complaint numbers
key_cnt = df.CMPLNT_NUM.value_counts()
key_cnt [ key_cnt>1 ]
df = df[ ~df.CMPLNT_NUM.isin( key_cnt [ key_cnt>1 ].index.values ) ]

In [131]:
len(df)

8351263

In [132]:
df.CMPLNT_NUM = df.CMPLNT_NUM.astype('int32')

In [13]:
df.CMPLNT_NUM

0           10600119
1           11052575
2           10832306
3           10107192
4           23893731
             ...    
8353044    261171983
8353045    261175492
8353046    261147482
8353047    261179651
8353048    261157928
Name: CMPLNT_NUM, Length: 8351263, dtype: int32

### CMPLNT_FR_DT       object
### CMPLNT_FR_TM       object
### CMPLNT_TO_DT       object
### CMPLNT_TO_TM       object

In [133]:
# There are a few rows that contain year 1015, 1016, ... that trigger an error during date conversion
# We replace all years written as 10XX with 20XX
# Note the usage of regular expressions
df.CMPLNT_FR_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )
df.CMPLNT_TO_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )

# Similarly, a few hours are written as 24:00:00, which also triggers errors.
# We fix these hours
df.CMPLNT_FR_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)
df.CMPLNT_TO_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)

# Convert the two separate date and time columns into single datetime columns
df['CMPLNT_FR'] = pd.to_datetime(df.CMPLNT_FR_DT + ' ' + df.CMPLNT_FR_TM, format='%m/%d/%Y %H:%M:%S', cache=True, errors="coerce")
df['CMPLNT_TO'] = pd.to_datetime(df.CMPLNT_TO_DT + ' ' + df.CMPLNT_TO_TM, format='%m/%d/%Y %H:%M:%S', cache=True, errors="coerce")

# These columns are redundant
to_drop = ['CMPLNT_FR_DT','CMPLNT_TO_DT','CMPLNT_FR_TM','CMPLNT_TO_TM']
# We created the CMPLNT_FR and CMPLNT_TO columns
# We have the longitude and latitude so the other coordinates are not needed
df = df.drop(to_drop, axis='columns')

In [134]:
df.CMPLNT_FR.isnull().sum()

702

In [135]:
df.CMPLNT_TO.isnull().sum()

1777806

In [136]:
df = df [ ~df.CMPLNT_FR.isnull() ]

###  ADDR_PCT_CD        object

In [137]:

df.ADDR_PCT_CD.replace(to_replace = '-99', value='99', inplace = True)
# df = df [ ~df.ADDR_PCT_CD.isnull() ]
df.ADDR_PCT_CD = pd.Categorical(df.ADDR_PCT_CD)

###  RPT_DT             object

In [138]:


# Convert RPT_DT to date
df.RPT_DT = pd.to_datetime(df.RPT_DT, format="%m/%d/%Y", cache=True)

### 7   KY_CD              object
### 8   OFNS_DESC          object

In [142]:
df.KY_CD.value_counts()

341    1446598
578    1103329
344     876981
109     732617
351     682285
        ...   
881         16
357         12
123          7
362          5
577          3
Name: KY_CD, Length: 74, dtype: int64

In [139]:
df.OFNS_DESC.replace(to_replace = 'KIDNAPPING', value='KIDNAPPING & RELATED OFFENSES', inplace = True)
df.OFNS_DESC.replace(to_replace = 'KIDNAPPING AND RELATED OFFENSES', value='KIDNAPPING & RELATED OFFENSES', inplace = True)

df.OFNS_DESC.replace(to_replace = 'AGRICULTURE & MRKTS LAW-UNCLASSIFIED', value='OTHER STATE LAWS (NON PENAL LAW)', inplace = True)
df.OFNS_DESC.replace(to_replace = 'OTHER STATE LAWS (NON PENAL LA', value='OTHER STATE LAWS (NON PENAL LAW)', inplace = True)

df.OFNS_DESC.replace(to_replace = 'ENDAN WELFARE INCOMP', value='OFFENSES RELATED TO CHILDREN', inplace = True)

df.OFNS_DESC.replace(to_replace = 'THEFT OF SERVICES', value='OTHER OFFENSES RELATED TO THEF', inplace = True)

df.OFNS_DESC.replace(to_replace = 'NYS LAWS-UNCLASSIFIED VIOLATION', value='OTHER STATE LAWS', inplace = True)

df.OFNS_DESC.replace(to_replace = 'FELONY SEX CRIMES', value='SEX CRIMES', inplace = True)

df.loc[df.KY_CD=='120','OFNS_DESC'] ='CHILD ABANDONMENT/NON SUPPORT'

df.loc[df.KY_CD=='125','OFNS_DESC'] ='NYS LAWS-UNCLASSIFIED FELONY'

offenses = df[ ["KY_CD", "OFNS_DESC"] ].drop_duplicates().dropna()
offenses['KY_CD'] = pd.Categorical(pd.to_numeric(offenses['KY_CD'] ).astype(int))
offenses = offenses.set_index("KY_CD")
offenses = offenses.sort_index()
offenses = offenses.reset_index()
offenses


Unnamed: 0,KY_CD,OFNS_DESC
0,102,HOMICIDE-NEGLIGENT-VEHICLE
1,103,"HOMICIDE-NEGLIGENT,UNCLASSIFIE"
2,104,RAPE
3,105,ROBBERY
4,106,FELONY ASSAULT
...,...,...
69,676,NEW YORK CITY HEALTH CODE
70,677,OTHER STATE LAWS
71,678,MISCELLANEOUS PENAL LAW
72,685,ADMINISTRATIVE CODES


In [143]:
df.KY_CD = pd.Categorical(df.KY_CD)

In [144]:
df = df.drop('OFNS_DESC', axis='columns')

KeyError: ignored

### 9   PD_CD              object
### 10  PD_DESC            object

In [145]:



df.loc[df.PD_CD=='694','PD_DESC'] ='INCEST'

df.loc[df.PD_CD=='234','PD_DESC'] ='BURGLARY,UNKNOWN TIME'

internal = df[ ["PD_CD", "PD_DESC"] ].drop_duplicates().dropna()
internal['PD_CD'] = pd.Categorical(pd.to_numeric(internal['PD_CD'] ).astype(int))
internal = internal.set_index("PD_CD")
internal = internal.sort_index()
internal = internal.reset_index()
internal

Unnamed: 0,PD_CD,PD_DESC
0,100,STALKING COMMIT SEX OFFENSE
1,101,ASSAULT 3
2,102,ASSAULT SCHOOL SAFETY AGENT
3,103,ASSAULT TRAFFIC AGENT
4,104,VEHICULAR ASSAULT (INTOX DRIVE
...,...,...
434,916,LEAVING SCENE-ACCIDENT-PERSONA
435,918,RECKLESS DRIVING
436,922,"TRAFFIC,UNCLASSIFIED MISDEMEAN"
437,969,"TRAFFIC,UNCLASSIFIED INFRACTIO"


In [146]:
df.PD_CD.isnull().sum()

0

In [25]:
# df = df[~df.PD_CD.isnull()]

In [147]:
df.PD_CD = pd.Categorical(df.PD_CD)

In [148]:
df = df.drop('PD_DESC', axis='columns')

### 11  CRM_ATPT_CPTD_CD   object

In [149]:
df.CRM_ATPT_CPTD_CD.value_counts()

COMPLETED    8209913
ATTEMPTED     140480
Name: CRM_ATPT_CPTD_CD, dtype: int64

In [152]:
df.CRM_ATPT_CPTD_CD = pd.Categorical(df.CRM_ATPT_CPTD_CD)

In [150]:
df.CRM_ATPT_CPTD_CD.isnull().sum()

168

In [151]:
df = df [ ~df.CRM_ATPT_CPTD_CD.isnull() ]


### 12  LAW_CAT_CD         object

In [153]:
df.LAW_CAT_CD.value_counts()

MISDEMEANOR    4633011
FELONY         2597775
VIOLATION      1119607
Name: LAW_CAT_CD, dtype: int64

In [154]:
df.LAW_CAT_CD = pd.Categorical(df.LAW_CAT_CD)

### 16  JURIS_DESC         object
### 17  JURISDICTION_CODE  object

In [155]:
df.JURISDICTION_CODE.isnull().sum()

0

In [156]:
# df = df[ ~df.JURISDICTION_CODE.isnull() ]

jusridiction = df[ ["JURISDICTION_CODE", "JURIS_DESC", ] ].drop_duplicates().dropna()
jusridiction['JURISDICTION_CODE'] = pd.to_numeric(jusridiction['JURISDICTION_CODE'] )
jusridiction['JURISDICTION_CODE'] = jusridiction['JURISDICTION_CODE'].astype(int)
jusridiction = jusridiction.set_index("JURISDICTION_CODE")
jusridiction = jusridiction.sort_index()
jusridiction = jusridiction.reset_index()
jusridiction

Unnamed: 0,JURISDICTION_CODE,JURIS_DESC
0,0,N.Y. POLICE DEPT
1,1,N.Y. TRANSIT POLICE
2,2,N.Y. HOUSING POLICE
3,3,PORT AUTHORITY
4,4,TRI-BORO BRDG TUNNL
5,6,LONG ISLAND RAILRD
6,7,AMTRACK
7,8,CONRAIL
8,9,STATN IS RAPID TRANS
9,11,N.Y. STATE POLICE


In [157]:
df.JURISDICTION_CODE = pd.Categorical(df.JURISDICTION_CODE)


In [158]:
df = df.drop('JURIS_DESC', axis='columns')

###  13  BORO_NM            object

In [159]:
df.BORO_NM.value_counts()

BROOKLYN         2460497
MANHATTAN        2015694
BRONX            1806241
QUEENS           1676682
STATEN ISLAND     384548
Name: BORO_NM, dtype: int64

In [39]:
# df.BORO_NM.replace(to_replace = '(null)', value=np.nan, inplace = True)

In [160]:
df.BORO_NM.isnull().sum()

6731

In [161]:
df = df[~df.BORO_NM.isnull()]

In [162]:
df.BORO_NM = pd.Categorical(df.BORO_NM)

### 23  SUSP_AGE_GROUP     object
### 32  VIC_AGE_GROUP      object

In [163]:
# Both columns have a lot of noisy entries. We keep only the dominant groups, and also define an order
df.SUSP_AGE_GROUP = pd.Categorical(df.SUSP_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])
df.VIC_AGE_GROUP = pd.Categorical(df.VIC_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])


### 24  SUSP_RACE          object
### 25  SUSP_SEX           object

### 33  VIC_RACE           object
### 34  VIC_SEX            object

In [164]:
df.VIC_SEX.isnull().sum()

305

In [165]:
df.VIC_SEX.value_counts()

F    3269360
M    2769955
E    1231498
D    1070730
L       1814
Name: VIC_SEX, dtype: int64

In [166]:
df = df[~df.VIC_SEX.isnull()]

df.VIC_SEX.replace(to_replace = 'U', value=np.nan, inplace = True)

In [171]:
df.VIC_RACE.isnull().sum()

0

In [174]:
df.VIC_RACE.value_counts()

UNKNOWN                           2718383
BLACK                             2013798
WHITE                             1411743
WHITE HISPANIC                    1364029
ASIAN / PACIFIC ISLANDER           506809
BLACK HISPANIC                     291597
AMERICAN INDIAN/ALASKAN NATIVE      36998
Name: VIC_RACE, dtype: int64

In [169]:
df.VIC_RACE.replace(to_replace = 'OTHER', value='UNKNOWN', inplace = True)

In [170]:
df.VIC_RACE.replace(to_replace = np.nan, value='UNKNOWN', inplace = True)

In [51]:
# df = df[~df.VIC_RACE.isnull()]

In [172]:
df.SUSP_SEX.value_counts()

M    2853749
F     881057
U     872451
Name: SUSP_SEX, dtype: int64

In [173]:
# U is unknown, same is NULL.
df.SUSP_SEX.replace(to_replace = 'U', value=np.nan, inplace = True)
df.SUSP_SEX.replace(to_replace = '(null)', value=np.nan, inplace = True)

In [175]:
df.SUSP_RACE.value_counts()

BLACK                             1754564
UNKNOWN                           1278917
WHITE HISPANIC                     793876
WHITE                              501223
BLACK HISPANIC                     243343
ASIAN / PACIFIC ISLANDER           154782
AMERICAN INDIAN/ALASKAN NATIVE      13596
OTHER                                  11
Name: SUSP_RACE, dtype: int64

In [177]:
# Very small amount of OTHER values
df.SUSP_RACE.replace(to_replace = 'OTHER', value='UNKNOWN', inplace = True)
df.SUSP_RACE.replace(to_replace = np.nan, value='UNKNOWN', inplace = True)


In [178]:
df.SUSP_RACE = pd.Categorical(df.SUSP_RACE)
df.SUSP_SEX = pd.Categorical(df.SUSP_SEX)
df.VIC_RACE = pd.Categorical(df.VIC_RACE)
df.VIC_SEX = pd.Categorical(df.VIC_SEX)

In [179]:
df.dtypes

CMPLNT_NUM                    int32
ADDR_PCT_CD                category
RPT_DT               datetime64[ns]
KY_CD                      category
PD_CD                      category
CRM_ATPT_CPTD_CD           category
LAW_CAT_CD                 category
BORO_NM                    category
LOC_OF_OCCUR_DESC            object
PREM_TYP_DESC                object
JURISDICTION_CODE          category
PARKS_NM                     object
HADEVELOPT                   object
HOUSING_PSA                  object
SUSP_AGE_GROUP             category
SUSP_RACE                  category
SUSP_SEX                   category
TRANSIT_DISTRICT             object
Latitude                     object
Longitude                    object
PATROL_BORO                  object
STATION_NAME                 object
VIC_AGE_GROUP              category
VIC_RACE                   category
VIC_SEX                    category
CMPLNT_FR            datetime64[ns]
CMPLNT_TO            datetime64[ns]
dtype: object

###  14  LOC_OF_OCCUR_DESC  object

In [180]:
df.LOC_OF_OCCUR_DESC.value_counts()

INSIDE         4271230
FRONT OF       1976777
OPPOSITE OF     215205
REAR OF         173470
Name: LOC_OF_OCCUR_DESC, dtype: int64

In [181]:
df.LOC_OF_OCCUR_DESC.replace(to_replace = '(null)', value=np.nan, inplace = True)

In [182]:
df.LOC_OF_OCCUR_DESC.isnull().sum()

1706675

In [183]:
df.LOC_OF_OCCUR_DESC = pd.Categorical(df.LOC_OF_OCCUR_DESC)

### Latitude                     object
### Longitude                    object

In [184]:
# !sudo apt-get update
!sudo apt-get install python3-rtree
!sudo pip3 install geopandas descartes shapely ngram # matplotlib==3.1.3

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
python3-rtree is already the newest version (0.9.7-1).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.


In [185]:
import geopandas as gpd

In [186]:
df.Latitude = pd.to_numeric(df.Latitude, downcast='float')
df.Longitude  = pd.to_numeric(df.Longitude, downcast='float')

In [187]:
%%time
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))

CPU times: user 6.41 s, sys: 571 ms, total: 6.98 s
Wall time: 6.92 s


In [188]:
shapefile_url = 'https://data.cityofnewyork.us/api/geospatial/cpf4-rkhq?method=export&format=Shapefile'
df_nyc = gpd.GeoDataFrame.from_file(shapefile_url)

In [189]:
!mkdir -p maps
!curl -s https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/nynta2010_23a.zip  -o maps/nynta2010_23a.zip
!cd maps && unzip -o nynta2010_23a.zip
shapefile = f"maps/nynta2010_23a/nynta2010.shp"
df_nyc = gpd.GeoDataFrame.from_file(shapefile)
df_nyc = df_nyc.to_crs(4326)

Archive:  nynta2010_23a.zip
  inflating: nynta2010_23a/nynta2010.shp  
  inflating: nynta2010_23a/nynta2010.dbf  
  inflating: nynta2010_23a/nynta2010.shx  
  inflating: nynta2010_23a/nynta2010.prj  
  inflating: nynta2010_23a/nynta2010.shp.xml  


In [190]:
df_nyc

Unnamed: 0,BoroCode,BoroName,CountyFIPS,NTACode,NTAName,Shape_Leng,Shape_Area,geometry
0,4,Queens,081,QN08,St. Albans,45401.316803,7.741275e+07,"POLYGON ((-73.75205 40.70523, -73.75174 40.704..."
1,2,Bronx,005,BX28,Van Cortlandt Village,21945.719299,2.566612e+07,"POLYGON ((-73.88705 40.88435, -73.88705 40.884..."
2,4,Queens,081,QN55,South Ozone Park,36708.169305,8.246139e+07,"POLYGON ((-73.80577 40.68293, -73.80552 40.682..."
3,3,Brooklyn,047,BK40,Windsor Terrace,19033.672066,1.404167e+07,"POLYGON ((-73.98017 40.66115, -73.98021 40.661..."
4,3,Brooklyn,047,BK50,Canarsie,43703.609666,8.208968e+07,"MULTIPOLYGON (((-73.88834 40.64671, -73.88835 ..."
...,...,...,...,...,...,...,...,...
190,2,Bronx,005,BX34,Melrose South-Mott Haven North,19589.196431,1.727176e+07,"POLYGON ((-73.90129 40.82048, -73.90160 40.819..."
191,2,Bronx,005,BX39,Mott Haven-Port Morris,35604.790810,4.189861e+07,"MULTIPOLYGON (((-73.89681 40.79581, -73.89694 ..."
192,2,Bronx,005,BX63,West Concourse,28571.879354,1.936642e+07,"POLYGON ((-73.91192 40.84326, -73.91195 40.843..."
193,5,Staten Island,085,SI22,West New Brighton-New Brighton-St. George,66052.593065,5.602857e+07,"POLYGON ((-74.07258 40.63794, -74.07330 40.637..."


In [191]:
%%time
# Match each accident with a neighborhood.
# Will take ~30 mins to run
# This is done with left join,
# so we preserve all the data points
# but we know which ones are not matching with the shapefile
gdf.crs = df_nyc.crs
gdf = gpd.sjoin(gdf, df_nyc, how='left')


CPU times: user 1min 49s, sys: 6.94 s, total: 1min 56s
Wall time: 1min 55s


In [192]:
gdf.dtypes

CMPLNT_NUM                    int32
ADDR_PCT_CD                category
RPT_DT               datetime64[ns]
KY_CD                      category
PD_CD                      category
CRM_ATPT_CPTD_CD           category
LAW_CAT_CD                 category
BORO_NM                    category
LOC_OF_OCCUR_DESC          category
PREM_TYP_DESC                object
JURISDICTION_CODE          category
PARKS_NM                     object
HADEVELOPT                   object
HOUSING_PSA                  object
SUSP_AGE_GROUP             category
SUSP_RACE                  category
SUSP_SEX                   category
TRANSIT_DISTRICT             object
Latitude                    float32
Longitude                   float32
PATROL_BORO                  object
STATION_NAME                 object
VIC_AGE_GROUP              category
VIC_RACE                   category
VIC_SEX                    category
CMPLNT_FR            datetime64[ns]
CMPLNT_TO            datetime64[ns]
geometry                   g

In [193]:
# We keep only boro_name and ntaname
todrop = [
    'index_right', 'BoroCode', 'CountyFIPS',
    'Shape_Area', 'Shape_Leng'
]

gdf = gdf.drop(todrop, axis='columns')

# Rename the columns
gdf = gdf.rename({
    'BoroName': 'BOROUGH',
    'NTAName': 'NEIGHBORHOOD',
    'NTACode': 'NEIGHBORHOOD_CODE',
}, axis='columns')

In [194]:
gdf['BOROUGH'] = gdf['BOROUGH'].str.upper()

In [195]:
# Mark as NULL all the lon/lat entries outside the NYC area
gdf.loc[gdf.BOROUGH.isnull(), 'Latitude'] = None
gdf.loc[gdf.BOROUGH.isnull(), 'Longitude'] = None

In [196]:
# Mark as NULL all the lon/lat entries that generate inconsistencies
mask = gdf.query('BOROUGH != BORO_NM and Latitude==Latitude and Longitude==Longitude').CMPLNT_NUM.values
condition = gdf.CMPLNT_NUM.isin(mask)

gdf.loc[condition, 'Latitude'] = None
gdf.loc[condition, 'Longitude'] = None

In [75]:
# Delete the cases where the reported and detected boroughs are different
# to_delete = sorted(set(to_delete))
# gdf = gdf [ ~gdf.CMPLNT_NUM.isin(to_delete) ]

In [197]:
gdf = gdf.drop('geometry', axis='columns')

In [198]:
df = pd.DataFrame(gdf)

In [199]:
df.BORO_NM.value_counts()

BROOKLYN         2460390
MANHATTAN        2015634
BRONX            1806166
QUEENS           1676630
STATEN ISLAND     384537
Name: BORO_NM, dtype: int64

In [79]:
#Temporarily, we drop these. We will add them back in the future
# df.drop(['BOROUGH','NEIGHBORHOOD','NEIGHBORHOOD_CODE'], axis='columns', inplace=True)

In [265]:
df = df[df.BOROUGH == df.BORO_NM]

In [266]:
df.drop(['BOROUGH'], axis='columns', inplace=True)
df.NEIGHBORHOOD_CODE = pd.Categorical(df.NEIGHBORHOOD_CODE)
df.NEIGHBORHOOD = pd.Categorical(df.NEIGHBORHOOD)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(['BOROUGH'], axis='columns', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.NEIGHBORHOOD_CODE = pd.Categorical(df.NEIGHBORHOOD_CODE)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.NEIGHBORHOOD = pd.Categorical(df.NEIGHBORHOOD)


### TRANSIT_DISTRICT

In [200]:
df.TRANSIT_DISTRICT.value_counts()


4     30049
2     22599
1     18048
3     17824
20    16505
33    16387
12    13675
11    12803
32    12394
30    11731
34     8017
23     2978
Name: TRANSIT_DISTRICT, dtype: int64

In [201]:
len(df) - df.TRANSIT_DISTRICT.isnull().sum()

183010

In [202]:
df.drop('TRANSIT_DISTRICT', axis='columns', inplace=True)


### PREM_TYP_DESC

In [203]:
df.PREM_TYP_DESC.value_counts()

STREET                        2630392
RESIDENCE - APT. HOUSE        1789963
RESIDENCE-HOUSE                822698
RESIDENCE - PUBLIC HOUSING     618939
OTHER                          225999
                               ...   
MOBILE FOOD                      1055
CEMETERY                          922
LOAN COMPANY                      532
DAYCARE FACILITY                  170
TRAMWAY                           152
Name: PREM_TYP_DESC, Length: 77, dtype: int64

In [204]:
df.PREM_TYP_DESC.isnull().sum()

13500

In [205]:
df = df [~df.PREM_TYP_DESC.isnull()]

In [206]:
df.PREM_TYP_DESC = pd.Categorical(df.PREM_TYP_DESC)

In [207]:
df.PARKS_NM.value_counts()

CENTRAL PARK                             2072
FLUSHING MEADOWS CORONA PARK             1693
WASHINGTON SQUARE PARK                   1327
CONEY ISLAND BEACH & BOARDWALK           1292
RIVERSIDE PARK                            712
                                         ... 
TOAD HALL PLAYGROUND                        1
BARNHILL SQUARE                             1
LOTT PARK                                   1
PATRICK O'ROURKE PARK                       1
NEW 123RD ST BLOCK ASSOCIATION GARDEN       1
Name: PARKS_NM, Length: 1260, dtype: int64

In [209]:
df.PARKS_NM.value_counts().sum()

34146

In [210]:
df.drop('PARKS_NM', axis='columns', inplace=True)



 19  HADEVELOPT         object


In [211]:
df.HADEVELOPT.value_counts()

INGERSOLL                                4242
WALD                                     2508
WILLIAMSBURG                             2322
NOSTRAND                                 2199
RIIS                                     1939
GRANT                                    1832
MANHATTANVILLE                           1807
MARLBORO                                 1805
SHEEPSHEAD BAY                           1652
MARBLE HILL                              1540
WHITMAN                                  1359
WOODSIDE                                 1148
RIIS II                                  1072
SMITH                                    1016
RED HOOK EAST                             853
MARKHAM GARDENS                           587
WEST BRIGHTON I                           452
GLENWOOD                                  384
CAMPOS PLAZA I                            381
RED HOOK WEST                             212
LOWER EAST SIDE REHAB (GROUP 5)           186
FIRST HOUSES                      

In [212]:
df.drop('HADEVELOPT', axis='columns', inplace=True)


 20  HOUSING_PSA        object



In [213]:
df.HOUSING_PSA.value_counts()

670      8737
887      8630
720      7605
845      7599
632      7301
         ... 
34134       1
33521       1
34432       1
34888       1
7721        1
Name: HOUSING_PSA, Length: 4540, dtype: int64

In [214]:
df.HOUSING_PSA.value_counts().sum()

631131

In [215]:
df.drop('HOUSING_PSA', axis='columns', inplace=True)

 30  PATROL_BORO        object


In [216]:
df.PATROL_BORO.value_counts()

PATROL BORO BRONX            1801124
PATROL BORO BKLYN SOUTH      1232112
PATROL BORO BKLYN NORTH      1225389
PATROL BORO MAN SOUTH        1015481
PATROL BORO MAN NORTH         995110
PATROL BORO QUEENS NORTH      870965
PATROL BORO QUEENS SOUTH      805624
PATROL BORO STATEN ISLAND     383976
Name: PATROL_BORO, dtype: int64

In [217]:
df.PATROL_BORO = pd.Categorical(df.PATROL_BORO)

In [218]:
df.PATROL_BORO.isnull().sum()

76

In [219]:
df = df[~df.PATROL_BORO.isnull()]

 31  STATION_NAME       object

In [220]:
df.STATION_NAME.value_counts()

125 STREET                        8848
14 STREET                         4869
42 ST.-PORT AUTHORITY BUS TERM    4363
34 ST.-PENN STATION               4183
42 ST.-TIMES SQUARE               3574
                                  ... 
DISTRICT 30 OFFICE                  19
DISTRICT 34 OFFICE                  13
DISTRICT 12 OFFICE                  13
OFF-SYSTEM                           8
DISTRICT 23 OFFICE                   6
Name: STATION_NAME, Length: 372, dtype: int64

In [221]:
df.STATION_NAME.replace(to_replace = '(null)', value=np.nan, inplace = True)

In [222]:
df.STATION_NAME.isnull().sum()



8148061

In [223]:
df.drop('STATION_NAME', axis='columns', inplace=True)

In [226]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8329781 entries, 17 to 8353048
Data columns (total 25 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float32       
 15  Longitude          float32       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          date

## Data exploration

In this part we check the different values that appear in the columns. When we detect noisy results, we delete the corresponding values. In fact, many of the operations that are performed above, in the 'data cleaning' section, are the result of observations that we make here.

In [227]:
# Find the unique values in each column
#
# df.describe(include = [np.object, 'category']).T['unique']
unique = df.describe(include = 'all').T['unique'].sort_values()

  unique = df.describe(include = 'all').T['unique'].sort_values()
  unique = df.describe(include = 'all').T['unique'].sort_values()
  unique = df.describe(include = 'all').T['unique'].sort_values()


In [228]:
unique

CRM_ATPT_CPTD_CD           2
SUSP_SEX                   2
LAW_CAT_CD                 3
LOC_OF_OCCUR_DESC          4
SUSP_AGE_GROUP             5
BOROUGH                    5
VIC_SEX                    5
BORO_NM                    5
VIC_AGE_GROUP              5
VIC_RACE                   7
SUSP_RACE                  7
PATROL_BORO                8
JURISDICTION_CODE         25
KY_CD                     74
ADDR_PCT_CD               77
PREM_TYP_DESC             77
NEIGHBORHOOD_CODE        195
NEIGHBORHOOD             195
PD_CD                    439
RPT_DT                  6209
CMPLNT_FR            2384241
CMPLNT_TO            2650854
CMPLNT_NUM               NaN
Latitude                 NaN
Longitude                NaN
Name: unique, dtype: object

In [229]:
for column in unique.index:
    if unique[column] < 200:
        print(df[column].value_counts())
        print("=====")

COMPLETED    8189629
ATTEMPTED     140152
Name: CRM_ATPT_CPTD_CD, dtype: int64
=====
M    2852522
F     880704
Name: SUSP_SEX, dtype: int64
=====
MISDEMEANOR    4619475
FELONY         2591777
VIOLATION      1118529
Name: LAW_CAT_CD, dtype: int64
=====
INSIDE         4266431
FRONT OF       1973233
OPPOSITE OF     214816
REAR OF         173142
Name: LOC_OF_OCCUR_DESC, dtype: int64
=====
25-44    1468719
18-24     564130
45-64     512910
<18       191863
65+        46001
Name: SUSP_AGE_GROUP, dtype: int64
=====
BROOKLYN         2457712
MANHATTAN        2013138
BRONX            1799246
QUEENS           1675632
STATEN ISLAND     383977
Name: BOROUGH, dtype: int64
=====
F    3266776
M    2767344
E    1224478
D    1069398
L       1785
Name: VIC_SEX, dtype: int64
=====
BROOKLYN         2457460
MANHATTAN        2012783
BRONX            1801277
QUEENS           1674286
STATEN ISLAND     383975
Name: BORO_NM, dtype: int64
=====
25-44    2767881
45-64    1434966
18-24     854312
<18       395082
6

In [107]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8350561 entries, 0 to 8353048
Data columns (total 22 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float32       
 15  Longitude          float32       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          datet

In [230]:
df['PREM_TYP_DESC'].value_counts().count()

77

In [231]:
# All columns, except for the dates and spatial coordinates, are categorical
# Columns with less than a few thousand unique values are good candidates
# for ENUMs in the database given that the dataset is static.
# Also, in Pandas the internal representation becomes much more efficient
# as the Categoricals are stored as integers and not as strings
for column in unique.index:
    if column == 'RPT_DT':
        continue
    if df[column].value_counts().count() < 1000:
      df[column] = pd.Categorical(df[column])

In [232]:
# With all the proper data typing the dataset went down in size from 1.9Gb+ to 425Mb.
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8329781 entries, 17 to 8353048
Data columns (total 25 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float32       
 15  Longitude          float32       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          date

In [111]:
df.memory_usage()

Index                66804488
CMPLNT_NUM           33402244
ADDR_PCT_CD           8353281
RPT_DT               66804488
KY_CD                 8353257
PD_CD                16721186
CRM_ATPT_CPTD_CD      8350693
LAW_CAT_CD            8350693
BORO_NM               8350773
LOC_OF_OCCUR_DESC     8350765
PREM_TYP_DESC         8353289
JURISDICTION_CODE     8351841
SUSP_AGE_GROUP        8350773
SUSP_RACE             8350917
SUSP_SEX              8350685
Latitude             33402244
Longitude            33402244
PATROL_BORO           8350933
VIC_AGE_GROUP         8350773
VIC_RACE              8350917
VIC_SEX               8350781
CMPLNT_FR            66804488
CMPLNT_TO            66804488
dtype: int64

In [233]:
df.dtypes

CMPLNT_NUM                    int32
ADDR_PCT_CD                category
RPT_DT               datetime64[ns]
KY_CD                      category
PD_CD                      category
CRM_ATPT_CPTD_CD           category
LAW_CAT_CD                 category
BORO_NM                    category
LOC_OF_OCCUR_DESC          category
PREM_TYP_DESC              category
JURISDICTION_CODE          category
SUSP_AGE_GROUP             category
SUSP_RACE                  category
SUSP_SEX                   category
Latitude                    float32
Longitude                   float32
PATROL_BORO                category
VIC_AGE_GROUP              category
VIC_RACE                   category
VIC_SEX                    category
CMPLNT_FR            datetime64[ns]
CMPLNT_TO            datetime64[ns]
BOROUGH                    category
NEIGHBORHOOD_CODE          category
NEIGHBORHOOD               category
dtype: object

## Storing in a MySQL database

In [234]:
!sudo pip3 install -U -q PyMySQL sqlalchemy

In [235]:
import os
from sqlalchemy import create_engine
from sqlalchemy import text

conn_string = 'mysql+pymysql://{user}:{password}@{host}/?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org',
    user = 'root',
    password = mysql_pass,
    encoding = 'utf8mb4')

engine = create_engine(conn_string)


In [236]:
# Query to create a database
db_name = 'nypd'

sql = f"DROP DATABASE IF EXISTS {db_name}"
with engine.connect() as connection:
  connection.execute(text(sql))

# Create a database
sql = f"CREATE DATABASE IF NOT EXISTS {db_name} DEFAULT CHARACTER SET 'utf8mb4'"
with engine.connect() as connection:
  connection.execute(text(sql))


In [237]:

# And lets switch to the database
sql = f"USE {db_name}"
with engine.connect() as connection:
  connection.execute(text(sql))


In [271]:
df['NEIGHBORHOOD'] = df['NEIGHBORHOOD'].str.replace('\'', '’', regex=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['NEIGHBORHOOD'] = df['NEIGHBORHOOD'].str.replace('\'', '’', regex=False)


In [272]:
NEIGHBORHOOD_enum = "ENUM('" + ("','".join(sorted(df.NEIGHBORHOOD.astype(str).unique()))) + "')"


In [273]:
print(NEIGHBORHOOD_enum)

ENUM('Airport','Allerton-Pelham Gardens','Annadale-Huguenot-Prince’s Bay-Eltingville','Arden Heights','Astoria','Auburndale','Baisley Park','Bath Beach','Battery Park City-Lower Manhattan','Bay Ridge','Bayside-Bayside Hills','Bedford','Bedford Park-Fordham North','Bellerose','Belmont','Bensonhurst East','Bensonhurst West','Borough Park','Breezy Point-Belle Harbor-Rockaway Park-Broad Channel','Briarwood-Jamaica Hills','Brighton Beach','Bronxdale','Brooklyn Heights-Cobble Hill','Brownsville','Bushwick North','Bushwick South','Cambria Heights','Canarsie','Carroll Gardens-Columbia Street-Red Hook','Central Harlem North-Polo Grounds','Central Harlem South','Charleston-Richmond Valley-Tottenville','Chinatown','Claremont-Bathgate','Clinton','Clinton Hill','Co-op City','College Point','Corona','Crotona Park East','Crown Heights North','Crown Heights South','Cypress Hills-City Line','DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill','Douglas Manor-Douglaston-Little Neck','Dyker Heights','East C

In [274]:
NCODE_enum = "ENUM('" + ("','".join(sorted(df.NEIGHBORHOOD_CODE.astype(str).unique()))) + "')"

In [275]:
# In principle, we can let Pandas create the table, but we want to be a bit more predise
# with the data types, and we want to add documentation for each column
# from https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i


sql = f'''
CREATE TABLE nypd (
  CMPLNT_NUM int,
  CMPLNT_FR datetime,
  CMPLNT_TO datetime,
  RPT_DT date,
  KY_CD SMALLINT,
  PD_CD SMALLINT,
  CRM_ATPT_CPTD_CD enum('COMPLETED','ATTEMPTED'),
  LAW_CAT_CD enum('FELONY','MISDEMEANOR','VIOLATION'),
  JURISDICTION_CODE SMALLINT,
  BORO_NM enum('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN ISLAND'),
  NEIGHBORHOOD {NEIGHBORHOOD_enum},
  NEIGHBORHOOD_CODE {NCODE_enum},
  ADDR_PCT_CD SMALLINT,
  LOC_OF_OCCUR_DESC enum('FRONT OF','INSIDE','OPPOSITE OF','OUTSIDE','REAR OF'),
  PATROL_BORO enum('PATROL BORO BRONX', 'PATROL BORO BKLYN SOUTH','PATROL BORO BKLYN NORTH','PATROL BORO MAN SOUTH','PATROL BORO MAN NORTH','PATROL BORO QUEENS NORTH','PATROL BORO QUEENS SOUTH','PATROL BORO STATEN ISLAND'),
  PREM_TYP_DESC varchar(30),
  SUSP_RACE enum('UNKNOWN', 'BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  VIC_RACE enum('UNKNOWN', 'BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  SUSP_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  VIC_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  SUSP_SEX enum('M', 'F'),
  VIC_SEX enum('M', 'F', 'E', 'D', 'L'),
  Latitude double,
  Longitude double,
  PRIMARY KEY (CMPLNT_NUM)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))

In [276]:
# Create a table
# See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html for the documentation
from tqdm import tqdm
batchsize = 50000
batches = len(df) // batchsize + 1

t = tqdm(range(batches))

for i in t:
    # print("Batch:",i)
    # continue # Cannot execute this on Travis
    start = batchsize * i
    end = batchsize * (i+1)
    df[start:end].to_sql(
        name = 'nypd',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False,
        chunksize = 1000)

100%|██████████| 167/167 [16:42<00:00,  6.00s/it]


In [277]:
sql = "CREATE INDEX ix_lat ON nypd.nypd(Latitude)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [278]:
sql = "CREATE INDEX ix_lon ON nypd.nypd(Longitude)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [279]:
sql = "CREATE INDEX ix_LAW_CAT_CD ON nypd.nypd(LAW_CAT_CD)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [280]:
sql = "CREATE INDEX ix_BORO_NM ON nypd.nypd(BORO_NM)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [281]:
sql = "CREATE INDEX ix_KY_CD ON nypd.nypd(KY_CD)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [282]:
sql = "CREATE INDEX ix_RPT_DT ON nypd.nypd(RPT_DT)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [283]:
sql = "CREATE INDEX ix_CMPLNT_FR ON nypd.nypd(CMPLNT_FR)"
with engine.connect() as connection:
  connection.execute(text(sql))

In [284]:
offenses = offenses[offenses.OFNS_DESC != "(null)"]

In [285]:
offenses

Unnamed: 0,KY_CD,OFNS_DESC
0,102,HOMICIDE-NEGLIGENT-VEHICLE
1,103,"HOMICIDE-NEGLIGENT,UNCLASSIFIE"
2,104,RAPE
3,105,ROBBERY
4,106,FELONY ASSAULT
...,...,...
69,676,NEW YORK CITY HEALTH CODE
70,677,OTHER STATE LAWS
71,678,MISCELLANEOUS PENAL LAW
72,685,ADMINISTRATIVE CODES


In [286]:
sql = "DROP TABLE IF EXISTS offense_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE offense_codes (
  KY_CD smallint,
  OFNS_DESC varchar(32),
  PRIMARY KEY (KY_CD)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))

offenses.to_sql(
        name = 'offense_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

74

In [287]:
sql = "DROP TABLE IF EXISTS jurisdiction_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE jurisdiction_codes (
  JURISDICTION_CODE smallint,
  JURIS_DESC varchar(40),
  PRIMARY KEY (JURISDICTION_CODE)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))


jusridiction.to_sql(
        name = 'jurisdiction_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

26

In [288]:
internal.PD_DESC.str.len().max()


71

In [289]:
internal = internal.query("PD_DESC !=	'CRIMINAL DISPOSAL FIREARM 1 &' ")
internal = internal.query("PD_DESC !=	'UNFINSH FRAME 2' ")
internal = internal.query("PD_DESC !=	'WEAPONS POSSESSION 1 & 2' ")
internal = internal.query("PD_DESC !=	'CRIM POS WEAP 4' ")


In [290]:
internal

Unnamed: 0,PD_CD,PD_DESC
0,100,STALKING COMMIT SEX OFFENSE
1,101,ASSAULT 3
2,102,ASSAULT SCHOOL SAFETY AGENT
3,103,ASSAULT TRAFFIC AGENT
4,104,VEHICULAR ASSAULT (INTOX DRIVE
...,...,...
434,916,LEAVING SCENE-ACCIDENT-PERSONA
435,918,RECKLESS DRIVING
436,922,"TRAFFIC,UNCLASSIFIED MISDEMEAN"
437,969,"TRAFFIC,UNCLASSIFIED INFRACTIO"


In [291]:
sql = "DROP TABLE IF EXISTS penal_codes;"
with engine.connect() as connection:
  connection.execute(text(sql))

sql = '''
CREATE TABLE penal_codes (
  PD_CD smallint,
  PD_DESC varchar(80),
  PRIMARY KEY (PD_CD)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
with engine.connect() as connection:
  connection.execute(text(sql))


internal.to_sql(
        name = 'penal_codes',
        schema = db_name,
        con = engine,
        if_exists = 'append',
        index = False)

437

## TODO

### Add the penal code data as a separate table

`!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/files/65f25845-1551-4d21-91dc-869c977cd93d?download=true&filename=PDCode_PenalLaw.xlsx' -o PDCode_PenalLaw.xlsx`

### Examine whether to normalize

The fields

  
PREM_TYP_DESC    
HADEVELOPT    
PARKS_NM                     

would be better off as foreign keys or enums. They take too much space as strings.