## NYPD Dataset

Dataset description at 
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i



| Column | Description |
|--------|-------------------|
| CMPLNT_NUM |  Randomly generated persistent ID for each complaint  |  
| ADDR_PCT_CD |  The precinct in which the incident occurred |  
| BORO |  The name of the borough in which the incident occurred |  
| CMPLNT_FR_DT |  Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) |  
| CMPLNT_FR_TM |  Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) |  
| CMPLNT_TO_DT |  Ending date of occurrence for the reported event, if exact time of occurrence is unknown |  
| CMPLNT_TO_TM |  Ending time of occurrence for the reported event, if exact time of occurrence is unknown |  
| CRM_ATPT_CPTD_CD |  Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely |  
| HADEVELOPT |  Name of NYCHA housing development of occurrence, if applicable |  
| HOUSING_PSA |  Development Level Code |  
| JURISDICTION_CODE |  Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc. |  
| JURIS_DESC |  Description of the jurisdiction code |  
| KY_CD |  Three digit offense classification code |  
| LAW_CAT_CD |  Level of offense: felony, misdemeanor, violation  |  
| LOC_OF_OCCUR_DESC |  Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of |  
| OFNS_DESC |  Description of offense corresponding with key code |  
| PARKS_NM |  Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) |  
| PATROL_BORO |  The name of the patrol borough in which the incident occurred |  
| PD_CD |  Three digit internal classification code (more granular than Key Code) |  
| PD_DESC |  Description of internal classification corresponding with PD code (more granular than Offense Description) |  
| PREM_TYP_DESC |  Specific description of premises; grocery store, residence, street, etc. |  
| RPT_DT |  Date event was reported to police  |  
| STATION_NAME |  Transit station name |  
| SUSP_AGE_GROUP |  Suspect’s Age Group |  
| SUSP_RACE |  Suspect’s Race Description |  
| SUSP_SEX |  Suspect’s Sex Description |  
| TRANSIT_DISTRICT |  Transit district in which the offense occurred. |  
| VIC_AGE_GROUP |  Victim’s Age Group |  
| VIC_RACE |  Victim’s Race Description |  
| VIC_SEX |  Victim’s Sex Description |  
| X_COORD_CD |  X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Y_COORD_CD |  Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Latitude |  Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)  |  
| Longitude |  Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |


In [None]:
import pandas as pd
import numpy as np

In [3]:
# We load everything as an object/string, because some data types (e.g., some IDs)
# are recognized as decimals, and it is a mess to restore them back
# So we will do all the conversions ourselves later on

# From https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/data
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD' -o nypd.csv


100 2251M    0 2251M    0     0  4808k      0 --:--:--  0:07:59 --:--:-- 4862k


In [100]:
%%time
df = pd.read_csv('nypd.csv', low_memory = True, dtype='object')

CPU times: user 43.4 s, sys: 7.66 s, total: 51.1 s
Wall time: 50.9 s


In [None]:
# We load directly from the URL
# url = 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD'
# df = pd.read_csv(url, low_memory = True, dtype='object')


In [103]:
len(df)

7396619

In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7396619 entries, 0 to 7396618
Data columns (total 35 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   CMPLNT_NUM         object
 1   CMPLNT_FR_DT       object
 2   CMPLNT_FR_TM       object
 3   CMPLNT_TO_DT       object
 4   CMPLNT_TO_TM       object
 5   ADDR_PCT_CD        object
 6   RPT_DT             object
 7   KY_CD              object
 8   OFNS_DESC          object
 9   PD_CD              object
 10  PD_DESC            object
 11  CRM_ATPT_CPTD_CD   object
 12  LAW_CAT_CD         object
 13  BORO_NM            object
 14  LOC_OF_OCCUR_DESC  object
 15  PREM_TYP_DESC      object
 16  JURIS_DESC         object
 17  JURISDICTION_CODE  object
 18  PARKS_NM           object
 19  HADEVELOPT         object
 20  HOUSING_PSA        object
 21  X_COORD_CD         object
 22  Y_COORD_CD         object
 23  SUSP_AGE_GROUP     object
 24  SUSP_RACE          object
 25  SUSP_SEX           object
 26  TRANSIT_DISTRI

## Data Cleaning

In [105]:
# These columns are redundant
to_drop = ['Lat_Lon','X_COORD_CD','Y_COORD_CD']
# We created the CMPLNT_FR and CMPLNT_TO columns
# We have the longitude and latitude so the other coordinates are not needed
df = df.drop(to_drop, axis='columns')

###  CMPLNT_NUM         object   

In [106]:
     
# Drop cases with duplicated complaint numbers
key_cnt = df.CMPLNT_NUM.value_counts()
key_cnt [ key_cnt>1 ]
df = df[ ~df.CMPLNT_NUM.isin( key_cnt [ key_cnt>1 ].index.values ) ]

In [107]:
df.CMPLNT_NUM = df.CMPLNT_NUM.astype('int32')

### CMPLNT_FR_DT       object
### CMPLNT_FR_TM       object
### CMPLNT_TO_DT       object
### CMPLNT_TO_TM       object

In [108]:
# There are a few rows that contain year 1015, 1016, ... that trigger an error during date conversion
# We replace all years written as 10XX with 20XX
# Note the usage of regular expressions
df.CMPLNT_FR_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )
df.CMPLNT_TO_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )

# Similarly, a few hours are written as 24:00:00, which also triggers errors.
# We fix these hours
df.CMPLNT_FR_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)
df.CMPLNT_TO_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)

# Convert the two separate date and time columns into single datetime columns
df['CMPLNT_FR'] = pd.to_datetime(df.CMPLNT_FR_DT + ' ' + df.CMPLNT_FR_TM, format='%m/%d/%Y %H:%M:%S', cache=True)
df['CMPLNT_TO'] = pd.to_datetime(df.CMPLNT_TO_DT + ' ' + df.CMPLNT_TO_TM, format='%m/%d/%Y %H:%M:%S', cache=True)

# These columns are redundant
to_drop = ['CMPLNT_FR_DT','CMPLNT_TO_DT','CMPLNT_FR_TM','CMPLNT_TO_TM']
# We created the CMPLNT_FR and CMPLNT_TO columns
# We have the longitude and latitude so the other coordinates are not needed
df = df.drop(to_drop, axis='columns')

In [109]:
df.CMPLNT_FR.isnull().sum()

702

In [110]:
df.CMPLNT_TO.isnull().sum()

1702645

In [111]:
df = df [ ~df.CMPLNT_FR.isnull() ]

###  ADDR_PCT_CD        object

In [112]:

df.ADDR_PCT_CD.replace(to_replace = '-99', value='99', inplace = True)
df = df [ ~df.ADDR_PCT_CD.isnull() ]
df.ADDR_PCT_CD = pd.Categorical(df.ADDR_PCT_CD)

###  RPT_DT             object

In [113]:


# Convert RPT_DT to date
df.RPT_DT = pd.to_datetime(df.RPT_DT, format="%m/%d/%Y", cache=True)

### 7   KY_CD              object
### 8   OFNS_DESC          object

In [114]:



df.OFNS_DESC.replace(to_replace = 'KIDNAPPING', value='KIDNAPPING & RELATED OFFENSES', inplace = True)
df.OFNS_DESC.replace(to_replace = 'KIDNAPPING AND RELATED OFFENSES', value='KIDNAPPING & RELATED OFFENSES', inplace = True)

df.OFNS_DESC.replace(to_replace = 'AGRICULTURE & MRKTS LAW-UNCLASSIFIED', value='OTHER STATE LAWS (NON PENAL LAW)', inplace = True)
df.OFNS_DESC.replace(to_replace = 'OTHER STATE LAWS (NON PENAL LA', value='OTHER STATE LAWS (NON PENAL LAW)', inplace = True)

df.OFNS_DESC.replace(to_replace = 'ENDAN WELFARE INCOMP', value='OFFENSES RELATED TO CHILDREN', inplace = True)

df.OFNS_DESC.replace(to_replace = 'THEFT OF SERVICES', value='OTHER OFFENSES RELATED TO THEF', inplace = True)

df.OFNS_DESC.replace(to_replace = 'NYS LAWS-UNCLASSIFIED VIOLATION', value='OTHER STATE LAWS', inplace = True)

df.OFNS_DESC.replace(to_replace = 'FELONY SEX CRIMES', value='SEX CRIMES', inplace = True)

df.loc[df.KY_CD=='120','OFNS_DESC'] ='CHILD ABANDONMENT/NON SUPPORT'

df.loc[df.KY_CD=='125','OFNS_DESC'] ='NYS LAWS-UNCLASSIFIED FELONY'

offences = df[ ["KY_CD", "OFNS_DESC"] ].drop_duplicates().dropna()
offences['KY_CD'] = pd.Categorical(pd.to_numeric(offences['KY_CD'] ).astype(int))
offences = offences.set_index("KY_CD")
offences = offences.sort_index()
offences = offences.reset_index()
offences


Unnamed: 0,KY_CD,OFNS_DESC
0,101,MURDER & NON-NEGL. MANSLAUGHTER
1,102,HOMICIDE-NEGLIGENT-VEHICLE
2,103,"HOMICIDE-NEGLIGENT,UNCLASSIFIE"
3,104,RAPE
4,105,ROBBERY
...,...,...
69,676,NEW YORK CITY HEALTH CODE
70,677,OTHER STATE LAWS
71,678,MISCELLANEOUS PENAL LAW
72,685,ADMINISTRATIVE CODES


In [115]:
df.KY_CD = pd.Categorical(df.KY_CD)

In [116]:
df = df.drop('OFNS_DESC', axis='columns')

### 9   PD_CD              object
### 10  PD_DESC            object

In [117]:



df.loc[df.PD_CD=='694','PD_DESC'] ='INCEST'

df.loc[df.PD_CD=='234','PD_DESC'] ='BURGLARY,UNKNOWN TIME'

internal = df[ ["PD_CD", "PD_DESC"] ].drop_duplicates().dropna()
internal['PD_CD'] = pd.Categorical(pd.to_numeric(internal['PD_CD'] ).astype(int))
internal = internal.set_index("PD_CD")
internal = internal.sort_index()
internal = internal.reset_index()
internal

Unnamed: 0,PD_CD,PD_DESC
0,100,STALKING COMMIT SEX OFFENSE
1,101,ASSAULT 3
2,102,ASSAULT SCHOOL SAFETY AGENT
3,103,ASSAULT TRAFFIC AGENT
4,104,VEHICULAR ASSAULT (INTOX DRIVE
...,...,...
427,916,LEAVING SCENE-ACCIDENT-PERSONA
428,918,RECKLESS DRIVING
429,922,"TRAFFIC,UNCLASSIFIED MISDEMEAN"
430,969,"TRAFFIC,UNCLASSIFIED INFRACTIO"


In [118]:
df.PD_CD.isnull().sum()

4462

In [119]:
df = df[~df.PD_CD.isnull()]

In [120]:
df.PD_CD = pd.Categorical(df.PD_CD)

In [121]:
df = df.drop('PD_DESC', axis='columns')



### 11  CRM_ATPT_CPTD_CD   object

In [122]:


df.CRM_ATPT_CPTD_CD.value_counts()

COMPLETED    7226842
ATTEMPTED     125419
Name: CRM_ATPT_CPTD_CD, dtype: int64

In [123]:
df.CRM_ATPT_CPTD_CD = pd.Categorical(df.CRM_ATPT_CPTD_CD)

In [124]:
df.CRM_ATPT_CPTD_CD.isnull().sum()

7

In [125]:
df = df [ ~df.CRM_ATPT_CPTD_CD.isnull() ]

### 12  LAW_CAT_CD         object

In [126]:
df.LAW_CAT_CD.value_counts()

MISDEMEANOR    4132690
FELONY         2261367
VIOLATION       958204
Name: LAW_CAT_CD, dtype: int64

In [127]:
df.LAW_CAT_CD = pd.Categorical(df.LAW_CAT_CD)

### 16  JURIS_DESC         object
### 17  JURISDICTION_CODE  object

In [128]:
df.JURISDICTION_CODE.isnull().sum()

0

In [129]:
df = df[ ~df.JURISDICTION_CODE.isnull() ]

jusridiction = df[ ["JURISDICTION_CODE", "JURIS_DESC", ] ].drop_duplicates().dropna()
jusridiction['JURISDICTION_CODE'] = pd.to_numeric(jusridiction['JURISDICTION_CODE'] )
jusridiction['JURISDICTION_CODE'] = jusridiction['JURISDICTION_CODE'].astype(int)
jusridiction = jusridiction.set_index("JURISDICTION_CODE")
jusridiction = jusridiction.sort_index()
jusridiction = jusridiction.reset_index()
jusridiction

Unnamed: 0,JURISDICTION_CODE,JURIS_DESC
0,0,N.Y. POLICE DEPT
1,1,N.Y. TRANSIT POLICE
2,2,N.Y. HOUSING POLICE
3,3,PORT AUTHORITY
4,4,TRI-BORO BRDG TUNNL
5,6,LONG ISLAND RAILRD
6,7,AMTRACK
7,8,CONRAIL
8,9,STATN IS RAPID TRANS
9,11,N.Y. STATE POLICE


In [130]:
df.JURISDICTION_CODE = pd.Categorical(df.JURISDICTION_CODE)


In [131]:
df = df.drop('JURIS_DESC', axis='columns')

###  13  BORO_NM            object

In [132]:
df.BORO_NM.value_counts()

BROOKLYN         2181583
MANHATTAN        1767367
BRONX            1596348
QUEENS           1459678
STATEN ISLAND     342235
Name: BORO_NM, dtype: int64

In [133]:
df.BORO_NM.isnull().sum()

5050

In [134]:
df = df[~df.BORO_NM.isnull()]

In [135]:
df.BORO_NM = pd.Categorical(df.BORO_NM)

### 23  SUSP_AGE_GROUP     object
### 32  VIC_AGE_GROUP      object

In [136]:
# Both columns have a lot of noisy entries. We keep only the dominant groups, and also define an order
df.SUSP_AGE_GROUP = pd.Categorical(df.SUSP_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])
df.VIC_AGE_GROUP = pd.Categorical(df.VIC_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])


### 24  SUSP_RACE          object
### 25  SUSP_SEX           object

### 33  VIC_RACE           object
### 34  VIC_SEX            object

In [137]:
df.VIC_SEX.isnull().sum()

305

In [138]:
df.VIC_SEX.value_counts()

F    2876754
M    2422388
E    1148296
D     899468
Name: VIC_SEX, dtype: int64

In [139]:
df = df[~df.VIC_SEX.isnull()]

In [140]:
df.VIC_RACE.isnull().sum()

1

In [141]:
df.VIC_RACE.value_counts()

UNKNOWN                           2426327
BLACK                             1761427
WHITE                             1263619
WHITE HISPANIC                    1186974
ASIAN / PACIFIC ISLANDER           425349
BLACK HISPANIC                     249841
AMERICAN INDIAN/ALASKAN NATIVE      33337
OTHER                                  31
Name: VIC_RACE, dtype: int64

In [142]:
df.VIC_RACE.replace(to_replace = 'OTHER', value='UNKNOWN', inplace = True)

In [143]:
df = df[~df.VIC_RACE.isnull()]

In [144]:
df.SUSP_SEX.value_counts()

M    2383624
F     753795
U     661168
Name: SUSP_SEX, dtype: int64

In [145]:
# U is unknown, same is NULL.
df.SUSP_SEX.replace(to_replace = 'U', value=np.nan, inplace = True)

In [146]:
df.SUSP_RACE.value_counts()

BLACK                             1466527
UNKNOWN                           1028922
WHITE HISPANIC                     666833
WHITE                              428768
BLACK HISPANIC                     202788
ASIAN / PACIFIC ISLANDER           125918
AMERICAN INDIAN/ALASKAN NATIVE      11870
OTHER                                  11
Name: SUSP_RACE, dtype: int64

In [147]:
# Very small amount of OTHER values
df.SUSP_RACE.replace(to_replace = 'OTHER', value='UNKNOWN', inplace = True)



In [148]:
df.SUSP_RACE = pd.Categorical(df.SUSP_RACE)
df.SUSP_SEX = pd.Categorical(df.SUSP_SEX)
df.VIC_RACE = pd.Categorical(df.VIC_RACE)
df.VIC_SEX = pd.Categorical(df.VIC_SEX)

In [149]:
df.dtypes

CMPLNT_NUM                    int32
ADDR_PCT_CD                category
RPT_DT               datetime64[ns]
KY_CD                      category
PD_CD                      category
CRM_ATPT_CPTD_CD           category
LAW_CAT_CD                 category
BORO_NM                    category
LOC_OF_OCCUR_DESC            object
PREM_TYP_DESC                object
JURISDICTION_CODE          category
PARKS_NM                     object
HADEVELOPT                   object
HOUSING_PSA                  object
SUSP_AGE_GROUP             category
SUSP_RACE                  category
SUSP_SEX                   category
TRANSIT_DISTRICT             object
Latitude                     object
Longitude                    object
PATROL_BORO                  object
STATION_NAME                 object
VIC_AGE_GROUP              category
VIC_RACE                   category
VIC_SEX                    category
CMPLNT_FR            datetime64[ns]
CMPLNT_TO            datetime64[ns]
dtype: object

###  14  LOC_OF_OCCUR_DESC  object

In [150]:
df.LOC_OF_OCCUR_DESC.value_counts()

INSIDE         3731560
FRONT OF       1723348
OPPOSITE OF     195415
REAR OF         156959
Name: LOC_OF_OCCUR_DESC, dtype: int64

In [151]:
df.LOC_OF_OCCUR_DESC.isnull().sum()

1539623

In [152]:
df.LOC_OF_OCCUR_DESC = pd.Categorical(df.LOC_OF_OCCUR_DESC)

### Latitude                     object
### Longitude                    object

In [153]:
df.Latitude = pd.to_numeric(df.Latitude, downcast='float')
df.Longitude  = pd.to_numeric(df.Longitude, downcast='float')

### TRANSIT_DISTRICT

In [154]:
df.TRANSIT_DISTRICT.value_counts()


4     27672
2     20128
1     15756
3     15755
20    14400
33    14328
12    12193
11    11124
32    11050
30    10462
34     7188
23     2687
Name: TRANSIT_DISTRICT, dtype: int64

In [155]:
len(df) - df.TRANSIT_DISTRICT.isnull().sum()

162743

In [156]:
df.drop('TRANSIT_DISTRICT', axis='columns', inplace=True)


### PREM_TYP_DESC

In [157]:
df.PREM_TYP_DESC.value_counts()

STREET                        2347358
RESIDENCE - APT. HOUSE        1558412
RESIDENCE-HOUSE                723117
RESIDENCE - PUBLIC HOUSING     551829
OTHER                          199061
                               ...   
CEMETERY                          854
MAILBOX INSIDE                    689
LOAN COMPANY                      492
TRAMWAY                           138
DAYCARE FACILITY                   53
Name: PREM_TYP_DESC, Length: 74, dtype: int64

In [158]:
df.PREM_TYP_DESC.isnull().sum()

33965

In [159]:
df = df [~df.PREM_TYP_DESC.isnull()]

In [160]:
df.PREM_TYP_DESC = pd.Categorical(df.PREM_TYP_DESC)

In [161]:
df.PARKS_NM.value_counts()

CENTRAL PARK                      1635
FLUSHING MEADOWS CORONA PARK      1307
CONEY ISLAND BEACH & BOARDWALK    1059
WASHINGTON SQUARE PARK             775
RIVERSIDE PARK                     614
                                  ... 
WORTH SQUARE                         1
FLOOD TRIANGLE                       1
SUMPTER COMMUNITY GARDEN             1
KIMLAU SQUARE                        1
BARCLAY TRIANGLE                     1
Name: PARKS_NM, Length: 1205, dtype: int64

In [162]:
df.PARKS_NM.value_counts().sum()

27556

In [163]:
df.drop('PARKS_NM', axis='columns', inplace=True)



 19  HADEVELOPT         object


In [164]:
df.HADEVELOPT.value_counts()

CASTLE HILL                                    7510
VAN DYKE I                                     6072
MARCY                                          5566
GRANT                                          5182
BUTLER                                         5146
                                               ... 
1010 EAST 178TH STREET                            1
344 EAST 28TH STREET                              1
FULTON                                            1
FRANKLIN AVENUE III MHOP                          1
FOREST HILLS COOP (108TH STREET-62ND DRIVE)       1
Name: HADEVELOPT, Length: 278, dtype: int64

In [165]:
df.drop('HADEVELOPT', axis='columns', inplace=True)


 20  HOUSING_PSA        object



In [166]:
df.HOUSING_PSA.value_counts()

670      7316
887      7236
845      6659
720      6458
632      6316
         ... 
45544       1
26567       1
56822       1
628         1
5093        1
Name: HOUSING_PSA, Length: 5003, dtype: int64

In [167]:
df.HOUSING_PSA.value_counts().sum()

563532

In [168]:
df.drop('HOUSING_PSA', axis='columns', inplace=True)

 30  PATROL_BORO        object


In [169]:
df.PATROL_BORO.value_counts()

PATROL BORO BRONX            1588470
PATROL BORO BKLYN SOUTH      1089337
PATROL BORO BKLYN NORTH      1085227
PATROL BORO MAN SOUTH         882224
PATROL BORO MAN NORTH         870375
PATROL BORO QUEENS NORTH      751939
PATROL BORO QUEENS SOUTH      704341
PATROL BORO STATEN ISLAND     340951
Name: PATROL_BORO, dtype: int64

In [170]:
df.PATROL_BORO = pd.Categorical(df.PATROL_BORO)

In [171]:
df.PATROL_BORO.isnull().sum()

76

In [172]:
df = df[~df.PATROL_BORO.isnull()] 

 31  STATION_NAME       object

In [173]:
df.STATION_NAME.value_counts()

125 STREET                        8152
14 STREET                         4498
34 ST.-PENN STATION               3778
42 ST.-PORT AUTHORITY BUS TERM    3728
116 STREET                        3302
                                  ... 
DISTRICT 30 OFFICE                  17
DISTRICT 12 OFFICE                  17
DISTRICT 34 OFFICE                  13
DISTRICT 23 OFFICE                   6
OFF-SYSTEM                           4
Name: STATION_NAME, Length: 372, dtype: int64

In [174]:
df.STATION_NAME.isnull().sum()

7151424

In [175]:
df.drop('STATION_NAME', axis='columns', inplace=True)

In [176]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7312864 entries, 0 to 7396618
Data columns (total 22 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float32       
 15  Longitude          float32       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          datet

## Data exploration

In this part we check the different values that appear in the columns. When we detect noisy results, we delete the corresponding values. In fact, many of the operations that are performed above, in the 'data cleaning' section, are the result of observations that we make here. 

In [177]:
# Find the unique values in each column
# 
# df.describe(include = [np.object, 'category']).T['unique']
unique = df.describe(include = 'all').T['unique'].sort_values()

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


In [178]:
unique

CRM_ATPT_CPTD_CD           2
SUSP_SEX                   2
LAW_CAT_CD                 3
VIC_SEX                    4
LOC_OF_OCCUR_DESC          4
BORO_NM                    5
VIC_AGE_GROUP              5
SUSP_AGE_GROUP             5
VIC_RACE                   7
SUSP_RACE                  7
PATROL_BORO                8
JURISDICTION_CODE         25
KY_CD                     73
PREM_TYP_DESC             74
ADDR_PCT_CD               78
PD_CD                    432
RPT_DT                  5479
CMPLNT_FR            2041884
CMPLNT_TO            2254234
CMPLNT_NUM               NaN
Latitude                 NaN
Longitude                NaN
Name: unique, dtype: object

In [179]:
for column in unique.index:
    if unique[column] < 200:
        print(df[column].value_counts())
        print("=====")

COMPLETED    7188018
ATTEMPTED     124846
Name: CRM_ATPT_CPTD_CD, dtype: int64
=====
M    2377336
F     751425
Name: SUSP_SEX, dtype: int64
=====
MISDEMEANOR    4109437
FELONY         2249282
VIOLATION       954145
Name: LAW_CAT_CD, dtype: int64
=====
F    2864721
M    2413807
E    1140660
D     893676
Name: VIC_SEX, dtype: int64
=====
INSIDE         3710230
FRONT OF       1717378
OPPOSITE OF     194893
REAR OF         156245
Name: LOC_OF_OCCUR_DESC, dtype: int64
=====
BROOKLYN         2174499
MANHATTAN        1754597
BRONX            1588597
QUEENS           1454218
STATEN ISLAND     340953
Name: BORO_NM, dtype: int64
=====
25-44    2416343
45-64    1260018
18-24     739767
<18       335238
65+       257644
Name: VIC_AGE_GROUP, dtype: int64
=====
25-44    1008874
18-24     384135
45-64     359064
<18       107679
65+        29939
Name: SUSP_AGE_GROUP, dtype: int64
=====
UNKNOWN                           2411042
BLACK                             1755296
WHITE                           

In [180]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7312864 entries, 0 to 7396618
Data columns (total 22 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float32       
 15  Longitude          float32       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          datet

In [183]:
df['PREM_TYP_DESC'].value_counts().count()

74

In [184]:
# All columns, except for the dates and spatial coordinates, are categorical
# Columns with less than a few thousand unique values are good candidates 
# for ENUMs in the database given that the dataset is static.
# Also, in Pandas the internal representation becomes much more efficient
# as the Categoricals are stored as integers and not as strings
for column in unique.index:
    if column == 'RPT_DT':
        continue
    if df[column].value_counts().count() < 1000:
      df[column] = pd.Categorical(df[column])

In [185]:
# With all the proper data typing the dataset went down in size from 1.9Gb+ to 425Mb.
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7312864 entries, 0 to 7396618
Data columns (total 22 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int32         
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   PD_CD              category      
 5   CRM_ATPT_CPTD_CD   category      
 6   LAW_CAT_CD         category      
 7   BORO_NM            category      
 8   LOC_OF_OCCUR_DESC  category      
 9   PREM_TYP_DESC      category      
 10  JURISDICTION_CODE  category      
 11  SUSP_AGE_GROUP     category      
 12  SUSP_RACE          category      
 13  SUSP_SEX           category      
 14  Latitude           float32       
 15  Longitude          float32       
 16  PATROL_BORO        category      
 17  VIC_AGE_GROUP      category      
 18  VIC_RACE           category      
 19  VIC_SEX            category      
 20  CMPLNT_FR          datet

In [186]:
df.memory_usage()

Index                58502912
CMPLNT_NUM           29251456
ADDR_PCT_CD           7316048
RPT_DT               58502912
KY_CD                 7316016
PD_CD                14649664
CRM_ATPT_CPTD_CD      7312960
LAW_CAT_CD            7312968
BORO_NM               7313064
LOC_OF_OCCUR_DESC     7313056
PREM_TYP_DESC         7316016
JURISDICTION_CODE     7313704
SUSP_AGE_GROUP        7313064
SUSP_RACE             7313240
SUSP_SEX              7312960
Latitude             29251456
Longitude            29251456
PATROL_BORO           7313248
VIC_AGE_GROUP         7313064
VIC_RACE              7313240
VIC_SEX               7313056
CMPLNT_FR            58502912
CMPLNT_TO            58502912
dtype: int64

In [None]:
df.dtypes

In [187]:
# Find unique values and maximum length of various columns
# We mainly use this to specify the max length of a varchar 
# data type in MySQL
for column in df.columns.values:
    datatype = df[column].dtype.name
    unique_values = len(df[column].value_counts())
    print(column, '\t', datatype, '\t', unique_values)
    if datatype == 'object' or datatype =='category':
        m = df[column].str.len().max()
        print("Max length:", m)


CMPLNT_NUM 	 int32 	 7312864
ADDR_PCT_CD 	 category 	 78
Max length: 3
RPT_DT 	 datetime64[ns] 	 5479
KY_CD 	 category 	 74
Max length: 3
PD_CD 	 category 	 432
Max length: 3
CRM_ATPT_CPTD_CD 	 category 	 2
Max length: 9
LAW_CAT_CD 	 category 	 3
Max length: 11
BORO_NM 	 category 	 5
Max length: 13
LOC_OF_OCCUR_DESC 	 category 	 4
Max length: 11.0
PREM_TYP_DESC 	 category 	 74
Max length: 28
JURISDICTION_CODE 	 category 	 25
Max length: 2
SUSP_AGE_GROUP 	 category 	 5
Max length: 5.0
SUSP_RACE 	 category 	 7
Max length: 30.0
SUSP_SEX 	 category 	 2
Max length: 1.0
Latitude 	 float32 	 63980
Longitude 	 float32 	 47947
PATROL_BORO 	 category 	 8
Max length: 25
VIC_AGE_GROUP 	 category 	 5
Max length: 5.0
VIC_RACE 	 category 	 7
Max length: 30
VIC_SEX 	 category 	 4
Max length: 1
CMPLNT_FR 	 datetime64[ns] 	 2041884
CMPLNT_TO 	 datetime64[ns] 	 2254234


In [None]:
df.dtypes

## Storing in a MySQL database

In [188]:
!sudo pip3 install -U -q PyMySQL sqlalchemy sql_magic

[K     |████████████████████████████████| 51kB 4.4MB/s 
[K     |████████████████████████████████| 1.5MB 18.9MB/s 
[?25h

In [189]:
import os
from sqlalchemy import create_engine

conn_string = 'mysql+pymysql://{user}:{password}@{host}/?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org', 
    user = 'root',
    password = os.environ['MYSQL_PASSWORD'],
    encoding = 'utf8mb4')

engine = create_engine(conn_string)
con = engine.connect()

In [190]:
# Query to create a database
db_name = 'nypd'

sql = f"DROP DATABASE IF EXISTS {db_name}"
engine.execute(sql)

# Create a database
sql = f"CREATE DATABASE IF NOT EXISTS {db_name} DEFAULT CHARACTER SET 'utf8mb4'"
engine.execute(sql)


<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fa1619370d0>

In [191]:

# And lets switch to the database
sql = f"USE {db_name}"
engine.execute(sql)


<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fa139743850>

In [192]:
# In principle, we can let Pandas create the table, but we want to be a bit more predise
# with the data types, and we want to add documentation for each column
# from https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i


create_table_sql = '''
CREATE TABLE nypd (
  CMPLNT_NUM int,
  CMPLNT_FR datetime,
  CMPLNT_TO datetime,
  RPT_DT date,
  KY_CD SMALLINT,
  PD_CD SMALLINT,
  CRM_ATPT_CPTD_CD enum('COMPLETED','ATTEMPTED'),
  LAW_CAT_CD enum('FELONY','MISDEMEANOR','VIOLATION'),
  JURISDICTION_CODE SMALLINT,
  BORO_NM enum('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN ISLAND'),
  ADDR_PCT_CD SMALLINT,
  LOC_OF_OCCUR_DESC enum('FRONT OF','INSIDE','OPPOSITE OF','OUTSIDE','REAR OF'),
  PATROL_BORO enum('PATROL BORO BRONX', 'PATROL BORO BKLYN SOUTH','PATROL BORO BKLYN NORTH','PATROL BORO MAN SOUTH','PATROL BORO MAN NORTH','PATROL BORO QUEENS NORTH','PATROL BORO QUEENS SOUTH','PATROL BORO STATEN ISLAND'),
  PREM_TYP_DESC varchar(30),
  SUSP_RACE enum('UNKNOWN', 'BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  VIC_RACE enum('UNKNOWN', 'BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  SUSP_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  VIC_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  SUSP_SEX enum('M', 'F'),
  VIC_SEX enum('M', 'F', 'E', 'D'),
  Latitude double,
  Longitude double,
  PRIMARY KEY (CMPLNT_NUM)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
engine.execute(create_table_sql)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fa1619b61d0>

In [193]:
# Create a table
# See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html for the documentation
from tqdm import tqdm
batchsize = 50000
batches = len(df) // batchsize + 1

t = tqdm(range(batches))

for i in t:
    # print("Batch:",i)
    # continue # Cannot execute this on Travis
    start = batchsize * i
    end = batchsize * (i+1)
    df[start:end].to_sql(
        name = 'nypd', 
        schema = db_name, 
        con = engine,
        if_exists = 'append',
        index = False, 
        chunksize = 1000)

100%|██████████| 147/147 [25:34<00:00, 10.44s/it]


In [197]:
engine.execute("CREATE INDEX ix_lat ON nypd.nypd(Latitude)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fa174a30110>

In [198]:
engine.execute("CREATE INDEX ix_lon ON nypd.nypd(Longitude)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fa173f14ed0>

In [201]:
engine.execute("CREATE INDEX ix_LAW_CAT_CD ON nypd.nypd(LAW_CAT_CD)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fa17702ac50>

In [203]:
engine.execute("CREATE INDEX ix_BORO_NM ON nypd.nypd(BORO_NM)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fa170877210>

In [204]:
engine.execute("CREATE INDEX ix_RPT_DT ON nypd.nypd(RPT_DT)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fa139743510>

In [205]:
engine.execute("CREATE INDEX ix_CMPLNT_FR ON nypd.nypd(CMPLNT_FR)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fa1619b4790>

In [209]:
engine.execute("DROP TABLE IF EXISTS offense_codes;")

create_table_sql = '''
CREATE TABLE offense_codes (
  KY_CD smallint,
  OFNS_DESC varchar(32),
  PRIMARY KEY (KY_CD)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
engine.execute(create_table_sql)

offences.to_sql(
        name = 'offense_codes', 
        schema = db_name, 
        con = engine,
        if_exists = 'append',
        index = False)

35

In [214]:
engine.execute("DROP TABLE IF EXISTS jurisdiction_codes;")

create_table_sql = '''
CREATE TABLE jurisdiction_codes (
  JURISDICTION_CODE smallint,
  JURIS_DESC varchar(40),
  PRIMARY KEY (JURISDICTION_CODE)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
engine.execute(create_table_sql)


jusridiction.to_sql(
        name = 'jurisdiction_codes', 
        schema = db_name, 
        con = engine,
        if_exists = 'append',
        index = False)

In [217]:
internal.PD_DESC.str.len().max()


71

In [218]:
engine.execute("DROP TABLE IF EXISTS penal_codes;")

create_table_sql = '''
CREATE TABLE penal_codes (
  PD_CD smallint,
  PD_DESC varchar(80),
  PRIMARY KEY (PD_CD)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
engine.execute(create_table_sql)


internal.to_sql(
        name = 'penal_codes', 
        schema = db_name, 
        con = engine,
        if_exists = 'append',
        index = False)

## TODO

### Add the penal code data as a separate table

`!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/files/65f25845-1551-4d21-91dc-869c977cd93d?download=true&filename=PDCode_PenalLaw.xlsx' -o PDCode_PenalLaw.xlsx`

### Examine whether to normalize 

The fields 

  
PREM_TYP_DESC    
HADEVELOPT    
PARKS_NM                     

would be better off as foreign keys or enums. They take too much space as strings.