## NYPD Dataset

Dataset description at 
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i



| Column | Description |
|--------|-------------------|
| CMPLNT_NUM |  Randomly generated persistent ID for each complaint  |  
| ADDR_PCT_CD |  The precinct in which the incident occurred |  
| BORO |  The name of the borough in which the incident occurred |  
| CMPLNT_FR_DT |  Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) |  
| CMPLNT_FR_TM |  Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) |  
| CMPLNT_TO_DT |  Ending date of occurrence for the reported event, if exact time of occurrence is unknown |  
| CMPLNT_TO_TM |  Ending time of occurrence for the reported event, if exact time of occurrence is unknown |  
| CRM_ATPT_CPTD_CD |  Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely |  
| HADEVELOPT |  Name of NYCHA housing development of occurrence, if applicable |  
| HOUSING_PSA |  Development Level Code |  
| JURISDICTION_CODE |  Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc. |  
| JURIS_DESC |  Description of the jurisdiction code |  
| KY_CD |  Three digit offense classification code |  
| LAW_CAT_CD |  Level of offense: felony, misdemeanor, violation  |  
| LOC_OF_OCCUR_DESC |  Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of |  
| OFNS_DESC |  Description of offense corresponding with key code |  
| PARKS_NM |  Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) |  
| PATROL_BORO |  The name of the patrol borough in which the incident occurred |  
| PD_CD |  Three digit internal classification code (more granular than Key Code) |  
| PD_DESC |  Description of internal classification corresponding with PD code (more granular than Offense Description) |  
| PREM_TYP_DESC |  Specific description of premises; grocery store, residence, street, etc. |  
| RPT_DT |  Date event was reported to police  |  
| STATION_NAME |  Transit station name |  
| SUSP_AGE_GROUP |  Suspect’s Age Group |  
| SUSP_RACE |  Suspect’s Race Description |  
| SUSP_SEX |  Suspect’s Sex Description |  
| TRANSIT_DISTRICT |  Transit district in which the offense occurred. |  
| VIC_AGE_GROUP |  Victim’s Age Group |  
| VIC_RACE |  Victim’s Race Description |  
| VIC_SEX |  Victim’s Sex Description |  
| X_COORD_CD |  X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Y_COORD_CD |  Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Latitude |  Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)  |  
| Longitude |  Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |


In [1]:
import pandas as pd
import numpy as np

In [3]:
# We load everything as an object/string, because some data types (e.g., some IDs)
# are recognized as decimals, and it is a mess to restore them back
# So we will do all the conversions ourselves later on

# From https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/data
!curl -v 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD' -o nypd.csv
df = pd.read_csv('nypd.csv', low_memory = True, dtype='object')

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.206.68.26...
* TCP_NODELAY set
* Connected to data.cityofnewyork.us (52.206.68.26) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [108 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [2718 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [401 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [138 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Clie

In [2]:
# We load directly from the URL
url = 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD'
df = pd.read_csv(url, low_memory = True, dtype='object')


KeyboardInterrupt: ignored

In [4]:
len(df)

7396619

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7396619 entries, 0 to 7396618
Data columns (total 35 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   CMPLNT_NUM         object
 1   CMPLNT_FR_DT       object
 2   CMPLNT_FR_TM       object
 3   CMPLNT_TO_DT       object
 4   CMPLNT_TO_TM       object
 5   ADDR_PCT_CD        object
 6   RPT_DT             object
 7   KY_CD              object
 8   OFNS_DESC          object
 9   PD_CD              object
 10  PD_DESC            object
 11  CRM_ATPT_CPTD_CD   object
 12  LAW_CAT_CD         object
 13  BORO_NM            object
 14  LOC_OF_OCCUR_DESC  object
 15  PREM_TYP_DESC      object
 16  JURIS_DESC         object
 17  JURISDICTION_CODE  object
 18  PARKS_NM           object
 19  HADEVELOPT         object
 20  HOUSING_PSA        object
 21  X_COORD_CD         object
 22  Y_COORD_CD         object
 23  SUSP_AGE_GROUP     object
 24  SUSP_RACE          object
 25  SUSP_SEX           object
 26  TRANSIT_DISTRI

## Data Cleaning

In [None]:
# Drop cases with duplicated complaint numbers
key_cnt = df.CMPLNT_NUM.value_counts()
key_cnt [ key_cnt>1 ]
df = df[ ~df.CMPLNT_NUM.isin( key_cnt [ key_cnt>1 ].index.values ) ]

In [6]:
# There are a few rows that contain year 1015, 1016, ... that trigger an error during date conversion
# We replace all years written as 10XX with 20XX
# Note the usage of regular expressions
df.CMPLNT_FR_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )
df.CMPLNT_TO_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )

In [7]:
# Similarly, a few hours are written as 24:00:00, which also triggers errors.
# We fix these hours
df.CMPLNT_FR_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)
df.CMPLNT_TO_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)

In [8]:
# Convert the two separate date and time columns into single datetime columns
df['CMPLNT_FR'] = pd.to_datetime(df.CMPLNT_FR_DT + ' ' + df.CMPLNT_FR_TM, format='%m/%d/%Y %H:%M:%S', cache=True)
df['CMPLNT_TO'] = pd.to_datetime(df.CMPLNT_TO_DT + ' ' + df.CMPLNT_TO_TM, format='%m/%d/%Y %H:%M:%S', cache=True)

# Convert RPT_DT to date
df.RPT_DT = pd.to_datetime(df.RPT_DT, format="%m/%d/%Y", cache=True)

In [9]:
# These columns are redundant
to_drop = ['CMPLNT_FR_DT','CMPLNT_TO_DT','CMPLNT_FR_TM','CMPLNT_TO_TM','Lat_Lon','X_COORD_CD','Y_COORD_CD']
# We created the CMPLNT_FR and CMPLNT_TO columns
# We have the longitude and latitude so the other coordinates are not needed
df = df.drop(to_drop, axis='columns')

In [10]:
# MISD means MISDEMEANOR
df.LAW_CAT_CD.replace(to_replace = 'MISD', value='MISDEMEANOR', inplace = True )

In [11]:
# In some columns we have values that indicate unknown values.
# The columns also have NULL values, which indicate the same.
# We replace the values with NULL, for consistency

# Replace ' ' with NULL
df.LOC_OF_OCCUR_DESC.replace(to_replace = ' ', value=np.nan, inplace = True)
# U is unknown, same is NULL.
df.VIC_SEX.replace(to_replace = 'U', value=np.nan, inplace = True)
df.SUSP_SEX.replace(to_replace = 'U', value=np.nan, inplace = True)
# Very small amount of OTHER values
df.SUSP_RACE.replace(to_replace = 'OTHER', value=np.nan, inplace = True)
df.VIC_RACE.replace(to_replace = 'OTHER', value=np.nan, inplace = True)
# Very small amount of OTHER values
df.SUSP_RACE.replace(to_replace = 'UNKNOWN', value=np.nan, inplace = True)
df.VIC_RACE.replace(to_replace = 'UNKNOWN', value=np.nan, inplace = True)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7396619 entries, 0 to 7396618
Data columns (total 30 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         object        
 1   ADDR_PCT_CD        object        
 2   RPT_DT             datetime64[ns]
 3   KY_CD              object        
 4   OFNS_DESC          object        
 5   PD_CD              object        
 6   PD_DESC            object        
 7   CRM_ATPT_CPTD_CD   object        
 8   LAW_CAT_CD         object        
 9   BORO_NM            object        
 10  LOC_OF_OCCUR_DESC  object        
 11  PREM_TYP_DESC      object        
 12  JURIS_DESC         object        
 13  JURISDICTION_CODE  object        
 14  PARKS_NM           object        
 15  HADEVELOPT         object        
 16  HOUSING_PSA        object        
 17  SUSP_AGE_GROUP     object        
 18  SUSP_RACE          object        
 19  SUSP_SEX           object        
 20  TRANSIT_DISTRICT   objec

In [13]:
# Both columns have a lot of noisy entries. We keep only the dominant groups, and also define an order
df.SUSP_AGE_GROUP = pd.Categorical(df.SUSP_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])
df.VIC_AGE_GROUP = pd.Categorical(df.VIC_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])

In [14]:
df.Latitude = pd.to_numeric(df.Latitude, downcast='float')
df.Longitude  = pd.to_numeric(df.Longitude, downcast='float')

## Data exploration

In this part we check the different values that appear in the columns. When we detect noisy results, we delete the corresponding values. In fact, many of the operations that are performed above, in the 'data cleaning' section, are the result of observations that we make here. 

In [15]:
# Find the unique values in each column
# 
# df.describe(include = [np.object, 'category']).T['unique']
unique = df.describe(include = 'all').T['unique'].sort_values()

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


In [16]:
unique

SUSP_SEX                   2
CRM_ATPT_CPTD_CD           2
LAW_CAT_CD                 3
VIC_SEX                    4
VIC_AGE_GROUP              5
SUSP_AGE_GROUP             5
BORO_NM                    5
LOC_OF_OCCUR_DESC          5
SUSP_RACE                  6
VIC_RACE                   6
PATROL_BORO                8
TRANSIT_DISTRICT          12
JURISDICTION_CODE         25
JURIS_DESC                25
OFNS_DESC                 71
KY_CD                     74
PREM_TYP_DESC             74
ADDR_PCT_CD               78
HADEVELOPT               279
STATION_NAME             372
PD_DESC                  422
PD_CD                    432
PARKS_NM                1205
HOUSING_PSA             5102
RPT_DT                  5479
CMPLNT_FR            2053256
CMPLNT_TO            2262990
CMPLNT_NUM           7373156
Latitude                 NaN
Longitude                NaN
Name: unique, dtype: object

In [17]:
for column in unique.index:
    if unique[column] < 200:
        print(df[column].value_counts())
        print("=====")

M    2402648
F     758515
Name: SUSP_SEX, dtype: int64
=====
COMPLETED    7270662
ATTEMPTED     125950
Name: CRM_ATPT_CPTD_CD, dtype: int64
=====
MISDEMEANOR    4149416
FELONY         2284995
VIOLATION       962208
Name: LAW_CAT_CD, dtype: int64
=====
F    2898168
M    2441040
E    1152793
D     904306
Name: VIC_SEX, dtype: int64
=====
25-44    2440544
45-64    1272469
18-24     751022
<18       341321
65+       261327
Name: VIC_AGE_GROUP, dtype: int64
=====
25-44    1020765
18-24     389283
45-64     363435
<18       109314
65+        30510
Name: SUSP_AGE_GROUP, dtype: int64
=====
BROOKLYN         2192618
MANHATTAN        1776835
BRONX            1603632
QUEENS           1468180
STATEN ISLAND     343975
Name: BORO_NM, dtype: int64
=====
INSIDE         3763423
FRONT OF       1729865
OPPOSITE OF     195921
REAR OF         157496
OUTSIDE           3836
Name: LOC_OF_OCCUR_DESC, dtype: int64
=====
BLACK                             1476712
WHITE HISPANIC                     672125
WHITE    

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7396619 entries, 0 to 7396618
Data columns (total 30 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         object        
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   OFNS_DESC          category      
 5   PD_CD              category      
 6   PD_DESC            category      
 7   CRM_ATPT_CPTD_CD   category      
 8   LAW_CAT_CD         category      
 9   BORO_NM            category      
 10  LOC_OF_OCCUR_DESC  category      
 11  PREM_TYP_DESC      category      
 12  JURIS_DESC         category      
 13  JURISDICTION_CODE  category      
 14  PARKS_NM           category      
 15  HADEVELOPT         category      
 16  HOUSING_PSA        category      
 17  SUSP_AGE_GROUP     category      
 18  SUSP_RACE          category      
 19  SUSP_SEX           category      
 20  TRANSIT_DISTRICT   categ

In [18]:
# All columns, except for the dates and spatial coordinates, are categorical
# Columns with less than a few thousand unique values are good candidates 
# for ENUMs in the database given that the dataset is static.
# Also, in Pandas the internal representation becomes much more efficient
# as the Categoricals are stored as integers and not as strings
for column in unique.index:
    if column == 'RPT_DT':
        continue
    if unique[column] < 10000:
        df[column] = pd.Categorical(df[column])

In [20]:
# With all the proper data typing the dataset went down in size from 1.6Gb+ to 400Mb.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7396619 entries, 0 to 7396618
Data columns (total 30 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         object        
 1   ADDR_PCT_CD        category      
 2   RPT_DT             datetime64[ns]
 3   KY_CD              category      
 4   OFNS_DESC          category      
 5   PD_CD              category      
 6   PD_DESC            category      
 7   CRM_ATPT_CPTD_CD   category      
 8   LAW_CAT_CD         category      
 9   BORO_NM            category      
 10  LOC_OF_OCCUR_DESC  category      
 11  PREM_TYP_DESC      category      
 12  JURIS_DESC         category      
 13  JURISDICTION_CODE  category      
 14  PARKS_NM           category      
 15  HADEVELOPT         category      
 16  HOUSING_PSA        category      
 17  SUSP_AGE_GROUP     category      
 18  SUSP_RACE          category      
 19  SUSP_SEX           category      
 20  TRANSIT_DISTRICT   categ

In [None]:
df.memory_usage()

Index                      80
CMPLNT_NUM           48294440
ADDR_PCT_CD           6039981
RPT_DT               48294440
KY_CD                 6039957
OFNS_DESC             6039925
PD_CD                12097482
PD_DESC              12097386
CRM_ATPT_CPTD_CD      6036901
LAW_CAT_CD            6036909
BORO_NM               6037005
LOC_OF_OCCUR_DESC     6037005
PREM_TYP_DESC         6039941
JURIS_DESC            6037645
JURISDICTION_CODE     6037645
PARKS_NM             12123162
HADEVELOPT           12086082
HOUSING_PSA          12274434
SUSP_AGE_GROUP        6037005
SUSP_RACE             6037013
SUSP_SEX              6036901
TRANSIT_DISTRICT      6037221
Latitude             24147220
Longitude            24147220
PATROL_BORO           6037189
STATION_NAME         12086818
VIC_AGE_GROUP         6037005
VIC_RACE              6037013
VIC_SEX               6036997
CMPLNT_FR            48294440
CMPLNT_TO            48294440
dtype: int64

In [21]:
df.dtypes

CMPLNT_NUM                   object
ADDR_PCT_CD                category
RPT_DT               datetime64[ns]
KY_CD                      category
OFNS_DESC                  category
PD_CD                      category
PD_DESC                    category
CRM_ATPT_CPTD_CD           category
LAW_CAT_CD                 category
BORO_NM                    category
LOC_OF_OCCUR_DESC          category
PREM_TYP_DESC              category
JURIS_DESC                 category
JURISDICTION_CODE          category
PARKS_NM                   category
HADEVELOPT                 category
HOUSING_PSA                category
SUSP_AGE_GROUP             category
SUSP_RACE                  category
SUSP_SEX                   category
TRANSIT_DISTRICT           category
Latitude                    float32
Longitude                   float32
PATROL_BORO                category
STATION_NAME               category
VIC_AGE_GROUP              category
VIC_RACE                   category
VIC_SEX                    c

In [48]:
# Find unique values and maximum length of various columns
# We mainly use this to specify the max length of a varchar 
# data type in MySQL
for column in df.columns.values:
    datatype = df[column].dtype.name
    unique_values = len(df[column].value_counts())
    print(column, '\t', datatype, '\t', unique_values)
    if datatype == 'object' or datatype =='category':
        m = df[column].str.len().max()
        print("Max length:", m)


CMPLNT_NUM 	 object 	 7373156
Max length: 9
ADDR_PCT_CD 	 category 	 78
Max length: 3.0
RPT_DT 	 datetime64[ns] 	 5479
KY_CD 	 category 	 74
Max length: 3
OFNS_DESC 	 category 	 71
Max length: 36.0
PD_CD 	 category 	 432
Max length: 3.0
PD_DESC 	 category 	 422
Max length: 71.0
CRM_ATPT_CPTD_CD 	 category 	 2
Max length: 9.0
LAW_CAT_CD 	 category 	 3
Max length: 11
BORO_NM 	 category 	 5
Max length: 13.0
LOC_OF_OCCUR_DESC 	 category 	 5
Max length: 11.0
PREM_TYP_DESC 	 category 	 74
Max length: 28.0
JURIS_DESC 	 category 	 25
Max length: 35
JURISDICTION_CODE 	 category 	 25
Max length: 2.0
PARKS_NM 	 category 	 1205
Max length: 83.0
HADEVELOPT 	 category 	 279
Max length: 43.0
HOUSING_PSA 	 category 	 5102
Max length: 21.0
SUSP_AGE_GROUP 	 category 	 5
Max length: 5.0
SUSP_RACE 	 category 	 6
Max length: 30.0
SUSP_SEX 	 category 	 2
Max length: 1.0
TRANSIT_DISTRICT 	 category 	 12
Max length: 2.0
Latitude 	 float32 	 64015
Longitude 	 float32 	 47959
PATROL_BORO 	 category 	 8
Max leng

In [22]:
df.dtypes

CMPLNT_NUM                   object
ADDR_PCT_CD                category
RPT_DT               datetime64[ns]
KY_CD                      category
OFNS_DESC                  category
PD_CD                      category
PD_DESC                    category
CRM_ATPT_CPTD_CD           category
LAW_CAT_CD                 category
BORO_NM                    category
LOC_OF_OCCUR_DESC          category
PREM_TYP_DESC              category
JURIS_DESC                 category
JURISDICTION_CODE          category
PARKS_NM                   category
HADEVELOPT                 category
HOUSING_PSA                category
SUSP_AGE_GROUP             category
SUSP_RACE                  category
SUSP_SEX                   category
TRANSIT_DISTRICT           category
Latitude                    float32
Longitude                   float32
PATROL_BORO                category
STATION_NAME               category
VIC_AGE_GROUP              category
VIC_RACE                   category
VIC_SEX                    c

In [98]:
df[df.KY_CD == 105].OFNS_DESC = 'ROBBERY'

df[df.KY_CD == 106].OFNS_DESC = 'FELONY ASSAULT'

df[df.KY_CD == 107].OFNS_DESC = 'BURGLARY'


df[df.KY_CD == 360].OFNS_DESC = 'LOITERING FOR DRUG PURPOSES'

df[df.KY_CD == 361].OFNS_DESC = 'OFF. AGNST PUB ORD SENSBLTY &'

df[df.KY_CD == 364].OFNS_DESC = 'AGRICULTURE & MRKTS LAW-UNCLASSIFIED'



## Storing in a SQLite Database

## Storing in a MySQL database

In [None]:
!sudo pip3 install -U -q PyMySQL sqlalchemy sql_magic

In [84]:
import os
from sqlalchemy import create_engine

conn_string = 'mysql+pymysql://{user}:{password}@{host}/?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org', 
    user = 'root',
    password = 'ae6jQniBS5muV27', #os.environ['MYSQL_PASSWORD'],
    encoding = 'utf8mb4')

engine = create_engine(conn_string)
con = engine.connect()

In [63]:
# Query to create a database
db_name = 'nypd'

sql = f"DROP DATABASE IF EXISTS {db_name}"
engine.execute(sql)

# Create a database
sql = f"CREATE DATABASE IF NOT EXISTS {db_name} DEFAULT CHARACTER SET 'utf8mb4'"
engine.execute(sql)


<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1b47575690>

In [86]:

# And lets switch to the database
sql = f"USE {db_name}"
engine.execute(sql)


<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1ae7c8f950>

In [50]:
# In principle, we can let Pandas create the table, but we want to be a bit more predise
# with the data types, and we want to add documentation for each column
# from https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i

create_table_sql = '''
CREATE TABLE nypd (
  CMPLNT_NUM bigint(20),
  CMPLNT_FR datetime,
  CMPLNT_TO datetime,
  RPT_DT date,
  KY_CD char(3),
  OFNS_DESC varchar(60),
  PD_CD char(3),
  PD_DESC varchar(75),
  CRM_ATPT_CPTD_CD enum('COMPLETED','ATTEMPTED'),
  LAW_CAT_CD enum('FELONY','MISDEMEANOR','VIOLATION'),
  JURISDICTION_CODE char(2),
  JURIS_DESC varchar(60),
  BORO_NM enum('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN ISLAND'),
  ADDR_PCT_CD char(3),
  STATION_NAME varchar(60),
  LOC_OF_OCCUR_DESC enum('FRONT OF','INSIDE','OPPOSITE OF','OUTSIDE','REAR OF'),
  PATROL_BORO enum('PATROL BORO BRONX', 'PATROL BORO BKLYN SOUTH','PATROL BORO BKLYN NORTH','PATROL BORO MAN SOUTH','PATROL BORO MAN NORTH','PATROL BORO QUEENS NORTH','PATROL BORO QUEENS SOUTH','PATROL BORO STATEN ISLAND'),
  PREM_TYP_DESC varchar(60),
  PARKS_NM varchar(255),
  HADEVELOPT varchar(60),
  TRANSIT_DISTRICT char(2),
  HOUSING_PSA varchar(60),
  SUSP_RACE enum('BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  VIC_RACE enum('BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  SUSP_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  VIC_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  SUSP_SEX enum('M', 'F'),
  VIC_SEX enum('M', 'F', 'E', 'D'),
  Latitude double,
  Longitude double,
  PRIMARY KEY (CMPLNT_NUM)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
engine.execute(create_table_sql)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1b6738ff50>

In [None]:
# Create a table
# See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html for the documentation
from tqdm import tqdm
batchsize = 50000
batches = len(df) // batchsize + 1

t = tqdm(range(batches))

for i in t:
    # print("Batch:",i)
    # continue # Cannot execute this on Travis
    start = batchsize * i
    end = batchsize * (i+1)
    df[start:end].to_sql(
        name = 'nypd', 
        schema = db_name, 
        con = engine,
        if_exists = 'append',
        index = False, 
        chunksize = 1000)

In [65]:
engine.execute("CREATE INDEX ix_lat ON nypd.nypd(Latitude)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1b52931ed0>

In [66]:
engine.execute("CREATE INDEX ix_lon ON nypd.nypd(Longitude)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1b64899fd0>

In [72]:
engine.execute("CREATE INDEX ix_LAW_CAT_CD ON nypd.nypd(LAW_CAT_CD(11))")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1b0ddd6310>

In [73]:
engine.execute("CREATE INDEX ix_BORO_NM ON nypd.nypd(BORO_NM(13))")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1b6aa033d0>

In [74]:
engine.execute("CREATE INDEX ix_OFNS_DESC ON nypd.nypd(OFNS_DESC(36))")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1b5853eed0>

In [70]:
engine.execute("CREATE INDEX ix_RPT_DT ON nypd.nypd(RPT_DT)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1b662e5990>

In [71]:
engine.execute("CREATE INDEX ix_CMPLNT_FR ON nypd.nypd(CMPLNT_FR)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1b3c215d50>

In [76]:
engine.execute("CREATE INDEX ix_CMPLNT_NUM ON nypd.nypd(CMPLNT_NUM(20))")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f1b2446fc10>

## TODO

### Add the penal code data as a separate table

`!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/files/65f25845-1551-4d21-91dc-869c977cd93d?download=true&filename=PDCode_PenalLaw.xlsx' -o PDCode_PenalLaw.xlsx`

### Examine whether to normalize 

The fields 

PD_CD, PD_DESC    
KY_CD, OFNS_DESC     
PREM_TYP_DESC    
HADEVELOPT    
PARKS_NM                     

would be better off as foreign keys or enums. They take too much space as strings.

In [None]:
# Data quality issues to fix: KY_CD, OFNS_DESC

query = '''
SELECT KY_CD, OFNS_DESC, COUNT(*)
FROM nypd WHERE KY_CD IN (
SELECT KY_CD
FROM nypd
WHERE OFNS_DESC IS NOT NULL
GROUP BY KY_CD
HAVING COUNT(DISTINCT OFNS_DESC)>1)
GROUP BY KY_CD, OFNS_DESC
'''

df = pd.read_sql(query, con=engine)
df

In [None]:
# Data quality issues to fix: PD_CD, PD_DESC

query = '''
SELECT PD_CD, PD_DESC, COUNT(*)
FROM nypd WHERE PD_DESC IN (
SELECT PD_DESC
FROM nypd
WHERE PD_DESC IS NOT NULL
GROUP BY PD_DESC
HAVING COUNT(DISTINCT PD_CD)>1)
GROUP BY PD_CD, PD_DESC
'''

#df = pd.read_sql(query, con=engine)
#df