## NYPD Dataset

Dataset description at 
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i



| Column | Description |
|--------|-------------------|
| CMPLNT_NUM |  Randomly generated persistent ID for each complaint  |  
| ADDR_PCT_CD |  The precinct in which the incident occurred |  
| BORO |  The name of the borough in which the incident occurred |  
| CMPLNT_FR_DT |  Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) |  
| CMPLNT_FR_TM |  Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) |  
| CMPLNT_TO_DT |  Ending date of occurrence for the reported event, if exact time of occurrence is unknown |  
| CMPLNT_TO_TM |  Ending time of occurrence for the reported event, if exact time of occurrence is unknown |  
| CRM_ATPT_CPTD_CD |  Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely |  
| HADEVELOPT |  Name of NYCHA housing development of occurrence, if applicable |  
| HOUSING_PSA |  Development Level Code |  
| JURISDICTION_CODE |  Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc. |  
| JURIS_DESC |  Description of the jurisdiction code |  
| KY_CD |  Three digit offense classification code |  
| LAW_CAT_CD |  Level of offense: felony, misdemeanor, violation  |  
| LOC_OF_OCCUR_DESC |  Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of |  
| OFNS_DESC |  Description of offense corresponding with key code |  
| PARKS_NM |  Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) |  
| PATROL_BORO |  The name of the patrol borough in which the incident occurred |  
| PD_CD |  Three digit internal classification code (more granular than Key Code) |  
| PD_DESC |  Description of internal classification corresponding with PD code (more granular than Offense Description) |  
| PREM_TYP_DESC |  Specific description of premises; grocery store, residence, street, etc. |  
| RPT_DT |  Date event was reported to police  |  
| STATION_NAME |  Transit station name |  
| SUSP_AGE_GROUP |  Suspect’s Age Group |  
| SUSP_RACE |  Suspect’s Race Description |  
| SUSP_SEX |  Suspect’s Sex Description |  
| TRANSIT_DISTRICT |  Transit district in which the offense occurred. |  
| VIC_AGE_GROUP |  Victim’s Age Group |  
| VIC_RACE |  Victim’s Race Description |  
| VIC_SEX |  Victim’s Sex Description |  
| X_COORD_CD |  X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Y_COORD_CD |  Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Latitude |  Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)  |  
| Longitude |  Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |


In [2]:
import pandas as pd
import numpy as np

In [1]:
# From https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/data
!curl -v 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD' -o nypd.csv


In [4]:
# We load directly from the URL
# url = 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD'

# We load everything as an object/string, because some data types (e.g., some IDs)
# are recognized as decimals, and it is a mess to restore them back
# So we will do all the conversions ourselves later on
df = pd.read_csv('nypd.csv', low_memory = True, dtype='object')

In [5]:
len(df)

6036805

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6036805 entries, 0 to 6036804
Data columns (total 35 columns):
CMPLNT_NUM           object
CMPLNT_FR_DT         object
CMPLNT_FR_TM         object
CMPLNT_TO_DT         object
CMPLNT_TO_TM         object
ADDR_PCT_CD          object
RPT_DT               object
KY_CD                object
OFNS_DESC            object
PD_CD                object
PD_DESC              object
CRM_ATPT_CPTD_CD     object
LAW_CAT_CD           object
BORO_NM              object
LOC_OF_OCCUR_DESC    object
PREM_TYP_DESC        object
JURIS_DESC           object
JURISDICTION_CODE    object
PARKS_NM             object
HADEVELOPT           object
HOUSING_PSA          object
X_COORD_CD           object
Y_COORD_CD           object
SUSP_AGE_GROUP       object
SUSP_RACE            object
SUSP_SEX             object
TRANSIT_DISTRICT     object
Latitude             object
Longitude            object
Lat_Lon              object
PATROL_BORO          object
STATION_NAME       

## Data Cleaning

In [7]:
# There are a few rows that contain year 1015, 1016, ... that trigger an error during date conversion
# We replace all years written as 10XX with 20XX
# Note the usage of regular expressions
df.CMPLNT_FR_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )
df.CMPLNT_TO_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )

In [8]:
# Similarly, a few hours are written as 24:00:00, which also triggers errors.
# We fix these hours
df.CMPLNT_FR_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)
df.CMPLNT_TO_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)

In [9]:
# Convert the two separate date and time columns into single datetime columns
df['CMPLNT_FR'] = pd.to_datetime(df.CMPLNT_FR_DT + ' ' + df.CMPLNT_FR_TM, format='%m/%d/%Y %H:%M:%S', cache=True)
df['CMPLNT_TO'] = pd.to_datetime(df.CMPLNT_TO_DT + ' ' + df.CMPLNT_TO_TM, format='%m/%d/%Y %H:%M:%S', cache=True)

# Convert RPT_DT to date
df.RPT_DT = pd.to_datetime(df.RPT_DT, format="%m/%d/%Y", cache=True)

In [10]:
# These columns are redundant
to_drop = ['CMPLNT_FR_DT','CMPLNT_TO_DT','CMPLNT_FR_TM','CMPLNT_TO_TM','Lat_Lon','X_COORD_CD','Y_COORD_CD']
# We created the CMPLNT_FR and CMPLNT_TO columns
# We have the longitude and latitude so the other coordinates are not needed
df.drop(to_drop, axis='columns', inplace=True)

In [11]:
# MISD means MISDEMEANOR
df.LAW_CAT_CD.replace(to_replace = 'MISD', value='MISDEMEANOR', inplace = True )

In [12]:
# In some columns we have values that indicate unknown values.
# The columns also have NULL values, which indicate the same.
# We replace the values with NULL, for consistency

# Replace ' ' with NULL
df.LOC_OF_OCCUR_DESC.replace(to_replace = ' ', value=np.nan, inplace = True)
# U is unknown, same is NULL.
df.VIC_SEX.replace(to_replace = 'U', value=np.nan, inplace = True)
df.SUSP_SEX.replace(to_replace = 'U', value=np.nan, inplace = True)
# Very small amount of OTHER values
df.SUSP_RACE.replace(to_replace = 'OTHER', value=np.nan, inplace = True)
df.VIC_RACE.replace(to_replace = 'OTHER', value=np.nan, inplace = True)
# Very small amount of OTHER values
df.SUSP_RACE.replace(to_replace = 'UNKNOWN', value=np.nan, inplace = True)
df.VIC_RACE.replace(to_replace = 'UNKNOWN', value=np.nan, inplace = True)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6036805 entries, 0 to 6036804
Data columns (total 30 columns):
CMPLNT_NUM           object
ADDR_PCT_CD          object
RPT_DT               datetime64[ns]
KY_CD                object
OFNS_DESC            object
PD_CD                object
PD_DESC              object
CRM_ATPT_CPTD_CD     object
LAW_CAT_CD           object
BORO_NM              object
LOC_OF_OCCUR_DESC    object
PREM_TYP_DESC        object
JURIS_DESC           object
JURISDICTION_CODE    object
PARKS_NM             object
HADEVELOPT           object
HOUSING_PSA          object
SUSP_AGE_GROUP       object
SUSP_RACE            object
SUSP_SEX             object
TRANSIT_DISTRICT     object
Latitude             object
Longitude            object
PATROL_BORO          object
STATION_NAME         object
VIC_AGE_GROUP        object
VIC_RACE             object
VIC_SEX              object
CMPLNT_FR            datetime64[ns]
CMPLNT_TO            datetime64[ns]
dtypes: datetime64[ns](

In [14]:
# Both columns have a lot of noisy entries. We keep only the dominant groups, and also define an order
df.SUSP_AGE_GROUP = pd.Categorical(df.SUSP_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])
df.VIC_AGE_GROUP = pd.Categorical(df.VIC_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])

In [15]:
df.Latitude = pd.to_numeric(df.Latitude, downcast='float')
df.Longitude  = pd.to_numeric(df.Longitude, downcast='float')

## Data exploration

In this part we check the different values that appear in the columns. When we detect noisy results, we delete the corresponding values. In fact, many of the operations that are performed above, in the 'data cleaning' section, are the result of observations that we make here. 

In [16]:
# Find the unique values in each column
# 
# df.describe(include = [np.object, 'category']).T['unique']
unique = df.describe(include = 'all').T['unique'].sort_values()

In [17]:
unique

SUSP_SEX                   2
CRM_ATPT_CPTD_CD           2
LAW_CAT_CD                 3
VIC_SEX                    4
VIC_AGE_GROUP              5
SUSP_AGE_GROUP             5
BORO_NM                    5
LOC_OF_OCCUR_DESC          5
SUSP_RACE                  6
VIC_RACE                   6
PATROL_BORO                8
TRANSIT_DISTRICT          12
JURISDICTION_CODE         25
JURIS_DESC                25
OFNS_DESC                 70
PREM_TYP_DESC             72
KY_CD                     74
ADDR_PCT_CD               77
HADEVELOPT               279
STATION_NAME             371
PD_DESC                  412
PD_CD                    424
PARKS_NM                1074
RPT_DT                  4383
HOUSING_PSA             4623
CMPLNT_FR            1616072
CMPLNT_TO            1768998
CMPLNT_NUM           6036805
Latitude                 NaN
Longitude                NaN
Name: unique, dtype: object

In [18]:
for column in unique.index:
    if unique[column] < 200:
        print(df[column].value_counts())
        print("=====")

M    1784627
F     576490
Name: SUSP_SEX, dtype: int64
=====
COMPLETED    5932953
ATTEMPTED     103845
Name: CRM_ATPT_CPTD_CD, dtype: int64
=====
MISDEMEANOR    3431733
FELONY         1855915
VIOLATION       749157
Name: LAW_CAT_CD, dtype: int64
=====
F    2354734
M    1961519
E    1012663
D     707582
Name: VIC_SEX, dtype: int64
=====
25-44    1949740
45-64    1010708
18-24     622503
<18       286969
65+       200949
Name: VIC_AGE_GROUP, dtype: int64
=====
25-44    681843
18-24    277584
45-64    240263
<18       77784
65+       18845
Name: SUSP_AGE_GROUP, dtype: int64
=====
BROOKLYN         1798175
MANHATTAN        1443245
BRONX            1307343
QUEENS           1191688
STATEN ISLAND     286135
Name: BORO_NM, dtype: int64
=====
INSIDE         3023004
FRONT OF       1402959
OPPOSITE OF     165379
REAR OF         132249
OUTSIDE           3136
Name: LOC_OF_OCCUR_DESC, dtype: int64
=====
BLACK                             1093935
WHITE HISPANIC                     496597
WHITE         

341    986336
578    736317
344    624556
109    516712
351    510927
361    327564
235    319020
105    227065
106    224684
107    216357
126    137600
359    116847
110    113943
121     92337
347     84654
236     81985
352     73685
348     72003
117     71196
112     65931
113     60594
118     60390
233     54373
340     38200
232     22155
358     20031
353     17286
104     16683
343     16567
360     15978
        ...  
675      1558
342      1002
346       991
572       932
363       910
356       762
120       458
230       372
677       285
571       231
237       186
115       158
103       137
122       125
234       112
354       102
102        99
455        63
119        52
366        51
349        51
676        35
672        23
685        19
460        16
881        14
357        11
123         5
577         3
362         3
Name: KY_CD, Length: 74, dtype: int64
=====
75     195188
43     160373
44     150806
40     143090
14     139740
46     128289
52     126613
73  

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6036805 entries, 0 to 6036804
Data columns (total 30 columns):
CMPLNT_NUM           object
ADDR_PCT_CD          object
RPT_DT               datetime64[ns]
KY_CD                object
OFNS_DESC            object
PD_CD                object
PD_DESC              object
CRM_ATPT_CPTD_CD     object
LAW_CAT_CD           object
BORO_NM              object
LOC_OF_OCCUR_DESC    object
PREM_TYP_DESC        object
JURIS_DESC           object
JURISDICTION_CODE    object
PARKS_NM             object
HADEVELOPT           object
HOUSING_PSA          object
SUSP_AGE_GROUP       category
SUSP_RACE            object
SUSP_SEX             object
TRANSIT_DISTRICT     object
Latitude             float32
Longitude            float32
PATROL_BORO          object
STATION_NAME         object
VIC_AGE_GROUP        category
VIC_RACE             object
VIC_SEX              object
CMPLNT_FR            datetime64[ns]
CMPLNT_TO            datetime64[ns]
dtypes: category(

In [20]:
# All columns, except for the dates and spatial coordinates, are categorical
# Columns with less than a few thousand unique values are good candidates 
# for ENUMs in the database given that the dataset is static.
# Also, in Pandas the internal representation becomes much more efficient
# as the Categoricals are stored as integers and not as strings
for column in unique.index:
    if column == 'RPT_DT':
        continue
    if unique[column] < 10000:
        df[column] = pd.Categorical(df[column])

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6036805 entries, 0 to 6036804
Data columns (total 30 columns):
CMPLNT_NUM           object
ADDR_PCT_CD          category
RPT_DT               datetime64[ns]
KY_CD                category
OFNS_DESC            category
PD_CD                category
PD_DESC              category
CRM_ATPT_CPTD_CD     category
LAW_CAT_CD           category
BORO_NM              category
LOC_OF_OCCUR_DESC    category
PREM_TYP_DESC        category
JURIS_DESC           category
JURISDICTION_CODE    category
PARKS_NM             category
HADEVELOPT           category
HOUSING_PSA          category
SUSP_AGE_GROUP       category
SUSP_RACE            category
SUSP_SEX             category
TRANSIT_DISTRICT     category
Latitude             float32
Longitude            float32
PATROL_BORO          category
STATION_NAME         category
VIC_AGE_GROUP        category
VIC_RACE             category
VIC_SEX              category
CMPLNT_FR            datetime64[ns]
CMPLNT_TO

In [22]:
df.memory_usage()

Index                      80
CMPLNT_NUM           48294440
ADDR_PCT_CD           6039981
RPT_DT               48294440
KY_CD                 6039957
OFNS_DESC             6039925
PD_CD                12097482
PD_DESC              12097386
CRM_ATPT_CPTD_CD      6036901
LAW_CAT_CD            6036909
BORO_NM               6037005
LOC_OF_OCCUR_DESC     6037005
PREM_TYP_DESC         6039941
JURIS_DESC            6037645
JURISDICTION_CODE     6037645
PARKS_NM             12123162
HADEVELOPT           12086082
HOUSING_PSA          12274434
SUSP_AGE_GROUP        6037005
SUSP_RACE             6037013
SUSP_SEX              6036901
TRANSIT_DISTRICT      6037221
Latitude             24147220
Longitude            24147220
PATROL_BORO           6037189
STATION_NAME         12086818
VIC_AGE_GROUP         6037005
VIC_RACE              6037013
VIC_SEX               6036997
CMPLNT_FR            48294440
CMPLNT_TO            48294440
dtype: int64

In [23]:
df.dtypes

CMPLNT_NUM                   object
ADDR_PCT_CD                category
RPT_DT               datetime64[ns]
KY_CD                      category
OFNS_DESC                  category
PD_CD                      category
PD_DESC                    category
CRM_ATPT_CPTD_CD           category
LAW_CAT_CD                 category
BORO_NM                    category
LOC_OF_OCCUR_DESC          category
PREM_TYP_DESC              category
JURIS_DESC                 category
JURISDICTION_CODE          category
PARKS_NM                   category
HADEVELOPT                 category
HOUSING_PSA                category
SUSP_AGE_GROUP             category
SUSP_RACE                  category
SUSP_SEX                   category
TRANSIT_DISTRICT           category
Latitude                    float32
Longitude                   float32
PATROL_BORO                category
STATION_NAME               category
VIC_AGE_GROUP              category
VIC_RACE                   category
VIC_SEX                    c

In [24]:
# Find unique values and maximum length of various columns
# We mainly use this to specify the max length of a varchar 
# data type in MySQL
for column in df.columns.values:
    datatype = df[column].dtype.name
    unique_values = len(df[column].value_counts())
    print(column, '\t', datatype, '\t', unique_values)
    if datatype == 'object' or datatype =='category':
        m = max([len(str(x)) for x in df[column].value_counts().index.values])
        print("Max length:", m)


CMPLNT_NUM 	 object 	 6036805
Max length: 9
ADDR_PCT_CD 	 category 	 77
Max length: 3
RPT_DT 	 datetime64[ns] 	 4383
KY_CD 	 category 	 74
Max length: 3
OFNS_DESC 	 category 	 70
Max length: 36
PD_CD 	 category 	 424
Max length: 3
PD_DESC 	 category 	 412
Max length: 60
CRM_ATPT_CPTD_CD 	 category 	 2
Max length: 9
LAW_CAT_CD 	 category 	 3
Max length: 11
BORO_NM 	 category 	 5
Max length: 13
LOC_OF_OCCUR_DESC 	 category 	 5
Max length: 11
PREM_TYP_DESC 	 category 	 72
Max length: 28
JURIS_DESC 	 category 	 25
Max length: 35
JURISDICTION_CODE 	 category 	 25
Max length: 2
PARKS_NM 	 category 	 1074
Max length: 83
HADEVELOPT 	 category 	 279
Max length: 43
HOUSING_PSA 	 category 	 4623
Max length: 5
SUSP_AGE_GROUP 	 category 	 5
Max length: 5
SUSP_RACE 	 category 	 6
Max length: 30
SUSP_SEX 	 category 	 2
Max length: 1
TRANSIT_DISTRICT 	 category 	 12
Max length: 2
Latitude 	 float32 	 63462
Longitude 	 float32 	 47588
PATROL_BORO 	 category 	 8
Max length: 25
STATION_NAME 	 category 	 

In [None]:
df.dtypes

## Storing in a SQLite Database

In [1]:
!rm nypd.db
!rm nypd.csv

In [None]:
import sqlite3

# We start by creating our database
con = sqlite3.connect('nypd.db')

In [None]:
# Create a table
# See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html for the documentation
#
# We perform the operation in batches, so that we can track its progress
#
from tqdm import tqdm
batchsize = 50000
batches = len(df) // batchsize + 1

t = tqdm(range(batches))

for i in t:
    print("Batch:",i)
    start = batchsize * i
    end = batchsize * (i+1)
    df[start:end].to_sql(
        name = 'nypd', 
        con = con,
        if_exists = 'append',
        index = False, 
        chunksize = 1000)

## Storing in a MySQL database

In [None]:
import os
from sqlalchemy import create_engine

conn_string = 'mysql://{user}:{password}@{host}/?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org', 
    user = 'root',
    password = os.environ['MYSQL_PASSWORD'],
    encoding = 'utf8mb4')

engine = create_engine(conn_string)
con = engine.connect()

In [None]:
# Query to create a database
db_name = 'nypd'

sql = f"DROP DATABASE IF EXISTS {db_name}"
engine.execute(sql)

# Create a database
sql = f"CREATE DATABASE IF NOT EXISTS {db_name} DEFAULT CHARACTER SET 'utf8mb4'"
engine.execute(sql)

# And lets switch to the database
sql = f"USE {db_name}"
engine.execute(sql)


In [None]:
# In principle, we can let Pandas create the table, but we want to be a bit more predise
# with the data types, and we want to add documentation for each column
# from https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i

create_table_sql = '''
CREATE TABLE nypd (
  CMPLNT_NUM bigint(20),
  CMPLNT_FR datetime,
  CMPLNT_TO datetime,
  RPT_DT date,
  KY_CD char(3),
  OFNS_DESC varchar(60),
  PD_CD char(3),
  PD_DESC varchar(60),
  CRM_ATPT_CPTD_CD enum('COMPLETED','ATTEMPTED'),
  LAW_CAT_CD enum('FELONY','MISDEMEANOR','VIOLATION'),
  JURISDICTION_CODE char(2),
  JURIS_DESC enum('AMTRACK', 'CONRAIL', 'DEPT OF CORRECTIONS', 'DISTRICT ATTORNEY OFFICE',
             'FIRE DEPT (FIRE MARSHAL)', 'HEALTH & HOSP CORP',  'LONG ISLAND RAILRD', 'METRO NORTH',
             'N.Y. HOUSING POLICE', 'N.Y. POLICE DEPT', 'N.Y. STATE PARKS', 'N.Y. STATE POLICE',
             'N.Y. TRANSIT POLICE', 'NEW YORK CITY SHERIFF OFFICE', 'NYC DEPT ENVIRONMENTAL PROTECTION',
             'NYC PARKS', 'NYS DEPT ENVIRONMENTAL CONSERVATION', 'NYS DEPT TAX AND FINANCE',  'OTHER',
             'POLICE DEPT NYC', 'PORT AUTHORITY', 'SEA GATE POLICE DEPT', 'STATN IS RAPID TRANS', 'TRI-BORO BRDG TUNNL', 'U.S. PARK POLICE'),
  BORO_NM enum('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN ISLAND'),
  ADDR_PCT_CD char(3),
  STATION_NAME varchar(60),
  LOC_OF_OCCUR_DESC enum('FRONT OF','INSIDE','OPPOSITE OF','OUTSIDE','REAR OF'),
  PATROL_BORO enum('PATROL BORO BRONX', 'PATROL BORO BKLYN SOUTH','PATROL BORO BKLYN NORTH','PATROL BORO MAN SOUTH','PATROL BORO MAN NORTH','PATROL BORO QUEENS NORTH','PATROL BORO QUEENS SOUTH','PATROL BORO STATEN ISLAND'),
  PREM_TYP_DESC varchar(60),
  PARKS_NM varchar(255),
  HADEVELOPT varchar(60),
  TRANSIT_DISTRICT char(2),
  HOUSING_PSA char(5),
  SUSP_RACE enum('BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  VIC_RACE enum('BLACK', 'WHITE', 'WHITE HISPANIC', 'ASIAN / PACIFIC ISLANDER', 'BLACK HISPANIC', 'AMERICAN INDIAN/ALASKAN NATIVE'),
  SUSP_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  VIC_AGE_GROUP enum('<18', '18-24',  '25-44', '45-64', '65+'),
  SUSP_SEX enum('M', 'F'),
  VIC_SEX enum('M', 'F', 'E', 'D'),
  Latitude double,
  Longitude double,
  PRIMARY KEY (CMPLNT_NUM)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
'''
engine.execute(create_table_sql)

In [None]:
# Create a table
# See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html for the documentation
from tqdm import tqdm
batchsize = 50000
batches = len(df) // batchsize + 1

t = tqdm(range(batches))

for i in t:
    print("Batch:",i)
    start = batchsize * i
    end = batchsize * (i+1)
    df[start:end].to_sql(
        name = 'nypd', 
        schema = db_name, 
        con = engine,
        if_exists = 'append',
        index = False, 
        chunksize = 1000)

## TODO

### Add indexes in both MySQL and SQLite
    
* CMPLNT_NUM
* Latitude
* Longitude
* LAW_CAT_CD
* BORO_NM
* OFNS_DESC
* RPT_DT
* CMPLNT_FR

### Add the penal code data as a separate table

`!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/files/65f25845-1551-4d21-91dc-869c977cd93d?download=true&filename=PDCode_PenalLaw.xlsx' -o PDCode_PenalLaw.xlsx`

### Examine whether to normalize 

The fields 

PD_CD, PD_DESC    
KY_CD, OFNS_DESC     
PREM_TYP_DESC    
HADEVELOPT    
PARKS_NM                     

would be better off as foreign keys or enums. They take too much space as strings.

In [None]:
# Data quality issues to fix: KY_CD, OFNS_DESC

query = '''
SELECT KY_CD, OFNS_DESC, COUNT(*)
FROM nypd WHERE KY_CD IN (
SELECT KY_CD
FROM nypd
WHERE OFNS_DESC IS NOT NULL
GROUP BY KY_CD
HAVING COUNT(DISTINCT OFNS_DESC)>1)
GROUP BY KY_CD, OFNS_DESC
'''

#df = pd.read_sql(query, con=engine)
#df

In [None]:
# Data quality issues to fix: PD_CD, PD_DESC

query = '''
SELECT PD_CD, PD_DESC, COUNT(*)
FROM nypd WHERE PD_DESC IN (
SELECT PD_DESC
FROM nypd
WHERE PD_DESC IS NOT NULL
GROUP BY PD_DESC
HAVING COUNT(DISTINCT PD_CD)>1)
GROUP BY PD_CD, PD_DESC
'''

#df = pd.read_sql(query, con=engine)
#df