## NYPD Dataset

Dataset description at 
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i



| Column | Description |
|--------|-------------------|
| CMPLNT_NUM |  Randomly generated persistent ID for each complaint  |  
| ADDR_PCT_CD |  The precinct in which the incident occurred |  
| BORO |  The name of the borough in which the incident occurred |  
| CMPLNT_FR_DT |  Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) |  
| CMPLNT_FR_TM |  Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) |  
| CMPLNT_TO_DT |  Ending date of occurrence for the reported event, if exact time of occurrence is unknown |  
| CMPLNT_TO_TM |  Ending time of occurrence for the reported event, if exact time of occurrence is unknown |  
| CRM_ATPT_CPTD_CD |  Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely |  
| HADEVELOPT |  Name of NYCHA housing development of occurrence, if applicable |  
| HOUSING_PSA |  Development Level Code |  
| JURISDICTION_CODE |  Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc. |  
| JURIS_DESC |  Description of the jurisdiction code |  
| KY_CD |  Three digit offense classification code |  
| LAW_CAT_CD |  Level of offense: felony, misdemeanor, violation  |  
| LOC_OF_OCCUR_DESC |  Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of |  
| OFNS_DESC |  Description of offense corresponding with key code |  
| PARKS_NM |  Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) |  
| PATROL_BORO |  The name of the patrol borough in which the incident occurred |  
| PD_CD |  Three digit internal classification code (more granular than Key Code) |  
| PD_DESC |  Description of internal classification corresponding with PD code (more granular than Offense Description) |  
| PREM_TYP_DESC |  Specific description of premises; grocery store, residence, street, etc. |  
| RPT_DT |  Date event was reported to police  |  
| STATION_NAME |  Transit station name |  
| SUSP_AGE_GROUP |  Suspect’s Age Group |  
| SUSP_RACE |  Suspect’s Race Description |  
| SUSP_SEX |  Suspect’s Sex Description |  
| TRANSIT_DISTRICT |  Transit district in which the offense occurred. |  
| VIC_AGE_GROUP |  Victim’s Age Group |  
| VIC_RACE |  Victim’s Race Description |  
| VIC_SEX |  Victim’s Sex Description |  
| X_COORD_CD |  X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Y_COORD_CD |  Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |  
| Latitude |  Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)  |  
| Longitude |  Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |


In [None]:
# From https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/data
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD' -o nypd.csv

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("nypd.csv", low_memory = False, dtype='object')

In [3]:
len(df)

6036805

In [4]:
# df [ df.CMPLNT_FR_DT.str.contains('1015') == True ]

In [5]:
# There are a few rows that contain year 1015, 1016, ... that trigger an error during date conversion
# We replace all years written as 10XX with 20XX
# Note the usage of regular expressions
df.CMPLNT_FR_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )
df.CMPLNT_TO_DT.replace(to_replace = '(\d\d)/(\d\d)/10(\d\d)', value=r'\1/\2/20\3', inplace = True, regex=True )

In [6]:
# Similarly, a few hours are written as 24:00:00, which also triggers errors.
# We fix these hours
df.CMPLNT_FR_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)
df.CMPLNT_TO_TM.replace(to_replace = '24:00:00', value='00:00:00', inplace = True)

In [7]:
# Convert the two separate date and time columns into single datetime columns
df['CMPLNT_FR'] = pd.to_datetime(df.CMPLNT_FR_DT + ' ' + df.CMPLNT_FR_TM, format='%m/%d/%Y %H:%M:%S')
df['CMPLNT_TO'] = pd.to_datetime(df.CMPLNT_TO_DT + ' ' + df.CMPLNT_TO_TM, format='%m/%d/%Y %H:%M:%S')

# Convert RPT_DT to date
df.RPT_DT = pd.to_datetime(df.RPT_DT, format="%m/%d/%Y")

In [8]:
# These columns are redundant
# 
df.drop('Lat_Lon', axis=1, inplace=True)
df.drop('CMPLNT_FR_DT', axis=1, inplace=True)
df.drop('CMPLNT_TO_DT', axis=1, inplace=True)
df.drop('CMPLNT_FR_TM', axis=1, inplace=True)
df.drop('CMPLNT_TO_TM', axis=1, inplace=True)

In [9]:
# MISD means MISDEMEANOR
df.LAW_CAT_CD.replace(to_replace = 'MISD', value='MISDEMEANOR', inplace = True )

# Replace ' ' with NULL
df.LOC_OF_OCCUR_DESC.replace(to_replace = ' ', value=np.nan, inplace = True)

##### We should have a discussion about data exploration/cleaning here

In [10]:
# Find the unique values in each column
# 
# df.describe(include = [np.object, 'category']).T['unique']
unique = df.describe(include = 'all').T['unique'].sort_values()

In [11]:
unique

CRM_ATPT_CPTD_CD           2
LAW_CAT_CD                 3
SUSP_SEX                   3
LOC_OF_OCCUR_DESC          5
BORO_NM                    5
VIC_SEX                    5
SUSP_RACE                  8
VIC_RACE                   8
PATROL_BORO                8
TRANSIT_DISTRICT          12
JURIS_DESC                25
JURISDICTION_CODE         25
OFNS_DESC                 70
PREM_TYP_DESC             72
KY_CD                     74
ADDR_PCT_CD               77
SUSP_AGE_GROUP            81
VIC_AGE_GROUP            181
HADEVELOPT               279
STATION_NAME             371
PD_DESC                  412
PD_CD                    424
PARKS_NM                1074
RPT_DT                  4383
HOUSING_PSA             4623
X_COORD_CD             70569
Y_COORD_CD             73188
Latitude              114932
Longitude             114938
CMPLNT_FR            1616072
CMPLNT_TO            1768998
CMPLNT_NUM           6036805
Name: unique, dtype: object

In [12]:
# All columns, except for the dates, are categorical
for column in unique.index:
    if unique[column] < 2000:
        df[column] = pd.Categorical(df[column])

In [13]:
for column in unique.index:
    if unique[column] < 200:
        print(df[column].value_counts())
        print("=====")

COMPLETED    5932953
ATTEMPTED     103845
Name: CRM_ATPT_CPTD_CD, dtype: int64
=====
MISDEMEANOR    3431733
FELONY         1855915
VIOLATION       749157
Name: LAW_CAT_CD, dtype: int64
=====
M    1784627
F     576490
U     438633
Name: SUSP_SEX, dtype: int64
=====
INSIDE         3023004
FRONT OF       1402959
OPPOSITE OF     165379
REAR OF         132249
OUTSIDE           3136
Name: LOC_OF_OCCUR_DESC, dtype: int64
=====
BROOKLYN         1798175
MANHATTAN        1443245
BRONX            1307343
QUEENS           1191688
STATEN ISLAND     286135
Name: BORO_NM, dtype: int64
=====
F    2354734
M    1961519
E    1012663
D     707582
U          3
Name: VIC_SEX, dtype: int64
=====
BLACK                             1093935
UNKNOWN                            764617
WHITE HISPANIC                     496597
WHITE                              330730
BLACK HISPANIC                     149002
ASIAN / PACIFIC ISLANDER            89136
AMERICAN INDIAN/ALASKAN NATIVE       9036
OTHER                   

In [18]:
df.SUSP_AGE_GROUP.value_counts().head(10)

25-44    681843
18-24    277584
45-64    240263
<18       77784
65+       18845
Name: SUSP_AGE_GROUP, dtype: int64

In [21]:
df.VIC_AGE_GROUP.value_counts().head(10)            

25-44    1949740
45-64    1010708
18-24     622503
<18       286969
65+       200949
Name: VIC_AGE_GROUP, dtype: int64

In [17]:
# Both columns have a lot of noisy entries. We keep only the dominant groups, and also define an order
df.SUSP_AGE_GROUP = pd.Categorical(df.SUSP_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])
df.VIC_AGE_GROUP = pd.Categorical(df.VIC_AGE_GROUP, ordered=True, categories=['<18', '18-24',  '25-44', '45-64', '65+'])

In [25]:
df.Latitude = pd.to_numeric(df.Latitude)
df.Longitude  = pd.to_numeric(df.Longitude)

In [26]:
df.dtypes

CMPLNT_NUM                   object
ADDR_PCT_CD                category
RPT_DT               datetime64[ns]
KY_CD                      category
OFNS_DESC                  category
PD_CD                      category
PD_DESC                    category
CRM_ATPT_CPTD_CD           category
LAW_CAT_CD                 category
BORO_NM                    category
LOC_OF_OCCUR_DESC          category
PREM_TYP_DESC              category
JURIS_DESC                 category
JURISDICTION_CODE          category
PARKS_NM                   category
HADEVELOPT                 category
HOUSING_PSA                  object
X_COORD_CD                   object
Y_COORD_CD                   object
SUSP_AGE_GROUP             category
SUSP_RACE                  category
SUSP_SEX                   category
TRANSIT_DISTRICT           category
Latitude                    float64
Longitude                   float64
PATROL_BORO                category
STATION_NAME               category
VIC_AGE_GROUP              c

In [None]:
# Find unique values and maximum length of various columns
for column in df.columns.values:
    datatype = df[column].dtype.name
    unique_values = len(df[column].value_counts())
    print(column, '\t', datatype, '\t', unique_values)
    if datatype == 'object' or datatype =='category':
        m = max([len(str(x)) for x in df[column].value_counts().index.values])
        print("Max length:", m)


In [None]:
 df.dtypes

The fields 

PD_CD, PD_DESC    
KY_CD, OFNS_DESC  
JURIS_DESC    
PREM_TYP_DESC    
HADEVELOPT    
PARKS_NM                     

would be better off as foreign keys or enums. They take too much space as strings.

### Writing a Pandas Dataframe in a SQLite Database

In [28]:
import sqlite3

# We start by creating our database
con = sqlite3.connect('nypd.db')

In [30]:
!sudo pip3 install tqdm

[33mThe directory '/home/ubuntu/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.[0m
[33mThe directory '/home/ubuntu/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.[0m
Collecting tqdm
[?25l  Downloading https://files.pythonhosted.org/packages/91/55/8cb23a97301b177e9c8e3226dba45bb454411de2cbd25746763267f226c2/tqdm-4.28.1-py2.py3-none-any.whl (45kB)
[K    100% |████████████████████████████████| 51kB 32.6MB/s ta 0:00:01
[?25hInstalling collected packages: tqdm
Successfully installed tqdm-4.28.1


Once we have connected successfully, we need to create our database:

In [31]:
# Create a table
# See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html for the documentation
from tqdm import tqdm
batchsize = 50000
batches = len(df) // batchsize + 1

t = tqdm(range(batches))

for i in t:
    start = batchsize * i
    end = batchsize * (i+1)
    df[start:end].to_sql(
        name = 'nypd', 
        con = con,
        if_exists = 'append',
        index = False, 
        chunksize = 1000)

100%|██████████| 121/121 [02:13<00:00,  1.03s/it]


In [None]:
# Once we have the data in the table, we also specify a primary key
# If we had FOREIGN KEYS we can add them in the same way
# add_key_query = 'ALTER TABLE nypd ADD PRIMARY KEY(CMPLNT_NUM)'
# engine.execute(add_key_query)

In [35]:
query =  "SELECT * FROM nypd LIMIT 100"
r = pd.read_sql(query, con=con).head(5)
r.head(5)

Unnamed: 0,CMPLNT_NUM,ADDR_PCT_CD,RPT_DT,KY_CD,OFNS_DESC,PD_CD,PD_DESC,CRM_ATPT_CPTD_CD,LAW_CAT_CD,BORO_NM,...,TRANSIT_DISTRICT,Latitude,Longitude,PATROL_BORO,STATION_NAME,VIC_AGE_GROUP,VIC_RACE,VIC_SEX,CMPLNT_FR,CMPLNT_TO
0,491097831,76,2013-09-03 00:00:00,578,HARRASSMENT 2,638,"HARASSMENT,SUBD 3,4,5",COMPLETED,VIOLATION,BROOKLYN,...,,40.684084,-73.98678,PATROL BORO BKLYN SOUTH,,,UNKNOWN,F,2013-08-31 20:00:00,2013-09-02 13:10:00
1,827796420,40,2013-09-03 00:00:00,359,OFFENSES AGAINST PUBLIC ADMINI,759,"PUBLIC ADMINISTATION,UNCLASS M",COMPLETED,MISDEMEANOR,BRONX,...,,40.815606,-73.914579,PATROL BORO BRONX,,,UNKNOWN,E,2013-08-31 19:45:00,2013-08-31 20:00:00
2,823404713,10,2013-09-03 00:00:00,361,OFF. AGNST PUB ORD SENSBLTY &,639,AGGRAVATED HARASSMENT 2,COMPLETED,MISDEMEANOR,MANHATTAN,...,,40.74781,-73.998518,PATROL BORO MAN SOUTH,,18-24,WHITE HISPANIC,F,2013-08-31 19:30:00,2013-08-31 20:00:00
3,950495742,106,2013-09-03 00:00:00,110,GRAND LARCENY OF MOTOR VEHICLE,441,"LARCENY,GRAND OF AUTO",COMPLETED,FELONY,QUEENS,...,,40.662512,-73.856311,PATROL BORO QUEENS SOUTH,,45-64,WHITE HISPANIC,F,2013-08-31 19:00:00,2013-09-02 09:00:00
4,655454255,71,2013-09-03 00:00:00,107,BURGLARY,234,"BURGLARY,UNKNOWN TIME",COMPLETED,FELONY,BROOKLYN,...,,40.666588,-73.939678,PATROL BORO BKLYN SOUTH,,25-44,BLACK,M,2013-08-31 19:00:00,2013-09-03 11:00:00


And remember that from Pandas it is also possible to export in other formats, such as Excel of CSV.

## TODO

The script should also write to a MySQL Database

Add indexes:
    
CMPLNT_NUM
Latitude
Longitude
LAW_CAT_CD
BORO_NM
OFNS_DESC
RPT_DT
CMPLNT_FR

In [None]:
# The necessary library to write in Excel
# !sudo pip3 install -U xlwt

In [None]:
# Data quality issues to fix: KY_CD, OFNS_DESC

query = '''
SELECT KY_CD, OFNS_DESC, COUNT(*)
FROM nypd WHERE KY_CD IN (
SELECT KY_CD
FROM nypd
WHERE OFNS_DESC IS NOT NULL
GROUP BY KY_CD
HAVING COUNT(DISTINCT OFNS_DESC)>1)
GROUP BY KY_CD, OFNS_DESC
'''

df = pd.read_sql(query, con=con)
df

In [None]:
# Data quality issues to fix: PD_CD, PD_DESC

query = '''
SELECT PD_CD, PD_DESC, COUNT(*)
FROM nypd WHERE PD_DESC IN (
SELECT PD_DESC
FROM nypd
WHERE PD_DESC IS NOT NULL
GROUP BY PD_DESC
HAVING COUNT(DISTINCT PD_CD)>1)
GROUP BY PD_CD, PD_DESC
'''

df = pd.read_sql(query, con=con)
df