# Crime Prediction

Analysis
- crime prediction
- find features that affect the seriousness of crimes

- feature preprocessing 
  - data preprocessing
    - fix cleaning code errors
  - dimension reduction 
  - feature selection 
  - feature transformation
- modeling 
  - try different models (3-4)
  - experiments: tune parameters
  - compare results
- feature monitoring: feature importance evalution, find common features that have high feature importance



## Load Dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime

In [None]:
from google.colab import drive
drive.mount('/content/drive/')
import os
# TODO: Put your dataset in your google drive!
os.chdir('/content/drive/MyDrive/Crimes/')
!ls

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
'Crime Complaints In Different Precincts Per area.jpg'
 crime_features.csv
 crime_preprocessed.csv
'Crimes in different Boroughs.jpg'
'Crime types.jpg'
'Crime Type Specific.jpg'
'Jurisdiction responsible for incident.jpg'
'Level of offense.jpg'
'Name of NYCHA housing development of occurrence.jpg'
 NYPD_Complaint_Data_Current__Year_To_Date_.csv
'patrol borough in which the incident occurred.jpg'
'Police Precincts.geojson'
'Specific location of occurrence in or around the premises.jpg'
'whether crime was successfully completed or attempted.jpg'


In [None]:
data_path = "/content/drive/MyDrive/Crimes/crime_preprocessed.csv"
df = pd.read_csv(data_path)
field_names = list(df.columns)

In [None]:
# Drop rows with no Level of offense 
df = df[df['LAW_CAT_CD'].notna()]

In [None]:
print (df.columns)
print (df.shape)

Index(['Unnamed: 0', 'ADDR_PCT_CD', 'BORO_NM', 'CRM_ATPT_CPTD_CD',
       'HADEVELOPT', 'HOUSING_PSA', 'JURIS_DESC', 'LAW_CAT_CD',
       'LOC_OF_OCCUR_DESC', 'OFNS_DESC', 'PARKS_NM', 'PATROL_BORO', 'PD_CD',
       'PD_DESC', 'PREM_TYP_DESC', 'RPT_DT', 'STATION_NAME', 'SUSP_AGE_GROUP',
       'SUSP_RACE', 'SUSP_SEX', 'TRANSIT_DISTRICT', 'VIC_AGE_GROUP',
       'VIC_RACE', 'VIC_SEX', 'X_COORD_CD', 'Y_COORD_CD', 'Latitude',
       'Longitude', 'Lat_Lon', 'New Georeferenced Column', 'FR_TIME',
       'TO_TIME', 'duration'],
      dtype='object')
(449506, 33)


# Change categorical features into numerical

### Drop Unrelated columns 
Keeping those features which we think are important and start further processing

In [None]:
df_final = df[['ADDR_PCT_CD', 'JURIS_DESC', 'LOC_OF_OCCUR_DESC', 
               'OFNS_DESC', 'PREM_TYP_DESC', 'SUSP_AGE_GROUP','SUSP_RACE',
               'SUSP_SEX', 'VIC_AGE_GROUP','VIC_RACE','VIC_SEX', 'LAW_CAT_CD',
               'FR_TIME','duration']]
df_final.dtypes

ADDR_PCT_CD           int64
JURIS_DESC           object
LOC_OF_OCCUR_DESC    object
OFNS_DESC            object
PREM_TYP_DESC        object
SUSP_AGE_GROUP       object
SUSP_RACE            object
SUSP_SEX             object
VIC_AGE_GROUP        object
VIC_RACE             object
VIC_SEX              object
LAW_CAT_CD           object
FR_TIME              object
duration             object
dtype: object

### OHE JURISDICTION Description

In [None]:
df_final['JURIS_DESC'].value_counts()

N.Y. POLICE DEPT                405043
N.Y. HOUSING POLICE              32258
N.Y. TRANSIT POLICE               8966
PORT AUTHORITY                    1473
OTHER                              881
DEPT OF CORRECTIONS                270
NYC PARKS                          198
HEALTH & HOSP CORP                 192
TRI-BORO BRDG TUNNL                101
N.Y. STATE POLICE                   44
METRO NORTH                         20
LONG ISLAND RAILRD                  12
NEW YORK CITY SHERIFF OFFICE        11
U.S. PARK POLICE                    10
AMTRACK                              8
N.Y. STATE PARKS                     8
STATN IS RAPID TRANS                 6
NYS DEPT TAX AND FINANCE             5
Name: JURIS_DESC, dtype: int64

Since most of the complaints are in top 5 institutes, only going to encode it into numbers. 

In [None]:
# import labelencoder
from sklearn.preprocessing import LabelEncoder

# instantiate labelencoder object
le = LabelEncoder()

# apply le on categorical feature columns
df_final['JURIS_DESC'] = le.fit_transform(df_final['JURIS_DESC'])
df_final['JURIS_DESC'].head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0    6
1    6
2    6
3    6
4    6
5    6
6    6
7    6
8    6
9    6
Name: JURIS_DESC, dtype: int64

### Three digit offense classification code to OHE

In [None]:
df_final["OFNS_DESC"].value_counts()[:30]

PETIT LARCENY                      87091
HARRASSMENT 2                      74733
ASSAULT 3 & RELATED OFFENSES       48437
CRIMINAL MISCHIEF & RELATED OF     44739
GRAND LARCENY                      40874
FELONY ASSAULT                     22809
OFF. AGNST PUB ORD SENSBLTY &      17318
MISCELLANEOUS PENAL LAW            14570
ROBBERY                            13834
BURGLARY                           12791
GRAND LARCENY OF MOTOR VEHICLE     10417
VEHICLE AND TRAFFIC LAWS            8382
DANGEROUS DRUGS                     7744
SEX CRIMES                          7053
OFFENSES AGAINST PUBLIC ADMINI      5742
DANGEROUS WEAPONS                   5436
FORGERY                             4503
THEFT-FRAUD                         3882
FRAUDS                              2611
OFFENSES INVOLVING FRAUD            2589
INTOXICATED & IMPAIRED DRIVING      2578
CRIMINAL TRESPASS                   1930
RAPE                                1487
UNAUTHORIZED USE OF A VEHICLE       1290
POSSESSION OF ST

In [None]:
# Only keeping those entries with classes of higher than 1000 incidence 
counts_ofns = df_final.groupby("OFNS_DESC")["OFNS_DESC"].transform(len)
df_final['ofns_count'] = counts_ofns
df_final = df_final.loc[df_final['ofns_count']>1000]
df_final = df_final.drop(columns = ['ofns_count'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
# apply le on categorical feature columns
df_final['OFNS_DESC'] = le.fit_transform(df_final['OFNS_DESC'])
df_final.head()

Unnamed: 0,ADDR_PCT_CD,JURIS_DESC,LOC_OF_OCCUR_DESC,OFNS_DESC,PREM_TYP_DESC,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX,LAW_CAT_CD,FR_TIME,duration
3,52,6,,5,STREET,,,,UNKNOWN,UNKNOWN,E,FELONY,2021-12-07 22:49:00,
6,47,6,,5,STREET,,,,UNKNOWN,UNKNOWN,E,FELONY,2021-12-01 00:01:00,
19,9,6,,7,STREET,,,,UNKNOWN,UNKNOWN,E,FELONY,2021-09-12 16:55:00,
37,43,6,,21,STREET,UNKNOWN,UNKNOWN,U,25-44,BLACK,M,FELONY,2021-07-06 16:15:00,
59,46,6,,4,STREET,,,,UNKNOWN,UNKNOWN,E,MISDEMEANOR,2021-02-13 15:15:00,


### Location of occurance - One Hot Encoding 

In [None]:
df_final['LOC_OF_OCCUR_DESC'].value_counts()

INSIDE         239224
FRONT OF       115500
OPPOSITE OF      9086
REAR OF          7766
Name: LOC_OF_OCCUR_DESC, dtype: int64

In [None]:
# Use dummy variables to represent the crime descriptions 
dummy2 = pd.get_dummies(df_final['LOC_OF_OCCUR_DESC'])
# Take a look
dummy2.head()

Unnamed: 0,FRONT OF,INSIDE,OPPOSITE OF,REAR OF
3,0,0,0,0
6,0,0,0,0
19,0,0,0,0
37,0,0,0,0
59,0,0,0,0


In [None]:
df_final = pd.concat([df_final, dummy2], axis=1).drop('LOC_OF_OCCUR_DESC', axis=1)
df_final.head()

Unnamed: 0,ADDR_PCT_CD,JURIS_DESC,OFNS_DESC,PREM_TYP_DESC,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX,LAW_CAT_CD,FR_TIME,duration,FRONT OF,INSIDE,OPPOSITE OF,REAR OF
3,52,6,5,STREET,,,,UNKNOWN,UNKNOWN,E,FELONY,2021-12-07 22:49:00,,0,0,0,0
6,47,6,5,STREET,,,,UNKNOWN,UNKNOWN,E,FELONY,2021-12-01 00:01:00,,0,0,0,0
19,9,6,7,STREET,,,,UNKNOWN,UNKNOWN,E,FELONY,2021-09-12 16:55:00,,0,0,0,0
37,43,6,21,STREET,UNKNOWN,UNKNOWN,U,25-44,BLACK,M,FELONY,2021-07-06 16:15:00,,0,0,0,0
59,46,6,4,STREET,,,,UNKNOWN,UNKNOWN,E,MISDEMEANOR,2021-02-13 15:15:00,,0,0,0,0


### property type

In [None]:
df_final["PREM_TYP_DESC"].value_counts()

STREET                        126659
RESIDENCE - APT. HOUSE        107658
RESIDENCE-HOUSE                46435
RESIDENCE - PUBLIC HOUSING     31996
CHAIN STORE                    20318
                               ...  
CEMETERY                          30
PHOTO/COPY                        28
LOAN COMPANY                      25
DAYCARE FACILITY                  18
TRAMWAY                            6
Name: PREM_TYP_DESC, Length: 74, dtype: int64

In [None]:
# Due to the large number of different property type, use number encoding instead 
df_final['PREM_TYP_DESC'] = le.fit_transform(df_final['PREM_TYP_DESC'])
df_final.head()

Unnamed: 0,ADDR_PCT_CD,JURIS_DESC,OFNS_DESC,PREM_TYP_DESC,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX,LAW_CAT_CD,FR_TIME,duration,FRONT OF,INSIDE,OPPOSITE OF,REAR OF
3,52,6,5,62,,,,UNKNOWN,UNKNOWN,E,FELONY,2021-12-07 22:49:00,,0,0,0,0
6,47,6,5,62,,,,UNKNOWN,UNKNOWN,E,FELONY,2021-12-01 00:01:00,,0,0,0,0
19,9,6,7,62,,,,UNKNOWN,UNKNOWN,E,FELONY,2021-09-12 16:55:00,,0,0,0,0
37,43,6,21,62,UNKNOWN,UNKNOWN,U,25-44,BLACK,M,FELONY,2021-07-06 16:15:00,,0,0,0,0
59,46,6,4,62,,,,UNKNOWN,UNKNOWN,E,MISDEMEANOR,2021-02-13 15:15:00,,0,0,0,0


### Suspect and Victim demographic info

In [None]:
# Assign the new group 
df_pca = df_final[['SUSP_AGE_GROUP','SUSP_RACE','SUSP_SEX', 'VIC_AGE_GROUP','VIC_RACE','VIC_SEX']]
df_pca.head()

Unnamed: 0,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
3,,,,UNKNOWN,UNKNOWN,E
6,,,,UNKNOWN,UNKNOWN,E
19,,,,UNKNOWN,UNKNOWN,E
37,UNKNOWN,UNKNOWN,U,25-44,BLACK,M
59,,,,UNKNOWN,UNKNOWN,E


In [None]:
# Change UNKNOWN, E to np.Nan
print(df_pca['SUSP_SEX'].unique())
df_pca['SUSP_SEX'][df_pca['SUSP_SEX'] == 'U'] = np.nan

print(df_pca['SUSP_AGE_GROUP'].unique())
df_pca['SUSP_AGE_GROUP'][df_pca['SUSP_AGE_GROUP'] == 'UNKNOWN'] = np.nan

print(df_pca['SUSP_RACE'].unique())
df_pca['SUSP_RACE'][df_pca['SUSP_RACE'] == 'UNKNOWN'] = np.nan

print(df_pca['VIC_AGE_GROUP'].unique())
df_pca['VIC_AGE_GROUP'][df_pca['VIC_AGE_GROUP'] == 'UNKNOWN'] = np.nan

print(df_pca['VIC_RACE'].unique())
df_pca['VIC_RACE'][df_pca['VIC_RACE'] == 'UNKNOWN'] = np.nan

print(df_pca['VIC_SEX'].unique())
df_pca['VIC_SEX'][df_pca['VIC_SEX'] == 'E'] = np.nan
df_pca['VIC_SEX'][df_pca['VIC_SEX'] == 'D'] = np.nan

[nan 'U' 'M' 'F']
[nan 'UNKNOWN' '25-44' '18-24' '45-64' '65+' '<18' '2021' '-969' '-33'
 '-60' '953' '-971' '942' '-946' '1032' '938' '-955' '940' '-69' '-941'
 '-975' '-947' '-973' '1017' '-966']
[nan 'UNKNOWN' 'BLACK' 'BLACK HISPANIC' 'WHITE' 'WHITE HISPANIC'
 'ASIAN / PACIFIC ISLANDER' 'AMERICAN INDIAN/ALASKAN NATIVE']
['UNKNOWN' '25-44' '45-64' '18-24' '65+' '<18' '-4' '-943' '-62' '-3' '-1'
 '936' '-960' '-921' '-61' '-48' '970' '963' '945' '-935' '-51']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


['UNKNOWN' 'BLACK' 'BLACK HISPANIC' 'WHITE' 'WHITE HISPANIC'
 'ASIAN / PACIFIC ISLANDER' 'AMERICAN INDIAN/ALASKAN NATIVE' nan]
['E' 'M' 'F' 'D']


In [None]:
# Use LabelEncoder to convert labels to numbers
original = df_pca
mask = df_pca.isnull()
df_pca = df_pca.astype(str).apply(LabelEncoder().fit_transform)
df_pca.where(~mask, original)

Unnamed: 0,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
3,,,,,,
6,,,,,,
19,,,,,,
37,,,,12,2,1
59,,,,,,
...,...,...,...,...,...,...
449501,,,,13,4,0
449502,,,,,,
449503,16,4,1,,,
449504,16,5,1,13,5,0


In [None]:
# Fill Nan with median 
df_pca = df_pca.fillna(df_pca.median())
df_pca.head()

Unnamed: 0,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
3,24,6,2,20,6,2
6,24,6,2,20,6,2
19,24,6,2,20,6,2
37,24,6,2,12,2,1
59,24,6,2,20,6,2


In [None]:
df_final = df_final.drop(columns = ['SUSP_AGE_GROUP','SUSP_RACE','SUSP_SEX', 'VIC_AGE_GROUP','VIC_RACE','VIC_SEX'])
df_final = pd.concat([df_final, df_pca], axis = 1)
df_final.head()

Unnamed: 0,ADDR_PCT_CD,JURIS_DESC,OFNS_DESC,PREM_TYP_DESC,LAW_CAT_CD,FR_TIME,duration,FRONT OF,INSIDE,OPPOSITE OF,REAR OF,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
3,52,6,5,62,FELONY,2021-12-07 22:49:00,,0,0,0,0,24,6,2,20,6,2
6,47,6,5,62,FELONY,2021-12-01 00:01:00,,0,0,0,0,24,6,2,20,6,2
19,9,6,7,62,FELONY,2021-09-12 16:55:00,,0,0,0,0,24,6,2,20,6,2
37,43,6,21,62,FELONY,2021-07-06 16:15:00,,0,0,0,0,24,6,2,12,2,1
59,46,6,4,62,MISDEMEANOR,2021-02-13 15:15:00,,0,0,0,0,24,6,2,20,6,2


### Complaint time ('FR_TIME')

In [None]:
df_final['FR_TIME'] = (pd.to_datetime(df_final['FR_TIME']).dt.hour)

In [None]:
print(df_final['FR_TIME'].unique())

[22  0 16 15  3 19  4 20 10 13 17  2  9  7 12  1  8 14 18 23 11 21  6  5]


In [None]:
df_final['FR_TIME'] = df_final['FR_TIME'].fillna(df_final['FR_TIME'].median())

### Duration ('duration')

In [None]:
# Convert into hours 
df_final['duration']= pd.to_timedelta(df_final['duration'])/ np.timedelta64(1, 'h')

In [None]:
df_final['duration'] = df_final['duration'].fillna(df_final['duration'].median())

### Convert outcome to numbers 

In [None]:
df_final['LAW_CAT_CD'] = le.fit_transform(df_final['LAW_CAT_CD'])

### Standarize Data


In [None]:
df_final['ADDR_PCT_CD'] =( df_final['ADDR_PCT_CD'] - df_final['ADDR_PCT_CD'].mean() ) / df_final['ADDR_PCT_CD'].std()
df_final['PREM_TYP_DESC'] =( df_final['PREM_TYP_DESC'] - df_final['PREM_TYP_DESC'].mean() ) / df_final['PREM_TYP_DESC'].std()

## Draw feature correlation

In [None]:
print(df_final.corr()["LAW_CAT_CD"][:])

ADDR_PCT_CD       0.028336
JURIS_DESC       -0.032275
OFNS_DESC         0.105847
PREM_TYP_DESC    -0.027790
LAW_CAT_CD        1.000000
FR_TIME           0.037094
duration         -0.012858
FRONT OF         -0.031059
INSIDE            0.084814
OPPOSITE OF      -0.050234
REAR OF          -0.024590
SUSP_AGE_GROUP   -0.108653
SUSP_RACE        -0.145324
SUSP_SEX         -0.231874
VIC_AGE_GROUP    -0.126339
VIC_RACE         -0.107298
VIC_SEX          -0.185111
Name: LAW_CAT_CD, dtype: float64


In [None]:
# Fill Values with Nan with 0 
df_final = df_final.fillna(0)

In [None]:
df_final.to_csv('crime_features.csv')
!cp data.csv "/content/drive/MyDrive/Crimes"

cp: cannot stat 'data.csv': No such file or directory
