Minneapolis Crime Project

* Author: Jay Bennett [@k-chuang](https://www.github.com/k-chuang)
* Created on: August 25, 2018
* Description: Data analysis, exploration, visualization, and data mining on crime in SF
* Original dataset: [SF Gov Crime dataset](https://data.sfgov.org/Public-Safety/-Change-Notice-Police-Department-Incidents/tmnf-yvry/about)
* Kaggle dataset: [Kaggle SF Crime](https://www.kaggle.com/c/sf-crime/data)

---------------

# Table of Contents

- Introduction
    - SF Crime Dataset
- Basic Preparation
    - Import libraries
    - Load data
- Data Exploration/Analysis Extension
- Data Preprocessing
    - Data Imputation/Removal
    - Feature Engineering
    - Feature Encoding
- Build Machine Learning Models
    - Train different baseline models
    - Analyze results
- Model Selection
- Hyperparameter tuning
- Train Model with optimal hyperparameters
- Feature Selection
    - Feature Importance
    - Feature Removal
- Train Final Model
- Model Evaluation
- Summary
- Kaggle Submission
- Conclusion

# Introduction


## SF Crime Dataset

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set. The goal is to try to predict the category of crime that occurred in the city of San Francisco. 

### Data Fields
- **Dates** - timestamp of the crime incident
- **Category** - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
- **Descript** - detailed description of the crime incident (only in train.csv)
- **DayOfWeek** - the day of the week
- **Precinct** - name of the Police Department District
- **Resolution** - how the crime incident was resolved (only in train.csv)
- **Address** - the approximate street address of the crime incident 
- **X** - Longitude
- **Y** - Latitude


In this juypter notebook, I will go through the whole process, end-to-end, of creating a machine learning model on the open source San Francisco Crime dataset. This includes data exploration & analysis, data preprocessing (huge part of this project and includes feature engineering), trying out different ML algorithms and determining the optimal ML model, tuning the hyperparameters of that model, and finally, evaluating the chosen model in terms of multiclass log loss. 

Since this is an old Kaggle competition, I will refrain from looking online for resources or old Kaggle kernels. The plan is to get better at coding an end to end data science project and to familiarize myself with the Python data science libraries. Also, I hope to learn some interesting things and discover some cool patterns or ideas using this dataset. Well, here goes nothing!



## Import libraries

In [1]:
# # Count number of observations for each day of week
# train_df['DayOfWeek'].value_counts()
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
import xgboost as xgb

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder

# Metrics 
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score

# Model Selection & Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold
from skopt import BayesSearchCV
from skopt.space  import Real, Categorical, Integer


# Clustering
from sklearn.cluster import KMeans

# Mathematical Functions
import math

import warnings
warnings.filterwarnings('ignore')

jobs=1000

## Load data

In [2]:
#train_df.columns.values

df_train = pd.read_csv('incidents.csv', index_col=0)
df_train = df_train

df_train.drop(columns=['Time', 'Date', 'Month_Name'], inplace=True)
#df_train.rename(columns={'Lat':'Y', 'Long':'X'}, inplace=True)

df_train['X'] = df_train['Long']
df_train['Y'] = df_train['Lat']

start_date = '2018-01-01'
end_date = '2020-12-31'
mask = (df_train['Dates'] > start_date) & (df_train['Dates'] <= end_date)
df_train = df_train.loc[mask]
df_train

df_train.head()


train_df, test_df = train_test_split(df_train,test_size=0.25,random_state=0)#,stratify=df_train['Survived'])
# train,test=train_test_split(df_train,test_size=0.3,random_state=0,stratify=df_train['Survived'])
# train_X=train[train.columns[1:]]
# train_y=train[train.columns[:1]]
# test_X=test[test.columns[1:]]
# test_y=test[test.columns[:1]]
# X=df_train[df_train.columns[1:]]
# y=df_train['Survived']

#train_df = train #pd.read_csv('train.csv')
#test_df = test#pd.read_csv('test.csv')
#train_df = pd.read_csv('train.csv')
#test_df = pd.read_csv('test.csv')
#df.rename(columns={'ReportedDateTime':'Dates', 'Offense':'Category', 'DoW':'DayOfWeek'}, inplace=True)


FileNotFoundError: [Errno 2] No such file or directory: 'incidents.csv'

In [None]:
# Count number of observations for each day of week
train_df['DayOfWeek'].value_counts()


Friday       6620
Saturday     6290
Monday       6231
Thursday     6205
Wednesday    6101
Tuesday      6035
Sunday       6022
Name: DayOfWeek, dtype: int64

# Data Exploration & Analysis Extension

- Complete data exploration & visualizations are located in jupyter notebook: [kaggle-sf-crime-data-exploration.ipynb](kaggle-sf-crime-data-exploration.ipynb)
- This dataset suffers from **imbalanced classes** (**TREA** has 6 occurrences while **LARCENY/THEFT** has 1,749,000 occurrences)
    - There are a couple ways to deal with imbalanced classes, such as:
        - Changing performance metric (Do not use accuracy, use a confusion matrix, precision, recall, F1 score, ROC curves)
        - Resample dataset (Oversample under-represented classes, and undersample over-represented classes)
        - Try different ML algorithms that can handle imbalanced classes
            - Decision Trees (Random Forests/XGBoost) often perform well on imbalanced classes (due to splitting rules)

In [None]:
train_df.head(8)

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Offense,Description,UCRCode,Year,Month,DayOfWeek_Num,DayOfWeek,Hour,Category,X,Y
11380,2020-06-20 14:02:00,0001XX GRANT ST W,44.969863,-93.28008,LORING PARK,DOWNTOWN,THEFT,OTHER THEFT,7,2020,6,5,Saturday,14,LARCENY,-93.28008,44.969863
8731,2018-11-12 14:08:00,0018XX CHICAGO AVE,44.964567,-93.262554,VENTURA VILLAGE,SOUTHEAST,TBLDG,THEFT FROM BUILDING,7,2018,11,0,Monday,14,LARCENY,-93.262554,44.964567
3283,2018-07-18 11:45:00,0014XX NICOLLET AVE,44.968424,-93.277817,LORING PARK,DOWNTOWN,PETIT,OBS - PETTY THEFT,7,2018,7,2,Wednesday,11,LARCENY,-93.277817,44.968424
16375,2020-07-26 06:00:00,0009XX 28TH AVE NE,45.018586,-93.24642,AUDUBON PARK,NORTHEAST,TFMV,THEFT-MOTR VEH PARTS,7,2020,7,6,Sunday,6,LARCENY,-93.24642,45.018586
19965,2020-09-19 23:00:00,0026XX LYNDALE AVE S,44.954638,-93.288048,LOWRY HILL EAST,SOUTHWEST,BIKETF,BIKE THEFT,7,2020,9,5,Saturday,23,LARCENY,-93.288048,44.954638
14634,2020-06-04 11:15:00,0028XX STEVENS AVE,44.95015,-93.275367,WHITTIER,SOUTHWEST,BURGB,BURGLARY OF BUSINESS,6,2020,6,3,Thursday,11,BURGLARY,-93.275367,44.95015
7862,2019-05-09 22:00:00,0027XX IRVING AVE S,44.953341,-93.300904,EAST ISLES,SOUTHWEST,GTA,AUTOMOBILE THEFT,8,2019,5,3,Thursday,22,AUTO THEFT,-93.300904,44.953341
1809,2018-06-30 12:00:00,0054XX 34TH AVE S,44.90492,-93.222883,WENONAH,SOUTHEAST,BURGD,BURGLARY OF DWELLING,6,2018,6,5,Saturday,12,BURGLARY,-93.222883,44.90492


In [None]:
# len(train_df[train_df['Holiday'] == True])
train_df.columns.values

array(['Dates', 'Address', 'Lat', 'Long', 'Neighborhood', 'Precinct',
       'Offense', 'Description', 'UCRCode', 'Year', 'Month',
       'DayOfWeek_Num', 'DayOfWeek', 'Hour', 'Category', 'X', 'Y'],
      dtype=object)

In [None]:
# set show nulls to True
train_df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43504 entries, 11380 to 2916
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Dates          43504 non-null  object 
 1   Address        43504 non-null  object 
 2   Lat            43504 non-null  float64
 3   Long           43504 non-null  float64
 4   Neighborhood   43504 non-null  object 
 5   Precinct       43504 non-null  object 
 6   Offense        43504 non-null  object 
 7   Description    43504 non-null  object 
 8   UCRCode        43504 non-null  int64  
 9   Year           43504 non-null  int64  
 10  Month          43504 non-null  int64  
 11  DayOfWeek_Num  43504 non-null  int64  
 12  DayOfWeek      43504 non-null  object 
 13  Hour           43504 non-null  int64  
 14  Category       43504 non-null  object 
 15  X              43504 non-null  float64
 16  Y              43504 non-null  float64
dtypes: float64(4), int64(5), object(8)
memory usage

------------
### Things we learned thus far:

- 878,049 instances in training set (or recorded crime instances in SF)
- 9 columns (8 potential features + 1 label (Category))
- Data types:
    - 2 columns with float values
    - 7 objects
- There are no null (NaN) values (Yay!)

In [None]:
## Count number of observations for each crime 
train_df['Category'].value_counts()

LARCENY       23537
BURGLARY       7621
AUTO THEFT     5827
ASSAULT        3560
ROBBERY        1987
RAPE            715
ARSON           171
MURDER           86
Name: Category, dtype: int64

In [None]:
# Count number of observations of crime for each PD District
train_df['Precinct'].value_counts()

SOUTHEAST    12138
SOUTHWEST     9599
DOWNTOWN      7708
NORTH         7539
NORTHEAST     6520
Name: Precinct, dtype: int64

In [None]:
#len(train_df[train_df['Holiday'] == True])
train_df['Precinct'].value_counts()

SOUTHEAST    12138
SOUTHWEST     9599
DOWNTOWN      7708
NORTH         7539
NORTHEAST     6520
Name: Precinct, dtype: int64

In [None]:
# len(train_df[train_df['Holiday'] == True])
# train_df['DayOfWeek'].value_counts()

In [None]:
# ## Count number of observations for Resolution feature
# train_df['Resolution'].value_counts()

In [None]:
train_df[['X','Y']].describe()

Unnamed: 0,X,Y
count,43504.0,43504.0
mean,-93.263689,44.965419
std,0.632978,0.306617
min,-93.329109,0.0
25%,-93.288874,44.948347
50%,-93.271699,44.965746
75%,-93.247352,44.987328
max,0.0,45.051227


**There seems to be an invalid coordinates (max) 90 (latitude) or -120.5 (longitude) does not seem to be a valid coordinate in San Francisco. We must fix these values for this feature.**

# Data Preprocessing

- Data cleaning
    - imputation or removal of outlier values
- Feature Engineering (Feature Creation)
- Feature Encoding
    - **Integer encode** or **label encode** ordinal categorical features that maintain order (Year, Business Quarter, Block/Street Number)
    - Usually: 
        - **One hot encode** nominal categorical features (DayOfWeek, Precinct, StreetType, Category)
            - mainly for logistic regression
        - However, Random Forests & Boosting algorithms can handle nominal categorical features directly, so we just **integer encode** these features.

In [None]:
train_df['UCRCode'] = train_df['UCRCode'].astype(int)
#test_df['BusinessHour'] = test_df['Dates'].map(map_business_hours).astype('uint8')

## Data Cleaning

- Data removal
- Data imputation

In [None]:
train_df[train_df['Y'] == train_df['Y'].max()]

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Offense,Description,UCRCode,Year,Month,DayOfWeek_Num,DayOfWeek,Hour,Category,X,Y
7253,2018-10-29 23:00:00,0023XX 53RD AVE N,45.051227,-93.311238,SHINGLE CREEK,NORTH,TFMV,THEFT FROM MOTR VEHC,7,2018,10,0,Monday,23,LARCENY,-93.311238,45.051227


I notice that there are 108 rows with incorrect coordinates, and they seem to be the exact same two coordinates (90, -120.5). There are many ways to handle this. We need to do data imputation, which can be done several ways. For now, I will randomly sample from a normal distribution with the range of a standard deviation from the mean. However, I could use a linear regression model to predict the latitude and longitude values (based on other variables such as PD district?) and use that to impute the bad / inconsistent data points.

Another method is to completely remove this data. Since I already have a lot of data, and I do not want this incorrect data to affect my results, I could remove them. However, I will stick with data imputation.

In [None]:
len(train_df)
# train_df['X'].replace(to_replace= train_df['X'].max() ,value=np.nan, inplace=True)
# test_df['Y'].replace(to_replace= test_df['Y'].max() ,value=np.nan, inplace=True)
# test_df['X'].replace(to_replace= test_df['X'].max() ,value=np.nan, inplace=True)
train_df.dropna(inplace=True)
train_df.dropna(inplace=True)

In [None]:
train_df.isnull().sum()

Dates            0
Address          0
Lat              0
Long             0
Neighborhood     0
Precinct         0
Offense          0
Description      0
UCRCode          0
Year             0
Month            0
DayOfWeek_Num    0
DayOfWeek        0
Hour             0
Category         0
X                0
Y                0
dtype: int64

In [None]:
test_df.isnull().sum()

Dates            0
Address          0
Lat              0
Long             0
Neighborhood     0
Precinct         0
Offense          0
Description      0
UCRCode          0
Year             0
Month            0
DayOfWeek_Num    0
DayOfWeek        0
Hour             0
Category         0
X                0
Y                0
dtype: int64

In [None]:
train_df.dropna(inplace=True)
test_df.dropna(inplace=False)

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Offense,Description,UCRCode,Year,Month,DayOfWeek_Num,DayOfWeek,Hour,Category,X,Y
22106,2019-01-02 16:00:00,0009XX 4TH ST N,44.987210,-93.280944,NORTH LOOP,DOWNTOWN,CSCR,CSC - RAPE,3,2019,1,2,Wednesday,16,RAPE,-93.280944,44.987210
3396,2018-07-31 00:00:00,0011XX HENNEPIN AVE,44.975034,-93.279723,DOWNTOWN WEST,DOWNTOWN,ROBPAG,ROBBERY PER AGG,4,2018,7,1,Tuesday,0,ROBBERY,-93.279723,44.975034
16558,2019-05-28 19:15:00,0013XX 4TH ST SE,44.980767,-93.236473,MARCY HOLMES,NORTHEAST,THEFT,OTHER THEFT,7,2019,5,1,Tuesday,19,LARCENY,-93.236473,44.980767
14724,2020-07-03 21:00:00,0008XX 5TH AVE N,44.982787,-93.290185,SUMNER - GLENWOOD,NORTH,TFMV,THEFT FROM MOTR VEHC,7,2020,7,4,Friday,21,LARCENY,-93.290185,44.982787
21138,2019-02-27 06:50:00,0029XX 46TH AVE S,44.950453,-93.207583,COOPER,SOUTHEAST,BURGD,BURGLARY OF DWELLING,6,2019,2,2,Wednesday,6,BURGLARY,-93.207583,44.950453
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5009,2018-05-13 21:30:00,0006XX 16 ST E,44.967168,-93.266383,ELLIOT PARK,DOWNTOWN,TFMV,Theft From Motr Vehc,7,2018,5,6,Sunday,21,LARCENY,-93.266383,44.967168
5469,2018-09-30 19:49:00,0029XX PORTLAND AVE,44.949214,-93.267751,PHILLIPS WEST,SOUTHEAST,ASLT1,ASLT-GREAT BODILY HM,5,2018,9,6,Sunday,19,ASSAULT,-93.267751,44.949214
1922,2018-03-11 23:58:59,0007XX Penn AV N,44.985310,-93.308292,NEAR - NORTH,NORTH,GTA,Motor Vehicle Theft,8,2018,3,6,Sunday,23,AUTO THEFT,-93.308292,44.985310
12388,2020-10-16 03:55:00,0033XX HIAWATHA AVE,44.941249,-93.233428,LONGFELLOW,SOUTHEAST,GTA,AUTOMOBILE THEFT,8,2020,10,4,Friday,3,AUTO THEFT,-93.233428,44.941249


In [None]:
# data = [train_df, test_df]

# for dataset in data:
#     mean_X = dataset["X"].mean()
#     std_X = dataset["X"].std()
#     mean_Y = dataset["Y"].mean()
#     std_Y = dataset["Y"].std()
#     max_X = mean_X + std_X
#     min_X = mean_X - std_X
#     max_Y = mean_Y + std_Y
#     min_Y = mean_Y - std_Y

#     # Both X and Y will have the same null so just use Y
#     is_null = dataset['Y'].isnull().sum()
#     # randomly sample float numbers within a range from a uniform distribution
# #     random_X = (max_X - min_X) * np.random.random_sample(size = is_null) + min_X
# #     random_Y = (max_Y - min_Y) * np.random.random_sample(size = is_null) + min_Y
#     # randomly sample float numbers within a range from a normal distribution
#     random_X = (max_X - min_X) * np.random.randn(is_null) + min_X
#     random_Y = (max_Y - min_Y) * np.random.randn(is_null) + min_Y

#     X_slice = dataset['X'].copy()
#     Y_slice = dataset['Y'].copy()
#     X_slice[np.isnan(X_slice)] = random_X
#     Y_slice[np.isnan(Y_slice)] = random_Y
#     dataset['X'] = X_slice
#     dataset['Y'] = Y_slice


In [None]:
train_df[['X', 'Y']].describe()

Unnamed: 0,X,Y
count,43504.0,43504.0
mean,-93.263689,44.965419
std,0.632978,0.306617
min,-93.329109,0.0
25%,-93.288874,44.948347
50%,-93.271699,44.965746
75%,-93.247352,44.987328
max,0.0,45.051227


In [None]:
len(train_df)

43504

In [None]:
test_df[['X', 'Y']].describe()

Unnamed: 0,X,Y
count,14502.0,14502.0
mean,-93.267373,44.967446
std,0.027491,0.032305
min,-93.329109,44.890627
25%,-93.288308,44.948347
50%,-93.270298,44.965938
75%,-93.247305,44.98721
max,-93.19915,45.051227


In [None]:
len(test_df)

14502

# Feature Engineering

- Let's create some new features from the data that exists in the current feature space
- There are a couple categories of features:
    - Temporal features
    - Spatial features

## Temporal Features
We want to have a column for Time, so we must parse through the 'Dates' feature to create the 'Time' feature


In [None]:
# Transform the Date into a python datetime object.
train_df["Dates"] = pd.to_datetime(train_df["Dates"], format="%Y-%m-%d %H:%M:%S")
test_df["Dates"] = pd.to_datetime(test_df["Dates"], format="%Y-%m-%d %H:%M:%S")

In [None]:
# Minute
train_df["Minute"] = train_df["Dates"].map(lambda x: x.minute)
test_df["Minute"] = test_df["Dates"].map(lambda x: x.minute)

In [None]:
# Hour
train_df["Hour"] = train_df["Dates"].map(lambda x: x.hour)
test_df["Hour"] = test_df["Dates"].map(lambda x: x.hour)

In [None]:
# Day
train_df["Day"] = train_df["Dates"].map(lambda x: x.day)
test_df["Day"] = test_df["Dates"].map(lambda x: x.day)

In [None]:
# Month
train_df["Month"] = train_df["Dates"].map(lambda x: x.month)
test_df["Month"] = test_df["Dates"].map(lambda x: x.month)

In [None]:
# Year
train_df["Year"] = train_df["Dates"].map(lambda x: x.year)
test_df["Year"] = test_df["Dates"].map(lambda x: x.year)

In [None]:
# Hour Zone 0 - Pass midnight, 1 - morning, 2 - afternoon, 3 - dinner / sun set, 4 - night
def get_hour_zone(hour):
    if hour >= 2 and hour < 8: 
        return 0
    elif hour >= 8 and hour < 12: 
        return 1
    elif hour >= 12 and hour < 18: 
        return 2
    elif hour >= 18 and hour < 22: 
        return 3
    elif hour < 2 or hour >= 22: 
        return 4
    
train_df["Hour_Zone"] = train_df["Hour"].map(get_hour_zone)
test_df["Hour_Zone"] = test_df["Hour"].map(get_hour_zone)

In [None]:
# Add Week of Year
train_df["WeekOfYear"] = train_df["Dates"].map(lambda x: int(x.weekofyear / 2) - 1)
test_df["WeekOfYear"] = test_df["Dates"].map(lambda x: int(x.weekofyear / 2))

print(sorted(train_df['WeekOfYear'].unique()))
print(sorted(test_df['WeekOfYear'].unique()))

[-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]


In [None]:
train_df.head()

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Offense,Description,UCRCode,Year,...,DayOfWeek_Num,DayOfWeek,Hour,Category,X,Y,Minute,Day,Hour_Zone,WeekOfYear
11380,2020-06-20 14:02:00,0001XX GRANT ST W,44.969863,-93.28008,LORING PARK,DOWNTOWN,THEFT,OTHER THEFT,7,2020,...,5,Saturday,14,LARCENY,-93.28008,44.969863,2,20,2,11
8731,2018-11-12 14:08:00,0018XX CHICAGO AVE,44.964567,-93.262554,VENTURA VILLAGE,SOUTHEAST,TBLDG,THEFT FROM BUILDING,7,2018,...,0,Monday,14,LARCENY,-93.262554,44.964567,8,12,2,22
3283,2018-07-18 11:45:00,0014XX NICOLLET AVE,44.968424,-93.277817,LORING PARK,DOWNTOWN,PETIT,OBS - PETTY THEFT,7,2018,...,2,Wednesday,11,LARCENY,-93.277817,44.968424,45,18,1,13
16375,2020-07-26 06:00:00,0009XX 28TH AVE NE,45.018586,-93.24642,AUDUBON PARK,NORTHEAST,TFMV,THEFT-MOTR VEH PARTS,7,2020,...,6,Sunday,6,LARCENY,-93.24642,45.018586,0,26,0,14
19965,2020-09-19 23:00:00,0026XX LYNDALE AVE S,44.954638,-93.288048,LOWRY HILL EAST,SOUTHWEST,BIKETF,BIKE THEFT,7,2020,...,5,Saturday,23,LARCENY,-93.288048,44.954638,0,19,4,18


### Holiday Feature

- Certain crimes may be more apparent on holidays

In [None]:
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

# Training set
cal = calendar()
holidays = cal.holidays(start=train_df['Dates'].min(), end=train_df['Dates'].max())
train_df['Holiday'] = train_df['Dates'].dt.date.astype('datetime64').isin(holidays)

In [None]:
# Test set
cal = calendar()
holidays = cal.holidays(start=test_df['Dates'].min(), end=test_df['Dates'].max())
test_df['Holiday'] = test_df['Dates'].dt.date.astype('datetime64').isin(holidays)

In [None]:
# train_df.head(10)


In [None]:
# train_df.head(10)


### Business Hours Feature

- There should be an effect of business hours on the type of crime committed
- Let's create a binary feature where:
    - 1 is typical business hours [8:00AM - 6:00PM]
    - 0 is not business hours [6:01PM - 7:59 AM]

In [None]:
from datetime import datetime, time

def time_in_range(start, end, x):
    """Return true if x is in the inclusive range [start, end]"""
    if start <= end:
        return start <= x <= end
    else:
        return start <= x or x <= end

def map_business_hours(date):
    
    # Convert military time to AM & PM
    time_parsed = date.time()
    business_start = time(8, 0, 0)
    business_end = time(18, 0, 0)
    
    if time_in_range(business_start, business_end, time_parsed):
        return 1
    else:
        return 0
    
train_df['BusinessHour'] = train_df['Dates'].map(map_business_hours).astype('uint8')
test_df['BusinessHour'] = test_df['Dates'].map(map_business_hours).astype('uint8')

In [None]:
train_df['BusinessHour'].value_counts()

0    24056
1    19448
Name: BusinessHour, dtype: int64

In [None]:
# train_df.head(10)

### Business Quarter (Removed)

- Business Quarter might have an effect on what types of crimes are commited
- Q1 = 1 (Jan. - March) Q2 = 2 (April - June), Q3 = 3 (July - Sept.), Q4 = 4 (Oct. - Dec.)

In [None]:
# train_df.head(10)
    
#     if month in [1, 2, 3]:
# #         print(time_parsed)
#         return 1
#     elif month in [4, 5, 6]:
#         return 2
#     elif month in [7, 8, 9]:
#         return 3
#     elif month in [10, 11, 12]:
#         return 4
    
# train_df['Quarter'] = train_df['Month'].map(map_business_quarter)
# test_df['Quarter'] = test_df['Month'].map(map_business_quarter)

In [None]:
# train_df.head(10)

In [None]:
train_df.head(8)

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Offense,Description,UCRCode,Year,...,Hour,Category,X,Y,Minute,Day,Hour_Zone,WeekOfYear,Holiday,BusinessHour
11380,2020-06-20 14:02:00,0001XX GRANT ST W,44.969863,-93.28008,LORING PARK,DOWNTOWN,THEFT,OTHER THEFT,7,2020,...,14,LARCENY,-93.28008,44.969863,2,20,2,11,False,1
8731,2018-11-12 14:08:00,0018XX CHICAGO AVE,44.964567,-93.262554,VENTURA VILLAGE,SOUTHEAST,TBLDG,THEFT FROM BUILDING,7,2018,...,14,LARCENY,-93.262554,44.964567,8,12,2,22,True,1
3283,2018-07-18 11:45:00,0014XX NICOLLET AVE,44.968424,-93.277817,LORING PARK,DOWNTOWN,PETIT,OBS - PETTY THEFT,7,2018,...,11,LARCENY,-93.277817,44.968424,45,18,1,13,False,1
16375,2020-07-26 06:00:00,0009XX 28TH AVE NE,45.018586,-93.24642,AUDUBON PARK,NORTHEAST,TFMV,THEFT-MOTR VEH PARTS,7,2020,...,6,LARCENY,-93.24642,45.018586,0,26,0,14,False,0
19965,2020-09-19 23:00:00,0026XX LYNDALE AVE S,44.954638,-93.288048,LOWRY HILL EAST,SOUTHWEST,BIKETF,BIKE THEFT,7,2020,...,23,LARCENY,-93.288048,44.954638,0,19,4,18,False,0
14634,2020-06-04 11:15:00,0028XX STEVENS AVE,44.95015,-93.275367,WHITTIER,SOUTHWEST,BURGB,BURGLARY OF BUSINESS,6,2020,...,11,BURGLARY,-93.275367,44.95015,15,4,1,10,False,1
7862,2019-05-09 22:00:00,0027XX IRVING AVE S,44.953341,-93.300904,EAST ISLES,SOUTHWEST,GTA,AUTOMOBILE THEFT,8,2019,...,22,AUTO THEFT,-93.300904,44.953341,0,9,4,8,False,0
1809,2018-06-30 12:00:00,0054XX 34TH AVE S,44.90492,-93.222883,WENONAH,SOUTHEAST,BURGD,BURGLARY OF DWELLING,6,2018,...,12,BURGLARY,-93.222883,44.90492,0,30,2,12,False,1


### Season

The season feature may affect what type of crimes are commited. 
- 1 = Winter, 2 = Spring, 3 = Summer, 4 = Fall

In [None]:
train_df['Season']=(train_df['Month']%12 + 3)//3
test_df['Season']=(test_df['Month']%12 + 3)//3

In [None]:
train_df.head()

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Offense,Description,UCRCode,Year,...,Category,X,Y,Minute,Day,Hour_Zone,WeekOfYear,Holiday,BusinessHour,Season
11380,2020-06-20 14:02:00,0001XX GRANT ST W,44.969863,-93.28008,LORING PARK,DOWNTOWN,THEFT,OTHER THEFT,7,2020,...,LARCENY,-93.28008,44.969863,2,20,2,11,False,1,3
8731,2018-11-12 14:08:00,0018XX CHICAGO AVE,44.964567,-93.262554,VENTURA VILLAGE,SOUTHEAST,TBLDG,THEFT FROM BUILDING,7,2018,...,LARCENY,-93.262554,44.964567,8,12,2,22,True,1,4
3283,2018-07-18 11:45:00,0014XX NICOLLET AVE,44.968424,-93.277817,LORING PARK,DOWNTOWN,PETIT,OBS - PETTY THEFT,7,2018,...,LARCENY,-93.277817,44.968424,45,18,1,13,False,1,3
16375,2020-07-26 06:00:00,0009XX 28TH AVE NE,45.018586,-93.24642,AUDUBON PARK,NORTHEAST,TFMV,THEFT-MOTR VEH PARTS,7,2020,...,LARCENY,-93.24642,45.018586,0,26,0,14,False,0,3
19965,2020-09-19 23:00:00,0026XX LYNDALE AVE S,44.954638,-93.288048,LOWRY HILL EAST,SOUTHWEST,BIKETF,BIKE THEFT,7,2020,...,LARCENY,-93.288048,44.954638,0,19,4,18,False,0,4


### Weekend

- Weekends may have effect on what types of crimes are commmited
- Weekday = 0, Weekend =1

In [None]:
# Weekend Feature

# Weekday = 0, Weekend = 1
days = {'Monday':0 ,'Tuesday':0 ,'Wednesday':0 ,'Thursday':0 ,'Friday':0, 'Saturday':1 ,'Sunday':1}

train_df['Weekend'] = train_df['DayOfWeek'].replace(days).astype(int)
test_df['Weekend'] = test_df['DayOfWeek'].replace(days).astype(int)

## Spatial Features

### Street Type

The street type can have an effect on what type of crime is committed, so we want to extract the street type from the 'Address' feature.

We have avenues, streets, ways, boulevards, highways, courts, walks, plazas, and differet number of intersections of roads/streets (Addresses with /).

In [None]:
train_df['Address'].value_counts().index

Index(['0025XX LAKE ST E', '0030XX HENNEPIN AVE ', '0009XX NICOLLET MALL  ',
       '0028XX 26TH AVE S', '0004XX HENNEPIN AVE ', '00001X LAKE ST W',
       '0015XX NEW BRIGHTON BLVD ', '0006XX NICOLLET MALL  ',
       '0007XX BROADWAY AVE W', '0004XX 1ST AVE N',
       ...
       '00001X 4 ST S', '0023XX HAYES ST NE', '0002XX 19 ST E',
       'Lake ST E / Park AV S', '0041XX 45TH AVE S', '0001XX 32ND ST W',
       '00005X 5TH ST S', '0023XX 5TH AVE S', '0042XX ABBOTT AVE S',
       '0027XX Arthur ST NE'],
      dtype='object', length=9519)

In [None]:
import re
    
def find_streets(address):
    street_types = ['AV', 'ST', 'CT', 'PZ', 'LN', 'DR', 'PL', 'HY', 
                    'FY', 'WY', 'TR', 'RD', 'BL', 'WAY', 'CR', 'AL', 'I-80',  
                    'RW', 'WK','EL CAMINO DEL MAR']
    street_pattern = '|'.join(street_types)
    streets = re.findall(street_pattern, address)
    if len(streets) == 0:
        # Debug
#         print(address)
        return 'OTHER'
    elif len(streets) == 1:
        return streets[0]
    else:
#         print(address)
        return 'INT'

train_df['StreetType'] = train_df['Address'].map(find_streets)
test_df['StreetType'] = test_df['Address'].map(find_streets)


In [None]:
train_df['StreetType'].value_counts()

AV       24646
ST        9598
INT       7394
AL         626
BL         462
PL         211
OTHER      167
WY         134
DR         110
RD          73
LN          39
CT          32
WAY          6
HY           6
Name: StreetType, dtype: int64

In [None]:
train_df.isnull().sum()
train_df['StreetType'].isnull().sum()

0

In [None]:
train_df.isnull().sum()


Dates            0
Address          0
Lat              0
Long             0
Neighborhood     0
Precinct         0
Offense          0
Description      0
UCRCode          0
Year             0
Month            0
DayOfWeek_Num    0
DayOfWeek        0
Hour             0
Category         0
X                0
Y                0
Minute           0
Day              0
Hour_Zone        0
WeekOfYear       0
Holiday          0
BusinessHour     0
Season           0
Weekend          0
StreetType       0
dtype: int64

## Block Features (Removed)

- Let's explore and create the block feature, since we saw it a lot in the address features
- Binary feature
    - Categorize address that contains 'Block', as having a block, and if no block exists, we will assign to 0.
- 617231 addresses with blocks
- 260818 addresses with no blocks

In [None]:
# def find_block(address):
test_df.dropna(inplace=False)
#     blocks = re.search(block_pattern, address)
#     if blocks:
# #         print(address)
#         return 1
#     else:
# #         print(address)
#         return 0


# train_df['Block'] = train_df['Address'].map(find_block)
# test_df['Block'] = test_df['Address'].map(find_block)

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Offense,Description,UCRCode,Year,...,Y,Minute,Day,Hour_Zone,WeekOfYear,Holiday,BusinessHour,Season,Weekend,StreetType
22106,2019-01-02 16:00:00,0009XX 4TH ST N,44.987210,-93.280944,NORTH LOOP,DOWNTOWN,CSCR,CSC - RAPE,3,2019,...,44.987210,0,2,2,0,False,1,1,0,ST
3396,2018-07-31 00:00:00,0011XX HENNEPIN AVE,44.975034,-93.279723,DOWNTOWN WEST,DOWNTOWN,ROBPAG,ROBBERY PER AGG,4,2018,...,44.975034,0,31,4,15,False,0,3,0,AV
16558,2019-05-28 19:15:00,0013XX 4TH ST SE,44.980767,-93.236473,MARCY HOLMES,NORTHEAST,THEFT,OTHER THEFT,7,2019,...,44.980767,15,28,3,11,False,0,2,0,ST
14724,2020-07-03 21:00:00,0008XX 5TH AVE N,44.982787,-93.290185,SUMNER - GLENWOOD,NORTH,TFMV,THEFT FROM MOTR VEHC,7,2020,...,44.982787,0,3,3,13,True,0,3,0,AV
21138,2019-02-27 06:50:00,0029XX 46TH AVE S,44.950453,-93.207583,COOPER,SOUTHEAST,BURGD,BURGLARY OF DWELLING,6,2019,...,44.950453,50,27,0,4,False,0,1,0,AV
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5009,2018-05-13 21:30:00,0006XX 16 ST E,44.967168,-93.266383,ELLIOT PARK,DOWNTOWN,TFMV,Theft From Motr Vehc,7,2018,...,44.967168,30,13,3,9,False,0,2,1,ST
5469,2018-09-30 19:49:00,0029XX PORTLAND AVE,44.949214,-93.267751,PHILLIPS WEST,SOUTHEAST,ASLT1,ASLT-GREAT BODILY HM,5,2018,...,44.949214,49,30,3,19,False,0,4,1,AV
1922,2018-03-11 23:58:59,0007XX Penn AV N,44.985310,-93.308292,NEAR - NORTH,NORTH,GTA,Motor Vehicle Theft,8,2018,...,44.985310,58,11,4,5,False,0,2,1,AV
12388,2020-10-16 03:55:00,0033XX HIAWATHA AVE,44.941249,-93.233428,LONGFELLOW,SOUTHEAST,GTA,AUTOMOBILE THEFT,8,2020,...,44.941249,55,16,0,21,False,0,4,0,AV


In [None]:
# train_df['Block'].value_counts()

## Block Number Feature

- Let's explore the block number from address
- Block number has ordinal data type (order matters), and has spatial significance
- It seems all the block numbers are in intervals of 100
- How to categorize
    - Addresses that do not have a block number will be categorized as 0
    - Addresses with block number will be divided by 100, and added by 1 for mapping (0 is saved for addresses with no block number)
- 85 unique block numbers (including 1 where there is no block number)

In [None]:
def find_block_number(address):
    block_num_pattern = '[0-9]+\s[Block]'
    block_num = re.search(block_num_pattern, address)
    if block_num:
#         print(address)
        num_pattern = '[0-9]+'
        block_no_pos = re.search(num_pattern, address)
        # Get integer of found regular expression
        block_no = int(block_no_pos.group())
        # Convert block number by dividing by 100 and adding 1 (0 = addresses with no block)
        block_map = (block_no // 100) + 1
#         print(block_map)
        return block_map
    else:
#         print(address)
        # 
        return 0


train_df['BlockNo'] = train_df['Address'].map(find_block_number)
test_df['BlockNo'] = test_df['Address'].map(find_block_number)

In [None]:
train_df['BlockNo'].value_counts()

0    43504
Name: BlockNo, dtype: int64

## X, Y Coordinates

- Normalize and scale the X and Y coordinates
- I use **K-Means clustering** to create a new feature for the longitude and latitude by grouping clusters of points based on Euclidean distances.
- X = longitude, Y = latitude
- I also extract more spatial features from the X, Y coordinates by transforming them from the cartesian space to the polar space ([Reference](https://www.kaggle.com/c/sf-crime/discussion/18853))
    1. three variants of rotated Cartesian coordinates (rotated by 30, 45, 60 degree each) 
    2. Polar coordinates (i.e. the 'r' and the angle 'theta')
    3. The approach makes some intuitive sense i.e. that having such features should help in extracting some more spatial information (than relying on the current x-y alone)

In [None]:
# Normalize X and Y
print('There are %d unique longitude values, %d unique latitude values' % (train_df['X'].nunique(), 
                                                                           train_df['Y'].nunique()))

xy_scaler = StandardScaler().fit(train_df[['X', 'Y']])
train_df[['X', 'Y']] = xy_scaler.transform(train_df[['X', 'Y']])
test_df[['X', 'Y']] = xy_scaler.transform(test_df[['X', 'Y']])

There are 7058 unique longitude values, 7022 unique latitude values


In [None]:
# X-Y plane rotation and space transformation to extract more spatial information
# 2-dimensional rotation based on below functions:
# rotated x = xcos - ysin
# rotated y = xsin + ycos
# Conver from cartesian space -> polar space

cos_30 = math.cos(math.radians(30))
sin_30 = math.sin(math.radians(30))
cos_45 = math.cos(math.radians(45))
sin_45 = math.sin(math.radians(45))
cos_60 = math.cos(math.radians(60))
sin_60 = math.sin(math.radians(60))


train_df["Rot30_X"] = train_df['X'] * cos_30 - train_df['Y'] * sin_30 
train_df["Rot30_Y"] = train_df['X'] * sin_30 + train_df['Y'] * cos_30
train_df["Rot45_X"] = train_df['X'] * cos_45 - train_df['Y'] * sin_45  
train_df["Rot45_Y"] = train_df['X'] * sin_45 + train_df['Y'] * cos_45
train_df["Rot60_X"] = train_df['X'] * cos_60 - train_df['Y'] * sin_60  
train_df["Rot60_Y"] = train_df['X'] * sin_60 + train_df['Y'] * cos_60
train_df["Radius"] = np.sqrt(train_df['X'] ** 2 + train_df['Y'] ** 2)
train_df["Angle"] = np.arctan2(train_df['X'], train_df['Y'])

test_df["Rot30_X"] = test_df['X'] * cos_30 - test_df['Y'] * sin_30  
test_df["Rot30_Y"] = test_df['X'] * sin_30 + test_df['Y'] * cos_30
test_df["Rot45_X"] = test_df['X'] * cos_45 - test_df['Y'] * sin_45  
test_df["Rot45_Y"] = test_df['X'] * sin_45 + test_df['Y'] * cos_45
test_df["Rot60_X"] = test_df['X'] * cos_60 - test_df['Y'] * sin_60  
test_df["Rot60_Y"] = test_df['X'] * sin_60 + test_df['Y'] * cos_60
test_df["Radius"] = np.sqrt(test_df['X'] ** 2 + test_df['Y'] ** 2)
test_df["Angle"] = np.arctan2(test_df['X'], test_df['Y'])

In [None]:
# View the description of the numerical features again to ensure everything is right
train_df.describe()

Unnamed: 0,Lat,Long,UCRCode,Year,Month,DayOfWeek_Num,Hour,X,Y,Minute,...,Weekend,BlockNo,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle
count,43504.0,43504.0,43504.0,43504.0,43504.0,43504.0,43504.0,43504.0,43504.0,43504.0,...,43504.0,43504.0,43504.0,43504.0,43504.0,43504.0,43504.0,43504.0,43504.0,43504.0
mean,44.965419,-93.263689,6.592267,2019.059971,6.697361,3.009241,12.920881,-1.802171e-14,7.950569e-15,17.540985,...,0.283008,0.0,-1.9585e-14,-2.125226e-15,-1.836813e-14,-7.121816e-15,-1.589893e-14,-1.163305e-14,0.106756,-0.0904
std,0.306617,0.632978,1.118393,0.80485,3.339592,1.990104,7.13905,1.000011,1.000011,18.968796,...,0.450466,0.0,1.364292,0.3724971,1.412279,0.07426551,1.364292,0.3724971,1.410195,1.885773
min,0.0,-93.329109,1.0,2018.0,1.0,0.0,0.0,-0.1033534,-146.652,0.0,...,0.0,0.0,-0.2147909,-53.33297,-0.2581458,-0.2297988,-0.2839085,-0.193642,0.000979,-3.140637
25%,44.948347,-93.288874,6.0,2018.0,4.0,1.0,8.0,-0.03978854,-0.05567822,0.0,...,0.0,0.0,-0.04355181,-0.06016118,-0.05232896,-0.05117113,-0.06422611,-0.03894064,0.049578,-1.781244
50%,44.965746,-93.271699,7.0,2019.0,7.0,3.0,14.0,-0.01265438,0.001068412,11.0,...,0.0,0.0,-0.009405452,0.003060358,-0.005666315,0.00237033,-0.001311247,0.002489141,0.07895,-0.251025
75%,44.987328,-93.247352,7.0,2020.0,10.0,5.0,19.0,0.02581098,0.07145617,30.0,...,1.0,0.0,0.03717576,0.06437552,0.04433565,0.05200307,0.0530929,0.03139684,0.136436,1.342673
max,45.051227,0.0,10.0,2020.0,12.0,6.0,23.0,147.3427,0.2798597,59.0,...,1.0,0.0,200.9285,0.2244126,207.8857,0.4884357,200.6757,54.27656,207.886227,3.137892


In [None]:
# run KMeans separately on both the training set and test set
data = [train_df, test_df]
num_clusters = 40
for dataset in data:
    coordinates = dataset.loc[:,['Y','X']]
    kmeans = KMeans(n_clusters=num_clusters, n_jobs=jobs, random_state=1).fit(coordinates)
    id_labels=kmeans.labels_
#     print(kmeans.cluster_centers_)
    dataset['Cluster'] = id_labels

In [None]:
train_df.head()

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Offense,Description,UCRCode,Year,...,BlockNo,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
11380,2020-06-20 14:02:00,0001XX GRANT ST W,44.969863,-93.28008,LORING PARK,DOWNTOWN,THEFT,OTHER THEFT,7,2020,...,0,-0.029674,-0.000395,-0.02856,-0.008061,-0.025501,-0.015179,0.029676,-1.060496,25
8731,2018-11-12 14:08:00,0018XX CHICAGO AVE,44.964567,-93.262554,VENTURA VILLAGE,SOUTHEAST,TBLDG,THEFT FROM BUILDING,7,2018,...,0,0.002942,-0.001507,0.003232,-0.000694,0.003302,0.000166,0.003306,2.567789,18
3283,2018-07-18 11:45:00,0014XX NICOLLET AVE,44.968424,-93.277817,LORING PARK,DOWNTOWN,PETIT,OBS - PETTY THEFT,7,2018,...,0,-0.02423,-0.002672,-0.022713,-0.008852,-0.019648,-0.014429,0.024377,-1.157034,25
16375,2020-07-26 06:00:00,0009XX 28TH AVE NE,45.018586,-93.24642,AUDUBON PARK,NORTHEAST,TFMV,THEFT-MOTR VEH PARTS,7,2020,...,0,-0.063072,0.163811,-0.103321,0.141904,-0.136528,0.110328,0.175534,0.156061,27
19965,2020-09-19 23:00:00,0026XX LYNDALE AVE S,44.954638,-93.288048,LOWRY HILL EAST,SOUTHWEST,BIKETF,BIKE THEFT,7,2020,...,0,-0.015747,-0.049691,-0.002349,-0.052073,0.011208,-0.050907,0.052126,-2.311112,4


## Drop Features

- We have already extracted all the necessary features from the `Address` attribute, so drop
- We don't need `Resolution` or `Descript` features since it is not included in the training data

In [None]:
# Drop Address feature from both train and test set
train_df.drop(['Address'], axis=1, inplace=True)
test_df.drop(['Address'], axis=1, inplace=True)

In [None]:
train_df.head()
train_df.drop(['Dates'], axis=1, inplace=True)
test_df.drop(['Dates'], axis=1, inplace=True)

In [None]:
# Drop columns that are no longer needed
train_df.drop(['Lat', 'Long'], axis=1, inplace=True)
test_df.drop(['Lat', 'Long'], axis=1, inplace=True)

In [None]:
# Drop Descript column since test set does not have this column
train_df.drop(['Description'], axis=1, inplace=True)

In [None]:
# # Let's quickly view the data
# dict(zip(offense.classes_, offense.transform(offense.classes_)))


# Feature Encoding 

- Convert categorical data to numeric data

### Precincts

- convert Precinct categorical feature to numeric

In [None]:
precincts = {'DOWNTOWN':1, 'NORTHEAST':2, 'SOUTHEAST':3, 'NORTH':4, 'SOUTHWEST':5}
train_df['Precinct'].replace(precincts, inplace=True)
test_df['Precinct'].replace(precincts, inplace=True)

In [None]:
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Offense,UCRCode,Year,Month,DayOfWeek_Num,DayOfWeek,Hour,Category,...,BlockNo,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
11380,LORING PARK,1,THEFT,7,2020,6,5,Saturday,14,LARCENY,...,0,-0.029674,-0.000395,-0.02856,-0.008061,-0.025501,-0.015179,0.029676,-1.060496,25
8731,VENTURA VILLAGE,3,TBLDG,7,2018,11,0,Monday,14,LARCENY,...,0,0.002942,-0.001507,0.003232,-0.000694,0.003302,0.000166,0.003306,2.567789,18
3283,LORING PARK,1,PETIT,7,2018,7,2,Wednesday,11,LARCENY,...,0,-0.02423,-0.002672,-0.022713,-0.008852,-0.019648,-0.014429,0.024377,-1.157034,25
16375,AUDUBON PARK,2,TFMV,7,2020,7,6,Sunday,6,LARCENY,...,0,-0.063072,0.163811,-0.103321,0.141904,-0.136528,0.110328,0.175534,0.156061,27
19965,LOWRY HILL EAST,5,BIKETF,7,2020,9,5,Saturday,23,LARCENY,...,0,-0.015747,-0.049691,-0.002349,-0.052073,0.011208,-0.050907,0.052126,-2.311112,4


### Neighborhoods

- convert Neighborhoods categorical feature to numeric

In [None]:
data = [train_df, test_df]
for dataset in data:
    neighborhood_le = LabelEncoder()
    neighborhood_le.fit(dataset['Neighborhood'].unique())
    #print(list(neighborhood.classes_))

    dataset['Neighborhood']=neighborhood_le.transform(dataset['Neighborhood']) 
    
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Offense,UCRCode,Year,Month,DayOfWeek_Num,DayOfWeek,Hour,Category,...,BlockNo,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
11380,45,1,THEFT,7,2020,6,5,Saturday,14,LARCENY,...,0,-0.029674,-0.000395,-0.02856,-0.008061,-0.025501,-0.015179,0.029676,-1.060496,25
8731,77,3,TBLDG,7,2018,11,0,Monday,14,LARCENY,...,0,0.002942,-0.001507,0.003232,-0.000694,0.003302,0.000166,0.003306,2.567789,18
3283,45,1,PETIT,7,2018,7,2,Wednesday,11,LARCENY,...,0,-0.02423,-0.002672,-0.022713,-0.008852,-0.019648,-0.014429,0.024377,-1.157034,25
16375,1,2,TFMV,7,2020,7,6,Sunday,6,LARCENY,...,0,-0.063072,0.163811,-0.103321,0.141904,-0.136528,0.110328,0.175534,0.156061,27
19965,47,5,BIKETF,7,2020,9,5,Saturday,23,LARCENY,...,0,-0.015747,-0.049691,-0.002349,-0.052073,0.011208,-0.050907,0.052126,-2.311112,4


In [None]:
# So we know the mapping (important)
dict(zip(neighborhood_le.classes_, neighborhood_le.transform(neighborhood_le.classes_)))

{'ARMATAGE': 0,
 'AUDUBON PARK': 1,
 'BANCROFT': 2,
 'BELTRAMI': 3,
 'BOTTINEAU': 4,
 'BRYANT': 5,
 'BRYN - MAWR': 6,
 'CAMDEN INDUSTRIAL': 7,
 'CARAG': 8,
 'CEDAR - ISLES - DEAN': 9,
 'CEDAR RIVERSIDE': 10,
 'CENTRAL': 11,
 'CLEVELAND': 12,
 'COLUMBIA PARK': 13,
 'COMO': 14,
 'COOPER': 15,
 'CORCORAN': 16,
 'DIAMOND LAKE': 17,
 'DOWNTOWN EAST': 18,
 'DOWNTOWN WEST': 19,
 'EAST HARRIET': 20,
 'EAST ISLES': 21,
 'EAST PHILLIPS': 22,
 'ECCO': 23,
 'ELLIOT PARK': 24,
 'ERICSSON': 25,
 'FIELD': 26,
 'FOLWELL': 27,
 'FULTON': 28,
 'HALE': 29,
 'HARRISON': 30,
 'HAWTHORNE': 31,
 'HIAWATHA': 32,
 'HOLLAND': 33,
 'HOWE': 34,
 'HUMBOLDT INDUSTRIAL AREA': 35,
 'JORDAN': 36,
 'KEEWAYDIN': 37,
 'KENNY': 38,
 'KENWOOD': 39,
 'KING FIELD': 40,
 'LIND - BOHANON': 41,
 'LINDEN HILLS': 42,
 'LOGAN PARK': 43,
 'LONGFELLOW': 44,
 'LORING PARK': 45,
 'LOWRY HILL': 46,
 'LOWRY HILL EAST': 47,
 'LYNDALE': 48,
 'LYNNHURST': 49,
 'MARCY HOLMES': 50,
 'MARSHALL TERRACE': 51,
 'MCKINLEY': 52,
 'MID - CITY INDUS

### Year

- Year is an **ordinal** variable, so let's keep that ordering and mapping
- convert Year categorical feature to numeric

In [None]:
data = [train_df, test_df]

for dataset in data:
    year_le = LabelEncoder()
    year_le.fit(dataset['Year'].unique())
    print(list(year_le.classes_))

    dataset['Year']=year_le.transform(dataset['Year']) 

[2018, 2019, 2020]
[2018, 2019, 2020]


In [None]:
dict(zip(year_le.classes_, year_le.transform(year_le.classes_)))

{2018: 0, 2019: 1, 2020: 2}

## Offense
Convert to numeric

In [None]:
data = [train_df, test_df]

for dataset in data:
    offense_le = LabelEncoder()
    offense_le.fit(dataset['Offense'].unique())
    print(list(offense_le.classes_))

    dataset['Offense']=offense_le.transform(dataset['Offense']) 
    # So we know the mapping (important)
    
dict(zip(offense_le.classes_, offense_le.transform(offense_le.classes_)))

['ARSON', 'ASLT1', 'ASLT2', 'ASLT3', 'ASLT4', 'BIKETF', 'BURGB', 'BURGD', 'COINOP', 'COMPUT', 'CSCR', 'DASLT1', 'DASLT2', 'DASLT3', 'DASTR', 'DISARM', 'GTA', 'LOOT', 'MURDR', 'NOPAY', 'ONLTHT', 'PETIT', 'POCKET', 'ROBBIZ', 'ROBPAG', 'ROBPER', 'SCRAP', 'SHOPLF', 'TBLDG', 'TFMV', 'TFPER', 'THEFT', 'THFTSW']
['ARSON', 'ASLT1', 'ASLT2', 'ASLT3', 'ASLT4', 'BIKETF', 'BURGB', 'BURGD', 'COINOP', 'COMPUT', 'CSCR', 'DASLT1', 'DASLT2', 'DASLT3', 'DASTR', 'GTA', 'MURDR', 'ONLTHT', 'PETIT', 'POCKET', 'ROBBIZ', 'ROBPAG', 'ROBPER', 'SCRAP', 'SHOPLF', 'TBLDG', 'TFMV', 'TFPER', 'THEFT', 'THFTSW']


{'ARSON': 0,
 'ASLT1': 1,
 'ASLT2': 2,
 'ASLT3': 3,
 'ASLT4': 4,
 'BIKETF': 5,
 'BURGB': 6,
 'BURGD': 7,
 'COINOP': 8,
 'COMPUT': 9,
 'CSCR': 10,
 'DASLT1': 11,
 'DASLT2': 12,
 'DASLT3': 13,
 'DASTR': 14,
 'GTA': 15,
 'MURDR': 16,
 'ONLTHT': 17,
 'PETIT': 18,
 'POCKET': 19,
 'ROBBIZ': 20,
 'ROBPAG': 21,
 'ROBPER': 22,
 'SCRAP': 23,
 'SHOPLF': 24,
 'TBLDG': 25,
 'TFMV': 26,
 'TFPER': 27,
 'THEFT': 28,
 'THFTSW': 29}

In [None]:
train_df['Year'].unique()


array([2, 0, 1])

In [None]:
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Offense,UCRCode,Year,Month,DayOfWeek_Num,DayOfWeek,Hour,Category,...,BlockNo,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
11380,45,1,31,7,2,6,5,Saturday,14,LARCENY,...,0,-0.029674,-0.000395,-0.02856,-0.008061,-0.025501,-0.015179,0.029676,-1.060496,25
8731,77,3,28,7,0,11,0,Monday,14,LARCENY,...,0,0.002942,-0.001507,0.003232,-0.000694,0.003302,0.000166,0.003306,2.567789,18
3283,45,1,21,7,0,7,2,Wednesday,11,LARCENY,...,0,-0.02423,-0.002672,-0.022713,-0.008852,-0.019648,-0.014429,0.024377,-1.157034,25
16375,1,2,29,7,2,7,6,Sunday,6,LARCENY,...,0,-0.063072,0.163811,-0.103321,0.141904,-0.136528,0.110328,0.175534,0.156061,27
19965,47,5,5,7,2,9,5,Saturday,23,LARCENY,...,0,-0.015747,-0.049691,-0.002349,-0.052073,0.011208,-0.050907,0.052126,-2.311112,4


In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43504 entries, 11380 to 2916
Data columns (total 31 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Neighborhood   43504 non-null  int64  
 1   Precinct       43504 non-null  int64  
 2   Offense        43504 non-null  int64  
 3   UCRCode        43504 non-null  int64  
 4   Year           43504 non-null  int64  
 5   Month          43504 non-null  int64  
 6   DayOfWeek_Num  43504 non-null  int64  
 7   DayOfWeek      43504 non-null  object 
 8   Hour           43504 non-null  int64  
 9   Category       43504 non-null  object 
 10  X              43504 non-null  float64
 11  Y              43504 non-null  float64
 12  Minute         43504 non-null  int64  
 13  Day            43504 non-null  int64  
 14  Hour_Zone      43504 non-null  int64  
 15  WeekOfYear     43504 non-null  int64  
 16  Holiday        43504 non-null  bool   
 17  BusinessHour   43504 non-null  uint8  
 18  Sea

### Year

- Year is an **ordinal** variable, so let's keep that ordering and mapping
- convert Year categorical feature to numeric

In [None]:
data = [train_df, test_df]
# dict(zip(dow_le.classes_, dow_le.transform(dow_le.classes_)))
for dataset in data:
    year_le = LabelEncoder()
    year_le.fit(dataset['Year'].unique())
    print(list(year_le.classes_))

    dataset['Year']=year_le.transform(dataset['Year']) 

[0, 1, 2]
[0, 1, 2]


In [None]:
train_df['Year'].unique()

array([2, 0, 1])

In [None]:
dict(zip(year_le.classes_, year_le.transform(year_le.classes_)))

{0: 0, 1: 1, 2: 2}

In [None]:
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Offense,UCRCode,Year,Month,DayOfWeek_Num,DayOfWeek,Hour,Category,...,BlockNo,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
11380,45,1,31,7,2,6,5,Saturday,14,LARCENY,...,0,-0.029674,-0.000395,-0.02856,-0.008061,-0.025501,-0.015179,0.029676,-1.060496,25
8731,77,3,28,7,0,11,0,Monday,14,LARCENY,...,0,0.002942,-0.001507,0.003232,-0.000694,0.003302,0.000166,0.003306,2.567789,18
3283,45,1,21,7,0,7,2,Wednesday,11,LARCENY,...,0,-0.02423,-0.002672,-0.022713,-0.008852,-0.019648,-0.014429,0.024377,-1.157034,25
16375,1,2,29,7,2,7,6,Sunday,6,LARCENY,...,0,-0.063072,0.163811,-0.103321,0.141904,-0.136528,0.110328,0.175534,0.156061,27
19965,47,5,5,7,2,9,5,Saturday,23,LARCENY,...,0,-0.015747,-0.049691,-0.002349,-0.052073,0.011208,-0.050907,0.052126,-2.311112,4


### DayOfWeek

- we are going to use sklearn's LabelEncoder to encode the categorical data to numeric
- Day of week is considered a categorical and nominal variable

In [None]:
train_df.head()

for dataset in data:
    dow_le = LabelEncoder()
    dow_le.fit(dataset['DayOfWeek'].unique())
    print(list(dow_le.classes_))
    dataset['DayOfWeek']=dow_le.transform(dataset['DayOfWeek'])

['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday']
['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday']


In [None]:
train_df['DayOfWeek'].unique()

array([2, 1, 6, 3, 4, 5, 0])

In [None]:
dict(zip(dow_le.classes_, dow_le.transform(dow_le.classes_)))

{'Friday': 0,
 'Monday': 1,
 'Saturday': 2,
 'Sunday': 3,
 'Thursday': 4,
 'Tuesday': 5,
 'Wednesday': 6}

### Street Type

- we are going to use sklearn's LabelEncoder to encode the categorical data to numeric

In [None]:
train_df.info()

for dataset in data:
    st_le = LabelEncoder()
    st_le.fit(dataset['StreetType'].unique())
    print(list(st_le.classes_))
    dataset['StreetType']=st_le.transform(dataset['StreetType'])

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43504 entries, 11380 to 2916
Data columns (total 31 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Neighborhood   43504 non-null  int64  
 1   Precinct       43504 non-null  int64  
 2   Offense        43504 non-null  int64  
 3   UCRCode        43504 non-null  int64  
 4   Year           43504 non-null  int64  
 5   Month          43504 non-null  int64  
 6   DayOfWeek_Num  43504 non-null  int64  
 7   DayOfWeek      43504 non-null  int64  
 8   Hour           43504 non-null  int64  
 9   Category       43504 non-null  object 
 10  X              43504 non-null  float64
 11  Y              43504 non-null  float64
 12  Minute         43504 non-null  int64  
 13  Day            43504 non-null  int64  
 14  Hour_Zone      43504 non-null  int64  
 15  WeekOfYear     43504 non-null  int64  
 16  Holiday        43504 non-null  bool   
 17  BusinessHour   43504 non-null  uint8  
 18  Sea

In [None]:
train_df['StreetType'].unique()

array([11,  1,  6,  0,  9, 13,  8,  2,  4, 10,  7,  3, 12,  5])

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43504 entries, 11380 to 2916
Data columns (total 31 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Neighborhood   43504 non-null  int64  
 1   Precinct       43504 non-null  int64  
 2   Offense        43504 non-null  int64  
 3   UCRCode        43504 non-null  int64  
 4   Year           43504 non-null  int64  
 5   Month          43504 non-null  int64  
 6   DayOfWeek_Num  43504 non-null  int64  
 7   DayOfWeek      43504 non-null  int64  
 8   Hour           43504 non-null  int64  
 9   Category       43504 non-null  object 
 10  X              43504 non-null  float64
 11  Y              43504 non-null  float64
 12  Minute         43504 non-null  int64  
 13  Day            43504 non-null  int64  
 14  Hour_Zone      43504 non-null  int64  
 15  WeekOfYear     43504 non-null  int64  
 16  Holiday        43504 non-null  bool   
 17  BusinessHour   43504 non-null  uint8  
 18  Sea

### Holiday

- Encode the binary feature

In [None]:
train_df['Holiday'].replace(False, 0, inplace=True)
train_df['Holiday'].replace(True, 1, inplace=True)
test_df['Holiday'].replace(False, 0, inplace=True)
test_df['Holiday'].replace(True, 1, inplace=True)

train_df['Holiday'] = train_df['Holiday'].astype(int)
train_df['Holiday'] = train_df['Holiday'].astype(int)

In [None]:
train_df[train_df['Holiday'] == 1].head()

Unnamed: 0,Neighborhood,Precinct,Offense,UCRCode,Year,Month,DayOfWeek_Num,DayOfWeek,Hour,Category,...,BlockNo,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
8731,77,3,28,7,0,11,0,1,14,LARCENY,...,0,0.002942,-0.001507,0.003232,-0.000694,0.003302,0.000166,0.003306,2.567789,18
1964,84,5,27,7,0,1,0,1,18,LARCENY,...,0,0.007154,-0.060161,0.022481,-0.05626,0.036276,-0.048524,0.060585,-2.736355,33
10460,40,5,6,6,0,12,1,5,2,BURGLARY,...,0,0.024898,-0.120792,0.055312,-0.110232,0.081958,-0.09216,0.123331,-2.821267,13
3789,49,5,29,7,2,7,4,0,0,LARCENY,...,0,0.061507,-0.188294,0.108146,-0.165959,0.147414,-0.132314,0.198086,-2.933721,14
22690,22,3,16,8,1,2,0,1,5,AUTO THEFT,...,0,0.042301,-0.029828,0.04858,-0.017863,0.051548,-0.004681,0.05176,2.708548,10


In [None]:
test_df[test_df['Holiday'] == 1].head()

Unnamed: 0,Neighborhood,Precinct,Offense,Description,UCRCode,Year,Month,DayOfWeek_Num,DayOfWeek,Hour,...,BlockNo,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
14724,74,4,26,THEFT FROM MOTR VEHC,7,2,7,4,0,21,...,0,-0.064574,0.028127,-0.069654,0.010456,-0.069987,-0.007928,0.070434,-0.636402,39
4891,57,4,28,OTHER THEFT,7,1,1,0,1,17,...,0,-0.087323,0.074491,-0.103627,0.049352,-0.112869,0.020849,0.114779,-0.340936,8
3781,50,2,28,OTHER THEFT,7,2,7,4,0,20,...,0,0.011669,0.068245,-0.006392,0.068939,-0.024017,0.064936,0.069235,0.692946,28
8037,49,5,26,THEFT FROM MOTR VEHC,7,0,10,0,1,14,...,0,0.063566,-0.187833,0.110015,-0.164981,0.148966,-0.130885,0.198297,-2.944311,9
4931,42,5,15,AUTOMOBILE THEFT,8,1,2,0,1,7,...,0,-0.003595,-0.139263,0.032572,-0.135448,0.066518,-0.122403,0.13931,-2.592187,15


### Category

- we are going to use sklearn's LabelEncoder to encode the categorical data to numeric

In [None]:
data = [train_df]

for dataset in data:
    cat_le = LabelEncoder()
    cat_le.fit(dataset['Category'].unique())
    print(list(cat_le.classes_))
    dataset['Category']=cat_le.transform(dataset['Category'])

['ARSON', 'ASSAULT', 'AUTO THEFT', 'BURGLARY', 'LARCENY', 'MURDER', 'RAPE', 'ROBBERY']


In [None]:
len(train_df['Category'].unique())

8

In [None]:
# So we know the mapping (important)
dict(zip(cat_le.classes_, cat_le.transform(cat_le.classes_)))

{'ARSON': 0,
 'ASSAULT': 1,
 'AUTO THEFT': 2,
 'BURGLARY': 3,
 'LARCENY': 4,
 'MURDER': 5,
 'RAPE': 6,
 'ROBBERY': 7}

In [None]:
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Offense,UCRCode,Year,Month,DayOfWeek_Num,DayOfWeek,Hour,Category,...,BlockNo,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
11380,45,1,31,7,2,6,5,2,14,4,...,0,-0.029674,-0.000395,-0.02856,-0.008061,-0.025501,-0.015179,0.029676,-1.060496,25
8731,77,3,28,7,0,11,0,1,14,4,...,0,0.002942,-0.001507,0.003232,-0.000694,0.003302,0.000166,0.003306,2.567789,18
3283,45,1,21,7,0,7,2,6,11,4,...,0,-0.02423,-0.002672,-0.022713,-0.008852,-0.019648,-0.014429,0.024377,-1.157034,25
16375,1,2,29,7,2,7,6,3,6,4,...,0,-0.063072,0.163811,-0.103321,0.141904,-0.136528,0.110328,0.175534,0.156061,27
19965,47,5,5,7,2,9,5,2,23,4,...,0,-0.015747,-0.049691,-0.002349,-0.052073,0.011208,-0.050907,0.052126,-2.311112,4


In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43504 entries, 11380 to 2916
Data columns (total 31 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Neighborhood   43504 non-null  int64  
 1   Precinct       43504 non-null  int64  
 2   Offense        43504 non-null  int64  
 3   UCRCode        43504 non-null  int64  
 4   Year           43504 non-null  int64  
 5   Month          43504 non-null  int64  
 6   DayOfWeek_Num  43504 non-null  int64  
 7   DayOfWeek      43504 non-null  int64  
 8   Hour           43504 non-null  int64  
 9   Category       43504 non-null  int64  
 10  X              43504 non-null  float64
 11  Y              43504 non-null  float64
 12  Minute         43504 non-null  int64  
 13  Day            43504 non-null  int64  
 14  Hour_Zone      43504 non-null  int64  
 15  WeekOfYear     43504 non-null  int64  
 16  Holiday        43504 non-null  int64  
 17  BusinessHour   43504 non-null  uint8  
 18  Sea

## View Information about Data

- One last check before training

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43504 entries, 11380 to 2916
Data columns (total 31 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Neighborhood   43504 non-null  int64  
 1   Precinct       43504 non-null  int64  
 2   Offense        43504 non-null  int64  
 3   UCRCode        43504 non-null  int64  
 4   Year           43504 non-null  int64  
 5   Month          43504 non-null  int64  
 6   DayOfWeek_Num  43504 non-null  int64  
 7   DayOfWeek      43504 non-null  int64  
 8   Hour           43504 non-null  int64  
 9   Category       43504 non-null  int64  
 10  X              43504 non-null  float64
 11  Y              43504 non-null  float64
 12  Minute         43504 non-null  int64  
 13  Day            43504 non-null  int64  
 14  Hour_Zone      43504 non-null  int64  
 15  WeekOfYear     43504 non-null  int64  
 16  Holiday        43504 non-null  int64  
 17  BusinessHour   43504 non-null  uint8  
 18  Sea

In [None]:
# # Convert all to 32 bit integers so less memory and will train faster (no loss in data since our integers dont reach)
columns_to_convert = ['DayOfWeek', 'Precinct', 'Minute', 'Hour', 'Day', 'Month', 'Year', 
                      'Hour_Zone', 'WeekOfYear', 'Season', 'StreetType', 'BlockNo', 'Cluster']
                      #'Neighborhood']
train_df[columns_to_convert] = train_df[columns_to_convert].astype('int16')
test_df[columns_to_convert] = test_df[columns_to_convert].astype('int16')

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43504 entries, 11380 to 2916
Data columns (total 31 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Neighborhood   43504 non-null  int64  
 1   Precinct       43504 non-null  int16  
 2   Offense        43504 non-null  int64  
 3   UCRCode        43504 non-null  int64  
 4   Year           43504 non-null  int16  
 5   Month          43504 non-null  int16  
 6   DayOfWeek_Num  43504 non-null  int64  
 7   DayOfWeek      43504 non-null  int16  
 8   Hour           43504 non-null  int16  
 9   Category       43504 non-null  int64  
 10  X              43504 non-null  float64
 11  Y              43504 non-null  float64
 12  Minute         43504 non-null  int16  
 13  Day            43504 non-null  int16  
 14  Hour_Zone      43504 non-null  int16  
 15  WeekOfYear     43504 non-null  int16  
 16  Holiday        43504 non-null  int64  
 17  BusinessHour   43504 non-null  uint8  
 18  Sea

# Building Machine Learning Models

- Baseline Models
    - Let's train a couple models on a stratified sample of the training data
    - Evaluate on a hold out set to get baseline results for each model to determine what model to use
    - Models:
        - Stochastic Gradient Descent (with elastic net regularization)
        - Gaussian Naive Bayes
        - K Nearest Neighbors
        - Logistic Regression (with L1 regularization)
        - Random Forest
        - XGBoost
    - Almost all the default scikit-learn ML algorithm hyperparameters exhibit bad performance
        - Researched online & read literature to determine some more ideal default hyperparameters
            - [Reference](https://arxiv.org/abs/1708.05070)
- Couple things to note:
    - **Decision tree models** including Ensemble methods (Random Forest & XGBoost) can handle categorical variables without one-hot encoding them. 
    - **Linear models** (SGD & Logistic Regression) cannot handle categorical features & need features to be OHE before training
    - Always OneHotEncode before you split data up to training/dev/test so that all features & classes will be represented

In [None]:
# result = opt.fit(X_train.values, Y_train.values, callback=status_print)

X_train = train_df.drop("Category", axis=1).copy()
Y_train = train_df["Category"].copy()

# Set testing data (drop Id)
#X_test = test_df.drop("Id", axis=1).copy()

In [None]:
def one_hot_encode(train_data):
    '''One Hot Encode the categorical features'''
    encoded_train_data = train_data

    #encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['Neighborhood']), prefix='Neighborhood')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['Precinct']), prefix='Precinct')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['DayOfWeek']), prefix='DayOfWeek')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['StreetType']), prefix='StreetType')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['Season']), prefix='Season')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['Hour_Zone']), prefix='Hour_Zone')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['Cluster']), prefix='Cluster')], axis=1)

    encoded_train_data = encoded_train_data.drop(['Cluster','StreetType', 'Season', 'Hour_Zone', 'DayOfWeek', 'Precinct'], axis=1)

    return encoded_train_data

In [None]:
X_encoded_train = one_hot_encode(X_train)

In [None]:
# Use these for ML algorithms that can't handle categorical data (Logistic Regression, Linear Models)
mini_encoded_train_data, mini_encoded_dev_data, mini_train_labels, mini_dev_labels = train_test_split(X_encoded_train, 
                                                                                      Y_train,
                                                                                      stratify=Y_train,
                                                                                      test_size=0.5,
                                                                                      random_state=1)

In [None]:
# Use these for ML algorithms that can handle categorical data without OHE
mini_train_data, mini_dev_data, mini_train_labels, mini_dev_labels = train_test_split(X_train,
                                                                                      Y_train,
                                                                                      stratify=Y_train,
                                                                                      test_size=0.5,
                                                                                      random_state=1)


In [None]:
# K Neighbors
knn = KNeighborsClassifier()
knn.fit(mini_train_data, mini_train_labels)
pred_probs = knn.predict_proba(mini_dev_data)
knn_loss = log_loss(mini_dev_labels, pred_probs)


print('KNN Validation Log Loss: ', knn_loss)

KNN Validation Log Loss:  2.5943647189960943


In [None]:
# Naive Bayes
gaussian = GaussianNB()
gaussian.fit(mini_train_data, mini_train_labels)
pred_probs = gaussian.predict_proba(mini_dev_data)
nb_loss = log_loss(mini_dev_labels, pred_probs)


print('Gaussian Naive Bayes Validation Log Loss: ', nb_loss)

Gaussian Naive Bayes Validation Log Loss:  7.462901613755392e-15


In [None]:
# stochastic gradient descent (SGD) learning
sgd = linear_model.SGDClassifier(penalty='elasticnet', loss='log', 
                                  tol=0.0001, max_iter=1000, n_jobs=3, random_state=1)
sgd.fit(mini_encoded_train_data, mini_train_labels)
pred_probs = sgd.predict_proba(mini_encoded_dev_data)
# sgd.fit(one_hot_encode(mini_train_data), mini_train_labels)
# sgd = gaussian.predict_proba(one_hot_encode(mini_dev_data))
sgd_loss = log_loss(mini_dev_labels, pred_probs)

print('Linear Model SGD Validation Log Loss: ', sgd_loss)

Linear Model SGD Validation Log Loss:  2.524191938930568


In [None]:
# Logistic Regression
logreg = LogisticRegression(penalty='l1', C=1.5, solver='saga', multi_class='multinomial', 
                            tol=0.0001, max_iter=1000, verbose=3, n_jobs=jobs, random_state=1)

logreg.fit(mini_encoded_train_data, mini_train_labels)
pred_probs = logreg.predict_proba(mini_encoded_dev_data)

logreg_loss = log_loss(mini_dev_labels, pred_probs)


print('Logistic Regression Validation Log Loss: ', logreg_loss)

Epoch 1, change: 1.00000000


[Parallel(n_jobs=1000)]: Using backend ThreadingBackend with 1000 concurrent workers.


Epoch 2, change: 0.22027997
Epoch 3, change: 0.17042231
Epoch 4, change: 0.14449822
Epoch 5, change: 0.12793020
Epoch 6, change: 0.11018445
Epoch 7, change: 0.09353372
Epoch 8, change: 0.08034122
Epoch 9, change: 0.07011421
Epoch 10, change: 0.06284175
Epoch 11, change: 0.05736645
Epoch 12, change: 0.05281880
Epoch 13, change: 0.04879994
Epoch 14, change: 0.04537899
Epoch 15, change: 0.04249344
Epoch 16, change: 0.03981332
Epoch 17, change: 0.03747177
Epoch 18, change: 0.03538643
Epoch 19, change: 0.03346808
Epoch 20, change: 0.03178975
Epoch 21, change: 0.03027352
Epoch 22, change: 0.02880122
Epoch 23, change: 0.02753095
Epoch 24, change: 0.02638704
Epoch 25, change: 0.02527710
Epoch 26, change: 0.02428557
Epoch 27, change: 0.02334832
Epoch 28, change: 0.02247828
Epoch 29, change: 0.02167946
Epoch 30, change: 0.02093988
Epoch 31, change: 0.02023165
Epoch 32, change: 0.01957633
Epoch 33, change: 0.01889548
Epoch 34, change: 0.01826197
Epoch 35, change: 0.01761706
Epoch 36, change: 0.01

[Parallel(n_jobs=1000)]: Done   1 out of   1 | elapsed:  1.4min finished


In [None]:
# Random Forest Ensemble
random_forest = RandomForestClassifier(n_estimators=500, max_depth=15, max_features='sqrt',
                                       min_samples_leaf=5, min_samples_split=25, 
                                       random_state=1, verbose=1, n_jobs=jobs)


random_forest.fit(mini_train_data, mini_train_labels)
pred_probs = random_forest.predict_proba(mini_dev_data)

rf_loss = log_loss(mini_dev_labels, pred_probs)

print('Random Forest Validation Log Loss: ', rf_loss)

[Parallel(n_jobs=1000)]: Using backend ThreadingBackend with 1000 concurrent workers.
[Parallel(n_jobs=1000)]: Done   4 out of 500 | elapsed:    0.2s remaining:   26.6s
[Parallel(n_jobs=1000)]: Done 500 out of 500 | elapsed:    2.5s finished
[Parallel(n_jobs=500)]: Using backend ThreadingBackend with 500 concurrent workers.
[Parallel(n_jobs=500)]: Done   2 out of 500 | elapsed:    0.1s remaining:   18.3s


Random Forest Validation Log Loss:  0.12481225127427174


[Parallel(n_jobs=500)]: Done 500 out of 500 | elapsed:    0.3s finished


In [None]:
# XGBoost Ensemble 
# xgb = XGBClassifier(n_estimators=100, verbose=3, n_jobs=jobs, random_state=1)
xgb = XGBClassifier(n_estimators=500, objective="multi:softprob", 
                    verbose=3, n_jobs=jobs, random_state=1)

xgb.fit(mini_encoded_train_data, mini_train_labels)
pred_probs = xgb.predict_proba(mini_encoded_dev_data)

xgb_loss = log_loss(mini_dev_labels, pred_probs)

print('XGBoost Validation Log Loss: ', xgb_loss)

In [None]:
# Display the rank of the models
models = pd.DataFrame({
    'Model': ['SGD (Elastic net)', 'Logistic Regression (l1)', 'Random Forest', 
              'Gaussian Naive Bayes', 'XGBoost', 'K Neighbors'],
    'Log_Loss': [sgd_loss, logreg_loss, rf_loss, nb_loss, xgb_loss, knn_loss]})
print(models.sort_values(by='Log_Loss', ascending=True).reset_index(drop=True))

# Model Selection

- Although Logistic Regression with L1 regularization seems promising, our dataset has a mixture of categorical and numerical features that have very different statistics (mean, variance), thus not very linear. In addition, with any linear model, this would require **one hot encoding** that would greatly increase the feature space (some categorical features such as `BlockNumber` have many levels/values). 
    - Logistic Regression is a generalized linear model, and can theoretically only solve problems where the classes are linearly separable & features are linear.
    - In practice, if we do more feature engineering and convert the non-linear features to linear features, we could increase the performance of LR
- Ensemble methods have been historically and theoretically powerful in handling datasets with very different features (numerical & categorical features). In addition, ensemble methods are effective in solving non-linear problems. So, I will select between Random Forest & XGBoost as the final model. 
    - The caveat is that the default hyperparameters for RF & XGB are generally not optimal for the problem in hand, so hyperparameter tuning is necessary, which can take a while since there are so many hyperparameters to tune for (at least in XGB).

# Hyperparameter Tuning

- Hyperparameter tuning involves defining an objective function (log loss), and using cross-validation to measure the hyperparameter quality. 
    - We want the hyperparameters that give the highest generalization performance.
- Three approaches: Grid Search (`GridSearchCV`), Random Search (`RandomSearchCV`), and Bayes Optimization (`BayesSearchCV`)
- Realized `GridSearchCV` took way too long and was impractical, and `RandomSearchCV` was too random.
    - Grid and random search are completely uninformed by past evaluations, and as a result, often spend a significant amount of time evaluating “bad” hyperparameters.
- Then, I did more research on more efficient & smarter hyperparameter tuning techniques and found Bayeisan Optimization (`BayesSearchCV`)
- **Bayesian Optimization Overview**
    - Build a probabilistic model of the objective function & use it to select promising hyperparameters to evaluate in the true objective function
        - The model used for approximating the objective function is called *surrogate model*. 
            - E.g. Gaussian Processes 
    - Keeps track of past evaluation results, which is used to form a probabilistic model mapping hyperparameters to a probability of a score on the objective function
    - Instead of optimizing an expensive objective function, we optimize on a cheap proxy function instead.
        - *Acquisition function* that directs sampling to areas where an improvement over the current best observation is likely.
            - E.g. maximum probability of improvement (MPI), expected improvement (EI) and upper confidence bound (UCB)
- **K-Folds Cross Validation**
    - Use cross validation to measure the true generalization performance of a model 
    - This is integrated with the hyperparameter tuning techniques (`GridSearchCV`, `RandomSearchCV`, `BayesSearchCV`)

--------
## Random Forest (Bagging)

- Basic Overview
    - An ensemble method that utilizes Bagging (Bootstrapp Aggregation or sampling with replacement)
    - Bagging helps reduce **variance** in any single learner (Decision Trees)
- Basic Steps:
    1. Several decision trees which are generated in parallel, form the base learners of bagging technique.
    2. Data sampled with replacement is fed to these learners for training.
    3. The final prediction is the averaged output from all the learners.
   

**Things I learned**:
- Since the random forest model is overfitting, we want to increase the **min** parameters of random forest and decrease the **max** parameters of random forest
- increasing n_estimators will prevent the random forest from **overfitting**
    - lower number of n_estimators will be similiar to just a simple decision tree (very prone to overfitting)
- increasing max depth will increase **variance** (overfitting, sensitivity to training set) and decrease **bias**
- increasing min samples leaf will decrease **variance** and increase **bias**.
- decreasing any of the **max*** parameters and increasing any of the **min*** parameters will increase **regularization**.

In [None]:
n_features = X_train.shape[1]

opt = BayesSearchCV(
    estimator=RandomForestClassifier(oob_score=True, random_state=1, n_jobs=jobs),
    search_spaces= 
    {
        'n_estimators': (100, 600),
        'max_depth': (1, 50),  
        'max_features': (1, n_features),
        'min_samples_leaf': (1, 50),  # integer valued parameter
        'min_samples_split': (2, 50),
    },
    n_iter=20,
    optimizer_kwargs= {'base_estimator': 'RF'},
    scoring='neg_log_loss',
    n_jobs=jobs,
    verbose=0,
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=1
    ),
    random_state=1
    
)


def status_print(optim_result):
    """Status callback durring bayesian hyperparameter search"""
    
    # Get all the models tested so far in DataFrame format
    all_models = pd.DataFrame(opt.cv_results_)    
    
    # Get current parameters and the best parameters    
    best_params = pd.Series(opt.best_params_)
    print('Model #{}\nBest LogLoss: {}\nBest params: {}\n'.format(
        len(all_models),
        np.round(opt.best_score_, 6),
        opt.best_params_
    ))
    
    # Save all model results
    clf_name = opt.estimator.__class__.__name__
    all_models.to_csv(clf_name + "_cv_results.csv")


In [None]:
result = opt.fit(X_train.values, Y_train.values, callback=status_print)

In [None]:
result.best_params_

## XGBoost (Boosting)

- Basic Overview:
    - Another ensemble method that uses Boosting instead of Bagging (Random Forests)
    - In **Boosting**, the trees are built sequentially such that each subsequent tree aims to reduce the errors of the previous tree.
    - Each tree learns from its predecessors and updates the residual errors. 
    - Each base learner is weak (high bias) and contributes some vital information for prediction, enabling the boosting technique to produce a strong learner by effectively combining these weak learners.
    - The final strong learner brings down both the **bias** and the **variance**.
    - In contrast to bagging techniques like Random Forest, in which trees are grown to their maximum extent, boosting makes use of trees with fewer splits
        -  Such small trees, which are not very deep, are **highly interpretable**. 
- Basic Steps:
    1. Initial model `F0` to predict target variable `y`. Used to also calculate residual (`y - F0`)
    2. A new model `h1` is used to fit to the residuals from the previous step
    3. Now, `F0` and `h1` are combined to give `F1`, which is the boosted version of `F0`. 
        - The MSE or whatever cost function you use (Log loss, MAE) of `F1` will be lower than `F0`.
    4. Iterate the above steps to create new models based off the previous models.
    
### Prevent Overfitting:
- Large number of trees will cause overfitting (unlike Random Forests)


In [None]:
# log-uniform: understand as search over p = exp(x) by varying x
bayes_cv_tuner = BayesSearchCV(
    estimator = XGBClassifier(
        #n_jobs = 3,
        n_jobs = 20,
        objective = 'multi:softprob',
        eval_metric = 'mlogloss',
        silent=1,
        random_state=1
    ),
    search_spaces = {
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'min_child_weight': (0, 10),
        'max_depth': (1, 100),
        'max_delta_step': (0, 20),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'colsample_bylevel': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'reg_alpha': (1e-9, 1.0, 'log-uniform'),
        'gamma': (1e-9, 0.5, 'log-uniform'),
        'min_child_weight': (0, 5),
        'n_estimators': (50, 300),
        'scale_pos_weight': (1e-6, 500, 'log-uniform')
    },    
    scoring = 'neg_log_loss',
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=1
    ),
    #n_jobs = 6,
    n_jobs = 10,
    n_iter = 20,   
    verbose = 0,
    refit = True,
    random_state = 1
)

def status_print(optim_result):
    """Status callback durring bayesian hyperparameter search"""
    
    # Get all the models tested so far in DataFrame format
    all_models = pd.DataFrame(bayes_cv_tuner.cv_results_)    
    
    # Get current parameters and the best parameters    
    best_params = pd.Series(bayes_cv_tuner.best_params_)
    print('Model #{}\nBest Log Loss: {}\nBest params: {}\n'.format(
        len(all_models),
        np.round(bayes_cv_tuner.best_score_, 8),
        bayes_cv_tuner.best_params_
    ))
    
    # Save all model results
    clf_name = bayes_cv_tuner.estimator.__class__.__name__
    all_models.to_csv(clf_name + "_cv_results.csv")

In [None]:
# Fit the model
result = bayes_cv_tuner.fit(X_train.values, Y_train.values, callback=status_print)

In [None]:
X_train.head()
result.best_params_

XGBoost Best params:

{'colsample_bylevel': 1.0, 'colsample_bytree': 1.0, 'gamma': 0.49999999999999994, 'learning_rate': 0.1858621466840661, 
'max_delta_step': 0, 'max_depth': 50, 'min_child_weight': 5, 'n_estimators': 86, 'reg_alpha': 1.0, 'reg_lambda': 60.121460571845695, 'scale_pos_weight': 1e-06, 'subsample': 1.0}

# Train model with optimal hyperparameters & all features

- Initially, I started with a Random Forest, but decided to use XGBoost
- We first train the model (with all the features) using the optimal hyperparameters that were found through `BayesSearchCV`
- Then, I use the model to predict the probabilities of test set with all the features
    - I'll save these predictions later to compare them with another model I will train with certain features removed

In [None]:
# It seems running time scales quadratically with the number of classes
xgb = XGBClassifier(
    n_estimators=86, 
    objective="multi:softprob", 
    learning_rate=0.1858621466840661,
    colsample_bylevel=1.0,
    colsample_bytree=1.0,
    gamma=0.49999999999999994,
    max_delta_step=0,
    max_depth=50,
    min_child_weight=5,
    reg_alpha=1.0,
    reg_lambda=60.121460571845695,
    scale_pos_weight=1e-06,
    subsample=1.0,
    random_state=1, 
    n_jobs=jobs,
    silent=False)


xgb.fit(X_train, Y_train)

Y_test_pred = xgb.predict_proba(X_test)

In [None]:
# random_forest = RandomForestClassifier(n_estimators=600, max_depth=21, max_features=6,
#                                        min_samples_leaf=43, min_samples_split=40, 
#                                        random_state=1, verbose=3, n_jobs=jobs)
# random_forest.fit(X_train, Y_train)

# Y_test_pred = random_forest.predict_proba(X_test)

# Feature Importance

- Measured by mean decrease in Gini information
- This is a form of feature selection that ensemble methods (Random Forest, XGBoost) can use to prevent overfitting
    - I drop the features that seem unimportant & with less than a 1% contribution

In [None]:
importances = pd.DataFrame({'feature': X_train.columns,
                            'importance': np.round(xgb.feature_importances_, 5)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')

In [None]:
importances

# Feature Removal

- Remove features to simplify model and prevent overfitting
- Drop anything that contributes under 1% to prevent overfitting

In [None]:
X_train = X_train.drop("BusinessHour", axis=1)
X_test  = X_test.drop("BusinessHour", axis=1)

In [None]:
X_train = X_train.drop("Precinct", axis=1)
X_test  = X_test.drop("Precinct", axis=1)

In [None]:
X_train = X_train.drop("Holiday", axis=1)
X_test  = X_test.drop("Holiday", axis=1)

In [None]:
X_train = X_train.drop("Weekend", axis=1)
X_test  = X_test.drop("Weekend", axis=1)

In [None]:
X_train.head()

In [None]:
X_test.head()

# Train final model with optimal hyperparameters & features

In [None]:
# It seems running time scales quadratically with the number of classes
xgb = XGBClassifier(
    n_estimators=86, 
    objective="multi:softprob", 
    learning_rate=0.1858621466840661,
    colsample_bylevel=1.0,
    colsample_bytree=1.0,
    gamma=0.49999999999999994,
    max_delta_step=0,
    max_depth=50,
    min_child_weight=5,
    reg_alpha=1.0,
    reg_lambda=60.121460571845695,
    scale_pos_weight=1e-06,
    subsample=1.0,
    random_state=1, 
    n_jobs=jobs,
    silent=False)


xgb.fit(X_train, Y_train)

Y_test_pred = xgb.predict_proba(X_test)

In [None]:
sample_submission = pd.read_csv('data/sampleSubmission.csv')

random_forest = RandomForestClassifier(n_estimators=600, max_depth=21, max_features=6,
                                       min_samples_leaf=43, min_samples_split=40, 
                                       random_state=1, verbose=3, n_jobs=jobs)
random_forest.fit(X_train, Y_train)

Y_test_pred = random_forest.predict_proba(X_test)

# Model Evaluation

- Evaluate final model based on K-Fold cross validation
- Average all K iterations to give the true estimate of the final model's performance

In [None]:
scores = cross_val_score(xgb, X_train, Y_train, 
                         cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=1), 
                         scoring = "neg_log_loss", n_jobs=jobs)

In [None]:
sample_submission.to_csv(
    'submissions/submission_xgb_with_Season.csv', index=False)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

# Kaggle Submission

- Reformat and turn in predictions and results from our model

In [None]:
Y_test_pred.shape

In [None]:
sample_submission = pd.read_csv('data/sampleSubmission.csv')

In [None]:
sample_submission.shape

In [None]:
sample_submission.iloc[:, 1:] = pd.DataFrame(Y_test_pred, columns=sample_submission.columns[1:])

In [None]:
sample_submission.head(10)

In [None]:
sample_submission.to_csv('submissions/submission_xgb_with_Season.csv', index=False)

# Summary

- After lots of tuning, I finally achieved a kaggle evaluation score (multiclass log loss) of **2.25674**, which would ideally rank at **#136** (out of 2,335 teams) or at the **top 6%** or **94th percentile** on the public leaderboard
    - Since this is an old kaggle competition, this would most likely be a lower rank, but I still felt proud to achieve this score
    - It is possible that I could run more experiments and tune the hyperparameters to achieve an even better score & ranking 
    - This was more of a learning experience for me & to get my feet wet with Data Science projects & Kaggle competitions
    - In an effort to learn, I refrained from looking up old Kaggle kernels & other resources that completed this specific Kaggle competition.
    - I coded most of this myself to learn the data science libraries, but did use resources such as other Kaggle competition kernels and research papers to get a better idea of how to think about the data. Google is awesome.
- Below, I show images of my two highest scoring submissions on Kaggle

In [None]:
from IPython.display import Image
Image(filename='images/best_kaggle_submission.png') 

In [None]:
Image(filename='images/2nd_best_kaggle_submission.png') 

# Conclusion

This project has taught me a lot about data science and has given me hands-on experience with working with data and completing an end-to-end data science project. I've had a lot of fun visualizing, analyzing, and experimenting with the data to gain more insight. This is just the beginning of my journey into data science, and I am very excited to see what the future holds in terms of new and interesting data science problems and datasets.

- **What I learned**:
    - There are more efficient ways to label or integer encode features
        - Will use sklearn's LabelEncoder, OneHotEncoder, & MultiLabelBinarizer next time
    - Instead of just blindly training models, research more about ways to optimize the hyperparameters efficiently
        - Spent too many AWS EC2 hours with `GridSearchCV`, when I should have used *Bayesian Optimization* for efficient hyperparameter tuning
        - Do more research on the domain of the problem, certain core ML algorithms, and data processing techniques
- **What's next?**
    - AutoML with `tpot` or `auto-sklearn`
        - automate the hyperparameter tuning and model selection with AutoML packages
    - Problem Redirection (Classification ---> Regression)
        - Instead of predicting category of crime, predict X & Y coordinates (longitude & latitude) continuous values given same spatial and temporal features as well as category of crime
        - **Use case:** Dynamically concentrate police on certain serious categories of crime to prevent crimes from happening beforehand
    - Rewrite all code in the jupyter notebook to .py files
        - Modularize each of the steps with functions and/or classes
        - Useful because I can run the .py file on AWS EC2 without having to host it on jupyter notebook locally
            - Meaning I can peacefully shut down my laptop and let script run in the cloud overnight
