# Minneapolis Crime Data Exploration & Analysis
* Created on: September, 5 2021
* Description: Data analysis, exploration & visualization on crime incidents in Minneapolis

## Minneapolis Incidents Datasets

This dataset contains incidents derived from Minneapolis Crime Incident Reporting system. The data ranges from 1/1/2020 to 09/04/2021. However, there are two different structures. From 2010 through May of 2018 the
Computer Assisted Police Records System (CAPRS) was used then in June of 2018 it was replaced by the PIMS (Police Incident Management System).

### Hypothesis 
My goal was to build a predictive model for two main reasons: 
    1)	to help Minneapolis citizens make better decisions regarding their personal safety 
    2)	to help law enforcement more effectively allocate resources. 

My model will provide an estimated price given a set of independent variables. 

We brainstormed a list of questions to drive our hypothesis and body of work. 

1.	Can a model be developed to accurately predict the price of the wine given the variety, vintage, rating, region, and description?
If so, what are the business/consumer implications?

How can this model be utilized? 

2.	What can be gleaned about the wine industry and consumer trends through visualization of this data? Which regions/vintages are favored? What are the “best buys” in wine? How is the industry changing/evolving? 

3.	Can we determine if there is pricing or reviewer bias in the data? 




## Import libraries

In [6]:
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
import xgboost as xgb

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder

# Metrics 
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score

# Model Selection & Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold
from skopt import BayesSearchCV
from skopt.space  import Real, Categorical, Integer

from sklearn.model_selection import GridSearchCV

# Clustering
from sklearn.cluster import KMeans

# Mathematical Functions
import math

import warnings
warnings.filterwarnings('ignore')

## Load data

In [7]:
from sklearn.model_selection import train_test_split #training and testing data split

df_train = pd.read_csv('incidents.csv', index_col=0)
# df_train = (df_train['Dates'] > '2020-01-01') & (df_train['Dates'] <= '2020-12-31')
df_train.drop(columns=['Time', 'Date', 'Month_Name', 'DayOfWeek_Num', 'Offense', 'Hour', 'Year', 'UCRCode'], inplace=True)
#df_train.rename(columns={'Lat':'Y', 'Long':'X'}, inplace=True)
# df_train['X'] = df_train['Long']
# df_train['Y'] = df_train['Lat']
# 
train_df, test_df = train_test_split(df_train,test_size=0.25,random_state=0)#,stratify=df_train['Survived'])

In [8]:
# train_df.shape()
# test_df.shape()

# Data Exploration & Analysis Extension

- This dataset suffers from **imbalanced classes has 6x occurrences while LARCENY/THEFT.
    - There are a couple ways to deal with imbalanced classes, such as:
        - Changing performance metric (Do not use accuracy, use a confusion matrix, precision, recall, F1 score, ROC curves)
        - Resample dataset (Oversample under-represented classes, and undersample over-represented classes)
        - Try different ML algorithms that can handle imbalanced classes
            - Decision Trees (Random Forests/XGBoost) often perform well on imbalanced classes (due to splitting rules)

In [9]:
train_df.head(8)

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Description,Month,DayOfWeek,Category
8827,2019-12-05 19:25:00,0025XX BROADWAY AVE W,45.006191,-93.312062,WILLARD - HAY,NORTH,AUTOMOBILE THEFT,12,Thursday,AUTO THEFT
15702,2019-07-17 16:50:00,0014XX LAKE ST W,44.948364,-93.298919,EAST ISLES,SOUTHWEST,OTHER THEFT,7,Wednesday,LARCENY
15643,2016-10-07 15:59:59,0045XX 18 AV S,44.920612,-93.248573,NORTHROP,SOUTHEAST,Burglary Of Dwelling,10,Friday,BURGLARY
8977,2014-06-19 23:29:59,0025XX Cedar AV S,44.957798,-93.247331,EAST PHILLIPS,SOUTHEAST,Asslt W/dngrs Weapon,6,Thursday,ASSAULT
5449,2016-04-30 00:30:00,00001X Barton AV SE,44.966949,-93.214913,PROSPECT PARK - EAST RIVER ROAD,NORTHEAST,Motor Vehicle Theft,4,Saturday,AUTO THEFT
7386,2019-07-27 17:53:00,0041XX CHICAGO AVE,44.927843,-93.262507,BRYANT,SOUTHEAST,THEFT FROM PERSON SNATCH/GRAB,7,Saturday,LARCENY
12024,2010-08-09 00:00:00,0055XX 36 AV S,44.90266,-93.220366,MORRIS PARK,SOUTHEAST,Theft From Motr Vehc,8,Monday,LARCENY
4659,2015-04-12 00:30:00,1 AV N / 5 ST N,44.98037,-93.27374,DOWNTOWN WEST,DOWNTOWN,Other Theft,4,Sunday,LARCENY


In [10]:
train_df.columns.values

array(['Dates', 'Address', 'Lat', 'Long', 'Neighborhood', 'Precinct',
       'Description', 'Month', 'DayOfWeek', 'Category'], dtype=object)

In [11]:
# set show nulls to True
train_df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179112 entries, 8827 to 15367
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Dates         179112 non-null  object 
 1   Address       179112 non-null  object 
 2   Lat           179112 non-null  float64
 3   Long          179112 non-null  float64
 4   Neighborhood  179112 non-null  object 
 5   Precinct      179112 non-null  object 
 6   Description   179112 non-null  object 
 7   Month         179112 non-null  int64  
 8   DayOfWeek     179112 non-null  object 
 9   Category      179112 non-null  object 
dtypes: float64(2), int64(1), object(7)
memory usage: 15.0+ MB


------------
### Things we learned thus far:

- 878,049 instances in training set (or recorded crime instances in SF)
- 9 columns (8 potential features + 1 label (Category))
- Data types:
    - 2 columns with float values
    - 7 objects
- There are no null (NaN) values (Yay!)

In [12]:
## Count number of observations for each crime 
train_df['Category'].value_counts()

LARCENY       97791
BURGLARY      33876
AUTO THEFT    17887
ASSAULT       13454
ROBBERY       12492
RAPE           2464
ARSON           868
MURDER          280
Name: Category, dtype: int64

In [13]:
## Count number of observations of crime for each PD District
train_df['Precinct'].value_counts()

SOUTHEAST    48320
SOUTHWEST    36622
NORTH        34752
DOWNTOWN     33706
NORTHEAST    25712
Name: Precinct, dtype: int64

In [14]:
## Count number of observations for each day of week
train_df['DayOfWeek'].value_counts()

Friday       27611
Saturday     26810
Monday       25351
Tuesday      24935
Wednesday    24922
Thursday     24829
Sunday       24654
Name: DayOfWeek, dtype: int64

In [15]:
# ## Count number of observations for Resolution feature
# train_df['Resolution'].value_counts()

In [16]:
train_df[['Long','Lat']].describe()

Unnamed: 0,Long,Lat
count,179112.0,179112.0
mean,-93.268126,44.968414
std,0.312855,0.153771
min,-93.329109,0.0
25%,-93.290013,44.948355
50%,-93.27243,44.968855
75%,-93.248573,44.989595
max,0.0,45.05124


**There seems to be an invalid coordinates (max) 90 (latitude) or -120.5 (longitude) does not seem to be a valid coordinate in San Francisco. We must fix these values for this feature.**

# Data Preprocessing

- Data cleaning
    - imputation or removal of outlier values
- Feature Engineering (Feature Creation)
- Feature Encoding
    - **Integer encode** or **label encode** ordinal categorical features that maintain order (Year, Business Quarter, Block/Street Number)
    - Usually: 
        - **One hot encode** nominal categorical features (DayOfWeek, Precinct, StreetType, Category)
            - mainly for logistic regression
        - However, Random Forests & Boosting algorithms can handle nominal categorical features directly, so we just **integer encode** these features.

In [17]:
#train_df['UCRCode'] = train_df['UCRCode'].astype(int)
#test_df['BusinessHour'] = test_df['Dates'].map(map_business_hours).astype('uint8')

## Data Cleaning

- Data removal
- Data imputation

In [18]:
train_df[train_df['Lat'] == train_df['Lat'].max()]

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Description,Month,DayOfWeek,Category
8572,2012-06-08 22:15:00,53 AV N / Vincent AV N,45.05124,-93.31697,SHINGLE CREEK,NORTH,Robbery Per Agg,6,Friday,ROBBERY


I notice that there are 108 rows with incorrect coordinates, and they seem to be the exact same two coordinates (90, -120.5). There are many ways to handle this. We need to do data imputation, which can be done several ways. For now, I will randomly sample from a normal distribution with the range of a standard deviation from the mean. However, I could use a linear regression model to predict the latitude and longitude values (based on other variables such as PD district?) and use that to impute the bad / inconsistent data points.

Another method is to completely remove this data. Since I already have a lot of data, and I do not want this incorrect data to affect my results, I could remove them. However, I will stick with data imputation.

In [19]:
train_df['Lat'].replace(to_replace= train_df['Lat'].max() ,value=np.nan, inplace=True)
train_df['Long'].replace(to_replace= train_df['Long'].max() ,value=np.nan, inplace=True)
test_df['Lat'].replace(to_replace= test_df['Lat'].max() ,value=np.nan, inplace=True)
test_df['Long'].replace(to_replace= test_df['Long'].max() ,value=np.nan, inplace=True)
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

In [20]:
train_df.isnull().sum()

Dates           0
Address         0
Lat             0
Long            0
Neighborhood    0
Precinct        0
Description     0
Month           0
DayOfWeek       0
Category        0
dtype: int64

In [21]:
test_df.isnull().sum()

Dates           0
Address         0
Lat             0
Long            0
Neighborhood    0
Precinct        0
Description     0
Month           0
DayOfWeek       0
Category        0
dtype: int64

In [22]:
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

In [23]:
data = [train_df, test_df]

for dataset in data:
    mean_X = dataset["Long"].mean()
    std_X = dataset["Long"].std()
    mean_Y = dataset["Lat"].mean()
    std_Y = dataset["Lat"].std()
    max_X = mean_X + std_X
    min_X = mean_X - std_X
    max_Y = mean_Y + std_Y
    min_Y = mean_Y - std_Y

    # Both X and Y will have the same null so just use Y
    is_null = dataset['Lat'].isnull().sum()
    # randomly sample float numbers within a range from a uniform distribution
#     random_X = (max_X - min_X) * np.random.random_sample(size = is_null) + min_X
#     random_Y = (max_Y - min_Y) * np.random.random_sample(size = is_null) + min_Y
    # randomly sample float numbers within a range from a normal distribution
    random_X = (max_X - min_X) * np.random.randn(is_null) + min_X
    random_Y = (max_Y - min_Y) * np.random.randn(is_null) + min_Y

    X_slice = dataset['Long'].copy()
    Y_slice = dataset['Lat'].copy()
    X_slice[np.isnan(X_slice)] = random_X
    Y_slice[np.isnan(Y_slice)] = random_Y
    dataset['Long'] = X_slice
    dataset['Lat'] = Y_slice


In [24]:
train_df[['Long', 'Lat']].describe()

Unnamed: 0,Long,Lat
count,179109.0,179109.0
mean,-93.269167,44.968915
std,0.027248,0.032641
min,-93.329109,44.89061
25%,-93.290013,44.948355
50%,-93.27243,44.968855
75%,-93.248578,44.989595
max,-93.19858,45.051227


In [25]:
len(train_df)

179109

In [26]:
test_df[['Long', 'Lat']].describe()

Unnamed: 0,Long,Lat
count,59702.0,59702.0
mean,-93.268953,44.968657
std,0.027328,0.032691
min,-93.329109,44.890627
25%,-93.289832,44.948351
50%,-93.27206,44.968457
75%,-93.248045,44.989254
max,-93.19858,45.05119


In [27]:
len(test_df)

59702

# Feature Engineering

- Let's create some new features from the data that exists in the current feature space
- There are a couple categories of features:
    - Temporal features
    - Spatial features

## Temporal Features
We want to have a column for Time, so we must parse through the 'Dates' feature to create the 'Time' feature


In [28]:
# Transform the Date into a python datetime object.
train_df["Dates"] = pd.to_datetime(train_df["Dates"], format="%Y-%m-%d %H:%M:%S")
test_df["Dates"] = pd.to_datetime(test_df["Dates"], format="%Y-%m-%d %H:%M:%S")

In [29]:
# Minute
train_df["Minute"] = train_df["Dates"].map(lambda x: x.minute)
test_df["Minute"] = test_df["Dates"].map(lambda x: x.minute)

In [30]:
# Hour
train_df["Hour"] = train_df["Dates"].map(lambda x: x.hour)
test_df["Hour"] = test_df["Dates"].map(lambda x: x.hour)

In [31]:
# Day
train_df["Day"] = train_df["Dates"].map(lambda x: x.day)
test_df["Day"] = test_df["Dates"].map(lambda x: x.day)

In [32]:
# Month
train_df["Month"] = train_df["Dates"].map(lambda x: x.month)
test_df["Month"] = test_df["Dates"].map(lambda x: x.month)

In [33]:
# Year
train_df["Year"] = train_df["Dates"].map(lambda x: x.year)
test_df["Year"] = test_df["Dates"].map(lambda x: x.year)

In [34]:
# Hour Zone 0 - Pass midnight, 1 - morning, 2 - afternoon, 3 - dinner / sun set, 4 - night
def get_hour_zone(hour):
    if hour >= 2 and hour < 8: 
        return 0
    elif hour >= 8 and hour < 12: 
        return 1
    elif hour >= 12 and hour < 18: 
        return 2
    elif hour >= 18 and hour < 22: 
        return 3
    elif hour < 2 or hour >= 22: 
        return 4
    
train_df["Hour_Zone"] = train_df["Hour"].map(get_hour_zone)
test_df["Hour_Zone"] = test_df["Hour"].map(get_hour_zone)

In [35]:
train_df.head()

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Description,Month,DayOfWeek,Category,Minute,Hour,Day,Year,Hour_Zone
8827,2019-12-05 19:25:00,0025XX BROADWAY AVE W,45.006191,-93.312062,WILLARD - HAY,NORTH,AUTOMOBILE THEFT,12,Thursday,AUTO THEFT,25,19,5,2019,3
15702,2019-07-17 16:50:00,0014XX LAKE ST W,44.948364,-93.298919,EAST ISLES,SOUTHWEST,OTHER THEFT,7,Wednesday,LARCENY,50,16,17,2019,2
15643,2016-10-07 15:59:59,0045XX 18 AV S,44.920612,-93.248573,NORTHROP,SOUTHEAST,Burglary Of Dwelling,10,Friday,BURGLARY,59,15,7,2016,2
8977,2014-06-19 23:29:59,0025XX Cedar AV S,44.957798,-93.247331,EAST PHILLIPS,SOUTHEAST,Asslt W/dngrs Weapon,6,Thursday,ASSAULT,29,23,19,2014,4
5449,2016-04-30 00:30:00,00001X Barton AV SE,44.966949,-93.214913,PROSPECT PARK - EAST RIVER ROAD,NORTHEAST,Motor Vehicle Theft,4,Saturday,AUTO THEFT,30,0,30,2016,4


In [36]:
train_df.head(10)

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Description,Month,DayOfWeek,Category,Minute,Hour,Day,Year,Hour_Zone
8827,2019-12-05 19:25:00,0025XX BROADWAY AVE W,45.006191,-93.312062,WILLARD - HAY,NORTH,AUTOMOBILE THEFT,12,Thursday,AUTO THEFT,25,19,5,2019,3
15702,2019-07-17 16:50:00,0014XX LAKE ST W,44.948364,-93.298919,EAST ISLES,SOUTHWEST,OTHER THEFT,7,Wednesday,LARCENY,50,16,17,2019,2
15643,2016-10-07 15:59:59,0045XX 18 AV S,44.920612,-93.248573,NORTHROP,SOUTHEAST,Burglary Of Dwelling,10,Friday,BURGLARY,59,15,7,2016,2
8977,2014-06-19 23:29:59,0025XX Cedar AV S,44.957798,-93.247331,EAST PHILLIPS,SOUTHEAST,Asslt W/dngrs Weapon,6,Thursday,ASSAULT,29,23,19,2014,4
5449,2016-04-30 00:30:00,00001X Barton AV SE,44.966949,-93.214913,PROSPECT PARK - EAST RIVER ROAD,NORTHEAST,Motor Vehicle Theft,4,Saturday,AUTO THEFT,30,0,30,2016,4
7386,2019-07-27 17:53:00,0041XX CHICAGO AVE,44.927843,-93.262507,BRYANT,SOUTHEAST,THEFT FROM PERSON SNATCH/GRAB,7,Saturday,LARCENY,53,17,27,2019,2
12024,2010-08-09 00:00:00,0055XX 36 AV S,44.90266,-93.220366,MORRIS PARK,SOUTHEAST,Theft From Motr Vehc,8,Monday,LARCENY,0,0,9,2010,4
4659,2015-04-12 00:30:00,1 AV N / 5 ST N,44.98037,-93.27374,DOWNTOWN WEST,DOWNTOWN,Other Theft,4,Sunday,LARCENY,30,0,12,2015,4
18649,2013-11-02 12:59:59,0027XX Dupont AV S,44.952846,-93.293163,LOWRY HILL EAST,SOUTHWEST,Burglary Of Dwelling,11,Saturday,BURGLARY,59,12,2,2013,2
15129,2016-10-03 21:30:00,0004XX Chicago AV S,44.974663,-93.259387,DOWNTOWN EAST,DOWNTOWN,Other Theft,10,Monday,LARCENY,30,21,3,2016,3


### Season

The season feature may affect what type of crimes are commited. 
- 1 = Winter, 2 = Spring, 3 = Summer, 4 = Fall

In [37]:
train_df['Season']=(train_df['Month']%12 + 3)//3
test_df['Season']=(test_df['Month']%12 + 3)//3

In [38]:
train_df.head()

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Description,Month,DayOfWeek,Category,Minute,Hour,Day,Year,Hour_Zone,Season
8827,2019-12-05 19:25:00,0025XX BROADWAY AVE W,45.006191,-93.312062,WILLARD - HAY,NORTH,AUTOMOBILE THEFT,12,Thursday,AUTO THEFT,25,19,5,2019,3,1
15702,2019-07-17 16:50:00,0014XX LAKE ST W,44.948364,-93.298919,EAST ISLES,SOUTHWEST,OTHER THEFT,7,Wednesday,LARCENY,50,16,17,2019,2,3
15643,2016-10-07 15:59:59,0045XX 18 AV S,44.920612,-93.248573,NORTHROP,SOUTHEAST,Burglary Of Dwelling,10,Friday,BURGLARY,59,15,7,2016,2,4
8977,2014-06-19 23:29:59,0025XX Cedar AV S,44.957798,-93.247331,EAST PHILLIPS,SOUTHEAST,Asslt W/dngrs Weapon,6,Thursday,ASSAULT,29,23,19,2014,4,3
5449,2016-04-30 00:30:00,00001X Barton AV SE,44.966949,-93.214913,PROSPECT PARK - EAST RIVER ROAD,NORTHEAST,Motor Vehicle Theft,4,Saturday,AUTO THEFT,30,0,30,2016,4,2


<!-- ### Weekend

- Weekends may have effect on what types of crimes are commmited
- Weekday = 0
- Weekend =1 -->

In [39]:
# # Weekend Feature

# # Weekday = 0, Weekend = 1
# days = {'Monday':0 ,'Tuesday':0 ,'Wednesday':0 ,'Thursday':0 ,'Friday':0, 'Saturday':1 ,'Sunday':1}

# train_df['Weekend'] = train_df['DayOfWeek'].replace(days).astype(int)
# test_df['Weekend'] = test_df['DayOfWeek'].replace(days).astype(int)

## Spatial Features

In [40]:
train_df.head(8)

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Description,Month,DayOfWeek,Category,Minute,Hour,Day,Year,Hour_Zone,Season
8827,2019-12-05 19:25:00,0025XX BROADWAY AVE W,45.006191,-93.312062,WILLARD - HAY,NORTH,AUTOMOBILE THEFT,12,Thursday,AUTO THEFT,25,19,5,2019,3,1
15702,2019-07-17 16:50:00,0014XX LAKE ST W,44.948364,-93.298919,EAST ISLES,SOUTHWEST,OTHER THEFT,7,Wednesday,LARCENY,50,16,17,2019,2,3
15643,2016-10-07 15:59:59,0045XX 18 AV S,44.920612,-93.248573,NORTHROP,SOUTHEAST,Burglary Of Dwelling,10,Friday,BURGLARY,59,15,7,2016,2,4
8977,2014-06-19 23:29:59,0025XX Cedar AV S,44.957798,-93.247331,EAST PHILLIPS,SOUTHEAST,Asslt W/dngrs Weapon,6,Thursday,ASSAULT,29,23,19,2014,4,3
5449,2016-04-30 00:30:00,00001X Barton AV SE,44.966949,-93.214913,PROSPECT PARK - EAST RIVER ROAD,NORTHEAST,Motor Vehicle Theft,4,Saturday,AUTO THEFT,30,0,30,2016,4,2
7386,2019-07-27 17:53:00,0041XX CHICAGO AVE,44.927843,-93.262507,BRYANT,SOUTHEAST,THEFT FROM PERSON SNATCH/GRAB,7,Saturday,LARCENY,53,17,27,2019,2,3
12024,2010-08-09 00:00:00,0055XX 36 AV S,44.90266,-93.220366,MORRIS PARK,SOUTHEAST,Theft From Motr Vehc,8,Monday,LARCENY,0,0,9,2010,4,3
4659,2015-04-12 00:30:00,1 AV N / 5 ST N,44.98037,-93.27374,DOWNTOWN WEST,DOWNTOWN,Other Theft,4,Sunday,LARCENY,30,0,12,2015,4,2


## X, Y Coordinates

- Normalize and scale the X and Y coordinates
- I use **K-Means clustering** to create a new feature for the longitude and latitude by grouping clusters of points based on Euclidean distances.
- X = longitude, Y = latitude
- I also extract more spatial features from the X, Y coordinates by transforming them from the cartesian space to the polar space ([Reference](https://www.kaggle.com/c/sf-crime/discussion/18853))
    1. three variants of rotated Cartesian coordinates (rotated by 30, 45, 60 degree each) 
    2. Polar coordinates (i.e. the 'r' and the angle 'theta')
    3. The approach makes some intuitive sense i.e. that having such features should help in extracting some more spatial information (than relying on the current x-y alone)

In [41]:
# Normalize X and Y
print('There are %d unique longitude values, %d unique latitude values' % (train_df['Long'].nunique(), 
                                                                           train_df['Lat'].nunique()))

xy_scaler = StandardScaler().fit(train_df[['Long', 'Lat']])
train_df[['Long', 'Lat']] = xy_scaler.transform(train_df[['Long', 'Lat']])
test_df[['Long', 'Lat']] = xy_scaler.transform(test_df[['Long', 'Lat']])

There are 11114 unique longitude values, 10700 unique latitude values


In [42]:
# X-Y plane rotation and space transformation to extract more spatial information
# 2-dimensional rotation based on below functions:
# rotated x = xcos - ysin
# rotated y = xsin + ycos
# Conver from cartesian space -> polar space

cos_30 = math.cos(math.radians(30))
sin_30 = math.sin(math.radians(30))
cos_45 = math.cos(math.radians(45))
sin_45 = math.sin(math.radians(45))
cos_60 = math.cos(math.radians(60))
sin_60 = math.sin(math.radians(60))


train_df["Rot30_X"] = train_df['Long'] * cos_30 - train_df['Lat'] * sin_30 
train_df["Rot30_Y"] = train_df['Long'] * sin_30 + train_df['Lat'] * cos_30
train_df["Rot45_X"] = train_df['Long'] * cos_45 - train_df['Lat'] * sin_45  
train_df["Rot45_Y"] = train_df['Long'] * sin_45 + train_df['Lat'] * cos_45
train_df["Rot60_X"] = train_df['Long'] * cos_60 - train_df['Lat'] * sin_60  
train_df["Rot60_Y"] = train_df['Long'] * sin_60 + train_df['Lat'] * cos_60
train_df["Radius"] = np.sqrt(train_df['Long'] ** 2 + train_df['Lat'] ** 2)
train_df["Angle"] = np.arctan2(train_df['Long'], train_df['Lat'])

test_df["Rot30_X"] = test_df['Long'] * cos_30 - test_df['Lat'] * sin_30  
test_df["Rot30_Y"] = test_df['Long'] * sin_30 + test_df['Lat'] * cos_30
test_df["Rot45_X"] = test_df['Long'] * cos_45 - test_df['Lat'] * sin_45  
test_df["Rot45_Y"] = test_df['Long'] * sin_45 + test_df['Lat'] * cos_45
test_df["Rot60_X"] = test_df['Long'] * cos_60 - test_df['Lat'] * sin_60  
test_df["Rot60_Y"] = test_df['Long'] * sin_60 + test_df['Lat'] * cos_60
test_df["Radius"] = np.sqrt(test_df['Long'] ** 2 + test_df['Lat'] ** 2)
test_df["Angle"] = np.arctan2(test_df['Long'], test_df['Lat'])

In [43]:
# View the description of the numerical features again to ensure everything is right
train_df.describe()

Unnamed: 0,Lat,Long,Month,Minute,Hour,Day,Year,Hour_Zone,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle
count,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0,179109.0
mean,2.575227e-13,3.612329e-13,6.661999,22.13427,12.731649,15.766963,2015.208504,2.279305,2.599451,1.840734e-13,4.036354e-13,7.333686e-14,4.375278e-13,-4.240573e-14,4.415973e-13,1.239928,0.020799
std,1.000003,1.000003,3.27082,21.252736,7.240048,8.830191,3.352325,1.276477,1.079458,1.107644,0.8792819,1.123375,0.8590925,1.107644,0.8792819,0.680132,1.7772
min,-2.399019,-2.199904,1.0,0.0,0.0,1.0,2010.0,0.0,1.0,-2.845945,-2.955079,-3.069195,-2.957462,-3.083285,-2.862578,0.009058,-3.141579
25%,-0.6299164,-0.765064,4.0,0.0,8.0,8.0,2012.0,1.0,2.0,-0.6293764,-0.5508591,-0.5272519,-0.5030372,-0.6047482,-0.577518,0.682714,-1.297353
50%,-0.001865436,-0.1197611,7.0,20.0,14.0,16.0,2015.0,2.0,3.0,-0.07038863,0.0751288,-0.04647813,0.06586983,0.02559916,-0.00766925,1.208405,-0.362006
75%,0.6335539,0.7556101,9.0,39.0,19.0,23.0,2018.0,3.0,4.0,0.7500017,0.51498,0.7272684,0.5074724,0.7142628,0.603406,1.737424,1.748211
max,2.521766,2.590559,12.0,59.0,23.0,31.0,2021.0,4.0,4.0,3.216748,2.534508,3.208196,2.525678,3.052529,2.380505,3.240358,3.140712


In [44]:
# run KMeans separately on both the training set and test set
data = [train_df, test_df]
num_clusters = 40
for dataset in data:
    coordinates = dataset.loc[:,['Lat','Long']]
    kmeans = KMeans(n_clusters=num_clusters, random_state=1).fit(coordinates)
    #kmeans = KMeans(n_clusters=num_clusters, n_jobs=3, random_state=1).fit(coordinates)
    id_labels=kmeans.labels_
#     print(kmeans.cluster_centers_)
    dataset['Cluster'] = id_labels

In [45]:
train_df.head()

Unnamed: 0,Dates,Address,Lat,Long,Neighborhood,Precinct,Description,Month,DayOfWeek,Category,...,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
8827,2019-12-05 19:25:00,0025XX BROADWAY AVE W,1.142002,-1.574275,WILLARD - HAY,NORTH,AUTOMOBILE THEFT,12,Thursday,AUTO THEFT,...,1,-1.934364,0.201865,-1.920698,-0.305663,-1.776141,-0.792361,1.944868,-0.943216,15
15702,2019-07-17 16:50:00,0014XX LAKE ST W,-0.629635,-1.09192,EAST ISLES,SOUTHWEST,OTHER THEFT,7,Wednesday,LARCENY,...,3,-0.630813,-1.091239,-0.326885,-1.217323,-0.00068,-1.260448,1.260448,-2.093855,7
15643,2016-10-07 15:59:59,0045XX 18 AV S,-1.47986,0.755813,NORTHROP,SOUTHEAST,Burglary Of Dwelling,10,Friday,BURGLARY,...,4,1.394484,-0.90369,1.58086,-0.511978,1.659503,-0.085376,1.661698,2.669396,36
8977,2014-06-19 23:29:59,0025XX Cedar AV S,-0.340616,0.801366,EAST PHILLIPS,SOUTHEAST,Asslt W/dngrs Weapon,6,Thursday,ASSAULT,...,3,0.864311,0.105701,0.807503,0.325799,0.695665,0.523695,0.870751,1.972704,16
5449,2016-04-30 00:30:00,00001X Barton AV SE,-0.060233,1.991128,PROSPECT PARK - EAST RIVER ROAD,NORTHEAST,Motor Vehicle Theft,4,Saturday,AUTO THEFT,...,2,1.754484,0.943401,1.450531,1.365349,1.047727,1.694251,1.992039,1.601038,0


## Drop Features

- We have already extracted all the necessary features from the `Address` attribute, so drop
- We don't need `Resolution` or `Descript` features since it is not included in the training data

In [46]:
# Drop Address feature from both train and test set
train_df.drop(['Address'], axis=1, inplace=True)
test_df.drop(['Address'], axis=1, inplace=True)

In [47]:
# We don't need Dates column anymore
train_df.drop(['Dates'], axis=1, inplace=True)
test_df.drop(['Dates'], axis=1, inplace=True)

In [48]:
# Drop columns that are no longer needed
train_df.drop(['Lat', 'Long'], axis=1, inplace=True)
test_df.drop(['Lat', 'Long'], axis=1, inplace=True)

In [49]:
# Drop Descript column since test set does not have this column
train_df.drop(['Description'], axis=1, inplace=True)

In [50]:
test_df.drop(['Description'], axis=1, inplace=True)

# test_df.drop(['UCRCode'], axis=1, inplace=True)
# train_df.drop(['UCRCode'], axis=1, inplace=True)

In [51]:
# Let's quickly view the data
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Month,DayOfWeek,Category,Minute,Hour,Day,Year,Hour_Zone,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
8827,WILLARD - HAY,NORTH,12,Thursday,AUTO THEFT,25,19,5,2019,3,1,-1.934364,0.201865,-1.920698,-0.305663,-1.776141,-0.792361,1.944868,-0.943216,15
15702,EAST ISLES,SOUTHWEST,7,Wednesday,LARCENY,50,16,17,2019,2,3,-0.630813,-1.091239,-0.326885,-1.217323,-0.00068,-1.260448,1.260448,-2.093855,7
15643,NORTHROP,SOUTHEAST,10,Friday,BURGLARY,59,15,7,2016,2,4,1.394484,-0.90369,1.58086,-0.511978,1.659503,-0.085376,1.661698,2.669396,36
8977,EAST PHILLIPS,SOUTHEAST,6,Thursday,ASSAULT,29,23,19,2014,4,3,0.864311,0.105701,0.807503,0.325799,0.695665,0.523695,0.870751,1.972704,16
5449,PROSPECT PARK - EAST RIVER ROAD,NORTHEAST,4,Saturday,AUTO THEFT,30,0,30,2016,4,2,1.754484,0.943401,1.450531,1.365349,1.047727,1.694251,1.992039,1.601038,0


# Feature Encoding 

- Convert categorical data to numeric data

### Precintcs

- convert Precinct categorical feature to numeric

In [52]:
precincts = {'DOWNTOWN':1, 'NORTHEAST':2, 'SOUTHEAST':3, 'NORTH':4, 'SOUTHWEST':5}

train_df['Precinct'].replace(precincts, inplace=True)
test_df['Precinct'].replace(precincts, inplace=True)

In [53]:
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Month,DayOfWeek,Category,Minute,Hour,Day,Year,Hour_Zone,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
8827,WILLARD - HAY,4,12,Thursday,AUTO THEFT,25,19,5,2019,3,1,-1.934364,0.201865,-1.920698,-0.305663,-1.776141,-0.792361,1.944868,-0.943216,15
15702,EAST ISLES,5,7,Wednesday,LARCENY,50,16,17,2019,2,3,-0.630813,-1.091239,-0.326885,-1.217323,-0.00068,-1.260448,1.260448,-2.093855,7
15643,NORTHROP,3,10,Friday,BURGLARY,59,15,7,2016,2,4,1.394484,-0.90369,1.58086,-0.511978,1.659503,-0.085376,1.661698,2.669396,36
8977,EAST PHILLIPS,3,6,Thursday,ASSAULT,29,23,19,2014,4,3,0.864311,0.105701,0.807503,0.325799,0.695665,0.523695,0.870751,1.972704,16
5449,PROSPECT PARK - EAST RIVER ROAD,2,4,Saturday,AUTO THEFT,30,0,30,2016,4,2,1.754484,0.943401,1.450531,1.365349,1.047727,1.694251,1.992039,1.601038,0


### Neighborhoods

- convert Neighborhoods categorical feature to numeric

In [54]:
data = [train_df, test_df]

for dataset in data:
    neighborhood = LabelEncoder()
    neighborhood.fit(dataset['Neighborhood'].unique())
    print(list(neighborhood.classes_))

    dataset['Neighborhood']=neighborhood.transform(dataset['Neighborhood']) 

['ARMATAGE', 'AUDUBON PARK', 'BANCROFT', 'BELTRAMI', 'BOTTINEAU', 'BRYANT', 'BRYN - MAWR', 'CAMDEN INDUSTRIAL', 'CARAG', 'CEDAR - ISLES - DEAN', 'CEDAR RIVERSIDE', 'CENTRAL', 'CLEVELAND', 'COLUMBIA PARK', 'COMO', 'COOPER', 'CORCORAN', 'DIAMOND LAKE', 'DOWNTOWN EAST', 'DOWNTOWN WEST', 'EAST BDE MAKA SKA', 'EAST HARRIET', 'EAST ISLES', 'EAST PHILLIPS', 'ECCO', 'ELLIOT PARK', 'ERICSSON', 'FIELD', 'FOLWELL', 'FULTON', 'HALE', 'HARRISON', 'HAWTHORNE', 'HIAWATHA', 'HOLLAND', 'HOWE', 'HUMBOLDT INDUSTRIAL AREA', 'JORDAN', 'KEEWAYDIN', 'KENNY', 'KENWOOD', 'KING FIELD', 'LIND - BOHANON', 'LINDEN HILLS', 'LOGAN PARK', 'LONGFELLOW', 'LORING PARK', 'LOWRY HILL', 'LOWRY HILL EAST', 'LYNDALE', 'LYNNHURST', 'MARCY HOLMES', 'MARSHALL TERRACE', 'MCKINLEY', 'MID - CITY INDUSTRIAL', 'MIDTOWN PHILLIPS', 'MINNEHAHA', 'MORRIS PARK', 'NEAR - NORTH', 'NICOLLET ISLAND - EAST BANK', 'NORTH LOOP', 'NORTHEAST PARK', 'NORTHROP', 'PAGE', 'PHILLIPS WEST', 'POWDERHORN PARK', 'PROSPECT PARK - EAST RIVER ROAD', 'REGINA'

In [55]:
train_df['Neighborhood'].unique()

array([86, 22, 62, 23, 66,  5, 57, 19, 48, 18, 58, 68, 51, 87, 46, 59, 29,
       75, 37,  8, 61, 10, 28, 65,  4, 81, 85, 11, 32,  0, 45, 82, 73, 78,
       47, 80, 35, 49, 25, 41, 55, 31, 60, 26, 76, 42, 70, 74, 53, 12,  1,
        2, 21, 88, 15, 30, 64, 63, 52, 77, 56, 54, 33, 43, 67, 24, 34, 83,
       27, 13,  6, 38, 79, 72,  3, 14, 69,  7, 39, 71, 44, 50, 16,  9, 17,
       40, 84, 36, 20])

In [56]:
# So we know the mapping (important)
dict(zip(neighborhood.classes_, neighborhood.transform(neighborhood.classes_)))

{'ARMATAGE': 0,
 'AUDUBON PARK': 1,
 'BANCROFT': 2,
 'BELTRAMI': 3,
 'BOTTINEAU': 4,
 'BRYANT': 5,
 'BRYN - MAWR': 6,
 'CAMDEN INDUSTRIAL': 7,
 'CARAG': 8,
 'CEDAR - ISLES - DEAN': 9,
 'CEDAR RIVERSIDE': 10,
 'CENTRAL': 11,
 'CLEVELAND': 12,
 'COLUMBIA PARK': 13,
 'COMO': 14,
 'COOPER': 15,
 'CORCORAN': 16,
 'DIAMOND LAKE': 17,
 'DOWNTOWN EAST': 18,
 'DOWNTOWN WEST': 19,
 'EAST BDE MAKA SKA': 20,
 'EAST HARRIET': 21,
 'EAST ISLES': 22,
 'EAST PHILLIPS': 23,
 'ECCO': 24,
 'ELLIOT PARK': 25,
 'ERICSSON': 26,
 'FIELD': 27,
 'FOLWELL': 28,
 'FULTON': 29,
 'HALE': 30,
 'HARRISON': 31,
 'HAWTHORNE': 32,
 'HIAWATHA': 33,
 'HOLLAND': 34,
 'HOWE': 35,
 'HUMBOLDT INDUSTRIAL AREA': 36,
 'JORDAN': 37,
 'KEEWAYDIN': 38,
 'KENNY': 39,
 'KENWOOD': 40,
 'KING FIELD': 41,
 'LIND - BOHANON': 42,
 'LINDEN HILLS': 43,
 'LOGAN PARK': 44,
 'LONGFELLOW': 45,
 'LORING PARK': 46,
 'LOWRY HILL': 47,
 'LOWRY HILL EAST': 48,
 'LYNDALE': 49,
 'LYNNHURST': 50,
 'MARCY HOLMES': 51,
 'MARSHALL TERRACE': 52,
 'MCKINLE

## Offense
Convert to numeric

In [57]:
# data = [train_df, test_df]

# for dataset in data:
#     offense = LabelEncoder()
#     offense.fit(dataset['Offense'].unique())
#     print(list(offense.classes_))

#     dataset['Offense']=offense.transform(dataset['Offense']) 

In [58]:
# train_df['Offense'].unique()

In [59]:
# # So we know the mapping (important)
# dict(zip(offense.classes_, offense.transform(offense.classes_)))

In [60]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179109 entries, 8827 to 15367
Data columns (total 20 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Neighborhood  179109 non-null  int64  
 1   Precinct      179109 non-null  int64  
 2   Month         179109 non-null  int64  
 3   DayOfWeek     179109 non-null  object 
 4   Category      179109 non-null  object 
 5   Minute        179109 non-null  int64  
 6   Hour          179109 non-null  int64  
 7   Day           179109 non-null  int64  
 8   Year          179109 non-null  int64  
 9   Hour_Zone     179109 non-null  int64  
 10  Season        179109 non-null  int64  
 11  Rot30_X       179109 non-null  float64
 12  Rot30_Y       179109 non-null  float64
 13  Rot45_X       179109 non-null  float64
 14  Rot45_Y       179109 non-null  float64
 15  Rot60_X       179109 non-null  float64
 16  Rot60_Y       179109 non-null  float64
 17  Radius        179109 non-null  float64
 18  An

### Year

- Year is an **ordinal** variable, so let's keep that ordering and mapping
- convert Year categorical feature to numeric

In [61]:
data = [train_df, test_df]

for dataset in data:
    year_le = LabelEncoder()
    year_le.fit(dataset['Year'].unique())
    print(list(year_le.classes_))

    dataset['Year']=year_le.transform(dataset['Year']) 

[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]
[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]


In [62]:
train_df['Year'].unique()

array([ 9,  6,  4,  0,  5,  3, 10,  8,  7,  2,  1, 11])

In [63]:
# So we know the mapping (important)
dict(zip(year_le.classes_, year_le.transform(year_le.classes_)))

{2010: 0,
 2011: 1,
 2012: 2,
 2013: 3,
 2014: 4,
 2015: 5,
 2016: 6,
 2017: 7,
 2018: 8,
 2019: 9,
 2020: 10,
 2021: 11}

In [64]:
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Month,DayOfWeek,Category,Minute,Hour,Day,Year,Hour_Zone,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
8827,86,4,12,Thursday,AUTO THEFT,25,19,5,9,3,1,-1.934364,0.201865,-1.920698,-0.305663,-1.776141,-0.792361,1.944868,-0.943216,15
15702,22,5,7,Wednesday,LARCENY,50,16,17,9,2,3,-0.630813,-1.091239,-0.326885,-1.217323,-0.00068,-1.260448,1.260448,-2.093855,7
15643,62,3,10,Friday,BURGLARY,59,15,7,6,2,4,1.394484,-0.90369,1.58086,-0.511978,1.659503,-0.085376,1.661698,2.669396,36
8977,23,3,6,Thursday,ASSAULT,29,23,19,4,4,3,0.864311,0.105701,0.807503,0.325799,0.695665,0.523695,0.870751,1.972704,16
5449,66,2,4,Saturday,AUTO THEFT,30,0,30,6,4,2,1.754484,0.943401,1.450531,1.365349,1.047727,1.694251,1.992039,1.601038,0


In [65]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179109 entries, 8827 to 15367
Data columns (total 20 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Neighborhood  179109 non-null  int64  
 1   Precinct      179109 non-null  int64  
 2   Month         179109 non-null  int64  
 3   DayOfWeek     179109 non-null  object 
 4   Category      179109 non-null  object 
 5   Minute        179109 non-null  int64  
 6   Hour          179109 non-null  int64  
 7   Day           179109 non-null  int64  
 8   Year          179109 non-null  int64  
 9   Hour_Zone     179109 non-null  int64  
 10  Season        179109 non-null  int64  
 11  Rot30_X       179109 non-null  float64
 12  Rot30_Y       179109 non-null  float64
 13  Rot45_X       179109 non-null  float64
 14  Rot45_Y       179109 non-null  float64
 15  Rot60_X       179109 non-null  float64
 16  Rot60_Y       179109 non-null  float64
 17  Radius        179109 non-null  float64
 18  An

### DayOfWeek

- we are going to use sklearn's LabelEncoder to encode the categorical data to numeric
- Day of week is considered a categorical and nominal variable

In [66]:
data = [train_df, test_df]

for dataset in data:
    dow_le = LabelEncoder()
    dow_le.fit(dataset['DayOfWeek'].unique())
    print(list(dow_le.classes_))
    dataset['DayOfWeek']=dow_le.transform(dataset['DayOfWeek'])

['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday']
['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday']


In [67]:
train_df['DayOfWeek'].unique()

array([4, 6, 0, 2, 1, 3, 5])

In [68]:
# So we know the mapping (important)
dict(zip(dow_le.classes_, dow_le.transform(dow_le.classes_)))

{'Friday': 0,
 'Monday': 1,
 'Saturday': 2,
 'Sunday': 3,
 'Thursday': 4,
 'Tuesday': 5,
 'Wednesday': 6}

In [69]:
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Month,DayOfWeek,Category,Minute,Hour,Day,Year,Hour_Zone,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
8827,86,4,12,4,AUTO THEFT,25,19,5,9,3,1,-1.934364,0.201865,-1.920698,-0.305663,-1.776141,-0.792361,1.944868,-0.943216,15
15702,22,5,7,6,LARCENY,50,16,17,9,2,3,-0.630813,-1.091239,-0.326885,-1.217323,-0.00068,-1.260448,1.260448,-2.093855,7
15643,62,3,10,0,BURGLARY,59,15,7,6,2,4,1.394484,-0.90369,1.58086,-0.511978,1.659503,-0.085376,1.661698,2.669396,36
8977,23,3,6,4,ASSAULT,29,23,19,4,4,3,0.864311,0.105701,0.807503,0.325799,0.695665,0.523695,0.870751,1.972704,16
5449,66,2,4,2,AUTO THEFT,30,0,30,6,4,2,1.754484,0.943401,1.450531,1.365349,1.047727,1.694251,1.992039,1.601038,0


In [70]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179109 entries, 8827 to 15367
Data columns (total 20 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Neighborhood  179109 non-null  int64  
 1   Precinct      179109 non-null  int64  
 2   Month         179109 non-null  int64  
 3   DayOfWeek     179109 non-null  int64  
 4   Category      179109 non-null  object 
 5   Minute        179109 non-null  int64  
 6   Hour          179109 non-null  int64  
 7   Day           179109 non-null  int64  
 8   Year          179109 non-null  int64  
 9   Hour_Zone     179109 non-null  int64  
 10  Season        179109 non-null  int64  
 11  Rot30_X       179109 non-null  float64
 12  Rot30_Y       179109 non-null  float64
 13  Rot45_X       179109 non-null  float64
 14  Rot45_Y       179109 non-null  float64
 15  Rot60_X       179109 non-null  float64
 16  Rot60_Y       179109 non-null  float64
 17  Radius        179109 non-null  float64
 18  An

In [71]:
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Month,DayOfWeek,Category,Minute,Hour,Day,Year,Hour_Zone,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
8827,86,4,12,4,AUTO THEFT,25,19,5,9,3,1,-1.934364,0.201865,-1.920698,-0.305663,-1.776141,-0.792361,1.944868,-0.943216,15
15702,22,5,7,6,LARCENY,50,16,17,9,2,3,-0.630813,-1.091239,-0.326885,-1.217323,-0.00068,-1.260448,1.260448,-2.093855,7
15643,62,3,10,0,BURGLARY,59,15,7,6,2,4,1.394484,-0.90369,1.58086,-0.511978,1.659503,-0.085376,1.661698,2.669396,36
8977,23,3,6,4,ASSAULT,29,23,19,4,4,3,0.864311,0.105701,0.807503,0.325799,0.695665,0.523695,0.870751,1.972704,16
5449,66,2,4,2,AUTO THEFT,30,0,30,6,4,2,1.754484,0.943401,1.450531,1.365349,1.047727,1.694251,1.992039,1.601038,0


In [72]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179109 entries, 8827 to 15367
Data columns (total 20 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Neighborhood  179109 non-null  int64  
 1   Precinct      179109 non-null  int64  
 2   Month         179109 non-null  int64  
 3   DayOfWeek     179109 non-null  int64  
 4   Category      179109 non-null  object 
 5   Minute        179109 non-null  int64  
 6   Hour          179109 non-null  int64  
 7   Day           179109 non-null  int64  
 8   Year          179109 non-null  int64  
 9   Hour_Zone     179109 non-null  int64  
 10  Season        179109 non-null  int64  
 11  Rot30_X       179109 non-null  float64
 12  Rot30_Y       179109 non-null  float64
 13  Rot45_X       179109 non-null  float64
 14  Rot45_Y       179109 non-null  float64
 15  Rot60_X       179109 non-null  float64
 16  Rot60_Y       179109 non-null  float64
 17  Radius        179109 non-null  float64
 18  An

### Category

- we are going to use sklearn's LabelEncoder to encode the categorical data to numeric

In [73]:
data = [train_df]

for dataset in data:
    cat_le = LabelEncoder()
    cat_le.fit(dataset['Category'].unique())
    print(list(cat_le.classes_))
    dataset['Category']=cat_le.transform(dataset['Category'])

['ARSON', 'ASSAULT', 'AUTO THEFT', 'BURGLARY', 'LARCENY', 'MURDER', 'RAPE', 'ROBBERY']


In [74]:
len(train_df['Category'].unique())

8

In [75]:
# So we know the mapping (important)
dict(zip(cat_le.classes_, cat_le.transform(cat_le.classes_)))

{'ARSON': 0,
 'ASSAULT': 1,
 'AUTO THEFT': 2,
 'BURGLARY': 3,
 'LARCENY': 4,
 'MURDER': 5,
 'RAPE': 6,
 'ROBBERY': 7}

In [76]:
train_df.head()

Unnamed: 0,Neighborhood,Precinct,Month,DayOfWeek,Category,Minute,Hour,Day,Year,Hour_Zone,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
8827,86,4,12,4,2,25,19,5,9,3,1,-1.934364,0.201865,-1.920698,-0.305663,-1.776141,-0.792361,1.944868,-0.943216,15
15702,22,5,7,6,4,50,16,17,9,2,3,-0.630813,-1.091239,-0.326885,-1.217323,-0.00068,-1.260448,1.260448,-2.093855,7
15643,62,3,10,0,3,59,15,7,6,2,4,1.394484,-0.90369,1.58086,-0.511978,1.659503,-0.085376,1.661698,2.669396,36
8977,23,3,6,4,1,29,23,19,4,4,3,0.864311,0.105701,0.807503,0.325799,0.695665,0.523695,0.870751,1.972704,16
5449,66,2,4,2,2,30,0,30,6,4,2,1.754484,0.943401,1.450531,1.365349,1.047727,1.694251,1.992039,1.601038,0


In [77]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179109 entries, 8827 to 15367
Data columns (total 20 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Neighborhood  179109 non-null  int64  
 1   Precinct      179109 non-null  int64  
 2   Month         179109 non-null  int64  
 3   DayOfWeek     179109 non-null  int64  
 4   Category      179109 non-null  int64  
 5   Minute        179109 non-null  int64  
 6   Hour          179109 non-null  int64  
 7   Day           179109 non-null  int64  
 8   Year          179109 non-null  int64  
 9   Hour_Zone     179109 non-null  int64  
 10  Season        179109 non-null  int64  
 11  Rot30_X       179109 non-null  float64
 12  Rot30_Y       179109 non-null  float64
 13  Rot45_X       179109 non-null  float64
 14  Rot45_Y       179109 non-null  float64
 15  Rot60_X       179109 non-null  float64
 16  Rot60_Y       179109 non-null  float64
 17  Radius        179109 non-null  float64
 18  An

## View Information about Data

- One last check before training

In [78]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179109 entries, 8827 to 15367
Data columns (total 20 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Neighborhood  179109 non-null  int64  
 1   Precinct      179109 non-null  int64  
 2   Month         179109 non-null  int64  
 3   DayOfWeek     179109 non-null  int64  
 4   Category      179109 non-null  int64  
 5   Minute        179109 non-null  int64  
 6   Hour          179109 non-null  int64  
 7   Day           179109 non-null  int64  
 8   Year          179109 non-null  int64  
 9   Hour_Zone     179109 non-null  int64  
 10  Season        179109 non-null  int64  
 11  Rot30_X       179109 non-null  float64
 12  Rot30_Y       179109 non-null  float64
 13  Rot45_X       179109 non-null  float64
 14  Rot45_Y       179109 non-null  float64
 15  Rot60_X       179109 non-null  float64
 16  Rot60_Y       179109 non-null  float64
 17  Radius        179109 non-null  float64
 18  An

In [79]:
# # Convert all to 32 bit integers so less memory and will train faster (no loss in data since our integers dont reach)
columns_to_convert = ['DayOfWeek', 'Precinct', 'Minute', 'Hour', 'Day', 'Month', 'Year', 
                      'Hour_Zone', 'Season', 'Cluster']
                      #'Neighborhood']
train_df[columns_to_convert] = train_df[columns_to_convert].astype('int16')
test_df[columns_to_convert] = test_df[columns_to_convert].astype('int16')

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179109 entries, 8827 to 15367
Data columns (total 20 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Neighborhood  179109 non-null  int64  
 1   Precinct      179109 non-null  int16  
 2   Month         179109 non-null  int16  
 3   DayOfWeek     179109 non-null  int16  
 4   Category      179109 non-null  int64  
 5   Minute        179109 non-null  int16  
 6   Hour          179109 non-null  int16  
 7   Day           179109 non-null  int16  
 8   Year          179109 non-null  int16  
 9   Hour_Zone     179109 non-null  int16  
 10  Season        179109 non-null  int16  
 11  Rot30_X       179109 non-null  float64
 12  Rot30_Y       179109 non-null  float64
 13  Rot45_X       179109 non-null  float64
 14  Rot45_Y       179109 non-null  float64
 15  Rot60_X       179109 non-null  float64
 16  Rot60_Y       179109 non-null  float64
 17  Radius        179109 non-null  float64
 18  An

# Building Machine Learning Models

- Baseline Models
    - Let's train a couple models on a stratified sample of the training data
    - Evaluate on a hold out set to get baseline results for each model to determine what model to use
    - Models:
        - Stochastic Gradient Descent (with elastic net regularization)
        - Gaussian Naive Bayes
        - K Nearest Neighbors
        - Logistic Regression (with L1 regularization)
        - Random Forest
        - XGBoost
    - Almost all the default scikit-learn ML algorithm hyperparameters exhibit bad performance
- Couple things to note:
    - Decision tree models including Ensemble methods (Random Forest & XGBoost) can handle categorical variables without one-hot encoding them. 
    - Linear models (SGD & Logistic Regression) cannot handle categorical features & need features to be OHE before training
    - Always OneHotEncode before you split data up to training/dev/test so that all features & classes will be represented

In [80]:
# Set training data (drop labels) and training labels
X_train = train_df.drop("Category", axis=1).copy()
Y_train = train_df["Category"].copy()

# Set testing data (drop Id)
# X_test = test_df.drop("Id", axis=1).copy()
X_test = test_df#.drop("Id", axis=1).copy()

In [81]:
def one_hot_encode(train_data):
    '''One Hot Encode the categorical features'''
    encoded_train_data = train_data

    #encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['Neighborhood']), prefix='Neighborhood')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['Precinct']), prefix='Precinct')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['DayOfWeek']), prefix='DayOfWeek')], axis=1)
    # encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['StreetType']), prefix='StreetType')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['Season']), prefix='Season')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['Hour_Zone']), prefix='Hour_Zone')], axis=1)
    encoded_train_data = pd.concat([encoded_train_data, pd.get_dummies(pd.Series(encoded_train_data['Cluster']), prefix='Cluster')], axis=1)

    encoded_train_data = encoded_train_data.drop(['Cluster', 'Season', 'Hour_Zone', 'DayOfWeek', 'Precinct'], axis=1)

    return encoded_train_data

In [82]:
X_encoded_train = one_hot_encode(X_train)

In [83]:
# Use these for ML algorithms that can't handle categorical data (Logistic Regression, Linear Models)
mini_encoded_train_data, mini_encoded_dev_data, mini_train_labels, mini_dev_labels = train_test_split(X_encoded_train, 
                                                                                      Y_train,
                                                                                      stratify=Y_train,
                                                                                      test_size=0.5,
                                                                                      random_state=1)

In [84]:
# Use these for ML algorithms that can handle categorical data without OHE
mini_train_data, mini_dev_data, mini_train_labels, mini_dev_labels = train_test_split(X_train, 
                                                                                      Y_train,
                                                                                      stratify=Y_train,
                                                                                      test_size=0.5,
                                                                                      random_state=1)



In [85]:
# K Neighbors
knn = KNeighborsClassifier()
knn.fit(mini_train_data, mini_train_labels)
pred_probs = knn.predict_proba(mini_dev_data)
knn_loss = log_loss(mini_dev_labels, pred_probs)


print('KNN Validation Log Loss: ', knn_loss)

KNN Validation Log Loss:  7.836518944887235


In [86]:
# Naive Bayes
gaussian = GaussianNB()
gaussian.fit(mini_train_data, mini_train_labels)
pred_probs = gaussian.predict_proba(mini_dev_data)
nb_loss = log_loss(mini_dev_labels, pred_probs)


print('Gaussian Naive Bayes Validation Log Loss: ', nb_loss)

Gaussian Naive Bayes Validation Log Loss:  1.4475058076534344


In [87]:
# stochastic gradient descent (SGD) learning
sgd = linear_model.SGDClassifier(penalty='elasticnet', loss='log', 
                                  tol=0.0001, max_iter=100, n_jobs=3, random_state=1)
sgd.fit(mini_encoded_train_data, mini_train_labels)
pred_probs = sgd.predict_proba(mini_encoded_dev_data)
# sgd.fit(one_hot_encode(mini_train_data), mini_train_labels)
# sgd = gaussian.predict_proba(one_hot_encode(mini_dev_data))
sgd_loss = log_loss(mini_dev_labels, pred_probs)

print('Linear Model SGD Validation Log Loss: ', sgd_loss)

Linear Model SGD Validation Log Loss:  1.2763528768765593


In [88]:
# Logistic Regression
logreg = LogisticRegression(penalty='l1', C=1.5, solver='saga', multi_class='multinomial', 
                            tol=0.0001, max_iter=100, verbose=3, n_jobs=3, random_state=1)

logreg.fit(mini_encoded_train_data, mini_train_labels)
pred_probs = logreg.predict_proba(mini_encoded_dev_data)

logreg_loss = log_loss(mini_dev_labels, pred_probs)


print('Logistic Regression Validation Log Loss: ', logreg_loss)

[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.


Epoch 1, change: 1.00000000
Epoch 2, change: 0.18153147
Epoch 3, change: 0.10661015
Epoch 4, change: 0.06616489
Epoch 5, change: 0.05005582
Epoch 6, change: 0.04263325
Epoch 7, change: 0.03617479
Epoch 8, change: 0.03065591
Epoch 9, change: 0.02689389
Epoch 10, change: 0.02386761
Epoch 11, change: 0.02128038
Epoch 12, change: 0.01908566
Epoch 13, change: 0.01688159
Epoch 14, change: 0.01514840
Epoch 15, change: 0.01386302
Epoch 16, change: 0.01319354
Epoch 17, change: 0.01291747
Epoch 18, change: 0.01264959
Epoch 19, change: 0.01238939
Epoch 20, change: 0.01215084
Epoch 21, change: 0.01190439
Epoch 22, change: 0.01166654
Epoch 23, change: 0.01143336
Epoch 24, change: 0.01121843
Epoch 25, change: 0.01100347
Epoch 26, change: 0.01079608
Epoch 27, change: 0.01058596
Epoch 28, change: 0.01038980
Epoch 29, change: 0.01019662
Epoch 30, change: 0.01001032
Epoch 31, change: 0.00982868
Epoch 32, change: 0.00964617
Epoch 33, change: 0.00947226
Epoch 34, change: 0.00928676
Epoch 35, change: 0.009

In [None]:
# Random Forest Ensemble
random_forest = RandomForestClassifier(n_estimators=100, max_depth=15, max_features='sqrt',
                                       min_samples_leaf=5, min_samples_split=25, 
                                       random_state=1, verbose=1, n_jobs=2)


random_forest.fit(mini_train_data, mini_train_labels)
pred_probs = random_forest.predict_proba(mini_dev_data)

rf_loss = log_loss(mini_dev_labels, pred_probs)

print('Random Forest Validation Log Loss: ', rf_loss)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    2.6s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    5.9s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.2s


Random Forest Validation Log Loss:  1.1577970576641894


[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.5s finished


In [None]:
# XGBoost Ensemble 
# xgb = XGBClassifier(n_estimators=100, verbose=3, n_jobs=2, random_state=1)
xgb = XGBClassifier(n_estimators=100, objective="multi:softprob", 
                    verbose=3, random_state=1)

xgb.fit(mini_encoded_train_data, mini_train_labels)
pred_probs = xgb.predict_proba(mini_encoded_dev_data)

xgb_loss = log_loss(mini_dev_labels, pred_probs)

print('XGBoost Validation Log Loss: ', xgb_loss)

Parameters: { verbose } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


XGBoost Validation Log Loss:  1.141295922667942


In [None]:
# Display the rank of the models
models = pd.DataFrame({
    'Model': ['SGD (Elastic net)', 'Logistic Regression (l1)', 'Random Forest', 
              'Gaussian Naive Bayes', 'XGBoost', 'K Neighbors'],
    'Log_Loss': [sgd_loss, logreg_loss, rf_loss, nb_loss, xgb_loss, knn_loss]})
print(models.sort_values(by='Log_Loss', ascending=True).reset_index(drop=True))

                      Model  Log_Loss
0                   XGBoost  1.141296
1             Random Forest  1.157797
2  Logistic Regression (l1)  1.242917
3         SGD (Elastic net)  1.276353
4      Gaussian Naive Bayes  1.447506
5               K Neighbors  7.836519


# Model Selection

- Although Logistic Regression with L1 regularization seems promising, our dataset has a mixture of categorical and numerical features that have very different statistics (mean, variance), thus not very linear. In addition, with any linear model, this would require **one hot encoding** that would greatly increase the feature space (some categorical features such as `BlockNumber` have many levels/values). 
    - Logistic Regression is a generalized linear model, and can theoretically only solve problems where the classes are linearly separable & features are linear.
    - In practice, if we do more feature engineering and convert the non-linear features to linear features, we could increase the performance of LR
- Ensemble methods have been historically and theoretically powerful in handling datasets with very different features (numerical & categorical features). In addition, ensemble methods are effective in solving non-linear problems. So, I will select between Random Forest & XGBoost as the final model. 
    - The caveat is that the default hyperparameters for RF & XGB are generally not optimal for the problem in hand, so hyperparameter tuning is necessary, which can take a while since there are so many hyperparameters to tune for (at least in XGB).

# Hyperparameter Tuning

- Hyperparameter tuning involves defining an objective function (log loss), and using cross-validation to measure the hyperparameter quality. 
    - We want the hyperparameters that give the highest generalization performance.
- Three approaches: Grid Search (GridSearchCV), Random Search (RandomSearchCV), and Bayes Optimization 
- Realized GridSearchCV took way too long and was impractical, and RandomSearchCV was too random.
    - Grid and random search are completely uninformed by past evaluations, and as a result, often spend a significant amount of time evaluating “bad” hyperparameters.
- Then, I did more research on more efficient & smarter hyperparameter tuning techniques and found Bayeisan Optimization (BayesSearchCV)
- Bayesian Optimization Overview
    - Build a probabilistic model of the objective function & use it to select promising hyperparameters to evaluate in the true objective function
        - The model used for approximating the objective function is called surrogate model. 
            - E.g. Gaussian Processes 
    - Keeps track of past evaluation results, which is used to form a probabilistic model mapping hyperparameters to a probability of a score on the objective function
    - Instead of optimizing an expensive objective function, we optimize on a cheap proxy function instead.
        - Acquisition function that directs sampling to areas where an improvement over the current best observation is likely.
            - E.g. maximum probability of improvement (MPI), expected improvement (EI) and upper confidence bound (UCB)
- K-Folds Cross Validation**
    - Use cross validation to measure the true generalization performance of a model 
    - This is integrated with the hyperparameter tuning techniques (GridSearchCV, RandomSearchCV, BayesSearchCV)

--------
## Random Forest (Bagging)

- Basic Overview
    - An ensemble method that utilizes Bagging (Bootstrapp Aggregation or sampling with replacement)
    - Bagging helps reduce **variance** in any single learner (Decision Trees)
- Basic Steps:
    1. Several decision trees which are generated in parallel, form the base learners of bagging technique.
    2. Data sampled with replacement is fed to these learners for training.
    3. The final prediction is the averaged output from all the learners.
   

**Things I learned**:
- Since the random forest model is overfitting, we want to increase the **min** parameters of random forest and decrease the **max** parameters of random forest
- increasing n_estimators will prevent the random forest from **overfitting**
    - lower number of n_estimators will be similiar to just a simple decision tree (very prone to overfitting)
- increasing max depth will increase **variance** (overfitting, sensitivity to training set) and decrease **bias**
- increasing min samples leaf will decrease **variance** and increase **bias**.
- decreasing any of the **max*** parameters and increasing any of the **min*** parameters will increase **regularization**.

In [None]:
n_features = X_train.shape[1]

opt = BayesSearchCV(
    estimator=RandomForestClassifier(oob_score=True, random_state=1),
    search_spaces= 
    {
        'n_estimators': (100, 600),
        'max_depth': (1, 50),  
        'max_features': (1, n_features),
        'min_samples_leaf': (1, 50),  # integer valued parameter
        'min_samples_split': (2, 50),
    },
    n_iter=5,
    optimizer_kwargs= {'base_estimator': 'RF'},
    scoring='neg_log_loss',
    verbose=0,
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=1
    ),
    random_state=1
    
)


def status_print(optim_result):
    """Status callback durring bayesian hyperparameter search"""
    
    # Get all the models tested so far in DataFrame format
    all_models = pd.DataFrame(opt.cv_results_)    
    
    # Get current parameters and the best parameters    
    best_params = pd.Series(opt.best_params_)
    print('Model #{}\nBest LogLoss: {}\nBest params: {}\n'.format(
        len(all_models),
        np.round(opt.best_score_, 6),
        opt.best_params_
    ))
    
    # Save all model results
    clf_name = opt.estimator.__class__.__name__
    all_models.to_csv(clf_name + "_cv_results.csv")


In [None]:
result = opt.fit(X_train.values, Y_train.values, callback=status_print)

Model #1
Best LogLoss: -1.167615
Best params: OrderedDict([('max_depth', 32), ('max_features', 2), ('min_samples_leaf', 38), ('min_samples_split', 21), ('n_estimators', 264)])

Model #2
Best LogLoss: -1.147887
Best params: OrderedDict([('max_depth', 31), ('max_features', 14), ('min_samples_leaf', 16), ('min_samples_split', 43), ('n_estimators', 550)])

Model #3
Best LogLoss: -1.147887
Best params: OrderedDict([('max_depth', 31), ('max_features', 14), ('min_samples_leaf', 16), ('min_samples_split', 43), ('n_estimators', 550)])

Model #4
Best LogLoss: -1.147887
Best params: OrderedDict([('max_depth', 31), ('max_features', 14), ('min_samples_leaf', 16), ('min_samples_split', 43), ('n_estimators', 550)])

Model #5
Best LogLoss: -1.147887
Best params: OrderedDict([('max_depth', 31), ('max_features', 14), ('min_samples_leaf', 16), ('min_samples_split', 43), ('n_estimators', 550)])



In [None]:
result.best_params_

OrderedDict([('max_depth', 31),
             ('max_features', 14),
             ('min_samples_leaf', 16),
             ('min_samples_split', 43),
             ('n_estimators', 550)])

OrderedDict([('max_depth', 31),
             ('max_features', 14),
             ('min_samples_leaf', 16),
             ('min_samples_split', 43),
             ('n_estimators', 550)])

## XGBoost (Boosting)

- Basic Overview:
    - Another ensemble method that uses Boosting instead of Bagging (Random Forests)
    - In **Boosting**, the trees are built sequentially such that each subsequent tree aims to reduce the errors of the previous tree.
    - Each tree learns from its predecessors and updates the residual errors. 
    - Each base learner is weak (high bias) and contributes some vital information for prediction, enabling the boosting technique to produce a strong learner by effectively combining these weak learners.
    - The final strong learner brings down both the **bias** and the **variance**.
    - In contrast to bagging techniques like Random Forest, in which trees are grown to their maximum extent, boosting makes use of trees with fewer splits
        -  Such small trees, which are not very deep, are **highly interpretable**. 
- Basic Steps:
    1. Initial model `F0` to predict target variable `y`. Used to also calculate residual (`y - F0`)
    2. A new model `h1` is used to fit to the residuals from the previous step
    3. Now, `F0` and `h1` are combined to give `F1`, which is the boosted version of `F0`. 
        - The MSE or whatever cost function you use (Log loss, MAE) of `F1` will be lower than `F0`.
    4. Iterate the above steps to create new models based off the previous models.
    
### Prevent Overfitting:
- Large number of trees will cause overfitting (unlike Random Forests)


In [None]:
# log-uniform: understand as search over p = exp(x) by varying x
bayes_cv_tuner = BayesSearchCV(
    estimator = XGBClassifier(
        n_jobs = 3,
        objective = 'multi:softprob',
        eval_metric = 'mlogloss',
        silent=1,
        random_state=1
    ),
    search_spaces = {
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'min_child_weight': (0, 10),
        'max_depth': (1, 100),
        'max_delta_step': (0, 20),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'colsample_bylevel': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'reg_alpha': (1e-9, 1.0, 'log-uniform'),
        'gamma': (1e-9, 0.5, 'log-uniform'),
        'min_child_weight': (0, 5),
        'n_estimators': (50, 300),
        'scale_pos_weight': (1e-6, 500, 'log-uniform')
    },    
    scoring = 'neg_log_loss',
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=1
    ),
    n_jobs = 6,
    n_iter = 5,   
    verbose = 0,
    refit = True,
    random_state = 1
)

def status_print(optim_result):
    """Status callback durring bayesian hyperparameter search"""
    
    # Get all the models tested so far in DataFrame format
    all_models = pd.DataFrame(bayes_cv_tuner.cv_results_)    
    
    # Get current parameters and the best parameters    
    best_params = pd.Series(bayes_cv_tuner.best_params_)
    print('Model #{}\nBest Log Loss: {}\nBest params: {}\n'.format(
        len(all_models),
        np.round(bayes_cv_tuner.best_score_, 8),
        bayes_cv_tuner.best_params_
    ))
    
    # Save all model results
    clf_name = bayes_cv_tuner.estimator.__class__.__name__
    all_models.to_csv(clf_name + "_cv_results.csv")

In [None]:
# Fit the model
result = bayes_cv_tuner.fit(X_train.values, Y_train.values, callback=status_print)



Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Model #1
Best Log Loss: -1.38830727
Best params: OrderedDict([('colsample_bylevel', 0.70561854251991



Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Model #2
Best Log Loss: -1.19307838
Best params: OrderedDict([('colsample_bylevel', 0.6867218654755486), ('colsample_bytree', 0.3412604797490719), ('gamma', 1.8023871181327715e-09), ('learning_rate', 0.07405425417827127), ('max_delta_step', 8), ('max_depth', 75), ('min_child_weight', 0), ('n_estimators', 142), ('reg_alpha', 0.05212318854833209), ('reg_lambda', 947), ('scale_pos_weight', 432), ('su



Model #3
Best Log Loss: -1.19307838
Best params: OrderedDict([('colsample_bylevel', 0.6867218654755486), ('colsample_bytree', 0.3412604797490719), ('gamma', 1.8023871181327715e-09), ('learning_rate', 0.07405425417827127), ('max_delta_step', 8), ('max_depth', 75), ('min_child_weight', 0), ('n_estimators', 142), ('reg_alpha', 0.05212318854833209), ('reg_lambda', 947), ('scale_pos_weight', 432), ('subsample', 0.3772619608161306)])

Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an 



Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Model #4
Best Log Loss: -1.16203301
Best params: OrderedDict([('colsample_bylevel', 0.7064056414363574), ('colsample_bytree', 0.4174572655093935), ('gamma', 1.1268483194913195e-08), ('learning_rate', 0.06366995699596882), ('max_delta_step', 8), ('max_depth', 87), ('min_child_weight', 5), ('n_estimators', 207), ('reg_alpha', 5.5514820852863855e-05), ('reg_lambda', 648), ('scale_pos_weight', 283), ('subsample', 0.5520832763658757)])

Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open 



Model #5
Best Log Loss: -1.13642972
Best params: OrderedDict([('colsample_bylevel', 0.8789082806014207), ('colsample_bytree', 0.36591688835830133), ('gamma', 1.67665900058721e-09), ('learning_rate', 0.1385669035760294), ('max_delta_step', 9), ('max_depth', 46), ('min_child_weight', 4), ('n_estimators', 176), ('reg_alpha', 0.07741769580886979), ('reg_lambda', 354), ('scale_pos_weight', 278), ('subsample', 0.41884600419507806)])

Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




In [None]:
# Print best params
result.best_params_

OrderedDict([('colsample_bylevel', 0.8789082806014207),
             ('colsample_bytree', 0.36591688835830133),
             ('gamma', 1.67665900058721e-09),
             ('learning_rate', 0.1385669035760294),
             ('max_delta_step', 9),
             ('max_depth', 46),
             ('min_child_weight', 4),
             ('n_estimators', 176),
             ('reg_alpha', 0.07741769580886979),
             ('reg_lambda', 354),
             ('scale_pos_weight', 278),
             ('subsample', 0.41884600419507806)])

In [None]:
X_train

Unnamed: 0,Neighborhood,Precinct,Month,DayOfWeek,Minute,Hour,Day,Year,Hour_Zone,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
8827,86,4,12,4,25,19,5,9,3,1,-1.934364,0.201865,-1.920698,-0.305663,-1.776141,-0.792361,1.944868,-0.943216,15
15702,22,5,7,6,50,16,17,9,2,3,-0.630813,-1.091239,-0.326885,-1.217323,-0.000680,-1.260448,1.260448,-2.093855,7
15643,62,3,10,0,59,15,7,6,2,4,1.394484,-0.903690,1.580860,-0.511978,1.659503,-0.085376,1.661698,2.669396,36
8977,23,3,6,4,29,23,19,4,4,3,0.864311,0.105701,0.807503,0.325799,0.695665,0.523695,0.870751,1.972704,16
5449,66,2,4,2,30,0,30,6,4,2,1.754484,0.943401,1.450531,1.365349,1.047727,1.694251,1.992039,1.601038,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1170,86,4,11,2,30,19,24,8,3,4,-1.957989,0.150543,-1.930236,-0.361352,-1.770940,-0.848621,1.963768,-0.970462,15
10879,23,3,7,3,0,23,26,5,4,3,0.790329,-0.048696,0.776003,0.157515,0.708793,0.352992,0.791828,2.155933,5
4988,49,5,5,3,34,16,13,8,2,2,0.185832,-0.963929,0.428983,-0.882987,0.642900,-0.741871,0.981679,-2.808443,2
1674,65,3,1,6,0,0,25,2,4,1,0.703002,-0.321179,0.762175,-0.128285,0.769407,0.073352,0.772896,2.522946,5


In [None]:
X_test
X_test['Category'] = 1#
#X_test.drop()
X_test.drop(columns={'Category'},inplace=True)

In [None]:
Y_train

8827     2
15702    4
15643    3
8977     1
5449     2
        ..
1170     2
10879    4
4988     4
1674     4
15367    1
Name: Category, Length: 179109, dtype: int64

In [None]:

##############################################

# data = [X_train]

# for dataset in data:
#     cat_le = LabelEncoder()
#     cat_le.fit(dataset['Category'].unique())
# #     print(list(cat_le.classes_))
# X_train['Category']=cat_le.transform(dataset['Category'])

# Train model with optimal hyperparameters & all features

- Initially, I started with a Random Forest, but decided to use XGBoost
- We first train the model (with all the features) using the optimal hyperparameters that were found through BayesSearchCV
- Then, I use the model to predict the probabilities of test set with all the features
    - I'll save these predictions later to compare them with another model I will train with certain features removed

In [None]:
# xgb = XGBClassifier(
#     n_estimators=86, 
#     objective="multi:softprob", 
#     learning_rate=0.1858621466840661,
#     colsample_bylevel=1.0,
#     colsample_bytree=1.0,
#     gamma=0.49999999999999994,
#     max_delta_step=0,
#     max_depth=50,
#     min_child_weight=5,
#     reg_alpha=1.0,
#     reg_lambda=60.121460571845695,
#     scale_pos_weight=1e-06,
#     subsample=1.0,
#     random_state=1, 
#     n_jobs=4,
#     silent=False
#     )
# xgb.fit(X_train, Y_train)

# Y_test_pred = xgb.predict_proba(X_test)

In [None]:
xgb = XGBClassifier(
    n_estimators=86, 
    objective="multi:softprob", 
    learning_rate=0.1858621466840661,
    colsample_bylevel=1.0,
    colsample_bytree=1.0,
    gamma=0.49999999999999994,
    max_delta_step=0,
    max_depth=50,
    min_child_weight=5,
    reg_alpha=1.0,
    reg_lambda=60.121460571845695,
    scale_pos_weight=1e-06,
    subsample=1.0,
    random_state=1, 
    n_jobs=4,
    silent=False,
    enable_categorical=True)

xgb.fit(X_train, Y_train)

Y_test_pred = xgb.predict_proba(X_test)

Parameters: { enable_categorical, scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




In [None]:
random_forest = RandomForestClassifier(n_estimators=600, max_depth=21, max_features=6,
                                       min_samples_leaf=43, min_samples_split=40, 
                                       random_state=1, verbose=3, n_jobs=2)
random_forest.fit(X_train, Y_train)

Y_test_pred = random_forest.predict_proba(X_test)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


building tree 1 of 600building tree 2 of 600

building tree 3 of 600
building tree 4 of 600
building tree 5 of 600
building tree 6 of 600
building tree 7 of 600
building tree 8 of 600
building tree 9 of 600
building tree 10 of 600
building tree 11 of 600
building tree 12 of 600
building tree 13 of 600
building tree 14 of 600
building tree 15 of 600
building tree 16 of 600
building tree 17 of 600
building tree 18 of 600
building tree 19 of 600
building tree 20 of 600
building tree 21 of 600
building tree 22 of 600
building tree 23 of 600
building tree 24 of 600
building tree 25 of 600
building tree 26 of 600
building tree 27 of 600
building tree 28 of 600
building tree 29 of 600
building tree 30 of 600


[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed:    4.4s


building tree 31 of 600
building tree 32 of 600
building tree 33 of 600
building tree 34 of 600
building tree 35 of 600
building tree 36 of 600
building tree 37 of 600
building tree 38 of 600
building tree 39 of 600
building tree 40 of 600
building tree 41 of 600
building tree 42 of 600
building tree 43 of 600
building tree 44 of 600
building tree 45 of 600
building tree 46 of 600
building tree 47 of 600
building tree 48 of 600
building tree 49 of 600
building tree 50 of 600
building tree 51 of 600
building tree 52 of 600
building tree 53 of 600
building tree 54 of 600
building tree 55 of 600
building tree 56 of 600
building tree 57 of 600
building tree 58 of 600
building tree 59 of 600
building tree 60 of 600
building tree 61 of 600
building tree 62 of 600
building tree 63 of 600
building tree 64 of 600
building tree 65 of 600
building tree 66 of 600
building tree 67 of 600
building tree 68 of 600
building tree 69 of 600
building tree 70 of 600
building tree 71 of 600
building tree 72

[Parallel(n_jobs=2)]: Done 124 tasks      | elapsed:   20.1s


building tree 127 of 600
building tree 128 of 600
building tree 129 of 600
building tree 130 of 600
building tree 131 of 600
building tree 132 of 600
building tree 133 of 600
building tree 134 of 600
building tree 135 of 600
building tree 136 of 600
building tree 137 of 600
building tree 138 of 600
building tree 139 of 600
building tree 140 of 600
building tree 141 of 600
building tree 142 of 600
building tree 143 of 600
building tree 144 of 600
building tree 145 of 600
building tree 146 of 600
building tree 147 of 600
building tree 148 of 600
building tree 149 of 600
building tree 150 of 600
building tree 151 of 600
building tree 152 of 600
building tree 153 of 600
building tree 154 of 600
building tree 155 of 600
building tree 156 of 600
building tree 157 of 600
building tree 158 of 600
building tree 159 of 600
building tree 160 of 600
building tree 161 of 600
building tree 162 of 600
building tree 163 of 600
building tree 164 of 600
building tree 165 of 600
building tree 166 of 600


[Parallel(n_jobs=2)]: Done 284 tasks      | elapsed:   45.8s


building tree 288 of 600
building tree 289 of 600
building tree 290 of 600
building tree 291 of 600
building tree 292 of 600
building tree 293 of 600
building tree 294 of 600
building tree 295 of 600
building tree 296 of 600
building tree 297 of 600
building tree 298 of 600
building tree 299 of 600
building tree 300 of 600
building tree 301 of 600
building tree 302 of 600
building tree 303 of 600
building tree 304 of 600
building tree 305 of 600
building tree 306 of 600
building tree 307 of 600
building tree 308 of 600
building tree 309 of 600
building tree 310 of 600
building tree 311 of 600
building tree 312 of 600
building tree 313 of 600
building tree 314 of 600
building tree 315 of 600
building tree 316 of 600
building tree 317 of 600
building tree 318 of 600
building tree 319 of 600
building tree 320 of 600
building tree 321 of 600
building tree 322 of 600
building tree 323 of 600
building tree 324 of 600
building tree 325 of 600
building tree 326 of 600
building tree 327 of 600


[Parallel(n_jobs=2)]: Done 508 tasks      | elapsed:  1.4min


building tree 511 of 600
building tree 512 of 600
building tree 513 of 600
building tree 514 of 600
building tree 515 of 600
building tree 516 of 600
building tree 517 of 600
building tree 518 of 600
building tree 519 of 600
building tree 520 of 600
building tree 521 of 600
building tree 522 of 600
building tree 523 of 600
building tree 524 of 600
building tree 525 of 600
building tree 526 of 600
building tree 527 of 600
building tree 528 of 600
building tree 529 of 600
building tree 530 of 600
building tree 531 of 600
building tree 532 of 600
building tree 533 of 600
building tree 534 of 600
building tree 535 of 600
building tree 536 of 600
building tree 537 of 600
building tree 538 of 600
building tree 539 of 600
building tree 540 of 600
building tree 541 of 600
building tree 542 of 600
building tree 543 of 600
building tree 544 of 600
building tree 545 of 600
building tree 546 of 600
building tree 547 of 600
building tree 548 of 600
building tree 549 of 600
building tree 550 of 600


[Parallel(n_jobs=2)]: Done 600 out of 600 | elapsed:  1.6min finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 124 tasks      | elapsed:    0.4s
[Parallel(n_jobs=2)]: Done 284 tasks      | elapsed:    1.0s
[Parallel(n_jobs=2)]: Done 508 tasks      | elapsed:    1.8s
[Parallel(n_jobs=2)]: Done 600 out of 600 | elapsed:    2.1s finished


# Feature Importance

- Measured by mean decrease in Gini information
- This is a form of feature selection that ensemble methods (Random Forest, XGBoost) can use to prevent overfitting
    - I drop the features that seem unimportant & with less than a 1% contribution

In [None]:
importances = pd.DataFrame({'feature': X_train.columns,
                            'importance': np.round(xgb.feature_importances_, 5)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')

In [None]:
importances

Unnamed: 0_level_0,importance
feature,Unnamed: 1_level_1
Precinct,0.35519
Hour_Zone,0.10108
Minute,0.06318
Rot45_X,0.04975
Hour,0.03895
Radius,0.03454
Year,0.03435
Cluster,0.03422
Rot60_X,0.03114
Angle,0.03066


# Feature Removal

- Remove features to simplify model and prevent overfitting
- Drop anything that contributes under 1% to prevent overfitting

In [None]:
# X_train = X_train.drop("BusinessHour", axis=1)
# X_test  = X_test.drop("BusinessHour", axis=1)

In [None]:
X_train = X_train.drop("Precinct", axis=1)
X_test  = X_test.drop("Precinct", axis=1)

In [None]:
# X_train = X_train.drop("Holiday", axis=1)
# X_test  = X_test.drop("Holiday", axis=1)

In [None]:
# X_train = X_train.drop("Weekend", axis=1)
# X_test  = X_test.drop("Weekend", axis=1)

In [None]:
X_train.head()

Unnamed: 0,Neighborhood,Month,DayOfWeek,Minute,Hour,Day,Year,Hour_Zone,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
8827,86,12,4,25,19,5,9,3,1,-1.934364,0.201865,-1.920698,-0.305663,-1.776141,-0.792361,1.944868,-0.943216,15
15702,22,7,6,50,16,17,9,2,3,-0.630813,-1.091239,-0.326885,-1.217323,-0.00068,-1.260448,1.260448,-2.093855,7
15643,62,10,0,59,15,7,6,2,4,1.394484,-0.90369,1.58086,-0.511978,1.659503,-0.085376,1.661698,2.669396,36
8977,23,6,4,29,23,19,4,4,3,0.864311,0.105701,0.807503,0.325799,0.695665,0.523695,0.870751,1.972704,16
5449,66,4,2,30,0,30,6,4,2,1.754484,0.943401,1.450531,1.365349,1.047727,1.694251,1.992039,1.601038,0


In [None]:
X_test.head()

Unnamed: 0,Neighborhood,Month,DayOfWeek,Minute,Hour,Day,Year,Hour_Zone,Season,Rot30_X,Rot30_Y,Rot45_X,Rot45_Y,Rot60_X,Rot60_Y,Radius,Angle,Cluster
11073,55,7,5,0,0,13,0,4,3,0.494259,-0.092615,0.501388,0.038464,0.474349,0.166922,0.502862,2.279629,32
8324,51,6,1,0,23,13,1,4,3,0.896813,1.075765,0.587826,1.271222,0.23878,1.380046,1.400551,1.218525,14
6458,34,5,4,0,18,19,6,3,2,-0.22065,1.383985,-0.571334,1.279718,-0.883081,1.088241,1.401464,0.365498,31
9954,88,12,0,38,13,21,8,2,1,0.312533,1.714797,-0.141938,1.737256,-0.586737,1.641324,1.743045,0.703877,31
2483,78,5,4,15,0,28,10,4,2,0.295927,-0.027139,0.292868,0.050377,0.26985,0.124461,0.297169,2.185847,32


In [None]:
dict(zip(year_le.classes_, year_le.transform(year_le.classes_)))
x_test['Category']=  cat_le.transform(dataset['Category'])

for dataset in data:
    cat_le = LabelEncoder()
    cat_le.fit(dataset['Category'].unique())
    print(list(cat_le.classes_))
    dataset['Category']=cat_le.transform(dataset['Category'])

ValueError: y contains previously unseen labels: [0, 1, 2, 3, 4, 5, 6, 7]

# Train final model with optimal hyperparameters & features

In [None]:
# It seems running time scales quadratically with the number of classes
xgb = XGBClassifier(
    n_estimators=86, 
    objective="multi:softprob", 
    learning_rate=0.1858621466840661,
    colsample_bylevel=1.0,
    colsample_bytree=1.0,
    gamma=0.49999999999999994,
    max_delta_step=0,
    max_depth=50,
    min_child_weight=5,
    reg_alpha=1.0,
    reg_lambda=60.121460571845695,
    scale_pos_weight=1e-06,
    subsample=1.0,
    random_state=1, 
    n_jobs=4,
    silent=False)


xgb.fit(X_train, Y_train)

Y_test_pred = xgb.predict_proba(X_test)

Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




In [None]:
#Final Random Forest Model

random_forest = RandomForestClassifier(n_estimators=600, max_depth=21, max_features=6,
                                       min_samples_leaf=43, min_samples_split=40, 
                                       random_state=1, verbose=3, n_jobs=2)
random_forest.fit(X_train, Y_train)

Y_test_pred = random_forest.predict_proba(X_test)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


building tree 1 of 600building tree 2 of 600

building tree 3 of 600
building tree 4 of 600
building tree 5 of 600
building tree 6 of 600
building tree 7 of 600
building tree 8 of 600
building tree 9 of 600
building tree 10 of 600
building tree 11 of 600
building tree 12 of 600
building tree 13 of 600
building tree 14 of 600
building tree 15 of 600
building tree 16 of 600
building tree 17 of 600
building tree 18 of 600
building tree 19 of 600
building tree 20 of 600
building tree 21 of 600
building tree 22 of 600
building tree 23 of 600
building tree 24 of 600
building tree 25 of 600
building tree 26 of 600
building tree 27 of 600
building tree 28 of 600
building tree 29 of 600
building tree 30 of 600


[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed:    4.8s


building tree 31 of 600
building tree 32 of 600
building tree 33 of 600
building tree 34 of 600
building tree 35 of 600
building tree 36 of 600
building tree 37 of 600
building tree 38 of 600
building tree 39 of 600
building tree 40 of 600
building tree 41 of 600
building tree 42 of 600
building tree 43 of 600
building tree 44 of 600
building tree 45 of 600
building tree 46 of 600
building tree 47 of 600
building tree 48 of 600
building tree 49 of 600
building tree 50 of 600
building tree 51 of 600
building tree 52 of 600
building tree 53 of 600
building tree 54 of 600
building tree 55 of 600
building tree 56 of 600
building tree 57 of 600
building tree 58 of 600
building tree 59 of 600
building tree 60 of 600
building tree 61 of 600
building tree 62 of 600
building tree 63 of 600
building tree 64 of 600
building tree 65 of 600
building tree 66 of 600
building tree 67 of 600
building tree 68 of 600
building tree 69 of 600
building tree 70 of 600
building tree 71 of 600
building tree 72

[Parallel(n_jobs=2)]: Done 124 tasks      | elapsed:   20.6s


building tree 127 of 600
building tree 128 of 600
building tree 129 of 600
building tree 130 of 600
building tree 131 of 600
building tree 132 of 600
building tree 133 of 600
building tree 134 of 600
building tree 135 of 600
building tree 136 of 600
building tree 137 of 600
building tree 138 of 600
building tree 139 of 600
building tree 140 of 600
building tree 141 of 600
building tree 142 of 600
building tree 143 of 600
building tree 144 of 600
building tree 145 of 600
building tree 146 of 600
building tree 147 of 600
building tree 148 of 600
building tree 149 of 600
building tree 150 of 600
building tree 151 of 600
building tree 152 of 600
building tree 153 of 600
building tree 154 of 600
building tree 155 of 600
building tree 156 of 600
building tree 157 of 600
building tree 158 of 600
building tree 159 of 600
building tree 160 of 600
building tree 161 of 600
building tree 162 of 600
building tree 163 of 600
building tree 164 of 600
building tree 165 of 600
building tree 166 of 600


[Parallel(n_jobs=2)]: Done 284 tasks      | elapsed:   47.8s


building tree 288 of 600
building tree 289 of 600
building tree 290 of 600
building tree 291 of 600
building tree 292 of 600
building tree 293 of 600
building tree 294 of 600
building tree 295 of 600
building tree 296 of 600
building tree 297 of 600
building tree 298 of 600
building tree 299 of 600
building tree 300 of 600
building tree 301 of 600
building tree 302 of 600
building tree 303 of 600
building tree 304 of 600
building tree 305 of 600
building tree 306 of 600
building tree 307 of 600
building tree 308 of 600
building tree 309 of 600
building tree 310 of 600
building tree 311 of 600
building tree 312 of 600
building tree 313 of 600
building tree 314 of 600
building tree 315 of 600
building tree 316 of 600
building tree 317 of 600
building tree 318 of 600
building tree 319 of 600
building tree 320 of 600
building tree 321 of 600
building tree 322 of 600
building tree 323 of 600
building tree 324 of 600
building tree 325 of 600
building tree 326 of 600
building tree 327 of 600


[Parallel(n_jobs=2)]: Done 508 tasks      | elapsed:  1.4min


building tree 511 of 600
building tree 512 of 600
building tree 513 of 600
building tree 514 of 600
building tree 515 of 600
building tree 516 of 600
building tree 517 of 600
building tree 518 of 600
building tree 519 of 600
building tree 520 of 600
building tree 521 of 600
building tree 522 of 600
building tree 523 of 600
building tree 524 of 600
building tree 525 of 600
building tree 526 of 600
building tree 527 of 600
building tree 528 of 600
building tree 529 of 600
building tree 530 of 600
building tree 531 of 600
building tree 532 of 600
building tree 533 of 600
building tree 534 of 600
building tree 535 of 600
building tree 536 of 600
building tree 537 of 600
building tree 538 of 600
building tree 539 of 600
building tree 540 of 600
building tree 541 of 600
building tree 542 of 600
building tree 543 of 600
building tree 544 of 600
building tree 545 of 600
building tree 546 of 600
building tree 547 of 600
building tree 548 of 600
building tree 549 of 600
building tree 550 of 600


[Parallel(n_jobs=2)]: Done 600 out of 600 | elapsed:  1.7min finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 124 tasks      | elapsed:    0.4s
[Parallel(n_jobs=2)]: Done 284 tasks      | elapsed:    1.0s
[Parallel(n_jobs=2)]: Done 508 tasks      | elapsed:    1.8s
[Parallel(n_jobs=2)]: Done 600 out of 600 | elapsed:    2.1s finished


# Model Evaluation

- Evaluate final model based on K-Fold cross validation
- Average all K iterations to give the true estimate of the final model's performance

In [None]:
scores = cross_val_score(xgb, X_train, Y_train, 
                         cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=1), 
                         scoring = "neg_log_loss", n_jobs=2)



Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.






Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




In [None]:
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

Scores: [-1.12803999 -1.12698922 -1.12928097]
Mean: -1.128103392205977
Standard Deviation: 0.0009366772705282381
