## Contents:

* [Introduction](#intro)

* [Learning, Model Selection and Evaulation](#model)

    * [Model 1: To predict all original crime categories - no collapsing](#model1)
        
        * [Feature Set 1 : District, Month, Hours](#m1.1)
        
        * [Conclusions](#m1.2)
        
        * [Possible Improvements](#m1.3)
    
    * [Model 2: To predict only the top 10 crimes](#model2)
    
        * [Feature Set 1 : District, Month, Hours](#m2.1)
    
            * [Evaluation Metrics used](#m2.1.1)
    
            * [Results](#m2.1.2)
    
        * [Feature Set 2: District, Month, Hours, Weekday/Weekend](#m2.2)
            
            * [Results](#m2.2.1)
        
        * [Feature 3 - Add streetname](#m2.3)
    
            * [Results](#m2.3.1)
        
        * [Conclusions](#m2.4)
    
        * [Improvements](#m2.5)

<a id='intro'></a>

## Introduction

In [1]:
%matplotlib inline
from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib
import math
import numpy as np
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)

In [2]:
from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
from sklearn import neighbors, metrics
from sklearn import linear_model
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

In [46]:
df_boston = pd.read_pickle('df_boston.pkl')

<a id='model'></a>

## Learning, Model Selection and Evaulation

* Since this is a classification problem, algorithms such as Naive Bayes, Logistic Regressions, Support Vector Machines can be used.
* Tree-based and Ensemble methods such as Decision Trees and Random Forests can be used.
* Neural Networks can also be used.

<a id='model1'></a>

### Model 1: To predict all original crime categories - no collapsing

* From the EDA done, the top 3 predictors of high importance are district, month, hours.
* The baseline model was built with these 3 features to get an idea of the features.
* The algorithms used were Naive Bayes Classifier, Decision Trees and Random Forests.

<a id='m1.1'></a>

#### Feature Set 1 : District, Month, Hours

#### Hours

The hours can be recoded in 4 catefories:
    1. Early Morning Hours
    2. Peak Hours(12 PM and AM)
    3. Peak Evening Hours (16-18 Hours)
    4. Rest of the hours
The hours have been split in this way to reflect the way crime distribution changes wrt the hours.

In [47]:
def bin_hours(df):
    list_hours = list()
    for hour in df.hours:
        if hour in range(1,7):
            list_hours.append(0)
        elif hour in range(16,19):
            list_hours.append(1)
        elif hour in [12,24]:
            list_hours.append(2)
        else:
            list_hours.append(3)
    df.hours = list_hours
    return df

In [9]:
df_model1_set1 = bin_hours(df_boston)

#### Encoding

In [48]:
def encode_features(df):
    df_district = pd.get_dummies(df['district'])
    df_month = pd.get_dummies(df['month'])
    df_hours = pd.get_dummies(df['hours'])
    column_dfs = [df_district,df_month,df_hours]
    df_concat = pd.concat(column_dfs,axis=1)
    df_concat.columns = ['Brighton','Charlestown',
     'Dorchester',
     'Downtown',
     'East Boston',
     'Human Traffic Unit',
     'Hyde Park',
     'Jamaica Plain',
     'Mattapan',
     'Roxbury',
     'South Boston',
     'South End',
     'West Roxbury',
     'missing_district'] + ['month' + str(i) for i in range(1,13)] + ['hour' + str(i) for i in range(0,4)]
    return df_concat

In [11]:
df_model1_set1 = encode_features(df_model1_set1)

#### Learning

In [49]:
def perform_cv(clf, X, Y, scoring_metric):
    kf_scores = list()
    scores = cross_validation.cross_val_score(clf, X, Y, cv=5, scoring=scoring_metric)
    kf_scores.append(scores)
    kf_scores.append(scores.mean())
    return kf_scores

In [13]:
data_x = df_model1_set1
data_y = df_boston.crime_category

#### Naive Bayes

In [153]:
nb_clf = make_pipeline(preprocessing.StandardScaler(), GaussianNB())
nb_kf_model1_set1_accuracy= perform_cv(nb_clf,data_x,data_y,'accuracy')
nb_kf_model1_set1_accuracy

[array([ 0.00067109,  0.00050347,  0.00026112,  0.00052248,  0.00087723]),
 0.00056707634074969063]

#### Decision Tree

In [155]:
tree_clf = make_pipeline(preprocessing.StandardScaler(),tree.DecisionTreeClassifier())
tree_kf_model1_set1_accuracy = perform_cv(tree_clf,data_x,data_y,'accuracy')
tree_kf_model1_set1_accuracy

[array([ 0.15105138,  0.14201537,  0.14611583,  0.13756041,  0.13546605]),
 0.14244180548002977]

#### Random Forests

In [15]:
rf_clf = make_pipeline(preprocessing.StandardScaler(),RandomForestClassifier(n_estimators=10))
rf_kf_model1_set1_accuracy = perform_cv(rf_clf,data_x,data_y,'accuracy')

In [16]:
rf_kf_model1_set1_accuracy

[array([ 0.15064126,  0.14125084,  0.14576145,  0.13505999,  0.13421554]),
 0.1413858157495162]

In [12]:
rf_clf = make_pipeline(preprocessing.StandardScaler(),RandomForestClassifier(n_estimators=150, max_depth=20))
rf_kf_model1_set1_f1 = perform_cv(rf_clf,data_x,data_y,'f1_weighted')

  'precision', 'predicted', average, warn_for)


In [13]:
rf_kf_model1_set1_f1

[array([ 0.07234724,  0.07383009,  0.07416921,  0.0677333 ,  0.07438393]),
 0.072492754506245122]

In [15]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=150, max_depth=20,class_weight = 'balanced'))
rf_kf_model1_set1_f1 = perform_cv(rf_clf,data_x,data_y,'f1_weighted')

  'recall', 'true', average, warn_for)


In [16]:
rf_kf_model1_set1_f1

[array([ 0.00730455,  0.00304428,  0.00726283,  0.00513344,  0.00974019]),
 0.0064970594956120118]

<a id='m1.2'></a>

#### Conclusions:

* ** The intial model which tries to predict all classes has very low accuracy. **
* ** The labels were too fine-grained and their close similaarity to each other seems to be the reason behind the poor performance of the baseline models built. **
* ** Class imbalance can be corrected by oversampling methods and ensemble learners. **

<a id='m1.3'></a>

#### Possible Improvements:

* Additional models can be built by collapsing several of these classes.
* Ensemble methods and other computationally intensive can also be tried to improve the performance as required.

<a id='model2'></a>

### Model 2: To predict only the top 10 crimes

*  From the EDA done, the top 3 predictors of high importance are district, month, hours.

* The baseline of the model will be to use only the name of disrict, month and hours as the features. 
* Additional features can be added if they seem to increase the performance of the model significantly.

<a id='m2.1'></a>

#### Feature Set 1 : District, Month, Hours

In [22]:
df_model2_set1 = bin_hours(df_boston)

In [23]:
df_model2_set1 = df_model2_set1.loc[df_model2_set1['crime_category'].isin(list(df_boston.crime_category.value_counts()[:10].index))]

In [24]:
data_y = df_model2_set1.crime_category

In [25]:
df_model2_set1 = encode_features(df_model2_set1) ### set1 features

In [26]:
data_x = df_model2_set1

#### Naive Bayes

In [127]:
nb_clf = make_pipeline(preprocessing.StandardScaler(), GaussianNB())
nb_kf_model2_set1_accuracy = perform_cv(nb_clf,data_x,data_y,'accuracy')

In [128]:
nb_kf_model2_set1_accuracy

[array([ 0.13022738,  0.12803117,  0.13267957,  0.12744526,  0.16275248]),
 0.13622717088758451]

In [129]:
nb_clf = make_pipeline(preprocessing.StandardScaler(), GaussianNB())
nb_kf_model2_set1_logloss = perform_cv(nb_clf,data_x,data_y,'log_loss')

In [130]:
nb_kf_model2_set1_logloss

[array([-8.90294717, -8.62602481, -8.6802286 , -9.81271047, -5.49586608]),
 -8.3035554273031984]

#### Logistic Regression

In [133]:
log_clf = make_pipeline(linear_model.LogisticRegression(multi_class='multinomial',solver='lbfgs',C=0.01))
log_kf_model2_set1_accuracy = perform_cv(log_clf,data_x,data_y,'accuracy')
log_kf_model2_set1_accuracy

[array([ 0.21794668,  0.20859725,  0.21452425,  0.20781424,  0.21847002]),
 0.21347048798609033]

In [134]:
log_clf = make_pipeline(linear_model.LogisticRegression(multi_class='multinomial',solver='lbfgs',C=0.01))
log_kf_model2_set1_logloss = perform_cv(log_clf,data_x,data_y,'log_loss')
log_kf_model2_set1_logloss

[array([-2.19763379, -2.23155644, -2.2208301 , -2.24391866, -2.20795114]),
 -2.2203780255259806]

#### Decision Trees

In [135]:
tree_clf = make_pipeline(tree.DecisionTreeClassifier())
tree_kf_model2_set1_accuracy = perform_cv(tree_clf,data_x,data_y,'accuracy')
tree_kf_model2_set1_accuracy

[array([ 0.21505804,  0.20149471,  0.20813676,  0.19882839,  0.20587923]),
 0.20587942811674975]

In [136]:
tree_clf = make_pipeline(tree.DecisionTreeClassifier())
tree_kf_model2_set1_accuracy = perform_cv(tree_clf,data_x,data_y,'log_loss')
tree_kf_model2_set1_accuracy

[array([-2.27930257, -2.30871192, -2.30774374, -2.32750826, -2.39922025]),
 -2.3244973487063048]

#### Random Forests

In [138]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=150,max_depth=20))
rf_kf_model2_set1_accuracy = perform_cv(rf_clf,data_x,data_y,'accuracy')
rf_kf_model2_set1_accuracy

[array([ 0.21405099,  0.2013092 ,  0.20702359,  0.19763558,  0.20519005]),
 0.20504188269811524]

In [141]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=150,max_depth=20))
rf_kf_model2_set1_accuracy = perform_cv(rf_clf,data_x,data_y,'log_loss')
rf_kf_model2_set1_accuracy

[array([-2.21165374, -2.24938181, -2.23379825, -2.25831878, -2.23257672]),
 -2.2371458602131669]

<a id='m2.1.1'></a>

#### Evaluation Metrics used:
* As it is a multi-class classification problem, the log-loss will be a better metric to evaluate the algorithms 
than the traditional accuracy metric which is used.
* The log-loss penalizes for any errors.
* A better evaulation metric when compared to overall accuracy as that can be poor due to the skewed nature of the 
class distribution and the multi-class situation.
* The models were evaulated under a 5-fold cross validated approach.

* From the log-losses calculated, Random Forests and Logistic Regression seem to perform well.

<a id='m2.1.2'></a>

#### Results

In [36]:
df

Unnamed: 0_level_0,acc,log_loss
algos,Unnamed: 1_level_1,Unnamed: 2_level_1
Naive Bayes,0.136227,-8.303555
Logistic Regression,0.21347,-2.220378
Decision Trees,0.205879,-2.324497
Random Forests,0.205042,-2.237146


* Naive Bayes: the dependence of the features could be a reason behind its low accuracy and log_loss.
* Logistic Regression, Trees and Random Forests seem to be naturally suited to the problem of multi class classification
and they tend to do well.

In [26]:
data_x_train, data_x_test, data_y_train, data_y_test = cross_validation.train_test_split(data_x, data_y, test_size=0.1, random_state=0)
clf = RandomForestClassifier(n_estimators=250,max_depth=25)
clf.fit(data_x_train,data_y_train)
expected = data_y_test
predicted = clf.predict(data_x_test)

In [27]:
print("Classification report for classifier %s:\n%s\n"
      % (clf, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

Classification report for classifier RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=25, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False):
             precision    recall  f1-score   support

AUTO VIOLATIONS       0.22      0.75      0.34      3764
DRUG CHARGES       0.20      0.06      0.10      1275
     INVPER       0.00      0.00      0.00      1311
LARCENY FROM MOTOR VEHICLE       0.19      0.01      0.02      1287
MEDICAL ASSISTANCE       0.16      0.03      0.05      1847
      MVACC       0.05      0.00      0.00      1376
OTHER ASSAULTS       0.21      0.05      0.09      1909
OTHER LARCENY       0.25      0.39      0.31      2464
   PROPERTY       0.18      0.05      0.07      2279
  VANDALISM       0.07      0.00      0.00 

In [28]:
metrics.accuracy_score(expected,predicted)

0.21966604823747682

In [29]:
metrics.f1_score(expected,predicted,average='weighted')

0.13753416152316739

* There is severe class imbalance. 
* Majority class favoured.

In [22]:
data_x_train, data_x_test, data_y_train, data_y_test = cross_validation.train_test_split(data_x, data_y, test_size=0.1, random_state=0)
clf = RandomForestClassifier(n_estimators=250,max_depth=25,class_weight='balanced')
clf.fit(data_x_train,data_y_train)
expected = data_y_test
predicted = clf.predict(data_x_test)

In [23]:
print("Classification report for classifier %s:\n%s\n"
      % (clf, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

Classification report for classifier RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=25, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False):
             precision    recall  f1-score   support

AUTO VIOLATIONS       0.31      0.12      0.17      3764
DRUG CHARGES       0.11      0.31      0.16      1275
     INVPER       0.09      0.16      0.11      1311
LARCENY FROM MOTOR VEHICLE       0.11      0.10      0.11      1287
MEDICAL ASSISTANCE       0.14      0.16      0.15      1847
      MVACC       0.12      0.14      0.13      1376
OTHER ASSAULTS       0.16      0.09      0.11      1909
OTHER LARCENY       0.25      0.36      0.30      2464
   PROPERTY       0.18      0.06      0.09      2279
  VANDALISM       0.09      0.09     

In [24]:
metrics.accuracy_score(expected,predicted)

0.15785846806254969

In [25]:
metrics.f1_score(expected,predicted, average='weighted')

0.15213573211333378

** Overall F1 score found to be higher when the classes are balanced. **

In [27]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=250, max_depth=25))
rf_metric = perform_cv(rf_clf,data_x,data_y,'f1_weighted')

  'precision', 'predicted', average, warn_for)


In [28]:
rf_metric

[array([ 0.12577343,  0.12839257,  0.12833203,  0.12305591,  0.130846  ]),
 0.12727998845657487]

In [37]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
data_x_resampled, data_y_resampled = ros.fit_sample(data_x, data_y)

In [38]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=250, max_depth=25))
rf_metric = perform_cv(rf_clf,data_x_resampled,data_y_resampled,'f1_weighted')

In [39]:
rf_metric

[array([ 0.14424732,  0.13233199,  0.14768532,  0.14506299,  0.15043987]),
 0.14395349761942172]

** When Random Oversampling is used, the f1 score is 14.39 compared to earlier scores of 0.12. **

<a id='m2.2'></a>

#### Feature Set 2: District, Month, Hours, Weekday/Weekend

 ** The weekday/weekend feature was added to check if it improves the model significantly or not. **

In [35]:
df_model2_set2 = bin_hours(df_boston)

In [36]:
def encode_features_set2(df):
    df_district = pd.get_dummies(df['district'])
    df_month = pd.get_dummies(df['month'])
    df_hours = pd.get_dummies(df['hours'])
    df_weekday_weekend = pd.DataFrame(df.weekday_weekend)
    column_dfs = [df_district,df_month,df_hours,df_weekday_weekend]
    df_concat = pd.concat(column_dfs,axis=1)
    df_concat.columns = ['Brighton','Charlestown',
     'Dorchester',
     'Downtown',
     'East Boston',
     'Human Traffic Unit',
     'Hyde Park',
     'Jamaica Plain',
     'Mattapan',
     'Roxbury',
     'South Boston',
     'South End',
     'West Roxbury',
     'missing_district'] + ['month' + str(i) for i in range(1,13)] + ['hour' + str(i) for i in range(0,4)] + ['weekday/weekend']
    
    return df_concat

In [37]:
df_model2_set2 = df_model2_set2.loc[df_model2_set2['crime_category'].isin(list(df_boston.crime_category.value_counts()[:10].index))]

In [38]:
data_y = df_model2_set2.crime_category

In [39]:
df_model2_set2 = encode_features_set2(df_model2_set2)

In [40]:
data_x = df_model2_set2

#### Naive Bayes

In [138]:
nb_clf = make_pipeline(preprocessing.StandardScaler(), GaussianNB())
nb_kf_model2_set2_accuracy = perform_cv(nb_clf,data_x,data_y,'accuracy')
nb_kf_model2_set2_accuracy

[array([ 0.13028038,  0.12800466,  0.13267957,  0.12744526,  0.16171871]),
 0.13602571704404545]

In [139]:
nb_clf = make_pipeline(preprocessing.StandardScaler(), GaussianNB())
nb_kf_model2_set2_logloss = perform_cv(nb_clf,data_x,data_y,'log_loss')
nb_kf_model2_set2_logloss

[array([-8.89852909, -8.62067851, -8.67557591, -9.80573853, -5.49719724]),
 -8.2995438563379054]

#### Logistic Regression

In [142]:
log_clf = make_pipeline(preprocessing.StandardScaler(),linear_model.LogisticRegression(multi_class='multinomial',solver='lbfgs'))
log_kf_model2_set1_accuracy = perform_cv(log_clf,data_x,data_y,'accuracy')
log_kf_model2_set1_accuracy

[array([ 0.21784067,  0.20706013,  0.21470978,  0.20484546,  0.21809892]),
 0.21251099514687546]

In [143]:
log_clf = make_pipeline(preprocessing.StandardScaler(),linear_model.LogisticRegression(multi_class='multinomial',solver='lbfgs'))
log_kf_model2_set1_logloss = perform_cv(log_clf,data_x,data_y,'log_loss')
log_kf_model2_set1_logloss

[array([-2.19664151, -2.23263093, -2.22163957, -2.24550365, -2.25884372]),
 -2.2310518762545426]

#### Decision Trees

In [140]:
tree_clf = make_pipeline(preprocessing.StandardScaler(),tree.DecisionTreeClassifier())
tree_kf_model2_set2_accuracy = perform_cv(tree_clf,data_x,data_y,'accuracy')
tree_kf_model2_set2_accuracy

[array([ 0.21304394,  0.1973604 ,  0.20540684,  0.19262578,  0.20823835]),
 0.20333506069198731]

In [141]:
tree_clf = make_pipeline(preprocessing.StandardScaler(),tree.DecisionTreeClassifier())
tree_kf_model2_set2_logloss = perform_cv(tree_clf,data_x,data_y,'log_loss')
tree_kf_model2_set2_logloss

[array([-2.55140323, -2.59173581, -2.60874564, -2.56879631, -2.67876033]),
 -2.5998882641967986]

#### Random Forests

In [144]:
rf_clf = make_pipeline(preprocessing.StandardScaler(),RandomForestClassifier(n_estimators=10))
rf_kf_model2_set2_accuracy = perform_cv(rf_clf,data_x,data_y,'accuracy')
rf_kf_model2_set2_accuracy

[array([ 0.20883023,  0.19622081,  0.20291545,  0.19236071,  0.20150559]),
 0.20036655967040379]

In [145]:
rf_kf_model2_set2_accuracy = perform_cv(rf_clf,data_x,data_y,'log_loss')
rf_kf_model2_set2_accuracy

[array([-2.55368367, -2.59701339, -2.61055505, -2.57362798, -2.63737222]),
 -2.5944504616811095]

<a id='m2.2.1'></a>

#### Results

In [39]:
df1

Unnamed: 0_level_0,acc,log_loss
algos,Unnamed: 1_level_1,Unnamed: 2_level_1
Naive Bayes,0.136026,-8.299544
Logistic Regression,0.212511,-2.231052
Decision Trees,0.203335,-2.599888
Random Forests,0.200367,-2.59445


* The addition of weekday/weekend feature seems to have little improvement over the performance of the Naive Bayes classifier.
* The rest of the algorithms seem to have either the same effect or thier performance has decreased.

In [41]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=250, max_depth=25))
rf_metric = perform_cv(rf_clf,data_x,data_y,'f1_weighted')

In [42]:
rf_metric

[array([ 0.13238057,  0.13121702,  0.13404057,  0.12477081,  0.13792949]),
 0.13206769133830684]

In [43]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
data_x_resampled, data_y_resampled = ros.fit_sample(data_x, data_y)

In [44]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=250, max_depth=25))
rf_metric = perform_cv(rf_clf,data_x_resampled,data_y_resampled,'f1_weighted')

In [45]:
rf_metric

[array([ 0.15022515,  0.13832258,  0.15416032,  0.15525703,  0.16270678]),
 0.15213437142591885]

<a id='m2.3'></a>

#### Feature 3 - Add streetname

* One of the other important features which seem to have the crime distribution varying was the different street names.
* This feature was added to the feature set to assess its performance.

In [50]:
top_streets = list(df_boston.streetname.value_counts()[:9].index)

In [51]:
def recode_streetname(df):
    list_street = list()
    for street in df.streetname:
        if street not in top_streets:
            list_street.append('other')
        else:
            list_street.append(street)
    df.streetname = list_street
    return df

In [52]:
df_model2_set3 = bin_hours(df_boston)

In [53]:
df_model2_set3 = recode_streetname(df_model2_set3)

In [54]:
df_model2_set3.streetname.unique()

array(['other', 'WASHINGTON ST', 'MASSACHUSETTS AV', 'BLUE HILL AV',
       'COMMONWEALTH AV', 'CENTRE ST', 'DORCHESTER AV', 'TREMONT ST',
       'BOYLSTON ST', 'HARRISON AV'], dtype=object)

In [55]:
def encode_features_set3(df):
    df_district = pd.get_dummies(df['district'])
    df_month = pd.get_dummies(df['month'])
    df_hours = pd.get_dummies(df['hours'])
    df_streetname = pd.get_dummies(df['streetname'])
    column_dfs = [df_district,df_month,df_hours,df_streetname]
    df_concat = pd.concat(column_dfs,axis=1)
    df_concat.columns = ['Brighton','Charlestown',
     'Dorchester',
     'Downtown',
     'East Boston',
     'Human Traffic Unit',
     'Hyde Park',
     'Jamaica Plain',
     'Mattapan',
     'Roxbury',
     'South Boston',
     'South End',
     'West Roxbury',
     'missing_district'] + ['month' + str(i) for i in range(1,13)]  + ['hour' + str(i) for i in range(0,4)] + ['street' + str(i) for i in range(0,10)]
    
    return df_concat

In [56]:
df_model2_set3 = df_model2_set3.loc[df_model2_set3['crime_category'].isin(list(df_boston.crime_category.value_counts()[:10].index))]

In [57]:
data_y = df_model2_set3.crime_category

In [58]:
df_model2_set3 = encode_features_set3(df_model2_set3)

In [98]:
data_x = df_model2_set3

#### Naive Bayes

In [101]:
nb_clf = make_pipeline(preprocessing.StandardScaler(), GaussianNB())
nb_kf_model2_set3_accuracy = perform_cv(nb_clf,data_x,data_y,'accuracy')
nb_kf_model2_set3_accuracy

[array([ 0.13531563,  0.13147643,  0.13578055,  0.12619944,  0.15374013]),
 0.13650243342939641]

In [102]:
nb_clf = make_pipeline(preprocessing.StandardScaler(), GaussianNB())
nb_kf_model2_set2_logloss = perform_cv(nb_clf,data_x,data_y,'log_loss')
nb_kf_model2_set2_logloss

[array([-8.37463345, -8.34523895, -8.36222727, -9.80426983, -6.83377484]),
 -8.3440288682091648]

#### Logistic Regression

In [113]:
log_clf = make_pipeline(preprocessing.StandardScaler(),linear_model.LogisticRegression(multi_class='multinomial',solver='lbfgs',
                                                                                      C=0.01))
log_kf_model2_set1_accuracy = perform_cv(log_clf,data_x,data_y,'accuracy')
log_kf_model2_set1_accuracy

[array([ 0.22698362,  0.21381814,  0.21910946,  0.20434183,  0.21049144]),
 0.2149488997835533]

In [114]:
log_clf = make_pipeline(preprocessing.StandardScaler(),linear_model.LogisticRegression(multi_class='multinomial',solver='lbfgs',C=0.01))
log_kf_model2_set1_logloss = perform_cv(log_clf,data_x,data_y,'log_loss')
log_kf_model2_set1_logloss

[array([-2.18034157, -2.21397699, -2.20425417, -2.23154264, -2.22744962]),
 -2.2115129975593506]

#### Decision Trees

In [106]:
tree_clf = make_pipeline(preprocessing.StandardScaler(),tree.DecisionTreeClassifier())
tree_kf_model2_set2_accuracy = perform_cv(tree_clf,data_x,data_y,'accuracy')
tree_kf_model2_set2_accuracy

[array([ 0.22197488,  0.20634458,  0.20784522,  0.20397074,  0.19996819]),
 0.20802071997067645]

In [107]:
tree_clf = make_pipeline(preprocessing.StandardScaler(),tree.DecisionTreeClassifier())
tree_kf_model2_set2_logloss = perform_cv(tree_clf,data_x,data_y,'log_loss')
tree_kf_model2_set2_logloss

[array([-3.08381026, -3.08200062, -3.1006632 , -3.1554863 , -3.19063663]),
 -3.1225194021165485]

#### Random Forests

In [119]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=150))
rf_kf_model2_set2_accuracy = perform_cv(rf_clf,data_x,data_y,'accuracy')
rf_kf_model2_set2_accuracy

[array([ 0.21911274,  0.20451594,  0.20596342,  0.20078991,  0.19715846]),
 0.2055080932663087]

In [142]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=150,max_depth=20))
rf_kf_model2_set2_logloss = perform_cv(rf_clf,data_x,data_y,'log_loss')
rf_kf_model2_set2_logloss

[array([-2.21154344, -2.24972586, -2.23364411, -2.25875862, -2.23279917]),
 -2.237294239461471]

<a id='m2.3.1'></a>

#### Results:

In [42]:
df2

Unnamed: 0_level_0,acc,log_loss
algos,Unnamed: 1_level_1,Unnamed: 2_level_1
Naive Bayes,0.136502,-8.344029
Logistic Regression,0.214949,-2.211513
Decision Trees,0.208021,-3.122519
Random Forests,0.205508,-2.237294


In [59]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=250, max_depth=25))
rf_metric = perform_cv(rf_clf,data_x,data_y,'f1_weighted')

In [60]:
rf_metric

[array([ 0.13238849,  0.13135357,  0.13379536,  0.12576837,  0.13564412]),
 0.13178998078736587]

In [61]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
data_x_resampled, data_y_resampled = ros.fit_sample(data_x, data_y)

In [63]:
rf_clf = make_pipeline(RandomForestClassifier(n_estimators=250, max_depth=25))
rf_metric = perform_cv(rf_clf,data_x_resampled,data_y_resampled,'f1_weighted')

In [64]:
rf_metric

[array([ 0.15252381,  0.13945765,  0.15484835,  0.15829368,  0.1606082 ]),
 0.1531463386132203]

* The Logistic Regression and the Random Forest model seem to dp better with this feature set also.
* The Decision Tree generally tends to overfit as the model becomes complex, which could be a reason behind the increase in
log loss from the earlier results.

* Other features like Weapon Count, Year and Latitudes/Longitudes were not used as they either had too many missing values or 
they failed to add any significant improvement over the model.
* The feature set of (District Name, Streetname, Hour, Month) looks to be the best possible feature set for this model.

<a id='m2.4'></a>

#### Conclusions:

* The NB classifier delivers poor results due to the highly skewed nature of the multi class distribution as well as
the dependence of features.
* The Decision Trees tends to do well on such categorical features, although it overfits when the complexity of the 
model increases.
* The Random Forests, being an ensemble learning algorithm, suits itself well to the categorical nature of the dataset and
is able to scale up well with the model complexity.
* The Logistic Regression also has a high performance due to the multinomial version used and it's natural tendency to
output probablities which goes well with the log-loss.
5. Oversampling improves the f1 score of the models built.

<a id='m2.5'></a>

#### Possible Improvements:

1. The Random Forest model and Logistic Regressions can be optimized further by grid search and finding the hyperparameters.
2. Other ensmble methods like Boosting algorithms can be used which could increase the performance of the classification.
3. Neural networks can also be used. But these 2 algorithms can get computationaly expensive at the expense of performance.
4. The class labels can be collapsed in various other means to explore thhe behaviours of such models.