Emily Wang and Filippos Lymperopoulos | Data Science 2016 | CYOA: sfcrime

Feb 15, 2016

### The process

* Import libraries and training data
* Feature engineering / preprocessing: 
    * Make "useful" combinations of features to give our model; 
    * Also encode categorical things in an intelligent way; 
    * Can choose to only use a subset of features if desired
* Partition your data (cross-validation kfolds, etc)
* Model fit
* Make some predictions
* Compute the logloss score
* Reflect; iterate 

Firstly, let's import some useful libraries and import the data. 

In [69]:
import pandas as pd

from IPython.display import display
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
# from sklearn.cross_validation import KFold
from sklearn import cross_validation
from sklearn.metrics import log_loss
import numpy as np
import pprint as pp

# Convert the Dates column of our provided data from string to datetime format.
train = pd.read_csv('train.csv', parse_dates = ['Dates'])
test = pd.read_csv('test.csv', parse_dates = ['Dates'])

# Print the first 3 rows of the dataframe.
display(train.head(3))

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414


### Feature Engineering, Preprocessing

Make a class that will:
* Extract the time features we want to use for the model (e.g. year, season, month, day, etc.)
* Encode categorical variables in a meaningful way: contains a preprocessor that can both transform and inverse_transform the categorical variables
* Return a transformed dataframe to be given to the model
* Maybe: allow for some flexibility with what is in the transformed dataframe (to iterate quickly) (e.g. choosing how many time features you want in this experiment)

In [70]:
# SFP = SFCrime Preprocessor
class SFP():
    def __init__(self, data):
        self.data = data
        self.Y_encoder = preprocessing.LabelEncoder()
    
    # Prepare inputs
    def prep_district(self):
        # one hot encoding
        return pd.get_dummies(self.data.PdDistrict)
    
    def prep_hour(self):
        # a continuous value from 0 to 23
        return self.data.Dates.dt.hour # Gets the hour portion form the "Dates" column
    
    def prep_day(self):
        # one hot encoding
        return pd.get_dummies(self.data.DayOfWeek)
    
    def prep_years(self):
        # beware: 2015 has significantly less incidents than the other years in this dataset.        
        pass
    
    def concat_features(self):
        hour = self.prep_hour()
        day = self.prep_day()
        district = self.prep_district()
        return pd.concat([hour, day, district], axis=1)
    
    # Encode or decode classes
    def encode_Y(self, Y):
        return self.Y_encoder.fit_transform(Y)

    def decode_Y(self, encoded_Y):
        return self.Y_encoder.inverse_transform(encoded_Y)

In [71]:
sfp = SFP(train)
X = sfp.concat_features()
y = sfp.encode_Y(train.Category)

display(X.head())

Unnamed: 0,Dates,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday,BAYVIEW,CENTRAL,INGLESIDE,MISSION,NORTHERN,PARK,RICHMOND,SOUTHERN,TARAVAL,TENDERLOIN
0,23,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
1,23,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
2,23,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
3,23,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
4,23,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0


### Partition the data!

In [72]:
# note: this X and Y in particular are from the data in train.csv. See previous section for details.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=42)

### Fit the training data to the algorithm

Decision trees are known to be good at handling categorical data. Let's try using some of the decision tree variations in scikit learn (decision tree, random forest, gradient boost, etc) and tweak some hyperparameters. We might even do some ensemble learning. Oooh shiny!

#### Decision Tree

In [73]:
from sklearn import tree
dtc = tree.DecisionTreeClassifier()
_ = dtc.fit(X_train, y_train)
y_predictions = dtc.predict_proba(X_test)
dtc_log_loss = log_loss(y_test, y_predictions)

In [74]:
dtc_log_loss 

3.007115993440352

Filippos says he thinks this log loss value of 3.007115993440352 is very deece. Confirmed by looking at the kaggle leaderboards.

#### Random Forest!!

In [75]:
from sklearn.ensemble import RandomForestClassifier

# Using some hyperparameter values from DataQuest mission 75
rf = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=4, min_samples_leaf=1) 
_ = rf.fit(X_train, y_train)
y_predictions = rf.predict_proba(X_test)
rf_log_loss = log_loss(y_test, y_predictions)

In [76]:
rf_log_loss

3.0111558103212417

#### Stochastic Gradient Descent
[scikit learn cheatsheet advises us to look into SGD classifiers!](http://scikit-learn.org/stable/tutorial/machine_learning_map/)

Understanding SGD:
* yay Andrew Ng ML video
* batch gradient descent (looks at all of the training examples in every iteration)
* stochastic gradient descent (looks at only one training example in every iteration)
    * how well is my hypothesis doing on a single example? for a given theta and x,y pair
* different in the implementation details and making progress towards the minimum

In [77]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(loss="log", penalty="l2")
_ = sgd.fit(X_train, y_train)
y_predictions = sgd.predict_proba(X_test)
sgd_log_loss = log_loss(y_test, y_predictions)

In [78]:
sgd_log_loss

2.7817291803146502

"Ooooooooooohhhh"  -- Emily and Filippos

#### Naive Bayes


In [79]:
from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB()
_ = nb.fit(X_train, y_train)
y_predictions = nb.predict_proba(X_test)
nb_log_loss = log_loss(y_test, y_predictions)

In [80]:
nb_log_loss

2.609770164645326

"Ooooooooooohhhh" (again)  -- Emily and Filippos

#### A baseline

What's the log loss for a model that just predicts that all new data is the most common crime type ('LARCENCY/THEFT') ?

#### Next steps

ASAP:
* ~~prepare a submission to kaggle~~
* Try playing with hyperparameters; see how changes in those values impact the logloss, and plot them
* Visualization of performance of the different models (include an ensemble learning result in here too)
* *10 fold validation + mean and standard deviation for the metrics* - Emily
* log loss on each separate class (turn the multi class problem into 39 binary problems) 
    * --> discover more specifically which things the model predicts well and not so well 
    * --> more exploration and feature engineering in hopes of resolving the difference in performance between crime classes


Backlog:
* ~~explanation on SGD black box~~
* ~~How to translate the 39-element outputs into more "human readable" outputs: TOP5~~
* *Creative approaches to ensemble learning!* - Filippos 
* Additional helpful visualizations for finding suggestions on how to improve?
* hyperparameter value vs. log loss 
* ~~Naive Bayes~~
* Comparison of log loss to the "predict the most common thing" baseline model
* Visualization of model performance

Process comments:
* We're relatively happy with our current preprocessor to pause on the feature engineering and do experiments with the predictive models; we'll cycle back to the feature engineering if there's time and interest. :)

Feb 16, 2016

### Trying it out on the test set

Preparing our inputs (the X matrix): This should all look very familiar to the cells in the previous sections.

Be careful of your variable names!!

In [84]:
# sfp_submission will be used to preprocess the data from test.csv
sfp_submission = SFP(test) 
X_submission = sfp_submission.concat_features() 

# Sanity check: These should not be the same, because X is from train.csv and X_submission is from test.csv
display(X.head())
display(X_submission.head())

Unnamed: 0,Dates,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday,BAYVIEW,CENTRAL,INGLESIDE,MISSION,NORTHERN,PARK,RICHMOND,SOUTHERN,TARAVAL,TENDERLOIN
0,23,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
1,23,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
2,23,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
3,23,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
4,23,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0


Unnamed: 0,Dates,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday,BAYVIEW,CENTRAL,INGLESIDE,MISSION,NORTHERN,PARK,RICHMOND,SOUTHERN,TARAVAL,TENDERLOIN
0,23,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0
1,23,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0
2,23,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
3,23,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0
4,23,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0


In [85]:
# As a first pass, usnig the sgd we trained earlier in the notebook
y_predictions_submission = sgd.predict_proba(X_submission)

In [86]:
submission_header = sfp.decode_Y(sgd.classes_).tolist()
df_submission = pd.DataFrame(y_predictions_submission)
display(df_submission.head())
filename = "model5_sgd.csv"
# df_submission.to_csv(filename, index=True, index_label="Id", header=submission_header)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,29,30,31,32,33,34,35,36,37,38
0,0.000606,0.078295,0.00022,0.000229,0.027868,0.000307,0.000372,0.048493,0.00189,0.000521,...,9.1e-05,0.001023,0.000257,0.069205,3.191167e-07,0.002561,0.127994,0.019293,0.242845,0.014147
1,0.000606,0.078295,0.00022,0.000229,0.027868,0.000307,0.000372,0.048493,0.00189,0.000521,...,9.1e-05,0.001023,0.000257,0.069205,3.191167e-07,0.002561,0.127994,0.019293,0.242845,0.014147
2,0.000277,0.058662,0.000268,0.000175,0.041469,0.000463,0.00046,0.048988,0.002655,0.000677,...,9.1e-05,0.001708,0.00032,0.0539,3.663883e-07,0.002751,0.132781,0.0165,0.257545,0.006295
3,0.000384,0.081634,0.000254,0.000227,0.029832,0.000303,0.000445,0.031242,0.001942,0.00061,...,0.000106,0.001203,0.000314,0.069312,3.540282e-07,0.002144,0.158274,0.032892,0.195303,0.011131
4,0.000384,0.081634,0.000254,0.000227,0.029832,0.000303,0.000445,0.031242,0.001942,0.00061,...,0.000106,0.001203,0.000314,0.069312,3.540282e-07,0.002144,0.158274,0.032892,0.195303,0.011131


In [87]:
df_submission.shape # should have 884262 predictions

(884262, 39)

This model received a log loss score of 2.70463 (rank 857 on the kaggle leaderboards).

### Supplemental notes on how we're preparing submission for kaggle

* a header that describes the different categories... 
* id column (?)
* each row is a 39-element vector (probablity for each of the 39 classes for each incident)

#### How do we make the header from our predict_proba output?

Emily: According to the scikit learn documentation, the output is: "The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_." [Also, thanks stackoverflow for an example.](http://stackoverflow.com/questions/16858652/how-to-find-the-corresponding-class-in-clf-predict-proba/16859091#16859091)

In [88]:
print(y_predictions_submission[0])
print(sgd.classes_) 
print(sfp.decode_Y(sgd.classes_)) # human readable

[  6.06436227e-04   7.82951500e-02   2.19911862e-04   2.28576367e-04
   2.78679336e-02   3.06797271e-04   3.72362065e-04   4.84930903e-02
   1.88992036e-03   5.21064663e-04   2.74871179e-04   1.58987720e-04
   1.95891841e-03   5.63223922e-03   1.46589850e-04   8.87881982e-04
   4.18571327e-02   3.20960310e-03   1.66403907e-04   8.98543469e-02
   2.30666292e-02   1.75781654e-01   2.27620794e-05   2.37614557e-04
   6.14237720e-03   8.08843792e-03   3.96767095e-03   1.77619233e-03
   5.53660075e-04   9.08012509e-05   1.02270964e-03   2.56735240e-04
   6.92046598e-02   3.19116728e-07   2.56139907e-03   1.27993697e-01
   1.92930438e-02   2.42844811e-01   1.41466090e-02]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38]
['ARSON' 'ASSAULT' 'BAD CHECKS' 'BRIBERY' 'BURGLARY' 'DISORDERLY CONDUCT'
 'DRIVING UNDER THE INFLUENCE' 'DRUG/NARCOTIC' 'DRUNKENNESS' 'EMBEZZLEMENT'
 'EXTORTION' 'FAMILY OFFENSES' 'FORGERY/COUNTERFEITING' 

Emily: I then prototyped some code to prepare the submission csv. Uncomment if you want to investigate further. (It's currently commented out to prevent the creation of a csv by accident.)

In [89]:
# Making a test_submission 
# subset = y_predictions[0:10]
# df = pd.DataFrame(subset)
# submission_header = sfp.decode_Y(sgd.classes_).tolist()
# csv = df.to_csv("test_submission.csv", index=True, index_label="Id", header=submission_header)

### Making the outputs human readable

Let's print out the top 5 probabilities for a single incident.

In [90]:
a = y_predictions[0]

# # strategy from http://stackoverflow.com/questions/6910641/how-to-get-indices-of-n-maximum-values-in-a-numpy-array
ind = np.argpartition(a, -5)[-5:]
top5_words = sfp.decode_Y(ind)
top5 = {}
for i in range(len(ind)):
    top5[top5_words[i]] = a[ind[i]] 

import operator
sorted_top5 = sorted(top5.items(), key=operator.itemgetter(1), reverse=True)
print sorted_top5

# Visualize for a given incident

[('DRUG/NARCOTIC', 0.22952135351915204), ('OTHER OFFENSES', 0.15045548579077567), ('LARCENY/THEFT', 0.11885207094595898), ('ASSAULT', 0.10955692433372424), ('NON-CRIMINAL', 0.094745908698342901)]


### Ensembling

### Running several trials and summary statistics for those trials

### Comparing different implementations (change in performance vs. frequency) 

### Examining model mistakes
* Finding the differences between what it does well and what it does not as well

### Making the outputs human readable
* Top 5?