**Michael Bocamazo**

**Question 4**: How well does a 2-stage model perform with Random Forests and minimal feature engineering?

**Date**: 2016/11/28

**Methods**: Use the same feature sets, train a multistage model in the same way as the previous notebook, and analyze the random forests generated.

The first stage predicts car v. non-car for samples in which CAR is available.  Those predicted not car are then passed into the second stage, with the no car available samples included.  The second stage can be multinomial or one-versus-all for all of the classes, not just TRAIN v. SwissMetro, to allow missed samples that should fall in the CAR class to do so.

**Conclusion**: 

In [1]:
import os, sys
import csv
import numpy as np
import sklearn
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import time
import pandas as pd
import seaborn as sns
import copy
%matplotlib inline
sns.set_style("darkgrid", {"grid.linewidth": .5, "axes.facecolor": ".9"})

In [2]:
import ML_utils as ml_ut

From the previous question, we have a clean data frame, and a data frame with expanded features.

In [3]:
df_e = pd.read_csv('SM_expand.csv')

In [4]:
df_e.columns

Index(['SURVEY', 'ID', 'PURPOSE', 'FIRST', 'TICKET', 'WHO', 'LUGGAGE', 'AGE',
       'MALE', 'INCOME', 'GA', 'ORIGIN', 'DEST', 'CAR_AV', 'TRAIN_TT',
       'TRAIN_CO', 'TRAIN_HE', 'SM_TT', 'SM_CO', 'SM_HE', 'SM_SEATS', 'CAR_TT',
       'CAR_CO', 'CHOICE', 'min_CO', 'ratio_TRAIN_CO', 'ratio_SM_CO',
       'ratio_CAR_CO', 'min_TT', 'ratio_TRAIN_TT', 'ratio_SM_TT',
       'ratio_CAR_TT'],
      dtype='object')

### Features to include
We can make a simple list of features that would be appropriate for a logistic regression model without any transformations.  The certain ones are the numerics and binary features.  Less certain are the ordinal encoded features.  The simply encoded features require an OHE.

The **pure feats** are the numeric features that we expect to get a straight correlation to choice, except for cost, which needs an adjustment based on GA to work well.  

The **simple feats** are binary or ordinal encoded features.  The INCOME and AGE features both have a category for unknowns, at the endpoint.  If the latent classes behind the unknowns are equally distributed among the values, this acts as a regularizer.  In both of these features they occupy the greatest value.  They could be made to occupy the the mid value for a cleaner regularization.

The **ratio features** are those developed in the first pass to compare between alternatives.  These should be quite useful for random forests, because they condense the number of nodes needed to express the comparison of the cost features.  However, they might not be useful for logistic regression.  We can experiment.

Finally, there are 5 non-ordinal **encoded features** that we'll probably omit until we want much higher complexity.

In [5]:
simple_feats = ['SURVEY','LUGGAGE','INCOME','AGE','GA','SM_SEATS','CAR_AV','FIRST','MALE']
pure_feats = ['TRAIN_TT', 'TRAIN_CO', 'TRAIN_HE', 'SM_TT', 'SM_CO', 'SM_HE', 'CAR_TT','CAR_CO']
encode_feats = ['PURPOSE','TICKET','WHO','ORIGIN','DEST']
ratio_feats = ['min_CO', 'ratio_TRAIN_CO','ratio_SM_CO', 'ratio_CAR_CO', 'min_TT', 'ratio_TRAIN_TT',
               'ratio_SM_TT','ratio_CAR_TT']

## Load and Reformat

In [6]:
from sklearn.ensemble import RandomForestClassifier

In [7]:
X = df_e.drop("CHOICE", axis = 1)
y = df_e['CHOICE']

The split was chosen based on session-split in q_00.

In [8]:
split = 7002

In [9]:
Xtrain, Xtest, ytrain, ytest = ml_ut.tt_split(X,y,split)

To start, we'll take the pure numeric feats and the simple ordinal encodings or binary features

In [10]:
ML_feat = pure_feats + simple_feats

When examining the accuracy results, we should recall that the classifier needs to do better than about 64% accuracy to beat predicting everything in the dominant class, which is SwissMetro.

In [11]:
ytrain.value_counts()/len(ytrain)

2    0.635961
1    0.182519
3    0.181520
Name: CHOICE, dtype: float64

In [12]:
ytest.value_counts()/len(ytest)

3    0.486683
2    0.474307
1    0.039010
Name: CHOICE, dtype: float64

Somewhat bad that this discrepancy was discovered only now, but it might point us in the direction of using a group shuffle.

### Group Shuffle all data

We want to shuffle the User IDs, not the samples themselves, so no one person is included in both the train an test sets.  It isn't clear if an equivalent function exists in scikit learn. 

There are ~11000 samples, within ~1200 responding users who were given different scenarios and then asked their preferences.

In [17]:
df_e.index

RangeIndex(start=0, stop=10719, step=1)

In [23]:
group_assignments = np.random.permutation(1192)

In [42]:
group_assignments

array([ 156,  845,  980, ...,  838, 1063,  563])

This rebuilds the dataframe.

In [47]:
df_shuffle = pd.concat(list(map(lambda z: df_e[df_e['ID']==z], group_assignments)))

In [48]:
df_shuffle.index = (range(len(df_shuffle)))

In [50]:
df_shuffle.to_csv('SM_shuffle.csv',delimiter = ',', index = False)

In [51]:
df_e = df_shuffle

In [55]:
# examining this section shows that this split is again appropriate
# df_e.iloc[7000:7010]
split = 7002

In [56]:
X = df_e.drop("CHOICE", axis = 1)
y = df_e['CHOICE']

In [57]:
Xtrain, Xtest, ytrain, ytest = ml_ut.tt_split(X,y,split)

In [58]:
ytrain.value_counts()/len(ytrain)

2    0.584690
3    0.280777
1    0.134533
Name: CHOICE, dtype: float64

In [59]:
ytest.value_counts()/len(ytest)

2    0.571197
3    0.299083
1    0.129720
Name: CHOICE, dtype: float64

The classes are much more evenly distributed in each set.

# Baseline Logistic Regression
Repeat the tests - shuffling the data may have made a difference.

In [60]:
from sklearn.linear_model import LogisticRegression

In [63]:
ML_feat = pure_feats + simple_feats + encode_feats + ratio_feats

In [64]:
model = LogisticRegression().fit(Xtrain[ML_feat], ytrain)

In [65]:
ml_ut.print_predict(X[ML_feat], model, ytrain, ytest, split)

Training Log Loss: 0.675518645067
Training Accuracy: 0.701228220508
Testing  Log Loss: 0.717972234424
Testing  Accuracy: 0.685814455232


Better than predicting the dominant class.

In [72]:
sklearn.metrics.confusion_matrix(ytest, model.predict(Xtest[ML_feat]))

array([[ 144,  314,   23],
       [  90, 1742,  286],
       [   4,  448,  657]])

In [66]:
model2_weights = pd.DataFrame(list(zip(ML_feat, model.coef_[0], model.coef_[1], model.coef_[2])))
model2_weights.columns = ['Feature','Train Choice','SM Choice','Car Choice']
model2_weights

Unnamed: 0,Feature,Train Choice,SM Choice,Car Choice
0,TRAIN_TT,0.001287,-0.003544,0.002585
1,TRAIN_CO,-0.002321,0.002365,-0.000168
2,TRAIN_HE,-0.008416,0.003967,-0.00097
3,SM_TT,-0.026267,0.012685,-0.003787
4,SM_CO,0.001767,-0.001716,0.000104
5,SM_HE,0.013411,-0.006416,-0.000857
6,CAR_TT,-0.002707,0.005695,-0.005434
7,CAR_CO,-0.005873,0.00629,0.000944
8,SURVEY,-1.241467,-1.456547,2.92916
9,LUGGAGE,-0.12496,-0.151437,0.379516


# MultiStage Logistic Regression
Repeat the previous multistage logistic regression tests with all features.

In [68]:
model1, model2, dy_train, dy_test = ml_ut.multistage_model(Xtrain, Xtest, ytrain, ytest, ML_feat)

In [69]:
sklearn.metrics.accuracy_score(ytest, dy_test)

0.64859762675296651

In [70]:
sklearn.metrics.confusion_matrix(ytest, dy_test)

array([[  41,  419,   21],
       [ 159, 1719,  240],
       [  45,  419,  645]])

The simpler model actually performed better when the class split was more even

# Baseline Random Forest
Without a multistage approach.  

In [85]:
from sklearn import ensemble

In [93]:
ML_feat = pure_feats + simple_feats + ratio_feats + encode_feats

In [100]:
model = ensemble.RandomForestClassifier(max_depth = 12, n_estimators=1000).fit(Xtrain[ML_feat], ytrain)

In [101]:
ml_ut.print_predict(X[ML_feat], model, ytrain, ytest, split)

Training Log Loss: 0.308707568319
Training Accuracy: 0.927306483862
Testing  Log Loss: 0.661010921949
Testing  Accuracy: 0.710355987055


Slightly better than the logistic regression.  Picked the max_depth based on past experience of other data sets with a similar number of features.  However, it could be cross validated.

In [89]:
sklearn.metrics.confusion_matrix(ytest, model.predict(Xtest[ML_feat]))

array([[ 162,  284,   35],
       [ 114, 1771,  233],
       [  18,  495,  596]])

# MultiStage Random Forest

Here we force the random forest to take car availability as a first node.  Hypothesized Benefits: it picks the useful information out automatically.  Possible negatives: less training data to each model.

In [104]:
clf1 = ensemble.RandomForestClassifier(max_depth = 12, n_estimators = 1000)
clf2 = ensemble.RandomForestClassifier(max_depth = 12, n_estimators = 1000)
model1, model2, dy_train, dy_test = ml_ut.multistage_model_gen(Xtrain, Xtest, ytrain, ytest, ML_feat, clf1, clf2)

In [105]:
sklearn.metrics.accuracy_score(ytest, dy_test)

0.65507011866235165

In [106]:
sklearn.metrics.confusion_matrix(ytest, dy_test)

array([[  32,  437,   12],
       [ 140, 1783,  195],
       [  46,  449,  614]])

The multistage did not help in either case.  But it was an interesting thought to try.  The weights of the logistic regression model showed that the CAR_AV feature was very important, as expected, and looking at the feature importance in both of these Random Forest models will probably show the same thing.