**Michael Bocamazo**

**Question 3.1**: How well does a 2-stage model perform with logistic regression and minimal feature engineering?

**Date**: 2016/11/28

**Methods**: Select down to a few features, train multistage model with split on car availability, and cross-validate to find ridge regression term.  Examine resultant weights in both models.  

New from previous notebook: Have encapsulated the multi-stage model.

**Conclusion**: A multi-stage model with logistic regression can perform better than guessing the most frequent class.  However, it does not predict anything present in the "TRAIN" class at all.  The first stage predicts car v. non-car for samples in which CAR is available.  Those predicted not car are then passed into the second stage, with the no car available samples.  The second stage can be multinomial or one-versus-all for all of the classes, not just TRAIN v. SwissMetro, to allow missed samples that should fall in the CAR class to do so.

In [1]:
import os, sys
import csv
import numpy as np
import sklearn
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import time
import pandas as pd
import seaborn as sns
import copy
%matplotlib inline
sns.set_style("darkgrid", {"grid.linewidth": .5, "axes.facecolor": ".9"})

In [3]:
import ML_utils as ml_ut

From the previous question, we have a clean data frame, and a data frame with expanded features.

In [7]:
df_e = pd.read_csv('SM_expand.csv')

In [8]:
df_e.columns

Index(['SURVEY', 'ID', 'PURPOSE', 'FIRST', 'TICKET', 'WHO', 'LUGGAGE', 'AGE',
       'MALE', 'INCOME', 'GA', 'ORIGIN', 'DEST', 'CAR_AV', 'TRAIN_TT',
       'TRAIN_CO', 'TRAIN_HE', 'SM_TT', 'SM_CO', 'SM_HE', 'SM_SEATS', 'CAR_TT',
       'CAR_CO', 'CHOICE', 'min_CO', 'ratio_TRAIN_CO', 'ratio_SM_CO',
       'ratio_CAR_CO', 'min_TT', 'ratio_TRAIN_TT', 'ratio_SM_TT',
       'ratio_CAR_TT'],
      dtype='object')

### Features to include
We can make a simple list of features that would be appropriate for a logistic regression model without any transformations.  The certain ones are the numerics and binary features.  Less certain are the ordinal encoded features.  The simply encoded features require an OHE.

The **pure feats** are the numeric features that we expect to get a straight correlation to choice, except for cost, which needs an adjustment based on GA to work well.  

The **simple feats** are binary or ordinal encoded features.  The INCOME and AGE features both have a category for unknowns, at the endpoint.  If the latent classes behind the unknowns are equally distributed among the values, this acts as a regularizer.  In both of these features they occupy the greatest value.  They could be made to occupy the the mid value for a cleaner regularization.

The **ratio features** are those developed in the first pass to compare between alternatives.  These should be quite useful for random forests, because they condense the number of nodes needed to express the comparison of the cost features.  However, they might not be useful for logistic regression.  We can experiment.

Finally, there are 5 non-ordinal **encoded features** that we'll probably omit until we want much higher complexity.

In [9]:
simple_feats = ['SURVEY','LUGGAGE','INCOME','AGE','GA','SM_SEATS','CAR_AV','FIRST','MALE']
pure_feats = ['TRAIN_TT', 'TRAIN_CO', 'TRAIN_HE', 'SM_TT', 'SM_CO', 'SM_HE', 'CAR_TT','CAR_CO']
encode_feats = ['PURPOSE','TICKET','WHO','ORIGIN','DEST']
ratio_feats = ['min_CO', 'ratio_TRAIN_CO','ratio_SM_CO', 'ratio_CAR_CO', 'min_TT', 'ratio_TRAIN_TT',
               'ratio_SM_TT','ratio_CAR_TT']

## Multivariate Logistic Regression
Or, "Multinomial Logit"

In [9]:
from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import LogisticRegressionCV

In [10]:
X = df_e.drop("CHOICE", axis = 1)
y = df_e['CHOICE']

The split was chosen based on session-split in q_00.

In [11]:
split = 7002

In [12]:
Xtrain, Xtest, ytrain, ytest = ml_ut.tt_split(X,y,split)

To start, we'll take the pure numeric feats and the simple ordinal encodings or binary features

In [13]:
ML_feat = pure_feats + simple_feats

When examining the accuracy results, we should recall that the classifier needs to do better than about 64% accuracy to beat predicting everything in the dominant class, which is SwissMetro.

In [81]:
ytrain.value_counts()/len(ytrain)

2    0.635961
1    0.182519
3    0.181520
Name: CHOICE, dtype: float64

# Logistic Regression No Pipeline

In [58]:
from sklearn.linear_model import LogisticRegression

In [60]:
model = LogisticRegression().fit(Xtrain[ML_feat], ytrain)

In [61]:
ml_ut.print_predict(X[ML_feat], model, ytrain, ytest, split)

Training Log Loss: 0.70249302042
Training Accuracy: 0.68694658669
Testing  Log Loss: 0.787769357819
Testing  Accuracy: 0.580844767285


# Train and Test using Pipeline model

In [20]:
import ML_utils as ml_ut

In [22]:
import imp

In [28]:
imp.reload(ml_ut)

<module 'ML_utils' from '/home/atproofer/SwissMetro/SwissMetroML/ML_utils.py'>

In [29]:
model1, model2, dy_train, dy_test = ml_ut.multistage_model(Xtrain, Xtest, ytrain, ytest, ML_feat)

5319
6456
3717
2894


In [56]:
sklearn.metrics.accuracy_score(ytrain, dy_train)

0.60554127392173662

In [57]:
sklearn.metrics.accuracy_score(ytest, dy_test)

0.60532687651331718

Very low divergence!  Suggests not fitting closely enough to data.  But the test accuracy did improve by 2% over no pipeline.  This will help us validate the use of random forests with a hard-coded split of car v. no car.  **Not better than guessing.**

We can now examine the weights in each model.  The first model is the discrimination between car and not car.  The second model could be the discrimination between Train and SwissMetro if we are completely confident in breaking all of the cars in the first model.  In this model structure, the second step actually does the 3 class discrimination. 

We applied the standard scaling - zero mean and unit variance to all features, so it is possible to compare the weights more directly.

Recall that the following weights are propensity of car choice.

In [44]:
pd.DataFrame(list(zip(model1.steps[1][1].coef_[0], ML_feat)))

Unnamed: 0,0,1
0,0.452455,TRAIN_TT
1,-0.032741,TRAIN_CO
2,0.01661,TRAIN_HE
3,0.201682,SM_TT
4,0.3584,SM_CO
5,0.026891,SM_HE
6,-1.01645,CAR_TT
7,-0.336056,CAR_CO
8,1.134848,SURVEY
9,0.260097,LUGGAGE


The most important features are where the survey is administered - if the survey was performed on a car trip, it is much more likely to correspond to a car choice. Travel time of car, and of train are both fairly important.  If the user had a GA (yearly ticket), much less likely to choose car.  These weights are overall sensical.

We can now look at the weights in the second model.  It is a one-v-all classification problem, so each choice will have a set of weights that are the propensity of choice of a mode correlated with the feature.

In [55]:
model2_weights = pd.DataFrame(list(zip(ML_feat, model2.steps[1][1].coef_[0], model2.steps[1][1].coef_[1], model2.steps[1][1].coef_[2])))
model2_weights.columns = ['Feature','Train Choice','SM Choice','Car Choice']
model2_weights

Unnamed: 0,Feature,Train Choice,SM Choice,Car Choice
0,TRAIN_TT,-0.409162,0.178377,0.423155
1,TRAIN_CO,-2.007174,1.858261,-0.037233
2,TRAIN_HE,-0.290533,0.154255,0.031135
3,SM_TT,0.329314,-0.37557,0.331431
4,SM_CO,1.623101,-1.592022,0.432404
5,SM_HE,0.103237,-0.073671,0.014438
6,CAR_TT,0.066243,0.360145,-1.194915
7,CAR_CO,-0.159817,0.269827,-0.409846
8,SURVEY,-0.70522,-0.356628,0.96946
9,LUGGAGE,-0.023662,-0.043234,0.173462


Travel time was consistently important, as was cost (but not for CAR).  Survey location, again important correlation with car.  Age surprising - older customers seem to disfavor the new mode of transport.
Luggage, income, gender, first class travel, SM seat configuration all fairly unimportant.

## Different feature sets

In [62]:
simple_feats = ['SURVEY','LUGGAGE','INCOME','AGE','GA','SM_SEATS','CAR_AV','FIRST','MALE']
pure_feats = ['TRAIN_TT', 'TRAIN_CO', 'TRAIN_HE', 'SM_TT', 'SM_CO', 'SM_HE', 'CAR_TT','CAR_CO']
encode_feats = ['PURPOSE','TICKET','WHO','ORIGIN','DEST']
ratio_feats = ['min_CO', 'ratio_TRAIN_CO','ratio_SM_CO', 'ratio_CAR_CO', 'min_TT', 'ratio_TRAIN_TT',
               'ratio_SM_TT','ratio_CAR_TT']

### Use the ratio features

In [71]:
ML_feat = simple_feats + pure_feats + ratio_feats

In [72]:
model = LogisticRegression().fit(Xtrain[ML_feat], ytrain)

In [73]:
ml_ut.print_predict(X[ML_feat], model, ytrain, ytest, split)

Training Log Loss: 0.676122464949
Training Accuracy: 0.706512425021
Testing  Log Loss: 0.811695151413
Testing  Accuracy: 0.61985472155


The additional features give a small increase in testing accuracy.

In [74]:
model1, model2, dy_train, dy_test = ml_ut.multistage_model(Xtrain, Xtest, ytrain, ytest, ML_feat)

5319
6283
3717
2576


In [75]:
sklearn.metrics.accuracy_score(ytrain, dy_train)

0.62253641816623817

In [77]:
sklearn.metrics.confusion_matrix(ytrain, dy_train)

array([[  67, 1200,   11],
       [ 454, 3791,  208],
       [ 112,  658,  501]])

In [78]:
sklearn.metrics.accuracy_score(ytest, dy_test)

0.63976324993274147

In [79]:
sklearn.metrics.confusion_matrix(ytest, dy_test)

array([[   0,  125,   20],
       [   0, 1507,  256],
       [   0,  938,  871]])

And even more so with the test set!  Strange - the training set has lower accuracy than the test, even while predicted no choices of "TRAIN" correctly.

However, from the distribution of predictions we can't say that it performs much better than random currently.

### Include the encoded features
It may still include some information, even without expanding into OHE.

In [82]:
ML_feat = simple_feats + pure_feats + ratio_feats + encode_feats

In [83]:
model = LogisticRegression().fit(Xtrain[ML_feat], ytrain)

In [84]:
ml_ut.print_predict(X[ML_feat], model, ytrain, ytest, split)

Training Log Loss: 0.665511907776
Training Accuracy: 0.713081976578
Testing  Log Loss: 0.78561370238
Testing  Accuracy: 0.638687113263


Again, the additional features give a small increase in testing accuracy.

In [85]:
model1, model2, dy_train, dy_test = ml_ut.multistage_model(Xtrain, Xtest, ytrain, ytest, ML_feat)

5319
6286
3717
2425


In [86]:
sklearn.metrics.accuracy_score(ytrain, dy_train)

0.61768066266780919

In [87]:
sklearn.metrics.confusion_matrix(ytrain, dy_train)

array([[  70, 1193,   15],
       [ 494, 3755,  204],
       [ 116,  655,  500]])

In [88]:
sklearn.metrics.accuracy_score(ytest, dy_test)

0.66020984665052462

In [89]:
sklearn.metrics.confusion_matrix(ytest, dy_test)

array([[   0,  128,   17],
       [   0, 1463,  300],
       [   0,  818,  991]])

With a pipeline model, we do actually show improvement over random guessing - including not putting any in the "TRAIN" class!  This could indicate that the second model could be improved by being multinomial instead of one-versus-all.

### Try the multinomial model 
Within the pipeline.

In [92]:
imp.reload(ml_ut)

<module 'ML_utils' from '/home/atproofer/SwissMetro/SwissMetroML/ML_utils.py'>

In [93]:
model1, model2, dy_train, dy_test = ml_ut.multistage_model_multinomial(Xtrain, Xtest, ytrain, ytest, ML_feat)

In [94]:
sklearn.metrics.accuracy_score(ytrain, dy_train)

0.61496715224221654

In [95]:
sklearn.metrics.confusion_matrix(ytrain, dy_train)

array([[  71, 1188,   19],
       [ 513, 3735,  205],
       [ 125,  646,  500]])

In [96]:
sklearn.metrics.accuracy_score(ytest, dy_test)

0.66182405165456015

In [97]:
sklearn.metrics.confusion_matrix(ytest, dy_test)

array([[   0,  128,   17],
       [   0, 1467,  296],
       [   0,  816,  993]])

This is a negligible amount better, but still does not predict anything in the TRAIN class!  