**Michael Bocamazo**

**Question 2**: How well does logistic regression perform as a baseline with minimal feature engineering and one-hot expansion?

**Date**: 2016/10/24

**Methods**: Select down to only features usable within logistic regression.  Choose a few appropriate for a OHE.  Train logistic regression.  Cross-validate the ridge term.  Add in OHE feats.  Examine resultant weights, and try to identify where prediction fails.

**Conclusion**: 

In [1]:
import os, sys
import csv
import numpy as np
import sklearn
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import time
import pandas as pd
import seaborn as sns
import copy
%matplotlib inline
sns.set_style("darkgrid", {"grid.linewidth": .5, "axes.facecolor": ".9"})

In [5]:
import ML_utils as ml_ut

From the previous question, we have a clean data frame, and a data frame with expanded features.

In [6]:
df_c = pd.read_csv('SM_clean.csv')

In [7]:
df_c.columns

Index(['SURVEY', 'ID', 'PURPOSE', 'FIRST', 'TICKET', 'WHO', 'LUGGAGE', 'AGE',
       'MALE', 'INCOME', 'GA', 'ORIGIN', 'DEST', 'CAR_AV', 'TRAIN_TT',
       'TRAIN_CO', 'TRAIN_HE', 'SM_TT', 'SM_CO', 'SM_HE', 'SM_SEATS', 'CAR_TT',
       'CAR_CO', 'CHOICE'],
      dtype='object')

In [8]:
df_e = pd.read_csv('SM_expand.csv')

In [9]:
df_e.columns

Index(['SURVEY', 'ID', 'PURPOSE', 'FIRST', 'TICKET', 'WHO', 'LUGGAGE', 'AGE',
       'MALE', 'INCOME', 'GA', 'ORIGIN', 'DEST', 'CAR_AV', 'TRAIN_TT',
       'TRAIN_CO', 'TRAIN_HE', 'SM_TT', 'SM_CO', 'SM_HE', 'SM_SEATS', 'CAR_TT',
       'CAR_CO', 'CHOICE', 'min_CO', 'ratio_TRAIN_CO', 'ratio_SM_CO',
       'ratio_CAR_CO', 'min_TT', 'ratio_TRAIN_TT', 'ratio_SM_TT',
       'ratio_CAR_TT'],
      dtype='object')

### Features to include
We can make a simple list of features that would be appropriate for a logistic regression model without any transformations.  The certain ones are the numerics and binary features.  Less certain are the ordinal encoded features.  The simply encoded features require an OHE.

The **pure feats** are the numeric features that we expect to get a straight correlation to choice, except for cost, which needs an adjustment based on GA to work well.  

The **simple feats** are binary or ordinal encoded features.  The INCOME and AGE features both have a category for unknowns, at the endpoint.  If the latent classes behind the unknowns are equally distributed among the values, this acts as a regularizer.  In both of these features they occupy the greatest value.  They could be made to occupy the the mid value for a cleaner regularization.

The **ratio features** are those developed in the first pass to compare between alternatives.  These should be quite useful for random forests, because they condense the number of nodes needed to express the comparison of the cost features.  However, they might not be useful for logistic regression.  We can experiment.

Finally, there are 5 non-ordinal **encoded features** that we'll probably omit until we want much higher complexity.

In [10]:
simple_feats = ['SURVEY','LUGGAGE','INCOME','AGE','GA','SM_SEATS','CAR_AV','FIRST','MALE']
pure_feats = ['TRAIN_TT', 'TRAIN_CO', 'TRAIN_HE', 'SM_TT', 'SM_CO', 'SM_HE', 'SM_SEATS', 'CAR_TT','CAR_CO']
encode_feats = ['PURPOSE','TICKET','WHO','ORIGIN','DEST']
ratio_feats = ['min_CO', 'ratio_TRAIN_CO','ratio_SM_CO', 'ratio_CAR_CO', 'min_TT', 'ratio_TRAIN_TT',
               'ratio_SM_TT','ratio_CAR_TT']


## Multivariate Logistic Regression
Or, "Multinomial Logit"

In [29]:
from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import LogisticRegressionCV

In [11]:
X = df_e.drop("CHOICE", axis = 1)
y = df_e['CHOICE']

The split was chosen based on session-split in q_00.

In [13]:
split = 7002

In [14]:
Xtrain, Xtest, ytrain, ytest = ml_ut.tt_split(X,y,split)

To start, we'll take the pure numeric feats and the simple ordinal encodings or binary features

In [24]:
ML_feat = pure_feats + simple_feats

### Training

In [26]:
clf = LogisticRegression(multi_class='ovr').fit(Xtrain[ML_feat], ytrain)

In [32]:
clf2 = LogisticRegression(solver = "lbfgs", multi_class='multinomial').fit(Xtrain[ML_feat], ytrain)

In [47]:
clf_cv = LogisticRegressionCV(multi_class='ovr').fit(Xtrain[ML_feat], ytrain)
clf_cv2 = LogisticRegressionCV(multi_class='multinomial').fit(Xtrain[ML_feat], ytrain)

The Attribute "C\_" gives the value of the regularizer that is best for each class.  Because the 'refit' parameter is default true for this model, the whole model is refit on all of the training data after finding the best hyperparameter C for each class.

In [37]:
clf_cv.C_

array([  2.15443469e+01,   1.00000000e-04,   1.00000000e-04])

I think this is due to it being the multinomial case and learning the same parameters for the whole set.

In [56]:
clf_cv2.C_

array([ 0.35938137,  0.35938137,  0.35938137])

### Testing
We can predict on the whole set and then evaluate on the test and train separately.

In [40]:
py_x = clf.predict_proba(X[ML_feat])
dy_x = clf.predict(X[ML_feat])

In [46]:
print(sklearn.metrics.log_loss(ytrain, py_x[:split]))
print(sklearn.metrics.accuracy_score(ytrain, dy_x[:split]))

0.699283805597
0.69165952585


In [44]:
print(sklearn.metrics.log_loss(ytest, py_x[split:]))
print(sklearn.metrics.accuracy_score(ytest, dy_x[split:]))

0.784816724333
0.585687382298


The accuracy is substantially higher on the training set, which is somewhat surprising. Logistic Regression is not something that I think usually overfits.  Let's look at the other models.

In [49]:
py_x = clf2.predict_proba(X[ML_feat])
dy_x = clf2.predict(X[ML_feat])

In [50]:
print(sklearn.metrics.log_loss(ytrain, py_x[:split]))
print(sklearn.metrics.accuracy_score(ytrain, dy_x[:split]))

0.815291533847
0.651242502142


In [51]:
print(sklearn.metrics.log_loss(ytest, py_x[split:]))
print(sklearn.metrics.accuracy_score(ytest, dy_x[split:]))

0.962744442308
0.478611783697


The accuracy for the multinomial fit was worse than the one-versus-all method for the non-cross-validated case.

#### With Cross-Validation

In [53]:
py_x = clf_cv.predict_proba(X[ML_feat])
dy_x = clf_cv.predict(X[ML_feat])

In [54]:
print(sklearn.metrics.log_loss(ytrain, py_x[:split]))
print(sklearn.metrics.accuracy_score(ytrain, dy_x[:split]))

0.787363181432
0.657812053699


In [55]:
print(sklearn.metrics.log_loss(ytest, py_x[split:]))
print(sklearn.metrics.accuracy_score(ytest, dy_x[split:]))

0.942837909585
0.478880817864


Surprisingly, we do substantially better without the cross-validation.  Perhaps there is another setting that I am overlooking. I did see that the best C values were at the bounds of the search space, which could mean that the bounds need to be widened.

In [57]:
py_x = clf_cv2.predict_proba(X[ML_feat])
dy_x = clf_cv2.predict(X[ML_feat])

In [58]:
print(sklearn.metrics.log_loss(ytrain, py_x[:split]))
print(sklearn.metrics.accuracy_score(ytrain, dy_x[:split]))

print(sklearn.metrics.log_loss(ytest, py_x[split:]))
print(sklearn.metrics.accuracy_score(ytest, dy_x[split:]))

0.806953372192
0.651242502142
0.953894843485
0.477804681195


#### Increase bounds of C parameter
The best performing was the dead-simplest.  Let's try to increase the bounds on the C parameter.

In [67]:
np.logspace(-1,2,4)

array([   0.1,    1. ,   10. ,  100. ])

In [74]:
model = LogisticRegressionCV(Cs = np.logspace(-10,2,13)).fit(Xtrain[ML_feat], ytrain)

py_x = model.predict_proba(X[ML_feat])
dy_x = model.predict(X[ML_feat])
print(sklearn.metrics.log_loss(ytrain, py_x[:split]))
print(sklearn.metrics.accuracy_score(ytrain, dy_x[:split]))
print(sklearn.metrics.log_loss(ytest, py_x[split:]))
print(sklearn.metrics.accuracy_score(ytest, dy_x[split:]))

0.802159414805
0.659097400743
1.00768252553
0.474307237019


In [76]:
model.C_

array([  1.00000000e+01,   1.00000000e-06,   1.00000000e-10])

In [77]:
model = LogisticRegression().fit(Xtrain[ML_feat], ytrain)

py_x = model.predict_proba(X[ML_feat])
dy_x = model.predict(X[ML_feat])
print(sklearn.metrics.log_loss(ytrain, py_x[:split]))
print(sklearn.metrics.accuracy_score(ytrain, dy_x[:split]))
print(sklearn.metrics.log_loss(ytest, py_x[split:]))
print(sklearn.metrics.accuracy_score(ytest, dy_x[split:]))

0.699283805597
0.69165952585
0.784816724333
0.585687382298


In [79]:
model.C

1.0

### Issue of C parameter
The regularization parameter is the only real parameter to be tuned for logistic regression, besides possibly the issue of one-vs-all against multinomial.  It is possible that the first models learned with cross validation are **worse with less data**, and so they learn sub-optimal parameters.  That is the only explanation I can think of.  We can move on to the inclusion or exclusion of feature sets.

## Feature Sets

In [80]:
ML_feat = pure_feats

In [81]:
model = LogisticRegression().fit(Xtrain[ML_feat], ytrain)

py_x = model.predict_proba(X[ML_feat])
dy_x = model.predict(X[ML_feat])
print(sklearn.metrics.log_loss(ytrain, py_x[:split]))
print(sklearn.metrics.accuracy_score(ytrain, dy_x[:split]))
print(sklearn.metrics.log_loss(ytest, py_x[split:]))
print(sklearn.metrics.accuracy_score(ytest, dy_x[split:]))

0.798716559381
0.65381319623
0.931596683912
0.482916330374


Worse than before - does better with more data.

In [82]:
ML_feat = pure_feats + simple_feats + ratio_feats

In [83]:
model = LogisticRegression().fit(Xtrain[ML_feat], ytrain)

py_x = model.predict_proba(X[ML_feat])
dy_x = model.predict(X[ML_feat])
print(sklearn.metrics.log_loss(ytrain, py_x[:split]))
print(sklearn.metrics.accuracy_score(ytrain, dy_x[:split]))
print(sklearn.metrics.log_loss(ytest, py_x[split:]))
print(sklearn.metrics.accuracy_score(ytest, dy_x[split:]))

0.679408553163
0.706369608683
0.817640880318
0.611783696529


We **do gain information** from the ratio feats, which is encouraging.  

In [85]:
# Accuracy scores from the simple random forest models
# log loss: 0.702412421415
# discrete accuracy: 0.684422921711

The accuracy still has some way to go, but now that we know that the logistic regression at least gives some meaningful result, we can investigate the relative strength of features.  

## Columns to drop
We'll make a list of the column headers that we want to drop.  Some of the data is uninformative or redundant.

In [5]:
drop_cols = []

In [6]:
df1.columns

Index(['GROUP', 'SURVEY', 'SP', 'ID', 'PURPOSE', 'FIRST', 'TICKET', 'WHO',
       'LUGGAGE', 'AGE', 'MALE', 'INCOME', 'GA', 'ORIGIN', 'DEST', 'TRAIN_AV',
       'CAR_AV', 'SM_AV', 'TRAIN_TT', 'TRAIN_CO', 'TRAIN_HE', 'SM_TT', 'SM_CO',
       'SM_HE', 'SM_SEATS', 'CAR_TT', 'CAR_CO', 'CHOICE'],
      dtype='object')

We want to know the meaningful data breakdown in each feature.  The 'GROUP' feature is equivalent to the 'SURVEY' feature, which encodes survey conducted in train (0) or on car trip (1).  These show the equilavence:

In [7]:
df1['GROUP'].value_counts()

3    6759
2    3969
Name: GROUP, dtype: int64

In [8]:
df1[df1['GROUP']==2]['SURVEY'].value_counts()

0    3969
Name: SURVEY, dtype: int64

In [9]:
df1[df1['GROUP']==3]['SURVEY'].value_counts()

1    6759
Name: SURVEY, dtype: int64

In [10]:
drop_cols += ['GROUP']

The 'TRAIN_AV' and 'SM_AV' features are not informative - always 1.

In [11]:
df1['TRAIN_AV'].value_counts()

1    10728
Name: TRAIN_AV, dtype: int64

In [12]:
df1['SM_AV'].value_counts()

1    10728
Name: SM_AV, dtype: int64

In [13]:
drop_cols += ['TRAIN_AV', 'SM_AV']

However, the 'CAR_AV' is meaningful, and different from where the survey was taken.  It encodes if the car is a possible output, so could be used for segmenting a modeling into two.  The only issue is sharing learning or weights between models, or tree subsets.

In [14]:
df1['CAR_AV'].value_counts()

1    9045
0    1683
Name: CAR_AV, dtype: int64

The column 'SP' simply means stated preference survey, is fixed at 1, and so can be removed.

In [15]:
df1['SP'].value_counts()

1    10728
Name: SP, dtype: int64

In [16]:
drop_cols += ['SP']

Sex of the traveller

In [17]:
df1['MALE'].value_counts()

1    8046
0    2682
Name: MALE, dtype: int64

Class of travel, within a train

In [18]:
df1['FIRST'].value_counts()

0    5679
1    5049
Name: FIRST, dtype: int64

Note that the class of the traveller is still encoded if the survey is given based on a car trip.

In [19]:
df1[df1['SURVEY']==1]['FIRST'].value_counts()

1    3726
0    3033
Name: FIRST, dtype: int64

### Drop the unnecessary columns

In [20]:
drop_cols

['GROUP', 'TRAIN_AV', 'SM_AV', 'SP']

In [21]:
for col in drop_cols:
    df1.drop(col, axis = 1, inplace = True)

In [22]:
df1.columns

Index(['SURVEY', 'ID', 'PURPOSE', 'FIRST', 'TICKET', 'WHO', 'LUGGAGE', 'AGE',
       'MALE', 'INCOME', 'GA', 'ORIGIN', 'DEST', 'CAR_AV', 'TRAIN_TT',
       'TRAIN_CO', 'TRAIN_HE', 'SM_TT', 'SM_CO', 'SM_HE', 'SM_SEATS', 'CAR_TT',
       'CAR_CO', 'CHOICE'],
      dtype='object')

# Description of Remaining Features
Let's give a prose description of all the features that haven't yet been referenced.  **SURVEY** is the type of trip, train or car, for which the survey was given.  **ID** is the ID of the user - this is the session ID.  There are 1192 different sessions. Every row with the same ID has the same **CHOICE** value.  **PURPOSE** (9 categorical) was the purpose of the trip for which the survey was taken. **TICKET** (11 categorical) is the kind of ticket, round-trip, one-way, half-day, seasonal, etc.  **WHO** (4 categorical) is who pays: unknown, self, employer, half-half.  **LUGGAGE** (3 cat) 0 pieces, 1 piece, 3 several pieces.  **AGE** (6 cat) is an ordinal encoding of age, with greatest val unknown.  **INCOME** (5 cat) ordinal encoding of income level. **GA** - binary for owning an annual ticket. **ORIG** and **DEST** are 26-level cats encoding the region of travel.  They are probably too large to make useful, so while informative, they are unwieldy and could be dropped.  Then we have 3 types of straight numerical features - **TT** = Travel Time, in minutes, for each type, **CO** = cost, and **HE** = Headway or period between trains.  Cost for the car is computed as a fixed cost per unit distance.  Cost for the Train is computed based on the actual fare for an individual trip, or as the cost of the whole year with a GA - this makes the data quite hard to use, because there is a very clear bimodality in the distribution of costs.  The information it encodes isn't actually useful.  We would have to count on different cost thresholds being learnt for the GA/no GA cases.  Finally, **SM_SEATS** is a binary for the kinds of seats used on the Swiss Metro - airline style, or not.

The numerical features we probably want to keep, and we can hypothesize that they will be quite useful in choice prediction.  The origin and destination features probably have to be dropped.  Or, only keep the most frequent 5 or so values.  Purpose and Ticket might be useful, but are still fairly large for a one-hot encoding given how many samples there are.  The ordinal categorical features could be useful, but often have codings for 'unknown' that are at an extreme and so would hurt any linear model.  They could be treated as NAs and then just be sampled from the known distribution.  However, it is probably necessary to A/B test these kinds of data-filling changes.

#### Multiple Levels of Data
We could make several data sets: one with all of the features except for the truly unnecessary or redundant, and another, made beforehand for simplicity without the less likely features.  

In [23]:
df1.to_csv('SM_clean.csv', index = False)

After transformations, we can make another saved set.

# The Target
On the CHOICE value.  This probably makes sense to keep as a multioutput problem.  For every row for each ID, there is a different set of Travel Time, Costs, Headway, and other parameters that determine the desirability of each method.  

It could however be transformed into a single output problem with session-level accuracy.  So each row could be converted into a set of different alternatives, each with their own travel-related parameters, but with the same customer parameters.  The target would be a binary chosen/not chosen feature.  Then, the output of the model would be a **utility**, rather than a probability, to be **normalized** to get the fractional 'shares allocation' of this user into different alternatives.  This is somewhat complicated by the fact that the different modes aren't easily comparable.  The car mode doesn't have a headway feature, and only SM has different seat configurations.  This generates about 3x the data, but a multioutput problem naturally creates models for each output.  

**Different Style**
It may be useful to train two different models: one to predict car v. not car, all car-user data goes here, then train v. SM, to which all train-only and car but train/SM choice data goes.

We have 1192 different individuals labelling their preferences, at 10728 total combinations of alternatives presented.

In [24]:
len(set(df1['ID']))

1192

In [25]:
len(df1)

10728

### Feature Sets

Here are some possible appropriate machine learning feature sets.

In [29]:
ml_feat = ['SURVEY','FIRST','LUGGAGE','AGE','MALE','INCOME','GA','CAR_AV','TRAIN_TT',
          'TRAIN_CO','TRAIN_HE','SM_TT','SM_CO','SM_HE','SM_SEATS','CAR_TT','CAR_CO']

In [30]:
ml_feat = ['SURVEY','AGE','INCOME','GA','CAR_AV','TRAIN_TT',
          'TRAIN_CO','TRAIN_HE','SM_TT','SM_CO','SM_HE','CAR_TT','CAR_CO']

In [31]:
ml_feat = ['CAR_AV','TRAIN_TT','TRAIN_CO','TRAIN_HE','SM_TT','SM_CO','SM_HE','CAR_TT','CAR_CO']

### Eliminate Unknown Choice values

There is little point in including a class with nine samples in the model, we could add it back in later if strictly necessary for comparison between models.

In [32]:
df_k = copy.deepcopy(df1)

In [33]:
df_k = df_k[df_k['CHOICE']!=0]

The eliminates nine rows.

In [34]:
len(df_k)

10719

### Valid Minimum Cost
We want to be able to create ratio features for the costs, which might improve learning by making direct comparisons.  For all rows, there is a train and SM cost, but there is only a car cost if the user has a car.  So to calculate a valid minimum for ratio features, we must account for this.

In [35]:
def calc_valid_min(df):
    x = df1.loc[:,['SM_CO','TRAIN_CO','CAR_CO']].min(axis=1)
    for i in range(len(x)):
        if x[i]==0:
            x[i] = min(df1.loc[i,['SM_CO', 'TRAIN_CO']])
    return x            

In [36]:
df_k.loc[:,'min_CO'] = calc_valid_min(df_k)

In [37]:
ratio_price_feats = ['ratio_TRAIN_CO','ratio_SM_CO','ratio_CAR_CO']

In [38]:
df_k.loc[:,'ratio_TRAIN_CO'] = df_k['TRAIN_CO']/df_k['min_CO']

df_k.loc[:,'ratio_SM_CO'] = df_k['SM_CO']/df_k['min_CO']
df_k.loc[:,'ratio_CAR_CO'] = df_k['CAR_CO']/df_k['min_CO']

### Valid Minimum Travel Time
Similarly, we want to create a set of travel time ratio features so the comparison can be done directly.

In [39]:
def calc_valid_min2(df):
    x = df1.loc[:,['SM_TT','TRAIN_TT','CAR_TT']].min(axis=1)
    for i in range(len(x)):
        if x[i]==0:
            x[i] = min(df1.loc[i,['SM_TT', 'TRAIN_TT']])
    return x            

In [41]:
df_k.loc[:,'min_TT'] = calc_valid_min2(df_k)

In [42]:
df_k.loc[:,'ratio_TRAIN_TT'] = df_k['TRAIN_TT']/df_k['min_TT']
df_k.loc[:,'ratio_SM_TT'] = df_k['SM_TT']/df_k['min_TT']
df_k.loc[:,'ratio_CAR_TT'] = df_k['CAR_TT']/df_k['min_TT']

In [43]:
ratio_feats = ratio_price_feats + ['ratio_TRAIN_TT','ratio_SM_TT','ratio_CAR_TT']

In [44]:
ml_feat = ['SURVEY','FIRST','LUGGAGE','AGE','MALE','INCOME','GA','CAR_AV','TRAIN_TT',
          'TRAIN_CO','TRAIN_HE','SM_TT','SM_CO','SM_HE','SM_SEATS','CAR_TT','CAR_CO'] + ratio_feats
          

### Save the result
We'll save these likely transformations.

In [45]:
df_k.to_csv('SM_expand.csv', index = False)