In this notebook we will use machine learning algorithms to predict mode choice for the SwissMetro dataset, using the `scikit-learn` library. We will compare the procedure and results with discrete choice models using the `BIOGEME` library. 

# DCMs 

We can first establish a discrete choice model benchmark on the SwissMetro data using Biogeme. 

Version of Biogeme

In [1]:
import biogeme.version as ver
print(ver.getText())

biogeme 3.2.8 [2021-09-02]
Version entirely written in Python
Home page: http://biogeme.epfl.ch
Submit questions to https://groups.google.com/d/forum/biogeme
Michel Bierlaire, Transport and Mobility Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL)



In [2]:
url_root = (
    'https://courses.edx.org/'
    'asset-v1:EPFLx+ChoiceModels2x+3T2021+type@asset+block@'
)

First we can import our libraries and setup the data:

In [3]:
import numpy as np
import pandas as pd
import biogeme.database as db
import biogeme.biogeme as bio
from biogeme import models
from biogeme.expressions import Beta

# Read the data
df = pd.read_table(f'{url_root}swissmetro.dat', sep='\t')
database = db.Database('swissmetro', df)

# The following statement allows you to use the names of the variable
# as Python variable.
globals().update(database.variables)

# Removing some observations
exclude = ((PURPOSE != 1) * (PURPOSE != 3) + (CHOICE == 0)) > 0
database.remove(exclude)

We can now specify our model. We will define a very simple model here, including only generic parameters for time and cost:

In [4]:
# Parameters to be estimated
ASC_CAR = Beta('ASC_CAR', 0, None, None, 0)
ASC_TRAIN = Beta('ASC_TRAIN', 0, None, None, 0)
ASC_SM = Beta('ASC_SM', 0, None, None, 1)
B_TIME = Beta('B_TIME', 0, None, None, 0)
B_COST = Beta('B_COST', 0, None, None, 0)


# Definition of new variables
SM_COST = SM_CO * (GA == 0)
TRAIN_COST = TRAIN_CO * (GA == 0)
CAR_AV_SP = CAR_AV * (SP != 0)
TRAIN_AV_SP = TRAIN_AV * (SP != 0)
TRAIN_TT_SCALED = TRAIN_TT / 100.0
TRAIN_COST_SCALED = TRAIN_COST / 100
SM_TT_SCALED = SM_TT / 100.0
SM_COST_SCALED = SM_COST / 100
CAR_TT_SCALED = CAR_TT / 100
CAR_CO_SCALED = CAR_CO / 100

# Definition of the utility functions
V1 = ASC_TRAIN + B_TIME * TRAIN_TT_SCALED + B_COST * TRAIN_COST_SCALED
V2 = ASC_SM + B_TIME * SM_TT_SCALED + B_COST * SM_COST_SCALED
V3 = ASC_CAR + B_TIME * CAR_TT_SCALED + B_COST * CAR_CO_SCALED

# Associate utility functions with the numbering of alternatives
V = {1: V1, 2: V2, 3: V3}

# Associate the availability conditions with the alternatives
av = {1: TRAIN_AV_SP, 2: SM_AV, 3: CAR_AV_SP}

# Definition of the model. This is the contribution of each
# observation to the log likelihood function.
logprob = models.loglogit(V, av, CHOICE)

# Create the Biogeme object
biogeme = bio.BIOGEME(database, logprob)
biogeme.modelName = 'swiss_metro_validation'

# Estimate the parameters
results = biogeme.estimate()

As we are comparing with machine learning, we need to perform out of sample validation. 

The `split` function below organizes the data into several slices (in this case 5) of about the same size, randomly defined. We make sure that groups of observations belong to the same data set.
Each slice is considered as a validation dataset. 

In [5]:
groups = 'ID'
def split(slices):
    ids = df[groups].unique()
    np.random.shuffle(ids)
    the_slices_ids = np.array_split(ids, slices)
    theSlices = [
        df[df[groups].isin(ids)]
        for ids in the_slices_ids
    ]
    estimationSets = []
    validationSets = []
    for i, v in enumerate(theSlices):
        estimationSets.append(
            pd.concat(theSlices[:i] + theSlices[i + 1:])
        )
        validationSets.append(v)
    return zip(estimationSets, validationSets)
    

In [6]:
validationData = split(slices=5)

The `validate` function then re-estimates the model using all the data except the slice, and the estimated model is applied on the validation set (i.e. the slice). 
The value of the log likelihood for each observation in the validation set is reported in a dataframe. 
As this is done for each slice, the output is a list of dataframes, each corresponding to one of these exercises.

In [7]:
validation_results = biogeme.validate(results, validationData)

We can then output the normalized log-likelihood loss by calculating the mean of each of the log-likelihood values for each observation in the data. 

In [8]:
for slide in validation_results:
    print(
        f'Normalised log likelihood for {slide.shape[0]} validation data: '
        f'{slide["Loglikelihood"].mean()}'
    )

Normalised log likelihood for 1359 validation data: -0.7209414127335014
Normalised log likelihood for 1359 validation data: -0.8602697129740046
Normalised log likelihood for 1350 validation data: -0.800106220549936
Normalised log likelihood for 1350 validation data: -0.7671216275484689
Normalised log likelihood for 1350 validation data: -0.7900951793680111


# Machine learning

We can now try to test alternative machine learning classifiers using `scikit-learn`. 

First we need to import our libraries. 
We will investigate the Logistic Regression and Random Forest classifiers.

We will also need to scale our data using the standard scaler, and evaluate the models using cross-validation.

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

Now we can set up our data. 

The machine learning model will consider all features provided in the input dataset. 

We therefore need to remove the target that we wish to predict and the context columns that don't tell us anything about the choice situation. 

We store the target as `y`. 

We then scale our input data and store it as `X`.

In [10]:
target_context = ['CHOICE', 'GROUP', 'SURVEY', 'SP', 'ID']
y = df.CHOICE
X_unscaled = df[[col 
                 for col in df.columns 
                 if col not in ['CHOICE']]]
X = StandardScaler().fit_transform(X_unscaled)

Now we are ready to investigate our machine learning models. 

We can first try the logistic regression classifier. 

It uses the same logistic function as the discrete choice model, though all variables are included uniformly for all choices, and so does not output utilities. 

We first define our classifier, specifying any parameters we may need. 

Here we use a smaller `C` parameter, increasing the strength of the `L2` regularization. 

We can then perform cross-validation in the same way we did for the discrete choice model, remembering to specify the scoring to use the normalised log likelihood.

In [11]:
lr = LogisticRegression(C=0.001)
lr_scores = cross_val_score(
    lr, X, y, cv=5, groups=df.ID, scoring='neg_log_loss'
)
lr_scores

array([-0.93692112, -1.00159854, -0.81403925, -0.72420042, -0.93098119])

We could also repeat the procedure for a different model, for instance a Random Forest Classifier...

In [12]:
rf = RandomForestClassifier(
    n_estimators=100, max_depth=3
)
rf_scores = cross_val_score(
    rf, X, y, cv=5, groups=df.ID, scoring='neg_log_loss'
)
rf_scores

array([-1.03909524, -1.07049627, -0.96614977, -0.77450147, -1.24844636])

It seems that the machine learning models do not perform as well with these parameters as our very simple discrete choice model. 

This is because the SwissMetro dataset is very small, and there is not enough data for the models to generalize to. 

We can therefore see what happens when we use a larger dataset. 