We will now compare ML and DCM on the much larger and more complex London Passenger Mode Choice data. 

# DCMs

We can now establish a DCM benchmark using the same steps as we did for the SwissMetro case study. 

[Click here for a description of the data](http://transp-or.epfl.ch/documents/technicalReports/CS_LPMC.pdf)

In [1]:
url_root = (
    'https://courses.edx.org/'
    'asset-v1:EPFLx+ChoiceModels2x+3T2021+type@asset+block@'
)

In [2]:
import numpy as np
import pandas as pd
import biogeme.database as db
import biogeme.biogeme as bio
from biogeme import models
from biogeme.expressions import Beta

# Read the data
df = pd.read_table(f'{url_root}lpmc.dat', sep='\t')
database = db.Database('lpmc', df)

# The following statement allows you to use the names of the variable
# as Python variable.
globals().update(database.variables)

This time we will use a more complex model specification.

In [3]:
# Parameters to be estimated
ASC_WALKING = Beta('ASC_WALKING', 0, None, None, 1)
ASC_CYCLING = Beta('ASC_CYCLING', 0, None, None, 0)
ASC_PT = Beta('ASC_PT', 0, None, None, 0)
ASC_DRIVING = Beta('ASC_DRIVING', 0, None, None, 0)
B_TIME_WALKING = Beta('B_TIME_WALKING', 0, None, None, 0)
B_TIME_CYCLING = Beta('B_TIME_CYCLING', 0, None, None, 0)
B_TIME_DRIVING = Beta('B_TIME_DRIVING', 0, None, None, 0)
B_COST_DRIVING = Beta('B_COST_DRIVING', 0, None, None, 0)
B_COST_PT = Beta('B_COST_PT', 0, None, None, 0)
B_TIME_PT_BUS = Beta('B_TIME_PT_BUS', 0, None, None, 0)
B_TIME_PT_RAIL = Beta('B_TIME_PT_RAIL', 0, None, None, 0)
B_TIME_PT_ACCESS = Beta('B_TIME_PT_ACCESS', 0, None, None, 0)
B_TIME_PT_INT = Beta('B_TIME_PT_INT_WAIT', 0, None, None, 0)
B_TRAFFIC_DRIVING = Beta('B_TRAFFIC_DRIVING', 0, None, None, 0)

# Utility functions

V1 = (
    ASC_WALKING + 
    B_TIME_WALKING * dur_walking
)

V2 = (
    ASC_CYCLING +
    B_TIME_CYCLING * dur_cycling
)

V3 = (
    ASC_PT +
    B_COST_PT * cost_transit + 
    B_TIME_PT_ACCESS * dur_pt_access + 
    B_TIME_PT_RAIL * dur_pt_rail + 
    B_TIME_PT_BUS * dur_pt_bus +
    B_TIME_PT_INT * dur_pt_int
)
      
V4 = (
    ASC_DRIVING +
    B_TIME_DRIVING * dur_driving +
    B_COST_DRIVING * (cost_driving_fuel + cost_driving_ccharge) +
    B_TRAFFIC_DRIVING * driving_traffic_percent
)
      
# Associate utility functions with the numbering of alternatives
V = {1: V1,
     2: V2,
     3: V3,
     4: V4}

# Associate the availability conditions with the alternatives

av = {1: 1,
      2: 1,
      3: 1,
      4: 1}

In [4]:
# Definition of the model. This is the contribution of each
# observation to the log likelihood function.
logprob = models.loglogit(V, av, travel_mode)

# Create the Biogeme object
biogeme = bio.BIOGEME(database, logprob)
biogeme.modelName = 'lpmc_validation'

# Estimate the parameters
results = biogeme.estimate()

The validation consists in organizing the data into several slices
of about the same size, randomly defined. Each slice is considered
as a validation dataset. The model is then re-estimated using all
the data except the slice, and the estimated model is applied on the
validation set (i.e. the slice). The value of the log likelihood for
each observation in the validation set is reported in a
dataframe. As this is done for each slice, the output is a list of
dataframes, each corresponding to one of these exercises.

In [5]:
groups = 'household_id'
def split(slices):
    ids = df[groups].unique()
    np.random.shuffle(ids)
    the_slices_ids = np.array_split(ids, slices)
    theSlices = [
        df[df[groups].isin(ids)]
        for ids in the_slices_ids
    ]
    estimationSets = []
    validationSets = []
    for i, v in enumerate(theSlices):
        estimationSets.append(
            pd.concat(theSlices[:i] + theSlices[i + 1:])
        )
        validationSets.append(v)
    return zip(estimationSets, validationSets)

In [6]:
validationData = split(slices=5)

validation_results = biogeme.validate(results, validationData)

for slide in validation_results:
    print(
        f'Log likelihood for {slide.shape[0]} validation data: '
        f'{slide["Loglikelihood"].mean()}'
    )

Log likelihood for 16511 validation data: -0.8467425977830267
Log likelihood for 16425 validation data: -0.8330816228984251
Log likelihood for 16050 validation data: -0.8310820577984016
Log likelihood for 16072 validation data: -0.8397447330034897
Log likelihood for 16028 validation data: -0.8379201327242661


# Machine learning

Again we can use `scikit-learn` to investigate ML classifiers. 

The way we set up and use our models is exactly the same. 

However, as our data is panel data, we need to perform grouped cross validation, where the folds are grouped by household. 

In [7]:
from sklearn.model_selection import cross_val_score, GroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

target_context = ['travel_mode', 'trip_id', 'household_id', 'person_n', 'trip_n']
y = df.travel_mode
X = StandardScaler().fit_transform(df[[col for col in df.columns if col not in target_context]])

lr = LogisticRegression(C=0.1, max_iter=1000)
lr_scores = cross_val_score(
    lr, X, y, 
    cv=GroupKFold(n_splits=5), 
    groups=df.household_id, 
    scoring='neg_log_loss')
lr_scores

array([-0.69158506, -0.68885028, -0.68355673, -0.70359612, -0.70837667])

In [8]:
rf = RandomForestClassifier(n_estimators=100, max_depth=6)
rf_scores = cross_val_score(
    rf, X, y, 
    cv=GroupKFold(n_splits=5), 
    groups=df.household_id, 
    scoring='neg_log_loss')
rf_scores

array([-0.71498655, -0.71687217, -0.71371533, -0.72852786, -0.73048802])

On the larger more complex dataset, the machine learning models appear to outperform the MNL model in terms of out of sample validation performance. 

However, there is a lot more we could explore, including different model specifications for the DCM, as well as different ML algorithms and parameter values!