File 01-binary_netherlands.py


Michel Bierlaire

Thu Aug 8 09:55:47 2024






In [None]:

import pandas as pd
import biogeme.database as db
import biogeme.biogeme as bio
from IPython.core.display_functions import display
from biogeme.expressions import Beta, Variable, log, exp


The goal of this computer session is to

1. become familiar with the Python syntax in Biogeme and
2. estimate and interpret a binary logit model.

We are using an old dataset for a binary transportation mode choice,
collected in the Netherlands. The data set is available as
http://transp-or.epfl.ch/data/netherlands.dat, and its description is
available as
http://transp-or.epfl.ch/documents/technicalReports/CS_NetherlandsDescription.pdf.

We recommend to download the dataset in your local directory.

# Data preparation

We first import the data into Pandas, using any interface that
Pandas allows. Here, we simply read the data from a text file, where
the data are separated by tabs.

In [None]:
df = pd.read_csv('netherlands.dat', sep='\t')
display(df)


We then import this database into Biogeme.

In [None]:
database = db.Database('netherlands', df)


We identify the columns that will be used as variable in our model.

In [None]:
sp = Variable('sp')
rail_ivtt = Variable('rail_ivtt')
rail_acc_time = Variable('rail_acc_time')
rail_egr_time = Variable('rail_egr_time')
car_ivtt = Variable('car_ivtt')
car_walk_time = Variable('car_walk_time')
car_cost = Variable('car_cost')
rail_cost = Variable('rail_cost')
choice = Variable('choice')


The data set contains both stated preferences (SP) and revealed
preferences (RP) data. We are using only RP data. Therefore, we
exclude the SP data.

In [None]:
exclude = sp != 0
database.remove(exclude)


We can see that the data set has reduced from 1739 rows down to 228 rows.

In [None]:
print(f'Shape of the data: {database.data.shape}')


Here is the reduced dat set.

In [None]:
display(database.data)


We can aso define new variables from existing one.

The total travel time by rail is the sum of the in-vehicle travel
time, the access time (time from the origin of the trip to the first
train station) and the egress time (time from the last train station
to the final destination).

In [None]:
rail_time = rail_ivtt + rail_acc_time + rail_egr_time


The total travel time by car is the sum of the in-vehicle travel
time and the walking time, to and from the parking.

In [None]:
car_time = car_ivtt + car_walk_time


The data set has been collected before the existence of Euro, and
the costs are coded in Dutch Guilders. In order to simplify the
interpretation of the results, we use the conversion of Guilders
into Euros.

In [None]:
DUTCH_GUILDERS_TO_EUROS = 0.44378022
car_cost_euro = car_cost * DUTCH_GUILDERS_TO_EUROS
rail_cost_euro = rail_cost * DUTCH_GUILDERS_TO_EUROS


# Model specification

We are now ready to specify the choice model. We start with a simple
model, that contains only one alternative specific constant:
\begin{align*}
V_\text{car} &= \text{ASC}_\text{car}, \\ V_\text{rail} &= 0.
\end{align*}

We define the unknown parameter using the Biogeme expression `Beta`,
that takes 5 arguments:

- the name of the parameter (it is advised to use the exact same
name for the corresponding Python variable),
- the starting value for the estimation (usually, 0),
- a lower bound on the value of the coefficient, or `None` for no
bound,
- an upper bound, or `None`for no bound,
- a parameter that is 1 if the value of the parameter must be fixed
to its starting value, and 0 if it has to be estimated.

In [None]:
asc_car = Beta(name='asc_car', value=0, lowerbound=None, upperbound=None, status=0)


We now write the utility functions:

In [None]:
v_car = asc_car
v_rail = 0


And we write the choice model:

In [None]:
prob_car = 1 / (1 + exp(v_rail - v_car))
prob_rail = 1 - prob_car


Biogeme needs the formula of the contribution of each observation to
the log likelihood function, which depends on the observed choice:

In [None]:
prob_observation = prob_car * (choice == 0) + prob_rail * (choice == 1)
logprob = log(prob_observation)


We initialize Biogeme with this expression, and the database.

In [None]:
biogeme = bio.BIOGEME(database, logprob)
biogeme.modelName = 'binary_netherlands'


# Estimation of the parameter(s)

We are now ready to estimate the parameter. Biogeme tries to read a
file `__binary_netherlands.iter` containing intermediary results
from a previous estimation run. If it does not find it, it triggers
a warning that can be ignored.

In [None]:
results = biogeme.estimate()


The results are stored in a object that allows to access various
information about the estimation. Use the help function to have a
detailed description.

In [None]:
help(results)


We first display some summary information:

In [None]:
print(results.print_general_statistics())


Then we display the estimation results

In [None]:
print(results.get_estimated_parameters())


The results are also available in an HTML file than can be opened in
your preferred browser.

In [None]:
print(f'HTML file: {results.data.htmlFileName}')


You now need to improve the model by including attributes: travel cost and travel time.
1. Try a specification where the coefficients of these attributes are generic.
2. Try a specification where the coefficients of these attributes are alternative specific.
3. Try a specification where the time coefficient is generic and the cost coefficient is alternative specific.
4. Try a specification where the cost coefficient is generic and the time coefficient is alternative specific.
5. Comment the results. Identify your preferred model, and explain why.