File 01-contingency_table.py


Michel Bierlaire

Wed Aug 7 18:03:31 2024






We first import various elements needed for the script.

The Pandas package, in charge of the management of the data.

In [1]:
import pandas as pd


The Biogeme interface with the database.

In [2]:
import biogeme.database as db


Biogeme itself.

In [3]:
import biogeme.biogeme as bio
from IPython.core.display_functions import display


Finally, we import some expressions needed to write the model specification.

In [4]:
from biogeme.expressions import Beta, log, Variable



The objective of this exercise is to estimate the parameters of the
simple model introduced in the lecture, using Biogeme.

We want to build a model that predicts the market penetration of
electric vehicles (EV) as a function of the income level. We have a
sample of 1000 individuals. The data is defined using the Pandas
package.

Each row of the database corresponds to a cell of the contingency table.

In [5]:
data = pd.DataFrame(
    {
        'Age': [1, 1, 2, 2, 3, 3],
        'Electric': [1, 0, 1, 0, 1, 0],
        'Total': [65, 835, 55, 1045, 5, 495],
    }
)
display(data)


Unnamed: 0,Age,Electric,Total
0,1,1,65
1,1,0,835
2,2,1,55
3,2,0,1045
4,3,1,5
5,3,0,495


We first import the data in the Biogeme database object.

In [6]:
database = db.Database('contingency', data)


Definition of the variables. We use the same names as the Pandas columns.

In [7]:
Age = Variable('Age')
Electric = Variable('Electric')
Total = Variable('Total')


Definition of the parameters to be estimated.

In [8]:
pi1 = Beta('pi1', 0.5, 0, 1, 0)
pi2 = Beta('pi2', 0.5, 0, 1, 0)
pi3 = Beta('pi3', 0.5, 0, 1, 0)



We associate with each observation the relevant parameter, depending
on the value of the variable Age. Note that the expressions like
`Age == 1` returns 1 if True and 0 if False. Therefore, for each row
in the database, `pi` will be either `pi1`, `pi2` or `pi3`, depending
on the value of `Age`.

In [9]:

pi = (Age == 1) * pi1 + (Age == 2) * pi2 + (Age == 3) * pi3


The contribution of each observation to the log likelihood function
depends on the value of the variable Electric, and must be applied
as many times as reported by Total, that is the value of the
corresponding cell of the contingency table.

In [10]:

loglike = Total * log(pi) * (Electric == 1) + Total * log(1 - pi) * (
    Electric == 0
)


We create an instance of Biogeme, combining the model and the data

In [11]:
biogeme = bio.BIOGEME(database, loglike)
biogeme.modelName = 'contingency'


File biogeme.toml has been created


We estimate the parameters

In [12]:
results = biogeme.estimate()


We obtain the results in a pandas table

In [13]:
pandas_results = results.get_estimated_parameters()
display(pandas_results)


Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
pi1,0.09375,0.137074,0.683935,0.494016
pi2,0.09375,0.171811,0.545658,0.585301
pi3,0.09375,0.468423,0.20014,0.841371


We can check that it is equal to the share in eah category.

In [14]:
for i in (1, 2, 3):
    total_for_age = data[data['Age'] == i]['Total'].sum()
    electric_for_age = data[(data['Age'] == i) & (data['Electric'] == 1)][
        'Total'
    ].sum()
    print(f'pi_{i} = {electric_for_age / total_for_age:.3g}')



pi_1 = 0.0722
pi_2 = 0.05
pi_3 = 0.01


We can now predict future market shares for a scenario with a new
distribution of age.

In [15]:
pi1_estimate = pandas_results['Value']['pi1']
pi2_estimate = pandas_results['Value']['pi2']
pi3_estimate = pandas_results['Value']['pi3']

market_share = pi1_estimate * 0.25 + pi2_estimate * 0.50 + pi3_estimate * 0.25
print(f'Predicted market share: {100*market_share:.2g}%')


Predicted market share: 9.4%


Consider now a similar data set where we have collected data per income category, coded as follows:
1: low, 2: medium, 3: high.

In [5]:
income_data = pd.DataFrame(
    {
        'Income': [1, 1, 2, 2, 3, 3],
        'Electric': [1, 0, 1, 0, 1, 0],
        'Total': [15, 200, 50, 450, 135, 150],
    }
)
display(income_data)


Unnamed: 0,Income,Electric,Total
0,1,1,15
1,1,0,200
2,2,1,50
3,2,0,450
4,3,1,135
5,3,0,150


1. Estimate the parameters of the  model predicting the choice of
electrical vehicle as a function of income.

2. Consider a scenario where  the
income distribution is as follows: 7.5% of the population
with low income, 40% of the population with medium income
and 52.5% of the population with high income. Use the
estimated model to forecast the market share of electric
vehicles under this scenario.

In [6]:
database = db.Database('contingency_1', income_data)

In [7]:
Income = Variable('Income')
Electric = Variable('Electric')
Total = Variable('Total')

In [8]:
pi1 = Beta('pi1', 0.5, 0, 1, 0)
pi2 = Beta('pi2', 0.5, 0, 1, 0)
pi3 = Beta('pi3', 0.5, 0, 1, 0)

In [9]:
pi = (Income == 1) * pi1 + ( Income == 2) * pi2 + ( Income== 3) * pi3

In [10]:
loglike = Total * log(pi) * (Electric == 1) + Total * log(1 - pi) * (
    Electric == 0
)

In [11]:
biogeme = bio.BIOGEME(database, loglike)
biogeme.modelName = 'contingency_1'

In [12]:
results = biogeme.estimate()

In [13]:
pandas_results = results.get_estimated_parameters()
display(pandas_results)

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
pi1,0.069767,0.091782,0.76014,0.447171
pi2,0.1,0.127279,0.785674,0.432058
pi3,0.473684,0.352574,1.343503,0.179109


In [15]:
for i in (1, 2, 3):
    total_for_income = income_data[income_data['Income'] == i]['Total'].sum()
    electric_for_income = income_data[(income_data['Income'] == i) & (income_data['Electric'] == 1)][
        'Total'
    ].sum()
    print(f'pi_{i} = {electric_for_income/ total_for_income:.3g}')

pi_1 = 0.0698
pi_2 = 0.1
pi_3 = 0.474


In [16]:
pi1_estimate = pandas_results['Value']['pi1']
pi2_estimate = pandas_results['Value']['pi2']
pi3_estimate = pandas_results['Value']['pi3']

market_share = pi1_estimate * 0.075 + pi2_estimate * 0.40 + pi3_estimate * 0.525
print(f'Predicted market share: {100*market_share:.2g}%')

Predicted market share: 29%
