# Module biogeme.biogeme 

## Examples of use of each function

This webpage is for programmers who need examples of use of the functions of the module. The examples are designed to illustrate the syntax. They do not correspond to any meaningful model. For examples of models, visit  [biogeme.epfl.ch](http://biogeme.epfl.ch).

In [1]:
import datetime
print(datetime.datetime.now())

In [2]:
import biogeme.version as ver
print(ver.getText())

In [3]:
import biogeme.biogeme as bio
import biogeme.database as db
import pandas as pd
import numpy as np
from biogeme.expressions import Beta, Variable, exp

Define the verbosity of Biogeme

In [4]:
import biogeme.logging as blog
logger = blog.get_screen_logger(level=blog.DEBUG)
logger.info('Logger initalized')


[INFO] 2023-04-14 10:26:15,558 Logger initalized <2747012181.py:3>


##  Definition of a database

In [5]:
df = pd.DataFrame({'Person':[1,1,1,2,2],
                   'Exclude':[0,0,1,0,1],
                   'Variable1':[1,2,3,4,5],
                   'Variable2':[10,20,30,40,50],
                   'Choice':[1,2,3,1,2],
                   'Av1':[0,1,1,1,1],
                   'Av2':[1,1,1,1,1],
                   'Av3':[0,1,1,1,1]})
myData = db.Database('test', df)

## Definition of various expressions

In [6]:
Variable1=Variable('Variable1')
Variable2=Variable('Variable2')
beta1 = Beta('beta1', -1.0, -3, 3, 0)
beta2 = Beta('beta2', 2.0, -3, 10, 0)
likelihood = -beta1**2 * Variable1 - exp(beta2 * beta1) \
    * Variable2 - beta2**4
simul = beta1 / Variable1 + beta2 / Variable2
dictOfExpressions = {'loglike': likelihood, 
                     'beta1': beta1,
                     'simul': simul}

## Creation of the BIOGEME object

In [7]:
myBiogeme = bio.BIOGEME(myData, dictOfExpressions)
myBiogeme.modelName = 'simpleExample'
print(myBiogeme)

[INFO] 2023-04-14 10:26:15,566 Parameters read from biogeme.toml <toml.py:58>


simpleExample: database [test]{'loglike': ((((-(beta1(init=-1.0) ** `2.0`)) * Variable1) - (exp((beta2(init=2.0) * beta1(init=-1.0))) * Variable2)) - (beta2(init=2.0) ** `4.0`)), 'beta1': beta1(init=-1.0), 'simul': ((beta1(init=-1.0) / Variable1) + (beta2(init=2.0) / Variable2))}
simpleExample: database [test]{'loglike': ((((-(beta1(init=-1.0) ** `2.0`)) * Variable1) - (exp((beta2(init=2.0) * beta1(init=-1.0))) * Variable2)) - (beta2(init=2.0) ** `4.0`)), 'beta1': beta1(init=-1.0), 'simul': ((beta1(init=-1.0) / Variable1) + (beta2(init=2.0) / Variable2))}


Note that, by default, Biogeme removes the unused variables from the database to optimize space.

In [8]:
myBiogeme.database.data.columns

Index(['Person', 'Exclude', 'Variable1', 'Variable2', 'Choice', 'Av1', 'Av2',
       'Av3'],
      dtype='object')

## calculateInitLikelihood

In [9]:
myBiogeme.calculateInitLikelihood()

[DEBUG] 2023-04-14 10:26:15,579 Log likelihood (N = 5):  -115.3003 <biogeme.py:837>


-115.30029248549191

## calculateLikelihood

In [10]:
x = myBiogeme.id_manager.free_betas_values
xplus = [v + 1 for v in x]
print(xplus)

[0.0, 3.0]


In [11]:
myBiogeme.calculateLikelihood(xplus, scaled=True)

[DEBUG] 2023-04-14 10:26:15,583 Log likelihood (N = 5):       -555 <biogeme.py:837>


-111.0

In [12]:
myBiogeme.database.data

Unnamed: 0,Person,Exclude,Variable1,Variable2,Choice,Av1,Av2,Av3
0,1,0,1,10,1,0,1,0
1,1,0,2,20,2,1,1,1
2,1,1,3,30,3,1,1,1
3,2,0,4,40,1,1,1,1
4,2,1,5,50,2,1,1,1


In [13]:
myBiogeme.database.data

Unnamed: 0,Person,Exclude,Variable1,Variable2,Choice,Av1,Av2,Av3
0,1,0,1,10,1,0,1,0
1,1,0,2,20,2,1,1,1
2,1,1,3,30,3,1,1,1
3,2,0,4,40,1,1,1,1
4,2,1,5,50,2,1,1,1


By default, each observation has the same probability to be selected in the sample. It is possible to define the selection probability to be proportional to the values of a column of the database, using the parameter 'weights'. 

In [14]:
myBiogeme.columnForBatchSamplingWeights = 'Variable2'
myBiogeme.calculateLikelihood(xplus, scaled=True)

[DEBUG] 2023-04-14 10:26:15,595 Log likelihood (N = 5):       -555 <biogeme.py:837>


-111.0

In [15]:
myBiogeme.database.data

Unnamed: 0,Person,Exclude,Variable1,Variable2,Choice,Av1,Av2,Av3
0,1,0,1,10,1,0,1,0
1,1,0,2,20,2,1,1,1
2,1,1,3,30,3,1,1,1
3,2,0,4,40,1,1,1,1
4,2,1,5,50,2,1,1,1


## calculateLikelihoodAndDerivatives

In [16]:
f, g, h, bhhh = myBiogeme.\
    calculateLikelihoodAndDerivatives(xplus,
                                      scaled=True,
                                      hessian=True,
                                      bhhh=True)
print(f'f = {f}')
print(f'g = {g}')
print(f'h = {h}')
print(f'bhhh = {bhhh}')

[DEBUG] 2023-04-14 10:26:15,601 Log likelihood (N = 5):       -555 Gradient norm:      7e+02 Hessian norm:       1e+03 BHHH norm:       1e+05 <biogeme.py:919>


f = -111.0
g = [ -90. -108.]
h = [[-270.  -30.]
 [ -30. -108.]]
bhhh = [[ 9900.  9720.]
 [ 9720. 11664.]]


Now the unscaled version

In [17]:
f, g, h, bhhh = myBiogeme.\
    calculateLikelihoodAndDerivatives(xplus,
                                      scaled=False,
                                      hessian=True,
                                      bhhh=True)
print(f'f = {f}')
print(f'g = {g}')
print(f'h = {h}')
print(f'bhhh = {bhhh}')

[DEBUG] 2023-04-14 10:26:15,605 Log likelihood (N = 5):       -555 Gradient norm:      7e+02 Hessian norm:       1e+03 BHHH norm:       1e+05 <biogeme.py:919>


f = -555.0
g = [-450. -540.]
h = [[-1350.  -150.]
 [ -150.  -540.]]
bhhh = [[49500. 48600.]
 [48600. 58320.]]


Using only a sample of the data

In [18]:
f, g, h, bhhh = myBiogeme.\
    calculateLikelihoodAndDerivatives(xplus,
                                      scaled=True,
                                      hessian=True,
                                      bhhh=True)
print(f'f = {f}')
print(f'g = {g}')
print(f'h = {h}')
print(f'bhhh = {bhhh}')

[DEBUG] 2023-04-14 10:26:15,608 Log likelihood (N = 5):       -555 Gradient norm:      7e+02 Hessian norm:       1e+03 BHHH norm:       1e+05 <biogeme.py:919>


f = -111.0
g = [ -90. -108.]
h = [[-270.  -30.]
 [ -30. -108.]]
bhhh = [[ 9900.  9720.]
 [ 9720. 11664.]]


## likelihoodFiniteDifferenceHessian

In [19]:
myBiogeme.likelihoodFiniteDifferenceHessian(xplus)

[DEBUG] 2023-04-14 10:26:15,611 Log likelihood (N = 5):       -555 Gradient norm:      7e+02   <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,612 Log likelihood (N = 5):       -555 Gradient norm:      7e+02   <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,613 Log likelihood (N = 5):  -555.0002 Gradient norm:      7e+02   <biogeme.py:919>


array([[-1380.00020229,  -150.        ],
       [ -150.0000451 ,  -540.00005396]])

## checkDerivatives

In [20]:
f, g, h, gdiff, hdiff = myBiogeme.checkDerivatives(xplus, verbose=True)

[DEBUG] 2023-04-14 10:26:15,616 Log likelihood (N = 5):       -555 Gradient norm:      7e+02 Hessian norm:       1e+03  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,616 Log likelihood (N = 5):       -555 Gradient norm:      7e+02 Hessian norm:       1e+03  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,617 Log likelihood (N = 5):       -555 Gradient norm:      7e+02 Hessian norm:       1e+03  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,617 Log likelihood (N = 5):  -555.0002 Gradient norm:      7e+02 Hessian norm:       1e+03  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,617 Log likelihood (N = 5):       -555 Gradient norm:      7e+02 Hessian norm:       1e+03  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,618 Log likelihood (N = 5):       -555 Gradient norm:      7e+02 Hessian norm:       1e+03  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,618 Log likelihood (N = 5):  -555.0002 Gradient norm:      7e+02 Hessian norm:       1e+03  <biogeme.py:919>


In [21]:
print(f'f = {f}')
print(f'g = {g}')
print(f'h = {h}')
print(f'gdiff = {gdiff}')
print(f'hdiff = {hdiff}')
hdiff

f = -555.0
g = [-450. -540.]
h = [[-1350.  -150.]
 [ -150.  -540.]]
gdiff = [6.93496986e-05 8.08701104e-05]
hdiff = [[3.00002023e+01 2.42550868e-10]
 [4.50960215e-05 5.39642255e-05]]


array([[3.00002023e+01, 2.42550868e-10],
       [4.50960215e-05, 5.39642255e-05]])

## estimate

During estimation, it is possible to save intermediate results, in case the estimation must be interrupted. 

In [22]:
results = myBiogeme.estimate(bootstrap=10)

[INFO] 2023-04-14 10:26:15,623 *** Initial values of the parameters are obtained from the file __simpleExample.iter <biogeme.py:1252>
[INFO] 2023-04-14 10:26:15,624 Parameter values restored from __simpleExample.iter <biogeme.py:1035>
[DEBUG] 2023-04-14 10:26:15,624 Log likelihood (N = 5):       -555 <biogeme.py:837>
[DEBUG] 2023-04-14 10:26:15,624 Run simple_bounds <biogeme.py:1494>
[INFO] 2023-04-14 10:26:15,624 ** Optimization: Newton with trust region for simple bounds <optimization.py:464>
[DEBUG] 2023-04-14 10:26:15,625 Log likelihood (N = 5):       -555 Gradient norm:      7e+02 Hessian norm:       1e+03  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,625 Log likelihood (N = 5):  -184.9275 <biogeme.py:837>
[DEBUG] 2023-04-14 10:26:15,626 Log likelihood (N = 5):  -184.9275 Gradient norm:      2e+02 Hessian norm:       5e+02  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,626 1 f=  36.98549 projected rel. grad.= 0.094 rel. change=  0.45 delta= 1e+03 rho=   1.2 ++ <algorithms.py:2013>


[INFO] 2023-04-14 10:26:15,666 Proportion of Hessian calculation: 100.0% <algorithms.py:2024>
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 418.44it/s]
[INFO] 2023-04-14 10:26:15,671 Results saved in file simpleExample.html <results.py:1252>
[INFO] 2023-04-14 10:26:15,672 Results saved in file simpleExample.pickle <results.py:379>


In [23]:
results.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
beta1,-1.273264,0.013724,-92.776769,0.0
beta2,1.248769,0.059086,21.134795,0.0


If the model has already been estimated, it is possible to recycle the estimation results. In that case, the other arguments are ignored, and the results are whatever is in the file.

In [24]:
recycled_results = myBiogeme.estimate(recycle=True, bootstrap=10)



In [25]:
print(recycled_results.shortSummary())

Results for model simpleExample
Nbr of parameters:		2
Sample size:			5
Excluded data:			0
Final log likelihood:		-67.06549
Akaike Information Criterion:	138.131
Bayesian Information Criterion:	137.3499



In [26]:
recycled_results.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
beta1,-1.273264,0.013724,-92.776769,0.0
beta2,1.248769,0.059086,21.134795,0.0


## simulate

Simulate with the default values for the parameters

In [27]:
simulationWithDefaultBetas = myBiogeme.simulate(myBiogeme.loglike.get_beta_values())
simulationWithDefaultBetas

Unnamed: 0,loglike,beta1,simul
0,-111.0,0.0,0.1
1,-111.0,0.0,0.1
2,-121.0,0.0,0.075
3,-131.0,0.0,0.06
4,-121.0,0.0,0.075


Simulate with the estimated values for the parameters

In [28]:
print(results.getBetaValues())
simulationWithEstimatedBetas =\
    myBiogeme.simulate(results.getBetaValues())
simulationWithEstimatedBetas

{'beta1': -1.273263915009374, 'beta2': 1.248768825523196}


Unnamed: 0,loglike,beta1,simul
0,-13.413098,-1.273264,-0.382796
1,-13.413098,-1.273264,-0.382796
2,-17.07353,-1.273264,-0.287097
3,-20.733962,-1.273264,-0.229677
4,-17.07353,-1.273264,-0.287097


## confidenceIntervals

In [29]:
drawsFromBetas =\
    results.getBetasForSensitivityAnalysis(myBiogeme.id_manager.free_betas.names)
left, right = myBiogeme.confidenceIntervals(drawsFromBetas)
left

Unnamed: 0,loglike,beta1,simul
0,-13.583391,-1.295608,-0.393516
1,-13.583391,-1.295608,-0.393516
2,-17.477827,-1.295608,-0.295137
3,-21.408836,-1.295608,-0.23611
4,-17.477827,-1.295608,-0.295137


In [30]:
right

Unnamed: 0,loglike,beta1,simul
0,-13.415398,-1.251738,-0.372564
1,-13.415398,-1.251738,-0.372564
2,-16.973184,-1.251738,-0.279423
3,-20.409226,-1.251738,-0.223538
4,-16.973184,-1.251738,-0.279423


## validate

The validation consists in organizing the data into several slices of about the same size, randomly defined. 
Each slide is considered as a validation dataset. The model is then re-estimated using all the data except the slice, and the estimated model is applied on the validation set (i.e. the slice). The value of the log likelihood for each observation in the validation set is reported in a dataframe. As this is done for each slice, the output is a list of dataframes, each corresponding to one of these exercises.   

In [31]:
validationData = myData.split(slices=5)
validation_results = myBiogeme.validate(results, validationData)

[INFO] 2023-04-14 10:26:15,721 Parameters read from biogeme.toml <toml.py:58>
[INFO] 2023-04-14 10:26:15,760 *** Initial values of the parameters are obtained from the file __simpleExample_val_est_1.iter <biogeme.py:1252>
[INFO] 2023-04-14 10:26:15,760 Cannot read file __simpleExample_val_est_1.iter. Statement is ignored. <biogeme.py:1037>
[DEBUG] 2023-04-14 10:26:15,761 Log likelihood (N = 4):  -49.99196 <biogeme.py:837>
[DEBUG] 2023-04-14 10:26:15,761 Run simple_bounds <biogeme.py:1494>
[INFO] 2023-04-14 10:26:15,761 ** Optimization: Newton with trust region for simple bounds <optimization.py:464>
[DEBUG] 2023-04-14 10:26:15,761 Log likelihood (N = 4):  -49.99196 Gradient norm:          3 Hessian norm:       1e+02  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,762 Log likelihood (N = 4):  -49.96064 <biogeme.py:837>
[DEBUG] 2023-04-14 10:26:15,763 Log likelihood (N = 4):  -49.96064 Gradient norm:       0.02 Hessian norm:       1e+02  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,763 1 f

[DEBUG] 2023-04-14 10:26:15,836 Log likelihood (N = 4):  -46.20181 <biogeme.py:837>
[DEBUG] 2023-04-14 10:26:15,836 Log likelihood (N = 4):  -46.20181 Gradient norm:      2e-05 Hessian norm:       1e+02  <biogeme.py:919>
[DEBUG] 2023-04-14 10:26:15,837 2 f=  11.55045 projected rel. grad.= 6e-07 rel. change=0.0007 delta= 1e+04 rho=     1 ++ <algorithms.py:2013>
[INFO] 2023-04-14 10:26:15,837 Proportion of Hessian calculation: 100.0% <algorithms.py:2024>
[DEBUG] 2023-04-14 10:26:15,838 Log likelihood (N = 4):  -46.20181 Gradient norm:      2e-05 Hessian norm:       1e+02 BHHH norm:       4e+01 <biogeme.py:919>
[INFO] 2023-04-14 10:26:15,841 Results saved in file simpleExample_val_est_5.html <results.py:1252>
[INFO] 2023-04-14 10:26:15,842 Results saved in file simpleExample_val_est_5.pickle <results.py:379>
[INFO] 2023-04-14 10:26:15,842 Parameters read from biogeme.toml <toml.py:58>
[INFO] 2023-04-14 10:26:15,847 Simulation results saved in file simpleExample_validation.pickle <biogeme.

In [32]:
validation_results

[   Loglikelihood
 3     -17.145326,
    Loglikelihood
 0      -6.341109,
    Loglikelihood
 2     -13.413098,
    Loglikelihood
 1       -9.81772,
    Loglikelihood
 4     -21.037421]

In [33]:
for slide in validation_results:
    print(f'Log likelihood for {slide.shape[0]} '
          f'validation data: {slide["Loglikelihood"].sum()}')


Log likelihood for 1 validation data: -17.145326446024075
Log likelihood for 1 validation data: -6.341108765392212
Log likelihood for 1 validation data: -13.413098095892746
Log likelihood for 1 validation data: -9.81771976465043
Log likelihood for 1 validation data: -21.03742136293277


## files_of_type

In [34]:
myBiogeme.files_of_type('pickle')

['simpleExample.pickle']