# Module biogeme.database

## Examples of use of each function

This webpage is for programmers who need examples of use of the functions of the class. The examples are designed to illustrate the syntax. They do not correspond to any meaningful model. For examples of models, visit  [biogeme.epfl.ch](http://biogeme.epfl.ch).

In [1]:
import datetime
print(datetime.datetime.now())

2023-08-04 18:40:41.226577


In [2]:
import biogeme.version as ver
print(ver.getText())

biogeme 3.2.12 [2023-08-04]
Home page: http://biogeme.epfl.ch
Submit questions to https://groups.google.com/d/forum/biogeme
Michel Bierlaire, Transport and Mobility Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL)



In [3]:
import pandas as pd
import numpy as np

In [4]:
import biogeme.database as db
from biogeme.expressions import Variable, exp, bioDraws
from biogeme.elementary_expressions import TypeOfElementaryExpression
from biogeme.segmentation import DiscreteSegmentationTuple
from biogeme.exceptions import BiogemeError

We set the seed so that the outcome of random operations is always the same.

In [5]:
np.random.seed(90267) 

## Create a database from a pandas data frame

In [6]:
df = pd.DataFrame({'Person':[1,1,1,2,2],
                   'Exclude':[0,0,1,0,1],
                   'Variable1':[1,2,3,4,5],
                   'Variable2':[10,20,30,40,50],
                   'Choice':[1,2,3,1,2],
                   'Av1':[0,1,1,1,1],
                   'Av2':[1,1,1,1,1],
                   'Av3':[0,1,1,1,1]})
myData = db.Database('test', df)
print(myData)

biogeme database test:
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3
0       1        0          1         10       1    0    1    0
1       1        0          2         20       2    1    1    1
2       1        1          3         30       3    1    1    1
3       2        0          4         40       1    1    1    1
4       2        1          5         50       2    1    1    1


## valuesFromDatabase

Evaluates an expression for each entry of the database.

        Args:
           expression: object of type biogeme.expressions. 

        Returns: 
           numpy series, long as the number of entries in the
           database, containing the calculated quantities.

In [7]:
Variable1 = Variable('Variable1')
Variable2 = Variable('Variable2')
expr = Variable1 + Variable2
result = myData.valuesFromDatabase(expr)
print(result)

[11. 22. 33. 44. 55.]


## check_segmentation

Check that the segmentation covers the complete database

        :param segmentation_tuple: object describing the segmentation
        :type segmentation_tuple: biogeme.segmentation.DiscreteSegmentationTuple

        :return: number of observations per segment.
        :rtype: dict(str: int)

A segmentation is a partition of the dataset based on the value of one of the variables. For instance, we can segment on the Choice variable.

In [8]:
correct_mapping = {1: 'Alt. 1', 2: 'Alt. 2', 3: 'Alt. 3'}
correct_segmentation = DiscreteSegmentationTuple(variable='Choice', mapping=correct_mapping)

If the segmentation is well defined, the function returns the size of each segment in the database.

In [9]:
myData.check_segmentation(correct_segmentation)

{'Alt. 1': 2, 'Alt. 2': 2, 'Alt. 3': 1}

In [10]:
incorrect_mapping = {1: 'Alt. 1', 2: 'Alt. 2'}
incorrect_segmentation = DiscreteSegmentationTuple(variable='Choice', mapping=incorrect_mapping)

If the segmentation is incorrect, an exception is raised.

In [11]:
try:
    myData.check_segmentation(incorrect_segmentation)
except BiogemeError as e:
    print(e)

Variable Choice takes the value 3 [1 times], and it does not define any segment.


In [12]:
another_incorrect_mapping = {1: 'Alt. 1', 2: 'Alt. 2', 4: 'Does not exist'}
another_incorrect_segmentation = DiscreteSegmentationTuple(variable='Choice', mapping=another_incorrect_mapping)

In [13]:
try:
    myData.check_segmentation(another_incorrect_segmentation)
except BiogemeError as e:
    print(e)

Variable Choice does not take the value 4 representing segment "Does not exist"


## checkAvailabilityOfChosenAlt

Check if the chosen alternative is available for each entry in the database.

        Args: 
            avail: list of biogeme.expressions to evaluate the
            availability conditions for each alternative.
            choice: biogeme.expressions to evaluate the chosen
            alternative.
                
        Returns:
           numpy series of bool, long as the number of entries 
           in the database, containing True is the chosen 
           alternative is available, False otherwise.


In [14]:
Av1 = Variable('Av1')
Av2 = Variable('Av2')
Av3 = Variable('Av3')
Choice = Variable('Choice')
avail = {1: Av1, 2: Av2, 3: Av3}
result = myData.checkAvailabilityOfChosenAlt(avail, Choice)
print(result)

[False  True  True  True  True]


## choiceAvailabilityStatistics

Calculates the number of time an alternative is chosen and available

        Args:
            avail: list of expressions to evaluate the
                      availability conditions for each alternative.
            choice: expression for the chosen alternative.

        Returns: 
            for each alternative, a tuple containing the number of time it is chosen, 
                 and the number of time it is available.
        

In [15]:
myData.choiceAvailabilityStatistics(avail, Choice)

{1.0: (2, 4.0), 2.0: (2, 5.0), 3.0: (1, 4.0)}

## Suggest scaling

Suggest a scaling of the variables in the database 

    Args:
       columns: list of columns. If None, all columns are analyzed.
         
    Returns: 

        A Pandas dataframe where each row contains the name of
        the variable, the suggested scale s and the laergest value in the column. 
        Ideally, the column should be multiplied by s. 


In [16]:
myData.data.columns

Index(['Person', 'Exclude', 'Variable1', 'Variable2', 'Choice', 'Av1', 'Av2',
       'Av3'],
      dtype='object')

In [17]:
myData.suggestScaling()

Unnamed: 0,Column,Scale,Largest
3,Variable2,0.01,50


In [18]:
myData.suggestScaling(columns=['Variable1', 'Variable2'])

Unnamed: 0,Column,Scale,Largest
1,Variable2,0.01,50


## scaleColumn

Divide an entire column by a scale value

           Args:
              column: name of the column

              scale: value of the scale. All values of the 
              column will be multiplied by that scale.


In [19]:
myData.data

Unnamed: 0,Person,Exclude,Variable1,Variable2,Choice,Av1,Av2,Av3
0,1,0,1,10,1,0,1,0
1,1,0,2,20,2,1,1,1
2,1,1,3,30,3,1,1,1
3,2,0,4,40,1,1,1,1
4,2,1,5,50,2,1,1,1


In [20]:
myData.scaleColumn('Variable2', 0.01)

In [21]:
myData.data

Unnamed: 0,Person,Exclude,Variable1,Variable2,Choice,Av1,Av2,Av3
0,1,0,1,0.1,1,0,1,0
1,1,0,2,0.2,2,1,1,1
2,1,1,3,0.3,3,1,1,1
3,2,0,4,0.4,1,1,1,1
4,2,1,5,0.5,2,1,1,1


## addColumn

Add a new column in the database, calculated from an expression.

        Args:
           expression: object of type biogeme.expressions 
           describing the expression to evaluate
           column: name of the column to add.

        Returns:
           nothing

        Raises:
              ValueError: if the column name already exists.

In [22]:
Variable1 = Variable('Variable1')
Variable2 = Variable('Variable2')
expression = exp(0.5 * Variable2) / Variable1
expression = Variable2 * Variable1
result = myData.addColumn(expression, 'NewVariable')
print(myData.data['NewVariable'].tolist())

[0.1, 0.4, 0.8999999999999999, 1.6, 2.5]


In [23]:
myData.data

Unnamed: 0,Person,Exclude,Variable1,Variable2,Choice,Av1,Av2,Av3,NewVariable
0,1,0,1,0.1,1,0,1,0,0.1
1,1,0,2,0.2,2,1,1,1,0.4
2,1,1,3,0.3,3,1,1,1,0.9
3,2,0,4,0.4,1,1,1,1,1.6
4,2,1,5,0.5,2,1,1,1,2.5


## split

Shuffle the data, and split the data into slices. For each slide, an estimation and a validation sets are generated. The validation set is the slice itself. The estimation set is the rest of the data. 

In [24]:
dataSets = myData.split(3)
for i in dataSets:
    print("==========")
    print("Estimation:")
    print(type(i[0]))
    print(i[0])
    print("Validation:")
    print(i[1])

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
Estimation:
<class 'pandas.core.frame.DataFrame'>
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
0       1        0          1        0.1       1    0    1    0          0.1
3       2        0          4        0.4       1    1    1    1          1.6
4       2        1          5        0.5       2    1    1    1          2.5
Validation:
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
1       1        0          2        0.2       2    1    1    1          0.4
2       1        1          3        0.3       3    1    1    1          0.9
Estimation:
<class 'pandas.core.frame.DataFrame'>
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
1       1        0          2        0.2    

## count

Counts the number of observations that have a specific value in a given column.

        Args:
            columnName: name of the column.
            value: value that is seeked.

        Returns: 
            Number of times that the value appears in the column.


Here, count the number of entries for individual 1.

In [25]:
myData.count('Person',1)

3

## remove

Removes from the database all entries such that the value of the expression is not 0. 
        
        Args:
           expression: object of type biogeme.expressions 
           describing the expression to evaluate
        Returns:
           Nothing.

In [26]:
Exclude = Variable('Exclude')
myData.remove(Exclude)
myData.data

Unnamed: 0,Person,Exclude,Variable1,Variable2,Choice,Av1,Av2,Av3,NewVariable
0,1,0,1,0.1,1,0,1,0,0.1
1,1,0,2,0.2,2,1,1,1,0.4
3,2,0,4,0.4,1,1,1,1,1.6


## dumpOnFile

Dumps the database in a CSV formatted file.

        Returns:  name of the file

In [27]:
myData.dumpOnFile()

'test_dumped.dat'

In [28]:
%%bash
cat test_dumped.dat

__rowId	Person	Exclude	Variable1	Variable2	Choice	Av1	Av2	Av3	NewVariable


0	1	0	1	0.1	1	0	1	0	0.1


1	1	0	2	0.2	2	1	1	1	0.4


3	2	0	4	0.4	1	1	1	1	1.6


## generateDraws

Generate draws for each variable.
        
        Args:
             types:
                 A dict indexed by the names of the variables,
                 describing the types of draws. Each of them can be a
                 native type or any type defined by the function
                 database.setRandomNumberGenerators

             names: 
                 the list of names of the variables that require
                 draws to be generated.
             numberOfDraws: 
                 number of draws to generate.

        Returns: 
             a 3-dimensional table with draws. The 3 dimensions are
              1. number of individuals
              2. number of draws
              3. number of variables


List native types and their description

In [29]:
db.Database.descriptionOfNativeDraws()

{'UNIFORM': 'Uniform U[0, 1]',
 'UNIFORM_ANTI': 'Antithetic uniform U[0, 1]',
 'UNIFORM_HALTON2': 'Halton draws with base 2, skipping the first 10',
 'UNIFORM_HALTON3': 'Halton draws with base 3, skipping the first 10',
 'UNIFORM_HALTON5': 'Halton draws with base 5, skipping the first 10',
 'UNIFORM_MLHS': 'Modified Latin Hypercube Sampling on [0, 1]',
 'UNIFORM_MLHS_ANTI': 'Antithetic Modified Latin Hypercube Sampling on [0, 1]',
 'UNIFORMSYM': 'Uniform U[-1, 1]',
 'UNIFORMSYM_ANTI': 'Antithetic uniform U[-1, 1]',
 'UNIFORMSYM_HALTON2': 'Halton draws on [-1, 1] with base 2, skipping the first 10',
 'UNIFORMSYM_HALTON3': 'Halton draws on [-1, 1] with base 3, skipping the first 10',
 'UNIFORMSYM_HALTON5': 'Halton draws on [-1, 1] with base 5, skipping the first 10',
 'UNIFORMSYM_MLHS': 'Modified Latin Hypercube Sampling on [-1, 1]',
 'UNIFORMSYM_MLHS_ANTI': 'Antithetic Modified Latin Hypercube Sampling on [-1, 1]',
 'NORMAL': 'Normal N(0, 1) draws',
 'NORMAL_ANTI': 'Antithetic normal dr

In [30]:
randomDraws1 = bioDraws('randomDraws1', 'NORMAL_MLHS_ANTI')
randomDraws2 = bioDraws('randomDraws2', 'UNIFORM_MLHS_ANTI')
randomDraws3 = bioDraws('randomDraws3', 'UNIFORMSYM_MLHS_ANTI')

We build an expression that involves the three random variables

In [31]:
x = randomDraws1 + randomDraws2 + randomDraws3
types = x.dict_of_elementary_expression(TypeOfElementaryExpression.DRAWS)
print(types)

{'randomDraws1': 'NORMAL_MLHS_ANTI', 'randomDraws2': 'UNIFORM_MLHS_ANTI', 'randomDraws3': 'UNIFORMSYM_MLHS_ANTI'}


In [32]:
theDrawsTable = myData.generateDraws(types,                         
                                     ['randomDraws1',
                                      'randomDraws2',
                                      'randomDraws3'],
                                     10)
theDrawsTable

array([[[-0.5605896 ,  0.17260212, -0.35933972],
        [-0.13811324,  0.53162299,  0.85919231],
        [ 1.82908818,  0.04835596,  0.13484656],
        [ 0.38628367,  0.78402643, -0.65819673],
        [ 1.1505448 ,  0.6848117 ,  0.68832765],
        [ 0.5605896 ,  0.82739788,  0.35933972],
        [ 0.13811324,  0.46837701, -0.85919231],
        [-1.82908818,  0.95164404, -0.13484656],
        [-0.38628367,  0.21597357,  0.65819673],
        [-1.1505448 ,  0.3151883 , -0.68832765]],

       [[-0.64437973,  0.2080917 , -0.46801586],
        [ 0.00796208,  0.12040568, -0.10798735],
        [-1.6477843 ,  0.99704115,  0.21589817],
        [-1.19741369,  0.83479683,  0.95349066],
        [ 0.45912504,  0.6264498 , -0.29536472],
        [ 0.64437973,  0.7919083 ,  0.46801586],
        [-0.00796208,  0.87959432,  0.10798735],
        [ 1.6477843 ,  0.00295885, -0.21589817],
        [ 1.19741369,  0.16520317, -0.95349066],
        [-0.45912504,  0.3735502 ,  0.29536472]],

       [[ 0.7953

## setRandomNumberGenerators

Defines user-defined random numbers generators.
        
        Args:

           rng: a dictionary of generators. The keys of the
           dictionary
           characterize the name of the generators, and must be
           different from the pre-defined generators in Biogeme:
           NORMAL, UNIFORM and UNIFORMSYM. The elements of the
           dictionary are functions that take two arguments: the
           number of series to generate (typically, the size of the
           database), and the number of draws per series.
 
        Returns: 
             nothing.

We first define functions returning draws, given the number of observations, and the number of draws

In [33]:
def logNormalDraws(sampleSize, numberOfDraws):
    return np.exp(np.random.randn(sampleSize, numberOfDraws))

def exponentialDraws(sampleSize, numberOfDraws):
    return -1.0 * np.log(np.random.rand(sampleSize, numberOfDraws))

We associate these functions with a name

In [34]:
dict = {'LOGNORMAL': (logNormalDraws, 
                      'Draws from lognormal distribution'), 
        'EXP': (exponentialDraws,
                'Draws from exponential distributions')}
myData.setRandomNumberGenerators(dict)

We can now generate draws from these distributions

In [35]:
randomDraws1 = bioDraws('randomDraws1', 'LOGNORMAL')
randomDraws2 = bioDraws('randomDraws2', 'EXP')
x = randomDraws1 + randomDraws2
types = x.dict_of_elementary_expression(TypeOfElementaryExpression.DRAWS)
theDrawsTable = myData.generateDraws(types,
                                     ['randomDraws1',
                                      'randomDraws2'],
                                     10)
print(theDrawsTable)

[[[2.15336577 0.35541854]
  [0.92036707 0.38330687]
  [1.35125462 2.83842826]
  [0.27817501 0.46249413]
  [0.5007549  0.6961861 ]
  [1.11902088 1.05840875]
  [0.6539865  0.15909907]
  [0.11955894 0.38886736]
  [0.60108954 0.40525196]
  [3.93153651 0.35868107]]

 [[4.60723253 0.18021421]
  [1.27062239 2.2373742 ]
  [2.73460167 1.17203962]
  [5.61600938 1.8920716 ]
  [2.54756523 0.07930524]
  [0.77284243 2.56028383]
  [5.16153268 0.59225528]
  [0.58972275 0.67940422]
  [0.88324351 0.63497716]
  [3.67625403 3.030641  ]]

 [[2.24536739 0.70518133]
  [0.46930501 0.67990918]
  [4.86579395 0.4097506 ]
  [2.14129298 0.8086017 ]
  [0.20614091 0.06963184]
  [0.2096891  0.02382351]
  [1.70933977 0.78170648]
  [0.63660909 1.83653019]
  [1.14977308 0.75890102]
  [0.26832114 4.20117546]]]


## sampleWithReplacement

Extract a random sample from the database, with replacement. Useful for bootstrapping. 
        Args:
            size: size of the sample. If None, a sample of the same size as the database will be generated.

        Returns:
            pandas dataframe with the sample.


In [36]:
myData.sampleWithReplacement()

Unnamed: 0,Person,Exclude,Variable1,Variable2,Choice,Av1,Av2,Av3,NewVariable
3,2,0,4,0.4,1,1,1,1,1.6
3,2,0,4,0.4,1,1,1,1,1.6
1,1,0,2,0.2,2,1,1,1,0.4


In [37]:
myData.sampleWithReplacement(6)

Unnamed: 0,Person,Exclude,Variable1,Variable2,Choice,Av1,Av2,Av3,NewVariable
3,2,0,4,0.4,1,1,1,1,1.6
0,1,0,1,0.1,1,0,1,0,0.1
0,1,0,1,0.1,1,0,1,0,0.1
0,1,0,1,0.1,1,0,1,0,0.1
0,1,0,1,0.1,1,0,1,0,0.1
1,1,0,2,0.2,2,1,1,1,0.4


## panel

Defines the data as panel data

        Args:
           columnName: name of the columns that identifies
           individuals.


In [38]:
myPanelData = db.Database('test', df)

Data is not considered panel yet

In [39]:
myPanelData.isPanel()

False

In [40]:
myPanelData.panel('Person')

Now it is panel

In [41]:
print(myPanelData.isPanel())

True


In [42]:
print(myPanelData)

biogeme database test:
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
0       1        0          1        0.1       1    0    1    0          0.1
1       1        0          2        0.2       2    1    1    1          0.4
2       2        0          4        0.4       1    1    1    1          1.6
Panel data
   0  1
1  0  1
2  2  2


When draws are generated for panel data, a set of draws is generated per person, not per observation.

In [43]:
randomDraws1 = bioDraws('randomDraws1', 'NORMAL')
randomDraws2 = bioDraws('randomDraws2', 'UNIFORM_HALTON3')

We build an expression that involves the two random variables

In [44]:
x = randomDraws1 + randomDraws2
types = x.dict_of_elementary_expression(TypeOfElementaryExpression.DRAWS)
theDrawsTable = myPanelData.generateDraws(types,
                                          ['randomDraws1',
                                           'randomDraws2'],
                                          10)
print(theDrawsTable)

[[[-1.57792232  0.7037037 ]
  [ 0.10870961  0.14814815]
  [ 0.05140378  0.48148148]
  [ 1.800922    0.81481481]
  [-1.85148982  0.25925926]
  [ 0.87938314  0.59259259]
  [ 1.353763    0.92592593]
  [-0.46741631  0.07407407]
  [-1.09546279  0.40740741]
  [-0.09265338  0.74074074]]

 [[ 1.92991243  0.18518519]
  [-0.29388122  0.51851852]
  [-0.49084943  0.85185185]
  [ 0.2439256   0.2962963 ]
  [ 0.42498657  0.62962963]
  [-2.72496968  0.96296296]
  [ 2.0755831   0.01234568]
  [ 0.44793057  0.34567901]
  [-0.13185245  0.67901235]
  [-1.04344227  0.12345679]]]


## getNumberOfObservations

Reports the number of observations in the database. Note that it returns the same value, irrespectively if the database contains panel data or not.  

        Returns:
            Number of observations.

        See:  getSampleSize()


In [45]:
myData.getNumberOfObservations()

3

In [46]:
myPanelData.getNumberOfObservations()

3

## getSampleSize

Reports the size of the sample. If the data is cross-sectional, it
        is the number of observations in the database. If the data is panel,
        it is the number of individuals.

        Returns: 
           Sample size.

        See: getNumberOfObservations()


In [47]:
myData.getSampleSize()

3

In [48]:
myPanelData.getSampleSize()

2

## sampleIndividualMapWithReplacement

Extract a random sample of the individual map from a panel data database, with replacement. Useful for bootstrapping. 

        Args:
            size: size of the sample. If None, a sample of the same
            size as the database will be generated.

        Returns:
            pandas dataframe with the sample.


In [49]:
myPanelData.sampleIndividualMapWithReplacement(10)

Unnamed: 0,0,1
2,2,2
1,0,1
1,0,1
1,0,1
1,0,1
1,0,1
1,0,1
1,0,1
2,2,2
1,0,1
