In [1]:
import pandas as pd
import pickle
from urllib.request import urlopen
import biogeme.database as db
import biogeme.biogeme as bio
from biogeme import models
import biogeme.results as res
import biogeme.optimization as opt
import biogeme.messaging as msg
from biogeme.expressions import (
    Beta,
    bioDraws,
    PanelLikelihoodTrajectory,
    MonteCarlo,
    log,
)

The estimation of these models may take a while. It you want to run the notebook using pre-estimated results saved on file, set the number of draws below to `None`

In [2]:
#numberOfDraws = 10000
numberOfDraws = None

In [3]:
def get_results_from_url(file):
    pickle_file = f'{url_root}{file}'

    with urlopen(pickle_file) as p:
        data = pickle.load(p)
    return res.bioResults(data)

In [4]:
url_root = (
    'https://courses.edx.org/'
    'asset-v1:EPFLx+ChoiceModels2x+3T2021+type@asset+block@'
)

# Dynamic Choice Models

We analyze the smoking behavior of individuals, as a function of their age and the price of tobacco using synthetic data. We develop a model that predicts, for every year, the probability to smoke or not. 

## Data

We use synthetic data that has been generated as follows. We postulate a true model for the data generation process. It is a mixture of logit models  with
two alternatives: ``smoking`` or ``not smoking``
. The utility for individual $n$ associated with "not smoking" in year $t$ is  
\begin{equation}
U_{0nt}= \varepsilon_{0nt}
\end{equation}
 and the utility associated with "smoking" is 
\begin{equation}
U_{1nt}= \beta_{nt} y_{n,t-1} + \beta^p_{nt} P_{t} + c_n + \varepsilon_{1nt},
\end{equation}
where 

- $\beta_{nt} = 10$,

- $y_{n,t-1}=1$ if $n$ is smoking at time $t-1$, $0$ otherwise,

- $\beta^p_{nt} = -0.1$,

- $P_t$ is the price of cigarets at time $t$,

- $c_n$ is an individual specific constant that captures the a priori, intrinsic attraction of each individual towards smoking. It is assumed to be normally distributed in the population, with zero mean and standard deviation 50: $N(0, 50^2)$, and constant over $t$.

We generate a sample of 1000 individuals, and we simulate their smoking behavior between the age of 16 until the age of 100. **But we observe them only between the age of 45 and the age of 55**. 

The date of birth of each individual is uniformaly distributed between 2000 and 2020.
The price of cigarets in 2000 is supposed to be 10. The price of cigarets in year $t$ is $$P_t = 10 \cdot 1.02^{t-2000},$$
which represents a price increase of 2% per year.



In [5]:
logger = msg.bioMessage()
logger.setGeneral()

## True value of the parameters

We store the true value of the parameters for future comparison

In [6]:
trueParameters = pd.DataFrame(
    {'Value': [
        -0.1, 
        10,
        0,
        50
    ]
    }, 
    index=[
        'coef_price', 
        'beta_last_year',
        'cte_mean',
        'cte_std'
    ]
)

## Data

The observations are available in the following data file.

In [7]:
df = pd.read_table(f'{url_root}smoking55.dat', sep=',')
df

Unnamed: 0,Age,Price,Smoking,LastYear,Id,Smoking45
0,45,25.870704,1,1,0,1
1,46,26.388118,1,1,0,1
2,47,26.915880,1,1,0,1
3,48,27.454198,1,1,0,1
4,49,28.003282,1,1,0,1
...,...,...,...,...,...,...
10995,51,29.134614,1,1,999,1
10996,52,29.717307,1,1,999,1
10997,53,30.311653,1,1,999,1
10998,54,30.917886,1,1,999,1


The data contains the following columns:

- the age of the individual,
- the price of the cigarets,
- a variable that is 1 if the individual is smoking, 0 otherwise,
- a variable that is 1 if the individual was smoking last year, 0 otherwise,
- a unique id for each individual,
- a variable that is 1 if the individual was smoking at the age of 45, in the beginning of the observation period.

In [8]:
database = db.Database('smoking', df)
globals().update(database.variables)

## Estimation procedure

The following procedure estimates the choice model (or read the estimation reszults from file), and returns the estimated parameters in a Pandas format. 

In [9]:
def estimate(the_logprob, the_name, the_database):
    if numberOfDraws is None:
        pickle_file = f'{the_name}.pickle'
        results = get_results_from_url(pickle_file)
    else:
        biogeme = bio.BIOGEME(
            the_database,
            the_logprob,
            numberOfDraws=numberOfDraws,
            removeUnusedVariables=False,
        )
        biogeme.modelName = the_name
        results = biogeme.estimate(algorithm=opt.bioBfgs)
    print(results.printGeneralStatistics())
    pandas_results = results.getEstimatedParameters()
    return pandas_results


## Static model

The static model considers the data as cross-sectional. No state dependance, and no serial correlation is captured.

In [10]:
cte_mean = Beta('cte_mean', 0, None, None, 0)
coef_price = Beta('coef_price', 0, None, None, 0)

In [11]:
V_s = coef_price * Price + cte_mean
V_ns = 0
V = {0: V_ns, 1: V_s}
logprob = models.loglogit(V, None, Smoking)

In [12]:
%%time
r_static = estimate(logprob, 
                    'static_model_truncated_t_55',
                    database)
r_static

Number of estimated parameters:	2
Sample size:	11000
Excluded observations:	0
Init log likelihood:	-7624.619
Final log likelihood:	-7620.313
Likelihood ratio test for the init. model:	8.611645
Rho-square for the init. model:	0.000565
Rho-square-bar for the init. model:	0.000302
Akaike Information Criterion:	15244.63
Bayesian Information Criterion:	15259.24
Final gradient norm:	1.7858E-01
Nbr of threads:	36

CPU times: user 19 ms, sys: 2.79 ms, total: 21.8 ms
Wall time: 521 ms


Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
coef_price,0.009229,0.004399,2.098019,0.035903,0.004401,2.097032,0.035991
cte_mean,-0.263393,0.145504,-1.810216,0.070262,0.145561,-1.809503,0.070373


## Comparison of the estimates

In [13]:
summary = pd.concat(
    [trueParameters['Value'], 
     r_static['Value']], 
    axis='columns')
summary.columns = ['True', 'Static']
summary.fillna('')

Unnamed: 0,True,Static
coef_price,-0.1,0.009229
beta_last_year,10.0,
cte_mean,0.0,-0.263393
cte_std,50.0,


The estimated price coefficient is positive and not significant. Indeed, price is the only variable that the model considers. Ignoring state dependence generates endogeneity. The model "thinks" that individuals are insensitive to price, as they choose an alternative that is expensive.  

## Dynamic model

The dynamic model adds the choice of last year as an explanatory variable

In [14]:
beta_last_year = Beta('beta_last_year', 0, None, None, 0)

In [15]:
V_s = beta_last_year * LastYear + coef_price * Price + cte_mean
V_ns = 0
V = {0: V_ns, 1: V_s}
logprob = models.loglogit(V, None, Smoking)

In [16]:
%%time
r_dynamic = estimate(logprob, 
                     'dynamic_model_truncated_t_55',
                     database)
r_dynamic

Number of estimated parameters:	3
Sample size:	11000
Excluded observations:	0
Init log likelihood:	-7624.619
Final log likelihood:	-27.43176
Likelihood ratio test for the init. model:	15194.37
Rho-square for the init. model:	0.996
Rho-square-bar for the init. model:	0.996
Akaike Information Criterion:	60.86351
Bayesian Information Criterion:	82.78047
Final gradient norm:	1.0510E-02
Nbr of threads:	36

CPU times: user 17.8 ms, sys: 2.34 ms, total: 20.1 ms
Wall time: 528 ms


Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
beta_last_year,16.52685,1.224703,13.494582,0.0,1.209799,13.660828,0.0
coef_price,-0.005069,0.132839,-0.038159,0.969561,0.120666,-0.042008,0.966492
cte_mean,-7.728526,4.389045,-1.760867,0.078261,3.557519,-2.172448,0.029822


### Comparison of the estimates

In [17]:
summary = pd.concat(
    [
        trueParameters['Value'], 
        r_static['Value'], 
        r_dynamic['Value']
    ], 
    axis='columns'
)
summary.columns = [
    'True', 
    'Static', 
    'Dynamic'
]
summary.fillna('')

Unnamed: 0,True,Static,Dynamic
coef_price,-0.1,0.009229,-0.005069
beta_last_year,10.0,,16.52685
cte_mean,0.0,-0.263393,-7.728526
cte_std,50.0,,


The introduction of the lag variable has increased a lot the final log likelihood from `-7620.313` to `-27.43176`. Note that the error term in the model is not the same as in the true model. Indeed, serial correlation has been ignored. Therefore, the coefficients cannot be directly compared. But their ratio can be compared, as it cancels the scale. Contrarily to what we obtained with the full set of data, we are not able to recover the true ratio of coefficients. Actually, the price coefficient is not significantly different from zero.

In [18]:
(
    summary.loc['coef_price', 'True'] / 
    summary.loc['beta_last_year', 'True'] 
)

-0.01

In [19]:
(
    summary.loc['coef_price', 'Dynamic'] / 
    summary.loc['beta_last_year', 'Dynamic'] 
)

-0.00030671186372841234

## Static model with serial correlation

We now introduce the agent effect to capture serial correlation. First, we tell Biogeme that the data is organized as a panel, meaning that there are several observations corresponding to the same individuals.
Therefore, instead of considering that there is a sample of 11000 independent observations, Biogeme knows that there is actually a sample of 1000 individuals, for which a trajectory is observed.

Sample size ignoring the panel nature of the data

In [20]:
database.getSampleSize()

11000

Declaring the panel nature of the data

In [21]:
database.panel('Id')
database.getSampleSize()

1000

In [22]:
cte_std = Beta('cte_std', 1, None, None, 0)
cte = cte_mean + cte_std * bioDraws('agent', 'NORMAL_ANTI')

In [23]:
V_s = coef_price * Price + cte
V_ns = 0
V = {0: V_ns, 1: V_s}
obsprob = models.logit(V, None, Smoking)
condprobIndiv = PanelLikelihoodTrajectory(obsprob)
logprob = log(MonteCarlo(condprobIndiv))

In [24]:
%%time
r_serial_static = estimate(logprob, 
                           'static_model_serial_truncated_t_55',
                           database)
r_serial_static

Number of estimated parameters:	3
Sample size:	1000
Observations:	11000
Excluded observations:	0
Init log likelihood:	-740.0878
Final log likelihood:	-723.8315
Likelihood ratio test for the init. model:	32.5127
Rho-square for the init. model:	0.022
Rho-square-bar for the init. model:	0.0179
Akaike Information Criterion:	1453.663
Bayesian Information Criterion:	1468.386
Final gradient norm:	2.6308E-03
Number of draws:	10000
Draws generation time:	0:00:12.931562
Types of draws:	['agent: NORMAL_ANTI']
Nbr of threads:	36

CPU times: user 19.8 ms, sys: 2.59 ms, total: 22.4 ms
Wall time: 526 ms


Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
coef_price,0.048515,-0.0,1.797693e+308,0.0,0.0,1.797693e+308,0.0
cte_mean,53.425753,-0.0,1.797693e+308,0.0,0.0,1.797693e+308,0.0
cte_std,705.533831,-0.0,1.797693e+308,0.0,0.0,1.797693e+308,0.0


It appears that there were some numerical issues during the estimation of the parameter. It is certainly due to the high value of the `cte_std` parameter. The values of the standard errors and the $t$-test are not valid.

## Dynamic model with serial correlation

We now introduce the state dependence in the model, to make it dynamic.

In [25]:
V_s = beta_last_year * LastYear + coef_price * Price + cte
V_ns = 0
V = {0: V_ns, 1: V_s}
obsprob = models.logit(V, None, Smoking)
condprobIndiv = PanelLikelihoodTrajectory(obsprob)
logprob = log(MonteCarlo(condprobIndiv))

In [26]:
%%time
r_serial_dynamic = estimate(logprob, 
                            'dynamic_model_serial_truncated_t_55',
                            database)
r_serial_dynamic

Number of estimated parameters:	4
Sample size:	1000
Observations:	11000
Excluded observations:	0
Init log likelihood:	-740.373
Final log likelihood:	-27.43082
Likelihood ratio test for the init. model:	1425.884
Rho-square for the init. model:	0.963
Rho-square-bar for the init. model:	0.958
Akaike Information Criterion:	62.86164
Bayesian Information Criterion:	82.49266
Final gradient norm:	3.6782E-03
Number of draws:	10000
Draws generation time:	0:00:12.691483
Types of draws:	['agent: NORMAL_ANTI']
Nbr of threads:	36

CPU times: user 23.3 ms, sys: 2.51 ms, total: 25.8 ms
Wall time: 553 ms


Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
beta_last_year,16.528013,1.224384,13.499048,0.0,1.571728,10.51582,0.0
coef_price,-0.001006,0.13284,-0.007574,0.993957,0.069741,-0.014427,0.988489
cte_mean,-7.864287,4.399341,-1.787606,0.07384,2.374923,-3.311386,0.000928
cte_std,9.2e-05,0.703775,0.000131,0.999896,9e-05,1.018902,0.30825


### Comparison of the estimates

In [27]:
summary = pd.concat(
    [
        trueParameters['Value'], 
        r_static['Value'], 
        r_dynamic['Value'], 
        r_serial_static['Value'],
        r_serial_dynamic['Value'], 

    ], 
    axis='columns'
)
summary.columns = [
    'True', 
    'Static', 
    'Dynamic', 
    'Static + serial', 
    'Dynamic + serial'
]
summary.fillna('')

Unnamed: 0,True,Static,Dynamic,Static + serial,Dynamic + serial
coef_price,-0.1,0.009229,-0.005069,0.048515,-0.001006
beta_last_year,10.0,,16.52685,,16.528013
cte_mean,0.0,-0.263393,-7.728526,53.425753,-7.864287
cte_std,50.0,,,705.533831,9.2e-05


Contrarily to what we observed with the full set of data, the parameters are not recovered, although the model specification is correct. This can be seen with the $t$-test against the true values.

In [28]:
def t_test(param, true_value):
    return (
        (r_serial_dynamic.loc[param, 'Value'] - true_value) /
        r_serial_dynamic.loc[param, 'Rob. Std err']
    ) 

In [29]:
t_test('coef_price', -0.1)

1.4194549192752515

In [30]:
t_test('beta_last_year', 10)

4.153397375340828

In [31]:
t_test('cte_mean', 0)

-3.3113863810680075

In [32]:
t_test('cte_std', 50)

-553776.7990644233

This is due to an issue called the "initial condition problem", that is discussed in the next video. 