In [1]:
import pandas as pd
import pickle
from urllib.request import urlopen
import biogeme.database as db
import biogeme.biogeme as bio
from biogeme import models
import biogeme.results as res
import biogeme.optimization as opt
import biogeme.messaging as msg
from biogeme.expressions import (
    Beta,
    bioDraws,
    PanelLikelihoodTrajectory,
    MonteCarlo,
    log,
)

The estimation of these models may take a while. It you want to run the notebook using pre-estimated results saved on file, set the number of draws below to `None`

In [2]:
#numberOfDraws = 10000
numberOfDraws = None

In [3]:
def get_results_from_url(file):
    pickle_file = f'{url_root}{file}'

    with urlopen(pickle_file) as p:
        data = pickle.load(p)
    return res.bioResults(data)

In [4]:
url_root = (
    'https://courses.edx.org/'
    'asset-v1:EPFLx+ChoiceModels2x+3T2021+type@asset+block@'
)

# Dynamic Choice Models with Panel Effects

We analyze again the smoking behavior of individuals, as a function of their age and the price of tobacco using synthetic data. We develop a model that predicts, for every year, the probability to smoke or not. 

## Data

We postulate a true model for the data generation process. It is a mixture of logit models where the utility associated with "not smoking" is  
\begin{equation}
U_{0nt}= \varepsilon_{0nt}
\end{equation}
 and the utility associated with "smoking" is 
\begin{equation}
U_{1nt}= \beta_{nt} y_{n,t-1} + \beta^p_{nt} P_{t} + c_n + \varepsilon_{1nt},
\end{equation}
where 

- $\beta_{nt} = 10$,

- $y_{n,t-1}=1$ if $n$ is smoking at time $t-1$, $0$ otherwise,

- $\beta^p_{nt} = -0.1$,

- $P_t$ is the price of cigarets at time $t$,

- $c_n$ is an individual specific constant that captures the a priori, intrinsic attraction of each individual towards smoking. It is assumed to be normally distributed in the population, with zero mean and standard deviation 50: $N(0, 50^2)$,





In [5]:
logger = msg.bioMessage()
logger.setGeneral()

## True value of the parameters

In [6]:
trueParameters = pd.DataFrame(
    {'Value': [
        -0.1, 
        10,
        0,
        50
    ]
    }, 
    index=[
        'coef_price', 
        'beta_last_year',
        'cte_mean',
        'cte_std'
    ]
)

## Data

We observe every individual only from the age of 45 and the age of 55.

In [7]:
df = pd.read_table(f'{url_root}smoking55.dat', sep=',')
df

Unnamed: 0,Age,Price,Smoking,LastYear,Id,Smoking45
0,45,25.870704,1,1,0,1
1,46,26.388118,1,1,0,1
2,47,26.915880,1,1,0,1
3,48,27.454198,1,1,0,1
4,49,28.003282,1,1,0,1
...,...,...,...,...,...,...
10995,51,29.134614,1,1,999,1
10996,52,29.717307,1,1,999,1
10997,53,30.311653,1,1,999,1
10998,54,30.917886,1,1,999,1


In [8]:
df['Age'].unique()

array([45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55])

The data contains the following columns:

- the age of the individual,
- the price of the cigarets,
- a variable that is 1 if the individual is smoking, 0 otherwise,
- a variable that is 1 if the individual was smoking last year, 0 otherwise,
- a unique id for each individual,
- a variable that is 1 if the individual was smoking at the age of 45, in the beginning of the observation period.

In [9]:
database = db.Database('smoking55', df)
globals().update(database.variables)

We declare that the data set contains panel data.

In [10]:
database.panel('Id')

## Estimation procedure

The following procedure estimates the choice model, and returns the estimated parameters in a Pandas format. If the model happens to have been already estimated, the estimation results are read from the pickle file and reported.

In [11]:
def estimate(the_logprob, the_name, the_database):
    if numberOfDraws is None:
        pickle_file = f'{the_name}.pickle'
        results = get_results_from_url(pickle_file)
    else:
        biogeme = bio.BIOGEME(
            the_database,
            the_logprob,
            numberOfDraws=numberOfDraws,
            removeUnusedVariables=False,
        )
        biogeme.modelName = the_name
        results = biogeme.estimate(algorithm=opt.bioBfgs)
    print(results.printGeneralStatistics())
    pandas_results = results.getEstimatedParameters()
    return pandas_results


## Dynamic model with serial correlation

In the previous quiz, we have estimated a dynamic model with panel effects to account for serial correlation. 

In [12]:
cte_mean = Beta('cte_mean', 0, None, None, 0)
cte_std = Beta('cte_std', 1, None, None, 0)
cte = cte_mean + cte_std * bioDraws('agent', 'NORMAL_ANTI')
coef_price = Beta('coef_price', 0, None, None, 0)
beta_last_year = Beta('beta_last_year', 0, None, None, 0)

In [13]:
V_s = beta_last_year * LastYear + coef_price * Price + cte
V_ns = 0
V = {0: V_ns, 1: V_s}
obsprob = models.logit(V, None, Smoking)
condprobIndiv = PanelLikelihoodTrajectory(obsprob)
logprob = log(MonteCarlo(condprobIndiv))

In [14]:
%%time
r_serial_dynamic = estimate(logprob, 
                            'dynamic_model_serial_truncated_t_55',
                            database)
r_serial_dynamic

Number of estimated parameters:	4
Sample size:	1000
Observations:	11000
Excluded observations:	0
Init log likelihood:	-740.373
Final log likelihood:	-27.43082
Likelihood ratio test for the init. model:	1425.884
Rho-square for the init. model:	0.963
Rho-square-bar for the init. model:	0.958
Akaike Information Criterion:	62.86164
Bayesian Information Criterion:	82.49266
Final gradient norm:	3.6782E-03
Number of draws:	10000
Draws generation time:	0:00:12.691483
Types of draws:	['agent: NORMAL_ANTI']
Nbr of threads:	36

CPU times: user 19.8 ms, sys: 2.18 ms, total: 22 ms
Wall time: 465 ms


Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
beta_last_year,16.528013,1.224384,13.499048,0.0,1.571728,10.51582,0.0
coef_price,-0.001006,0.13284,-0.007574,0.993957,0.069741,-0.014427,0.988489
cte_mean,-7.864287,4.399341,-1.787606,0.07384,2.374923,-3.311386,0.000928
cte_std,9.2e-05,0.703775,0.000131,0.999896,9e-05,1.018902,0.30825


### Comparison of the estimates

In [15]:
summary = pd.concat(
    [
        trueParameters['Value'], 
        r_serial_dynamic['Value'], 

    ], 
    axis=1
)
summary.columns = [
    'True', 
    'Dynamic + serial'
]
summary.fillna('')

Unnamed: 0,True,Dynamic + serial
coef_price,-0.1,-0.001006
beta_last_year,10.0,16.528013
cte_mean,0.0,-7.864287
cte_std,50.0,9.2e-05


We observe here the issue of the "initial condition problem". Although the model specification is correct (it is the same model as the data generation process), the values of the parameters are not correctly recovered. It is because the first observed choice, that is, the fact that an individual is smoking at the age of 45, is strongly correlated with the agent effect. This creates endogeneity. One visible consequence is the positive price coefficient. We are now using Wooldridge method to address it. 

## Dynamic model with serial correlation and Wooldridge term

We introduce in the specification of the constant a term that captures the fact that somebody is smoking during the first observation period.

In [16]:
coef_first_year = Beta('coef_first_year', 0, None, None, 0)
cte = (
    cte_mean + 
    coef_first_year * Smoking45 + 
    cte_std * bioDraws('agent', 'NORMAL_ANTI')
)

In [17]:
V_s = beta_last_year * LastYear + coef_price * Price + cte
V_ns = 0
V = {0: V_ns, 1: V_s}
obsprob = models.logit(V, None, Smoking)
condprobIndiv = PanelLikelihoodTrajectory(obsprob)
logprob = log(MonteCarlo(condprobIndiv))

In [18]:
%%time
r_wooldridge = estimate(
    logprob, 
    'dynamic_model_serial_wooldridge_truncated_t_55',
    database
)
r_wooldridge

Number of estimated parameters:	5
Sample size:	1000
Observations:	11000
Excluded observations:	0
Init log likelihood:	-740.1321
Final log likelihood:	-23.06365
Likelihood ratio test for the init. model:	1434.137
Rho-square for the init. model:	0.969
Rho-square-bar for the init. model:	0.962
Akaike Information Criterion:	56.12731
Bayesian Information Criterion:	80.66608
Final gradient norm:	7.7682E-03
Number of draws:	10000
Draws generation time:	0:00:12.527811
Types of draws:	['agent: NORMAL_ANTI']
Nbr of threads:	36

CPU times: user 25.2 ms, sys: 2.38 ms, total: 27.6 ms
Wall time: 471 ms


Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
beta_last_year,5.657383,2.393676,2.36347,0.018105,2.673154,2.11637,0.034313
coef_first_year,38.661564,37.255458,1.037742,0.29939,36.733656,1.052483,0.292578
coef_price,-0.177973,0.232155,-0.766613,0.443312,0.326439,-0.545195,0.585619
cte_mean,-15.717856,16.506846,-0.952202,0.340994,8.854878,-1.775051,0.07589
cte_std,6.533397,6.26571,1.042723,0.297077,5.870499,1.11292,0.265743


### Comparison of the estimates

In [19]:
summary = pd.concat(
    [
        trueParameters['Value'], 
        r_serial_dynamic['Value'],
        r_wooldridge['Value']
    ], axis=1)
summary.columns = [
    'True', 
    'Dynamic + serial', 
    'Wooldridge',
]
summary.fillna('')

Unnamed: 0,True,Dynamic + serial,Wooldridge
coef_price,-0.1,-0.001006,-0.177973
beta_last_year,10.0,16.528013,5.657383
cte_mean,0.0,-7.864287,-15.717856
cte_std,50.0,9.2e-05,6.533397
coef_first_year,,,38.661564


The estimates of the coefficients `coef_price` and `beta_last_year` are no closer to their true value. We perform a $t$-test analysis, to test the hypothesis that the value of the parameter is equal to its true value. 

In [20]:
def t_test(param, true_value):
    return (
        (r_wooldridge.loc[param, 'Value'] - true_value) /
        r_wooldridge.loc[param, 'Rob. Std err']
    ) 

In [21]:
t_test('coef_price', -0.1)

-0.23885858718326936

In [22]:
t_test('beta_last_year', 10)

-1.6245292636344826

In [23]:
t_test('cte_mean', 0)

-1.775050560036097

In [24]:
t_test('cte_std', 50)

-7.404243862277832

Except for `cte_std`, the $t$-test are, in absolute value, below 1.96. It means that we cannot reject the null hypothesis that the value of the parameter is equal to the true value, at the 95% level of confidence. It is important to realize from the relatively large value of the standard errors that the precision of the estimates is not high.   This is due to a lack of observations. Indeed, we observe each individual only during 10 years. But the Wooldridge correction has allowed to address the endogeneity issue.