Consider the estimation of the dynamic choice model with panel effects that was performed in the previous quiz, and reported below for reference. The objective of this excercise is to address the "initial condition problem" using Wooldridge's method.

In [1]:
import pandas as pd
import pickle
from urllib.request import urlopen
import biogeme.database as db
import biogeme.biogeme as bio
from biogeme import models
import biogeme.results as res
import biogeme.optimization as opt
import biogeme.messaging as msg
import biogeme.version as ver
from biogeme.expressions import (
    Beta,
    bioDraws,
    PanelLikelihoodTrajectory,
    MonteCarlo,
    log,
)

Version of Biogeme

In [2]:
print(ver.getText())

biogeme 3.2.8 [2021-09-02]
Version entirely written in Python
Home page: http://biogeme.epfl.ch
Submit questions to https://groups.google.com/d/forum/biogeme
Michel Bierlaire, Transport and Mobility Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL)



The estimation of these models may take a while. It you want to run the notebook using pre-estimated results saved on file, set the number of draws below to `None`

In [3]:
#numberOfDraws = 10000
numberOfDraws = None

In [4]:
def get_results_from_url(file):
    pickle_file = f'{url_root}{file}'

    with urlopen(pickle_file) as p:
        data = pickle.load(p)
    return res.bioResults(data)

In [5]:
url_root = (
    'https://courses.edx.org/'
    'asset-v1:EPFLx+ChoiceModels2x+3T2021+type@asset+block@'
)

# Dynamic Choice Models with Panel Effects

We analyze again the smoking behavior of individuals, as a function of their age and the price of tobacco using synthetic data. We develop a model that predicts, for every year, the probability to smoke or not. 

## Data

We postulate a true model for the data generation process. It is a mixture of logit models where the utility associated with "not smoking" is  
\begin{equation}
U_{0nt}= \varepsilon_{0nt}
\end{equation}
 and the utility associated with "smoking" is 
\begin{equation}
U_{1nt}= \beta_{nt} y_{n,t-1} + \beta^p_{nt} P_{t} + c_n + \varepsilon_{1nt},
\end{equation}
where 

- $\beta_{nt} = 10$,

- $y_{n,t-1}=1$ if $n$ is smoking at time $t-1$, $0$ otherwise,

- $\beta^p_{nt} = -0.1$,

- $P_t$ is the price of cigarets at time $t$,

- $c_n$ is an individual specific constant that captures the a priori, intrinsic attraction of each individual towards smoking. It is assumed to be normally distributed in the population, with zero mean and standard deviation 50: $N(0, 50^2)$,





In [6]:
logger = msg.bioMessage()
logger.setGeneral()

## True value of the parameters

In [7]:
trueParameters = pd.DataFrame(
    {'Value': [
        -0.1, 
        10,
        0,
        50
    ]
    }, 
    index=[
        'coef_price', 
        'beta_last_year',
        'cte_mean',
        'cte_std'
    ]
)

## Data

We observe every individual only from the age of 45 and the age of 55.

In [8]:
df = pd.read_table(f'{url_root}smoking55.dat', sep=',')
df

Unnamed: 0,Age,Price,Smoking,LastYear,Id,Smoking45
0,45,25.870704,1,1,0,1
1,46,26.388118,1,1,0,1
2,47,26.915880,1,1,0,1
3,48,27.454198,1,1,0,1
4,49,28.003282,1,1,0,1
...,...,...,...,...,...,...
10995,51,29.134614,1,1,999,1
10996,52,29.717307,1,1,999,1
10997,53,30.311653,1,1,999,1
10998,54,30.917886,1,1,999,1


In [9]:
df['Age'].unique()

array([45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55])

The data contains the following columns:

- the age of the individual,
- the price of the cigarets,
- a variable that is 1 if the individual is smoking, 0 otherwise,
- a variable that is 1 if the individual was smoking last year, 0 otherwise,
- a unique id for each individual,
- a variable that is 1 if the individual was smoking at the age of 45, in the beginning of the observation period.

In [10]:
database = db.Database('smoking55', df)
globals().update(database.variables)

We declare that the data set contains panel data.

In [11]:
database.panel('Id')

## Estimation procedure

The following procedure estimates the choice model, and returns the estimated parameters in a Pandas format. If the model happens to have been already estimated, the estimation results are read from the pickle file and reported.

In [12]:
def estimate(the_logprob, the_name, the_database):
    if numberOfDraws is None:
        pickle_file = f'{the_name}.pickle'
        results = get_results_from_url(pickle_file)
    else:
        biogeme = bio.BIOGEME(
            the_database,
            the_logprob,
            numberOfDraws=numberOfDraws,
            removeUnusedVariables=False,
        )
        biogeme.modelName = the_name
        results = biogeme.estimate(algorithm=opt.bioBfgs)
    print(results.printGeneralStatistics())
    pandas_results = results.getEstimatedParameters()
    return pandas_results


## Dynamic model with serial correlation

In the previous quiz, we have estimated a dynamic model with panel effects to account for serial correlation. 

In [13]:
cte_mean = Beta('cte_mean', 0, None, None, 0)
cte_std = Beta('cte_std', 1, None, None, 0)
cte = cte_mean + cte_std * bioDraws('agent', 'NORMAL_ANTI')
coef_price = Beta('coef_price', 0, None, None, 0)
beta_last_year = Beta('beta_last_year', 0, None, None, 0)

In [14]:
V_s = beta_last_year * LastYear + coef_price * Price + cte
V_ns = 0
V = {0: V_ns, 1: V_s}
obsprob = models.logit(V, None, Smoking)
condprobIndiv = PanelLikelihoodTrajectory(obsprob)
logprob = log(MonteCarlo(condprobIndiv))

In [15]:
%%time
r_serial_dynamic = estimate(logprob, 
                            'dynamic_model_serial_truncated_t_55',
                            database)
r_serial_dynamic

Number of estimated parameters:	4
Sample size:	1000
Observations:	11000
Excluded observations:	0
Init log likelihood:	-740.373
Final log likelihood:	-27.43082
Likelihood ratio test for the init. model:	1425.884
Rho-square for the init. model:	0.963
Rho-square-bar for the init. model:	0.958
Akaike Information Criterion:	62.86164
Bayesian Information Criterion:	82.49266
Final gradient norm:	3.6782E-03
Number of draws:	10000
Draws generation time:	0:00:12.691483
Types of draws:	['agent: NORMAL_ANTI']
Nbr of threads:	36

CPU times: user 23.7 ms, sys: 3.06 ms, total: 26.7 ms
Wall time: 230 ms


Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
beta_last_year,16.528013,1.224384,13.499048,0.0,1.571728,10.51582,0.0
coef_price,-0.001006,0.13284,-0.007574,0.993957,0.069741,-0.014427,0.988489
cte_mean,-7.864287,4.399341,-1.787606,0.07384,2.374923,-3.311386,0.000928
cte_std,9.2e-05,0.703775,0.000131,0.999896,9e-05,1.018902,0.30825


### Comparison of the estimates

In [16]:
summary = pd.concat(
    [
        trueParameters['Value'], 
        r_serial_dynamic['Value'], 

    ], 
    axis=1
)
summary.columns = [
    'True', 
    'Dynamic + serial'
]
summary.fillna('')

Unnamed: 0,True,Dynamic + serial
coef_price,-0.1,-0.001006
beta_last_year,10.0,16.528013
cte_mean,0.0,-7.864287
cte_std,50.0,9.2e-05


We observe here the issue of the "initial condition problem". Although the model specification is correct (it is the same model as the data generation process), the values of the parameters are not correctly recovered. It is because the first observed choice, that is, the fact that an individual is smoking at the age of 45, is strongly correlated with the agent effect. This creates endogeneity. One visible consequence is the positive price coefficient. 

Estimate the parameters using Wooldridge's method to address the endogeneity issue.