<a href="https://colab.research.google.com/github/pmontman/tmp_choicemodels/blob/main/nb/tutorials/WK_08_tuto_panel_mixed_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 8: Continuing with the panel data and mixed logit

We will continue with the tomato sauce dataset, showcasing how to create random effects on the coefficients, as opposed to the 'agent' effect (random alternative specific constants).




We have two famous brands of catsup and 3 different package sizes.

A description of the dataset can be found [here](https://www.tandfonline.com/doi/pdf/10.1080/07350015.1994.10524547?casa_token=r4LpjVvgDW4AAAAA:FVG8mEexsQ37tJ2bvk7oxZZ9K_jvvMJ2WxglLzBaHQD0_0REkXmKGsPPxXw_LRGwN3YHY8-L-k8U)

# Description of the dataset

* **id**: household identifiers,
* **choice**: one of heinz41, heinz32, heinz28, hunts32.
* **disp_x**: is there a display for brand X ?
* **feat_x**: is there a newspaper feature advertisement for brand x?
* **price_x**: price of brand x

---
---

# Preparing the environment
*The preparation and dataset loading code is given to the students*

In [1]:
!pip install biogeme

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting biogeme
  Downloading biogeme-3.2.10.tar.gz (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 8.7 MB/s 
[?25hBuilding wheels for collected packages: biogeme
  Building wheel for biogeme (setup.py) ... [?25l[?25hdone
  Created wheel for biogeme: filename=biogeme-3.2.10-cp37-cp37m-linux_x86_64.whl size=4253510 sha256=3006dc4dd8550f65447d7913d2a2d3e5679141c40c05b782b7f2ab5906c12356
  Stored in directory: /root/.cache/pip/wheels/5b/92/9b/63caa7ad9b2cd582de77d3701d10f7e8d041466f4a9d07d554
Successfully built biogeme
Installing collected packages: biogeme
Successfully installed biogeme-3.2.10


Load the packages, feel free to change the names.

In [2]:
import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp
import biogeme.tools as tools
import biogeme.distributions as dist

# Load the dataset

In [3]:
path = 'https://raw.githubusercontent.com/pmontman/pub-choicemodels/main/data/catsup.csv'
catsup_pd = pd.read_csv(path)


In this case, notice the id variable, that identifies each household, so we have data from many choice situations for each household. We have also different amount of observations per household.

In [4]:
catsup_pd.head(7)

Unnamed: 0,id,disp_heinz41,disp_heinz32,disp_heinz28,disp_hunts32,feat_heinz41,feat_heinz32,feat_heinz28,feat_hunts32,price_heinz41,price_heinz32,price_heinz28,price_hunts32,choice
0,1,0,0,0,0,0,0,0,0,4.6,3.7,5.2,3.4,heinz28
1,1,0,0,0,0,0,0,0,0,4.6,4.3,5.2,4.4,heinz28
2,1,0,0,0,0,0,1,0,0,4.6,2.5,4.6,4.8,heinz28
3,1,0,0,0,0,0,0,0,0,4.6,3.7,5.2,3.4,heinz28
4,1,0,0,0,0,0,0,1,0,4.6,3.0,4.6,4.8,heinz28
5,1,0,0,0,0,0,0,0,0,5.0,3.0,4.7,3.0,heinz28
6,1,0,0,0,1,0,0,0,1,5.1,3.1,4.6,4.1,heinz28


# Auxiliary function

In [5]:
def qbus_update_globals_bgm(pd_df):
   globals().update(db.Database('tmp_bg_bgm_for_glob', pd_df).variables)

# Data cleaning: Preparing the dataset for Biogeme

Encode the choice variable (a string) into numbers with the `factorize` function.
We take a look at the codetable to know how the numbers are mapped to the alternatives. The order of the codetable indicates the association,
0: heinz28, 1: heinz41 and so on.

In [6]:
catsup_pd['choice'], codetable = catsup_pd['choice'].factorize()

In [7]:
codetable

Index(['heinz28', 'heinz41', 'heinz32', 'hunts32'], dtype='object')

In [8]:
catsup_pd['choice']

0       0
1       0
2       0
3       0
4       0
       ..
2793    1
2794    1
2795    3
2796    0
2797    0
Name: choice, Length: 2798, dtype: int64

# Random coeffcients (the Betas)

We will showcase how to make the betas 'random' not only the ASCs.

In [9]:

# Define level of verbosity
import biogeme.messaging as msg
logger = msg.bioMessage()
logger.setSilent()


In [10]:
database = db.Database("catsup", catsup_pd)

We now declare the coefficients in our model, we will make a simple model,
just the alternative-specif constants and the variables disp, feat and price.

In [11]:
ASC_heinz41 = exp.Beta('ASC_heinz41',0,None,None,0)
ASC_heinz32  = exp.Beta('ASC_heinz32 ',0,None,None,0)
ASC_heinz28 = exp.Beta('ASC_heinz28',0,None,None,0)
ASC_hunts32 = exp.Beta('ASC_hunts32',0,None,None,1)

B_disp = exp.Beta('B_disp',0,None,None,0)
B_feat = exp.Beta('B_feat',0,None,None,0)
B_price = exp.Beta('B_price',0,None,None,0)

We create the parameters that will be the standard deviations, and the Error components.

In [12]:

SIGMA_price = exp.Beta('SIGMA_price',1,0,None,0)
EC_price = SIGMA_price * exp.bioDraws('EC_price','NORMAL')


**Pay attention to the utility specification**: We add the error component to the coefficient (the Beta) tha multiplies the variable. The interpretation then is that the `B_price`  will be the mean of the random Beta, and `SIGMA_price` will be the standard deviation.

In [13]:
globals().update(database.variables)
# Definition of the utility functions
V_heinz41 = ASC_heinz41 + B_disp *disp_heinz41 + B_feat * feat_heinz41 + (B_price + EC_price) * price_heinz41 
V_heinz32 = ASC_heinz32 + B_disp *disp_heinz32 + B_feat * feat_heinz32 + (B_price  + EC_price) * price_heinz32 
V_heinz28 = ASC_heinz28 + B_disp *disp_heinz28 + B_feat * feat_heinz28 + (B_price + EC_price) * price_heinz28 
V_hunts32 = ASC_hunts32 + B_disp *disp_hunts32 + B_feat * feat_hunts32 + (B_price + EC_price) * price_hunts32 

The rest of the process follows as usual.

In [14]:
# Associate utility functions with the numbering of alternatives
V = {0: V_heinz28,
     1: V_heinz41,
     2: V_heinz32,
     3: V_hunts32}

av = {0: 1,
     1: 1,
     2: 1,
     3: 1}

# Conditional to the random variables, the likelihood of one observation is
# given by the logit model (called the kernel)
obsprob = models.logit(V,av, choice)


The difference from the usual declaration of the MNL comes now.

We have to do two new steps:

1. Tell biogeme to consider the panel nature of the data.
2. Tell biogeme to calculate the choice probabilities by simulation. This is how we deal with the random parameters. We simulate for the distribution, and the we calculate the likelihood for that distribution.

And Step 2 we take the model and the modifyi it by the expresion `exp.MonteCarlo`. The final log it to take the loglikelihood.

In [15]:
logprob = exp.log(exp.MonteCarlo(obsprob))

We we are using simulation, we have to tell biogeme how many draws from the distribution are we going to generate. The more draws, the more accurate estimation, but it is compuationally costly.

We also set up a seed, so we can get the same results if the run the notebook again (setting up a seed is a good habit in general)

In [16]:

# Create the Biogeme object
biogeme  = bio.BIOGEME(database,logprob,numberOfDraws=100, seed=1)




Estimation and results as usual.

In [17]:

# Estimate the parameters. 
results = biogeme.estimate()

We take a look at the results.

In [18]:
results.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
ASC_heinz28,2.447676,0.100291,24.405795,0.0
ASC_heinz32,1.518097,0.073225,20.73194,0.0
ASC_heinz41,1.370638,0.120304,11.393091,0.0
B_disp,0.887885,0.106157,8.36391,0.0
B_feat,0.919989,0.125645,7.322112,2.44027e-13
B_price,-1.436144,0.092167,-15.582061,0.0
SIGMA_price,0.282898,0.310394,0.911416,0.3620763


# Analyzing the random paremeters

So have recovered that the $\beta_{price}$ is a random variable with mean -1.47 and standard deviation $\sigma$ 0.40.
We can 'ask questions' to interpret what this actually means. The type of questions will ask us to calculate something about the distribtuon, such as what is the proportion of individuals in the population that have values for a coefficient in a given range.



In [19]:
from scipy.stats import norm

mean_B_price = results.getBetaValues()['B_price']
sd_B_price = results.getBetaValues()['SIGMA_price']
mean_B_price, sd_B_price

(-1.4361441140666178, 0.2828982558168295)

#What is the proportion of people in the population that have a positive utility relationship with price?
Even though the mean of the $\beta_{price}$ is negative, because it is random it might happen that *some* people in the population will have a positive value for their $B_{price}$. We can calculate this easily using the normal cumulative distribution function (CDF).
Remember th CDF(x) is is the probability that values for that distribution is < x.
In our case, we want to calculate the prob. that a gaussian with mean -1.47 and sd 0.4 is smaller than 0.

In [20]:
norm.cdf(   0, loc=mean_B_price, scale=sd_B_price)

0.9999998078143392

So 0.99984 are negative, the proportion that have positive values for $B_{price}$ is

In [21]:
1- norm.cdf(   0, loc=mean_B_price, scale=sd_B_price)

1.921856608300132e-07

About 1 per 10000 in our population prefers to buy more expensive tomato sauces, all other things being equal.

# Comparing mixed specificaion vs the fixed effects specification

We can also test if making $B_{price}$ random is significantly better. We can compare the likelihoods, the random component has an additional parameter (the sigma for the price).

In [22]:
V_heinz41 = ASC_heinz41 + B_disp *disp_heinz41 + B_feat * feat_heinz41 + B_price * price_heinz41 
V_heinz32 = ASC_heinz32 + B_disp *disp_heinz32 + B_feat * feat_heinz32 + B_price  * price_heinz32 
V_heinz28 = ASC_heinz28 + B_disp *disp_heinz28 + B_feat * feat_heinz28 + B_price * price_heinz28 
V_hunts32 = ASC_hunts32 + B_disp *disp_hunts32 + B_feat * feat_hunts32 + B_price  * price_hunts32 

In [23]:
# Associate utility functions with the numbering of alternatives
V = {0: V_heinz28,
     1: V_heinz41,
     2: V_heinz32,
     3: V_hunts32}

av = {0: 1,
     1: 1,
     2: 1,
     3: 1}

# Conditional to the random variables, the likelihood of one observation is
# given by the logit model (called the kernel)
obsprob = models.loglogit(V,av, choice)


In [24]:
biogeme_fixed  = bio.BIOGEME(database,obsprob)
results_fixed = biogeme_fixed.estimate()
results_fixed.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
ASC_heinz28,2.425967,0.090695,26.748617,0.0
ASC_heinz32,1.50124,0.063947,23.476179,0.0
ASC_heinz41,1.353701,0.114963,11.775122,0.0
B_disp,0.875592,0.102257,8.562671,0.0
B_feat,0.908576,0.120177,7.560339,4.019007e-14
B_price,-1.40241,0.056095,-25.000724,0.0


In [25]:
def qbus_likeli_ratio_test_bgm(results_complex, results_reference, signif_level):
  return tools.likelihood_ratio_test( (results_complex.data.logLike, results_complex.data.nparam),
                                     (results_reference.data.logLike, results_reference.data.nparam), signif_level)

In [26]:
results.getGeneralStatistics()    

{'Number of estimated parameters': GeneralStatistic(value=7, format=''),
 'Sample size': GeneralStatistic(value=2798, format=''),
 'Excluded observations': GeneralStatistic(value=0, format=''),
 'Init log likelihood': GeneralStatistic(value=-3760.51597503762, format='.7g'),
 'Final log likelihood': GeneralStatistic(value=-2517.6554594847075, format='.7g'),
 'Likelihood ratio test for the init. model': GeneralStatistic(value=2485.721031105825, format='.7g'),
 'Rho-square for the init. model': GeneralStatistic(value=0.33050265543426627, format='.3g'),
 'Rho-square-bar for the init. model': GeneralStatistic(value=0.32864120874810243, format='.3g'),
 'Akaike Information Criterion': GeneralStatistic(value=5049.310918969415, format='.7g'),
 'Bayesian Information Criterion': GeneralStatistic(value=5090.867540055993, format='.7g'),
 'Final gradient norm': GeneralStatistic(value=0.00983899831276705, format='.4E'),
 'Number of draws': GeneralStatistic(value=100, format=''),
 'Draws generation ti

In [27]:
results_fixed.getGeneralStatistics()  

{'Number of estimated parameters': GeneralStatistic(value=6, format=''),
 'Sample size': GeneralStatistic(value=2798, format=''),
 'Excluded observations': GeneralStatistic(value=0, format=''),
 'Init log likelihood': GeneralStatistic(value=-3082.5573069162037, format='.7g'),
 'Final log likelihood': GeneralStatistic(value=-2517.8772496264055, format='.7g'),
 'Likelihood ratio test for the init. model': GeneralStatistic(value=1129.3601145795965, format='.7g'),
 'Rho-square for the init. model': GeneralStatistic(value=0.1831855829647836, format='.3g'),
 'Rho-square-bar for the init. model': GeneralStatistic(value=0.18123914713160771, format='.3g'),
 'Akaike Information Criterion': GeneralStatistic(value=5047.754499252811, format='.7g'),
 'Bayesian Information Criterion': GeneralStatistic(value=5083.374460184164, format='.7g'),
 'Final gradient norm': GeneralStatistic(value=0.003635318296739464, format='.4E'),
 'Nbr of threads': GeneralStatistic(value=2, format='')}

In [28]:
qbus_likeli_ratio_test_bgm(results, results_fixed, 0.05)

LRTuple(message='H0 cannot be rejected at level 5.0%', statistic=0.4435802833959315, threshold=3.841458820694124)

# Panel data: Advanced stuff (not implemented)

We have seen that we can specify which variable has the 'panel' or 'grouping' information, for example, we can use the `id` variable to 

#Exercise: What is the proportion of households that react negatively to diplaying the catsup product?