<a href="https://colab.research.google.com/github/pmontman/M4metaresults/blob/master/SOLU_QBUS3840_2022_PRACTICE_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PRACTICE FOR THE FINAL

---
---


## Education choices
We have a dataset coming from a discrete choice experiment on preferences for children education. The researchers want to establish the effects of the cost of education, foreign language used at school and the distance to the school.

The experiment presents households with several choice situations, each one in a different card. On each situation, the household has to decide between two alternatives. 
As in many choice experiments, the alternatives are used just to compare the effects attributes. Unlike, for example, a transportation choice with alternatives train, car and bus, here the alternatives do not encode information, we can just consider them as 'alternative A' and 'alternative B'.
**Alternative 'A' should be indistiguisable from Alternative B if their attributes in the choice situation are equal.**

Each household may answer several choice situations, we keep track of each household using the variable *id* in the dataset.

TO TEST IF THE EXAM IS WELL SUBMITTED INTO CANVAS. 

---
---

# Description of the dataset

Survey variables
 * **choice:** The response variable (1= alternative A, 2=alternative B).
 * **id:** Household ID.

\

Attributes

 * **cost_A, cost_B:** Yeaarly cost in dollars.
 * **foreign_A, foreign_B:** Whether school uses a foreign language as the default language for all units (except when teaching the local language unit).
 * **distance_A, distance_B:** Distance to the school, in meters.

\
  
Socioeconomic characteristics

 * **male:**  1 if the child in the household is male, 0 otherwise.
 * **female:**  1 if the child is in the household is female, 0 otherwise.
 * **parent_educ:** Head of the family education level (0=no formal education, 1=high school, 2=undergad, 3=postgrad)
 

---
---

# Preparing the environment
*The preparation and dataset loading code is given to the students, you might modify it.*

In [1]:
!pip install biogeme

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Load the packages.

In [2]:
import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp
import biogeme.tools as tools
import biogeme.distributions as dist


---
---


---
---

# Load the datasets

*Auxiliary code is provided to load the dataset*


In [3]:

url = 'https://drive.google.com/file/d/1flpaR4wwM9DaToAo5urt7T1Kfgb7zUN3/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
educ_pd = pd.read_csv(path)


In [4]:
educ_pd.head(7)

Unnamed: 0,id,choice,cost_A,foreign_A,distance_A,cost_B,foreign_B,distance_B,male,female,parent_educ
0,5,1,201218,0,3248,409082,1,2463,1,0,2
1,8,1,106655,1,5375,204348,0,2508,0,1,1
2,8,2,302259,0,4349,205217,1,2113,0,1,1
3,8,2,307518,0,4188,406835,1,2975,1,0,2
4,9,1,208701,1,4218,104187,0,5247,0,1,1
5,10,2,204284,0,2427,103413,1,2850,0,1,1
6,10,2,302429,0,4202,208732,1,2165,0,1,2



___
___


#1) Fit a multinomial logit model to act as reference model, using cost and distance as attributes.
*Hint: Do not consider alternative-specific constants.*

# !!!For some reason we need to standardize the variables, biogeme goes crazy otherwise

In [6]:
educ_pd['distance_A']=(educ_pd['distance_A']-educ_pd['distance_A'].mean())/educ_pd['distance_A'].std()
educ_pd['distance_B']=(educ_pd['distance_B']-educ_pd['distance_B'].mean())/educ_pd['distance_B'].std()
educ_pd['cost_A']=(educ_pd['cost_A']-educ_pd['cost_A'].mean())/educ_pd['cost_A'].std()
educ_pd['cost_B']=(educ_pd['cost_B']-educ_pd['cost_B'].mean())/educ_pd['cost_B'].std()

This one we give you a little hint:

In [7]:
bgm_edu = db.Database('edu', educ_pd)
globals().update(bgm_edu.variables)

B_cost = exp.Beta( 'B_cost', 0, None, None, 0)
B_dist = exp.Beta( 'B_dist', 0, None, None, 0)
B_fore = exp.Beta( 'B_fore', 0, None, None, 0)


V_A = B_cost*cost_A + B_dist*distance_A #+ B_fore*foreign_A
V_B = B_cost*cost_B + B_dist*distance_B #+ B_fore*foreign_B

V = {1: V_A ,
  2: V_B 
  }
av = {1: 1,
  2: 1
 }

logprob = models.loglogit (V , av , choice )
bgm_model = bio.BIOGEME ( bgm_edu, logprob )
bgm_model.modelName = 'my first multinomial logit'
results = bgm_model.estimate()



results.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
B_cost,-0.424985,0.047129,-9.017579,0.0
B_dist,-0.443359,0.037668,-11.770197,0.0


---

# 2) Use parents education as a characteristic interacting with cost.  Comment on the results (signs of the variables, interpretation of the interaction).

So parents education makes the effect of cost of the school less negative.
The code.

In [8]:

B_pared = exp.Beta( 'B_pared', 0, None, None, 0)
V_A_pared = B_cost*cost_A + B_dist*distance_A + B_pared*parent_educ*cost_A
V_B_pared = B_cost*cost_B + B_dist*distance_B + B_pared*parent_educ*cost_B

V_pared = {1: V_A_pared ,
  2: V_B_pared 
  }
av = {1: 1,
  2: 1
 }

logprob_pared = models.loglogit (V_pared , av , choice )
bgm_model_pared = bio.BIOGEME ( bgm_edu, logprob_pared )
bgm_model_pared.modelName = 'my first multinomial logit'
results_pared = bgm_model_pared.estimate()



results_pared.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
B_cost,-0.616324,0.117327,-5.253026,1.4962e-07
B_dist,-0.444602,0.037792,-11.764434,0.0
B_pared,0.155342,0.086823,1.789167,0.07358788


---

#3) Imagine that we model parents education as an additive characteristic (not an interaction, just adding to the utilities). However, in the context of our discrete choice experiment, this does not make sense. Explain why.
*Hint: No programming required but you might fit some models if it helps you.*



Adding a characteristic per alternative would differenciate the alternatives A and B, in our context, the alternatives A and B should not be differentiated, they represent the same concept, only used to compare the effects of the attributes.

---

# 4) Create a new multinomial logit model, including foreign language and gender variables in the model in Exercise 1. Is this a better model than the outcome of Exercise 1? 
*Hint: Pay attention to the way the variables are included (adding, interaction, per-altenative, etc.)*

This would mean, include them the via interaction,not as an additive effect.
Similar to Question 2

---

# 5) Fit a mixed logit model (no panel) using the specification of Exercise 1, consider that the effect of distance is random. Comment on the results (changes on all coefficients with respect to Exercise 1, variance of the random coefficient)

**You might use a small number of draws in the Montecarlo, or consider a smaller dataset if you get problems.**

In [9]:
SIGMA_B_dist = exp.Beta('SIGMA_B_dist',0.01,0.00001,None,0)

In [10]:
EC_B_dist = SIGMA_B_dist * exp.bioDraws('EC_B_dist','NORMAL')

In [11]:
V_A_mx = B_cost*cost_A + (B_dist+ EC_B_dist)*distance_A #+ B_fore*foreign_A
V_B_mx = B_cost*cost_B + (B_dist+EC_B_dist)*distance_B #+ B_fore*foreign_B

V_mx = {1: V_A_mx ,
  2: V_B_mx 
  }

In [12]:
m_obsprob = models.logit(V_mx, av, choice)
m_condprobIndiv = None

In [13]:
#if (do_PANEL):
# m_condprobIndiv = exp.PanelLikelihoodTrajectory(m_obsprob)
#else:
m_condprobIndiv = m_obsprob

In [14]:
#condprobIndiv = exp.PanelLikelihoodTrajectory(obsprob)

In [15]:
m_logprob = exp.log(exp.MonteCarlo(m_condprobIndiv))

In [38]:

# Create the Biogeme object
m_biogeme  = bio.BIOGEME(bgm_edu,m_logprob,numberOfDraws=50, seed=1)


In [39]:
results_mixed = m_biogeme.estimate()



In [40]:
results_mixed.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
B_cost,-0.475964,0.062169,-7.655942,1.909584e-14
B_dist,-0.608177,0.110942,-5.481952,4.206578e-08
SIGMA_B_dist,0.751121,0.27617,2.719778,0.006532577


### So the variance is much larger than the mean.

---

# 6) Consider the mixed logit with panel information about the household. Compare the results to Exercise 5.

In [19]:
educ_norm = educ_pd.copy()

In [20]:
educ_norm['distance_A']=(educ_norm['distance_A']-educ_norm['distance_A'].mean())/educ_norm['distance_A'].std()
educ_norm['distance_B']=(educ_norm['distance_B']-educ_norm['distance_B'].mean())/educ_norm['distance_B'].std()
educ_norm['cost_A']=(educ_norm['cost_A']-educ_norm['cost_A'].mean())/educ_norm['cost_A'].std()
educ_norm['cost_B']=(educ_norm['cost_B']-educ_norm['cost_B'].mean())/educ_norm['cost_B'].std()

In [21]:
#educ_norm = educ_norm.tail(230)
#educ_norm = educ_norm.drop(1534)

In [22]:
educ_norm['id'] = range(1542)
educ_norm['id'] = educ_norm['id'] + 1

In [23]:
educ_norm

Unnamed: 0,id,choice,cost_A,foreign_A,distance_A,cost_B,foreign_B,distance_B,male,female,parent_educ
0,1,1,-0.637955,0,-0.389403,1.618537,1,-1.021295,1,0,2
1,2,1,-1.816811,1,1.565657,-0.196371,0,-0.983065,0,1,1
2,3,2,0.621659,0,0.622596,-0.188668,1,-1.318640,0,1,1
3,4,2,0.687220,0,0.474611,1.598618,1,-0.586321,1,0,2
4,5,1,-0.544669,1,0.502186,-1.084270,0,1.343873,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...
1537,1538,2,-1.832893,1,1.677795,-0.198410,0,-1.459667,1,0,3
1538,1539,1,-0.553146,1,0.386371,-1.068677,0,1.455165,1,0,3
1539,1540,2,0.662237,1,-0.530954,-1.112194,0,1.429679,0,1,1
1540,1541,1,-1.876563,1,1.350573,-0.180663,0,-1.437578,0,1,1


In [24]:
bgm_edu_norm = db.Database('educ_norm', educ_pd)


In [25]:
bgm_edu_norm.panel('id')
#bgm_edu.panel('id')

In [26]:
globals().update(bgm_edu_norm.variables)

In [27]:
SIGMA_B_dist_pl = exp.Beta('SIGMA_B_dist',0.5,0,200,0)

In [28]:
EC_B_dist_pl = SIGMA_B_dist_pl * exp.bioDraws('EC_B_dist_pl','NORMAL')

In [29]:
V_A_mx_pl = B_cost*cost_A + (B_dist+ EC_B_dist_pl)*distance_A #+ B_fore*foreign_A
V_B_mx_pl = B_cost*cost_B + (B_dist+EC_B_dist_pl)*distance_B #+ B_fore*foreign_B

V_mx_pl = {1: V_A_mx_pl ,
  2: V_B_mx_pl 
  }

In [30]:
m_obsprob_pl = models.logit(V_mx_pl, av, choice)

In [31]:
condprobIndiv_pl = exp.PanelLikelihoodTrajectory(m_obsprob_pl)

In [32]:
m_logprob_pl = exp.log(exp.MonteCarlo(condprobIndiv_pl))

In [33]:

# Create the Biogeme object
m_biogeme_pl  = bio.BIOGEME(bgm_edu_norm,m_logprob_pl,numberOfDraws=50, seed=1)


In [34]:
results_mixed_pl = m_biogeme_pl.estimate()



In [35]:
results_mixed_pl.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
B_cost,-0.454608,0.057337,-7.928702,2.220446e-15
B_dist,-0.519994,0.05643,-9.214878,0.0
SIGMA_B_dist,0.485818,0.075582,6.427706,1.295442e-10


---

#7) Calculate the willingness to pay for reducing the distance to the school, based on the results of Exercise 6.
*Hint: consider the distribution of the WTP, since the coefficient for distance is random.*

# 8) The researchers want to test if there is some form of 'undesirable' systematic effect introduced in their survey method. In particular, if there are systematic differences between alternative A and alternative B (For example, because alternative A appears on top of the survey card, and B on the bottom). How would you test that? What would be the result of the test?

*Hint: You might want to compare some models to others.*

This is solved by actually consiering differences between alternative A and B. We could, for example, create ASC for each alternative and compare the model to the reference one. If it is a better fit, then there is some systematic effect.

#9) Based on the experiment, the government is thinking about increasing the number of schools in the area, with the idea of reducing the distance that the children have to travel. Assume that the rows of data in the survey represent the population. Building one school in the area would reduce the distance by 10%, two schools would reduce the distance by 15%, three schools would reduce the distance by 18%. Assuming a budget of 4 million dollars to build one school, how many schools should we build if we want to generate a net benefit for the population in less than 30 years?.


*Hint: It can be 0, 1, 2 or 3, the net benefit can come from WTP*

This one is explained on ED

# A1) Use a fixed effects model to estimate whether male-children households are less affected by distance than female-children households.

This one is explained on ED

# A2) In an alternative view of the dataset, we want to study the education level as the choice variable. Using as covariates the attributes of option 'A' and the characteristics, fit an ordered logit and comment on the results.

# A3) Continuing on the alternative view: Compare two different orders to the default order given in the dataset, and decide if they are better than the default.