<a href="https://colab.research.google.com/github/pmontman/tmp_choicemodels/blob/main/nb/tutorials/solutions/WK_06_sol_tuto_ordered_logit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SOLUTION Tutorial 6: Ordered Logit

We will see an example on how to use biogeme to estimate ordered logit models.

We will use the (in)famous 'affairs' dataset, described below.

The data comes from a survey from *Psychology Today*, conducted in the year 1969. It describes 'extramarital affairs', specifically, the participants in 
the survey were asked this question:

How often engaged in extramarital sexual intercourse during the past year?

With the possible answers:

* None
* Once
* Twice
* Three times
* 4 to 10 times
* Monthly
* Weekly
* Daily

Then by measuring socioeconomic variables of the participants, we could try to model the utility that they receive from having an extramarital affair (or more than one).

We can see that this falls perfectly into the ordered logit, the alternatives have a clear notion of order, from less affairs per year to more affairs per year, yet they are not perfectly numerical (which would allow us to use standard regression techniques).


# Description of the dataset

A scientific paper making use of the dataset to develop a 'Theory of Extramarial Affairs' can be found [here](https://www.uibk.ac.at/econometrics/data/fair78.pdf).

The dataset has 601 observations of the following variables:

* **affairs:** The answer to the survey, the answers are encoded with numbers.
  0 = None, 1 = Once, 2 = Twice, 3=Three times, 7= 4 to 10 times, 12= monthly, 12 = weekly, 12 = daily.
  As we see, the information was encoded in such a way that we lose information about the more frequent answers, and the encoded numbers do not completely coincide with the frequencies. However, there is an ordinal relationship among the possible answer, from less frequent to more frequent. Several categories overlap in how the are described (from monthly to daily).

*  **gender:** Categorical variable indicating either male or female among the participants.

* **age:** Numeric variable coding age in years: 17.5 = under 20, 22 = 20–24, 27 = 25–29, 32 = 30–34, 37 = 35–39, 42 = 40–44, 47 = 45–49, 52 = 50–54, 57 = 55 or over. 

* **yearsmarried:** Numeric variable coding number of years married: 0.125 = 3 months or less, 0.417 = 4–6 months, 0.75 = 6 months–1 year, 1.5 = 1–2 years, 4 = 3–5 years, 7 = 6–8 years, 10 = 9–11 years, 15 = 12 or more years.

* **children:** Categorical variable indicating if there are children in the marriage.

* **religiousness:** Categorical variable indicating how religious in the person, encoded as numbers: 1 = anti, 2 = not at all, 3 = slightly, 4 = somewhat, 5 = very.

* **education:**: Categorical variable indicating the level of education. Encoded as numbers: 9 = grade school, 12 = high school graduate, 14 = some college, 16 = college graduate, 17 = some graduate work, 18 = master's degree, 20 = Ph.D., M.D., or other advanced degree.

* **occupation:** Categorical variable classifying the profession of the individual. Encoded as numbers, and the meaning of the numbers has been somewhat lost in time. But it could be something like the one in this [link.](https://dictionary.fitbir.nih.gov/portal/publicData/dataElementAction!view.action?dataElementName=HollingsheadJobClassCat&publicArea=true)

* **rating:** Categorical variable indicating how happy they are with the marriage. Encoded as numbers: 1 = very unhappy, 2 = somewhat unhappy, 3 = average, 4 = happier than average, 5 = very happy.

---
---

# Preparing the environment
*The preparation and dataset loading code is given to the students*

In [2]:
!pip install biogeme

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting biogeme
  Downloading biogeme-3.2.10.tar.gz (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 4.1 MB/s 
[?25hBuilding wheels for collected packages: biogeme
  Building wheel for biogeme (setup.py) ... [?25l[?25hdone
  Created wheel for biogeme: filename=biogeme-3.2.10-cp37-cp37m-linux_x86_64.whl size=4253688 sha256=57a06600f26fe5f0967999e350c84cc6bcb3553729dbc6277ddb7d1628886d38
  Stored in directory: /root/.cache/pip/wheels/5b/92/9b/63caa7ad9b2cd582de77d3701d10f7e8d041466f4a9d07d554
Successfully built biogeme
Installing collected packages: biogeme
Successfully installed biogeme-3.2.10


Load the packages, feel free to change the names.

In [3]:
import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp
import biogeme.tools as tools
import biogeme.distributions as dist

# Load the dataset

In [4]:
path = 'https://raw.githubusercontent.com/pmontman/pub-choicemodels/main/data/affairs.csv'
affairs_pd = pd.read_csv(path)

A simple look at the dataset.

In [5]:
affairs_pd.head(5)

Unnamed: 0,affairs,gender,age,yearsmarried,children,religiousness,education,occupation,rating
0,0,male,37.0,10.0,no,3,18,7,4
1,0,female,27.0,4.0,no,4,14,6,4
2,0,female,32.0,15.0,yes,1,12,1,4
3,0,male,57.0,15.0,yes,5,18,6,5
4,0,male,22.0,0.75,no,2,17,6,3


# Auxiliary function

In [6]:
def qbus_update_globals_bgm(pd_df):
   globals().update(db.Database('tmp_bg_bgm_for_glob', pd_df).variables)

# Data cleaning: Preparing the dataset for Biogeme

Biogeme does not accept non-numerical variables.


We need to transform categorical variables that are not encoded into numbers.
The pandas function `factorize()` encodes a variable into integers. In this case, we will get 0 for Male, 1 for Female for the gender variable. We will get 0 for no, 1 for yes in the children varible. This means that we can use the encoding already as dummy variables that can be interpreted as 'isFemale' and 'hasChildren'.

In [7]:
affairs_pd['gender'] = affairs_pd['gender'].factorize()[0]
affairs_pd['children'] = affairs_pd['children'].factorize()[0]


After this transformation, we can already pass the pandas dataframe to biogeme, though be mindful that some variables such as occupation are numeric but should be transformed to dummy. 

# Ordinal logit with biogeme

We will first estimate the ordered logit 'manually' using biogeme, to get a better view of the process and how it is connected to the theory. Later we can create some auxiliary functions to simplify the process.

We start by update the globals as usual.

In [8]:
qbus_update_globals_bgm(affairs_pd)

### Utility function

The first step is to create the utility function, an important difference with respect to what we have done in the multinomial logit(MNL) and nested logit is that there will be only one utility function, while in the MNL there is one utility function per alternative.

As for the variables, after some consideration (maybe you disagree), we can take
the variables religiousness, rating and education as numeric, even though they are categorical. This is because religiousness can be represented as 'intensity' of the belief in religion and education can be roughly interpreted as 'years spent in education'. So for the sake of simplicity

In [9]:
# Parameters to be estimated
B_gender = exp.Beta('B_gender', 0, None, None, 0)
B_age = exp.Beta('B_age', 0, None, None, 0)
B_yearsmarried = exp.Beta('B_yearsmarried', 0, None, None, 0)
B_children = exp.Beta('B_children', 0, None, None, 0)
B_religiousness = exp.Beta('B_religiousness', 0, None, None, 0)
B_education = exp.Beta('B_education', 0, None, None, 0)
B_rating = exp.Beta('B_rating', 0, None, None, 0)

Define the one utility function using our familiar syntax.
We will simplify and not apply any transformation or interaction, though it might very well improve the fitting.

In [10]:
V_one = B_gender*gender + B_age*age + B_yearsmarried*yearsmarried + B_children*children + B_religiousness*religiousness + B_education*education + B_rating*rating

### Cutoff points

An important part of the ordinal logit model are the cut-off points, the values that will split the utility into different alternatives. These are being estimated from the data. 
Remember that we have to impose some order in the cutoff points, this is, tau1
is smaller than tau2, tau2 is smaller than tau3 and so on.

The way to do this in biogeme (and other estimation software) is to create some 'auxiliary' parameters that can be interpreted as the deltas or differences between taus. These deltas are created with `exp.Beta` as the any other parameter of the model.
 We impose that the deltas can be estimated, but they have to be greater than 0. Remember that when we define the parameters in Biogeme, one of the arguments of the function `exp.Beta` are the bounds that restrict the range to which the values can be estimated.

We can 'recreate' the taus from the deltas, for example, tau2 = tau1 + delta2, tau3 = tau2 + delta3 (or alternatively, tau3 = tau1 + delta2 + delta3).

Just to clarify: what we are doing with the deltas is imposing the order in the taus, it is just a trick to impose: all taus are in increasing order, starting from tau1, 'tau1' can have any value, tau2 has to be between tau1 and tau3, tau3 has to be between tau2 and tau4, ..., tau6 has to be between tau5 and +infinity. What we are doing is just a way to pass the information to biogeme.

In the next cell we declare the taus and the deltas. Notice the third argument in the `exp.Beta` when declaring the deltas is set to 0, the third argument specifies the lower bound in the estimation range. Notice how the taus are defined as the 'taus before them' plus some delta. With the exception of tau1. 
The default value of tau1 is -1, but this is arbitrary.

There next cell will be very verbose, because we have to manually define each of the deltas and taus. This is a candidate to be done automatically in an auxiliary function...

In [11]:
tau1 = exp.Beta('tau1', -1, None, None, 0)

delta2 = exp.Beta('delta2', 1, 0, None, 0)
tau2 = tau1 + delta2

delta3 = exp.Beta('delta3', 2, 0, None, 0)
tau3 = tau2 + delta3

delta4 = exp.Beta('delta4', 3, 0, None, 0)
tau4 = tau3 + delta4

delta5 = exp.Beta('delta5', 4, 0, None, 0)
tau5 = tau4 + delta5

What is left to define is the 'model' in the biogeme, the equivalent of the 
biogeme `models.loglogit` for the multinomial logit. In this case, we have to do it manually (another candidate for auxiliary functions).

We define the 'model' in biogeme using a dictionary. The dictionary maps the **values for the choice variable** to the computed choice probabilities. The choice variable here is affairs.

In the next cell the function `dist.logisticcdf` is just the name of the logistic transform in biogeme (there is the logistic probability distribution, which the cumulative distribution function 'logisticcdf' being the equivalent logistic transform).

In [12]:
alt_probs_map = {
    0: dist.logisticcdf(tau1 - V_one),
    1: dist.logisticcdf(tau2 - V_one) - dist.logisticcdf( tau1 - V_one),
    2: dist.logisticcdf(tau3 - V_one) - dist.logisticcdf( tau2 - V_one),
    3: dist.logisticcdf(tau4 - V_one) - dist.logisticcdf( tau3 - V_one),
    7: dist.logisticcdf(tau5 - V_one) - dist.logisticcdf( tau4 - V_one),
    12: 1- dist.logisticcdf(tau5 - V_one)}

Then we take the log of the choice probabilities (for the loglikelihood) and specify which variable in the dataset contains the choice or alternatives.

In [13]:

logprob = exp.log(exp.Elem(alt_probs_map, affairs))

We declare the biogeme object as usual, and we estimate the model.

In [14]:
affairs_db = db.Database('affairs', affairs_pd)

biogeme  = bio.BIOGEME(affairs_db, logprob)

results = biogeme.estimate()



And here are the results, for your interpretation!
The new part are the taus and deltas. We have tau1, and then the deltas that will allow us to recover the remaining taus. We check tha everything looks OK,all deltas are positive.

In [15]:
results.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
B_age,-0.04974,0.018725,-2.656382,0.007898413
B_children,0.261993,0.300968,0.870504,0.384025
B_education,0.02335,0.045835,0.509433,0.6104488
B_gender,-0.304145,0.220739,-1.377848,0.1682504
B_rating,-0.511382,0.090896,-5.62598,1.844578e-08
B_religiousness,-0.35981,0.091993,-3.91128,9.180831e-05
B_yearsmarried,0.120859,0.033328,3.626311,0.0002874995
delta2,0.374723,0.060704,6.172911,6.704384e-10
delta3,0.220792,0.052935,4.170971,3.033043e-05
delta4,0.279967,0.063055,4.440032,8.994532e-06


Simulation is as usual, pass the biogeme database and the dictionary with the targets to compute. In this case, to get the choice probabilities, we can reuse the `alt_probs_map` dictionary as the targets. 


In [16]:
  bgm_pred_model = bio.BIOGEME(affairs_db, alt_probs_map)
  simulatedValues = bgm_pred_model.simulate(results.getBetaValues())
  simulatedValues

Unnamed: 0,0,1,2,3,7,12
0,0.773768,0.058869,0.028555,0.030215,0.061349,0.047244
1,0.901572,0.028613,0.013046,0.013259,0.025330,0.018179
2,0.459921,0.093391,0.053720,0.064433,0.162453,0.166081
3,0.930165,0.020754,0.009337,0.009411,0.017760,0.012573
4,0.679986,0.075563,0.038453,0.042056,0.090025,0.073916
...,...,...,...,...,...,...
596,0.765087,0.060619,0.029532,0.031340,0.063925,0.049496
597,0.660012,0.078466,0.040353,0.044464,0.096360,0.080345
598,0.709701,0.070811,0.035483,0.038390,0.080742,0.064873
599,0.392080,0.091962,0.055112,0.068367,0.184268,0.208211


To simulate the utilities we have to create a dictionary just with the utility funcion. 

In [17]:
V_pred_map = {
    'util': V_one }

We do the simulation as usual, notice how the more positive the utility, the more it tends towars more infidelities. This is because we considered arbitrarily to intepret the 'utility that is received from having an affair'. We could have also used the 'utility from being faithful'.

In [18]:
  bgm_pred_util = bio.BIOGEME(affairs_db, V_pred_map)
  simulated_util = bgm_pred_util.simulate(results.getBetaValues())
  simulated_util



Unnamed: 0,util
0,-3.336457
1,-4.321564
2,-1.946090
3,-4.695977
4,-2.860458
...,...
596,-3.287524
597,-2.770097
598,-3.000684
599,-1.668172


# Exercise 1) Create a more fine tuned ordered logit model, adding a variable transformation


We add the age and yearsmarried squared,. 

In [19]:
B_age_sq = exp.Beta('B_age_sq', 0, None, None, 0)
B_yearsmarried_sq = exp.Beta('B_yearsmarried_sq', 0, None, None, 0)

The squared are added directly in biogeme, as opposed to transforming the pandas dataframe.
At the end of this line of code.

In [20]:
V_complex = B_gender*gender + B_age*age + B_yearsmarried*yearsmarried + B_children*children + B_religiousness*religiousness + B_education*education + B_rating*rating + B_age_sq*age*age + B_yearsmarried_sq*yearsmarried*yearsmarried

The point of the exercise is to practice with the specification of the taus (and deltas).
It is kind of an 'repeat' from the tutorial, we can go through it again. The only thing that changes from dataset to dataset might be the number of alternatives, so we might need to specify more deltas and taus.

In [22]:
tau1_complex = exp.Beta('tau1_complex', -1, None, None, 0)

delta2_complex = exp.Beta('delta2_complex', 1, 0, None, 0)
tau2_complex = tau1_complex + delta2_complex

delta3_complex = exp.Beta('delta3_complex', 1.2, 0, None, 0)
tau3_complex = tau2_complex + delta3_complex

delta4_complex = exp.Beta('delta4_complex', 1.3, 0, None, 0)
tau4_complex = tau3_complex + delta4_complex

delta5_complex = exp.Beta('delta5_complex', 4, 0, None, 0)
tau5_complex = tau4_complex + delta5_complex

This is how the choice probabilities are computed following the ordered logit formula...

In [23]:
alt_probs_map_complex = {
    0: dist.logisticcdf(tau1_complex - V_complex),
    1: dist.logisticcdf(tau2_complex - V_complex) - dist.logisticcdf( tau1_complex - V_complex),
    2: dist.logisticcdf(tau3_complex - V_complex) - dist.logisticcdf( tau2_complex - V_complex),
    3: dist.logisticcdf(tau4_complex - V_complex) - dist.logisticcdf( tau3_complex - V_complex),
    7: dist.logisticcdf(tau5_complex - V_complex) - dist.logisticcdf( tau4_complex - V_complex),
    12: 1- dist.logisticcdf(tau5_complex - V_one)}

In [25]:
logprob_complex = exp.log(exp.Elem(alt_probs_map_complex, affairs))

model_complex  = bio.BIOGEME(affairs_db, logprob_complex)

results_complex = model_complex.estimate()



The coefficients for the new variables, the squared, are negative, so we get the famous inverted 'U' shape relationship with age and years married. Early in the marriage, less utility from affairs, then it increases until it reaches a peak, and the decreases for couples that have been married for several years. 

In [26]:
results_complex.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
B_age,0.381804,0.078222,4.881051,1.055219e-06
B_age_sq,-0.005498,0.001108,-4.962617,6.954976e-07
B_children,0.24869,0.335485,0.741283,0.4585216
B_education,0.011196,0.052216,0.214416,0.8302224
B_gender,-0.162013,0.249002,-0.650649,0.515273
B_rating,-0.402034,0.092942,-4.325641,1.520888e-05
B_religiousness,-0.291271,0.102351,-2.845802,0.00442997
B_yearsmarried,0.150991,0.109598,1.377679,0.1683024
B_yearsmarried_sq,-0.007752,0.005871,-1.320315,0.1867299
delta2_complex,0.446466,0.072542,6.154621,7.52572e-10


Out of curiosity,t  this model is a better fit, so it seems that the story of the inverted 'U' shape has some merit.

In [27]:
tools.likelihood_ratio_test( (results_complex.data.logLike, results_complex.data.nparam),
                                     (results.data.logLike, results.data.nparam), 0.05)

LRTuple(message='H0 can be rejected at level 5.0%', statistic=187.1303540616533, threshold=5.991464547107979)


# Exercise 2) Compare the predictions and accuracy of the model tuned in the exposition of the tutorial (before exercise 1) to an equivalent multinomial logit (with the same variables)

So we have six categories and seven variables which are socioeconomic characteristics, one per category. How many parameters should we create for biogeme?

In [20]:
ASC_None = exp.Beta ( 'ASC_None' ,0, None , None ,1)
ASC_Once = exp.Beta ( 'ASC_Once' ,0, None , None ,0)
ASC_Twice = exp.Beta ( 'ASC_Twice' ,0, None , None ,0)
ASC_Three = exp.Beta ( 'ASC_Three' ,0, None , None ,0)
ASC_F2S = exp.Beta ( 'ASC_F2S' ,0, None , None ,0)
ASC_More = exp.Beta ( 'ASC_More' ,0, None , None ,0)