<a href="https://colab.research.google.com/github/pmontman/pub-choicemodels/blob/main/nb/tuto_05_ordered_logit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will see an example on how to use biogeme to estimate ordered logit models.
We will use the (in)famous 'affairs' dataset, describe below.

The data comes from a survey from *Psychology Today*, conducted in the year 1969.

How often engaged in extramarital sexual intercourse during the past year?
With the possible answers:

* None
* Once
* Twice
* Three times
* 4 to 10 times
* Monthly
* Weekly
* Daily

# Description of the dataset

The dataset has 601 observations of the following variables:

* **affairs:** The answer to the survey, the answers are encode with numbers.
  0 = None, 1 = Once, 2 = Twice, 3=Three times, 7= 4 to 10 times, 12= monthly, 12 = weekly, 12 = daily.
  As we see, the information was encoded in such a way that we lose information about the more frequent answers, and the encoded numbers do not completely coincide with the frequencies. However, there is an ordinal relationship among the possible answer, from less frequent to more frequent.

*  **gender:** Categorical variable indicating either male or female among the participants.

* **age:** Numeric variable coding age in years: 17.5 = under 20, 22 = 20–24, 27 = 25–29, 32 = 30–34, 37 = 35–39, 42 = 40–44, 47 = 45–49, 52 = 50–54, 57 = 55 or over. 

* **yearsmarried:** Numeric variable coding number of years married: 0.125 = 3 months or less, 0.417 = 4–6 months, 0.75 = 6 months–1 year, 1.5 = 1–2 years, 4 = 3–5 years, 7 = 6–8 years, 10 = 9–11 years, 15 = 12 or more years.

* **children:** Categorical variable indicating if there are children in the marriage.

* **religiousness:** Categorical variable indicating how religious in the person, encoded as numbers: 1 = anti, 2 = not at all, 3 = slightly, 4 = somewhat, 5 = very.

* **education:**: Categorical variable indicating the level of education. Encoded as numbers: 9 = grade school, 12 = high school graduate, 14 = some college, 16 = college graduate, 17 = some graduate work, 18 = master's degree, 20 = Ph.D., M.D., or other advanced degree.

* **occupation:** Categorical variable classifying the profession of the individual. Encoded as numbers, and the meaning of the numbers has been somewhat lost in time. But it could be something like the one in this [link.](https://dictionary.fitbir.nih.gov/portal/publicData/dataElementAction!view.action?dataElementName=HollingsheadJobClassCat&publicArea=true)

* **rating:** Categorical variable indicating how happy they are with the marriage. Encoded as numbers: 1 = very unhappy, 2 = somewhat unhappy, 3 = average, 4 = happier than average, 5 = very happy.

---
---

# Preparing the environment
*The preparation and dataset loading code is given to the students*

In [1]:
!pip install biogeme



Load the packages, feel free to change the names.

In [2]:
import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp
import biogeme.tools as tools
import biogeme.distributions as dist

# Load the dataset

In [3]:
path = 'https://raw.githubusercontent.com/pmontman/pub-choicemodels/main/data/affairs.csv'
affairs_pd = pd.read_csv(path)

A simple look at the dataset.

In [4]:
affairs_pd.head(5)

Unnamed: 0,affairs,gender,age,yearsmarried,children,religiousness,education,occupation,rating
0,0,male,37.0,10.0,no,3,18,7,4
1,0,female,27.0,4.0,no,4,14,6,4
2,0,female,32.0,15.0,yes,1,12,1,4
3,0,male,57.0,15.0,yes,5,18,6,5
4,0,male,22.0,0.75,no,2,17,6,3


# Auxiliary function

In [5]:
def qbus_update_globals_bgm(pd_df):
   globals().update(db.Database('tmp_bg_bgm_for_glob', pd_df).variables)

# Data cleaning: Preparing the dataset for Biogeme

Biogeme does not accept non-numerical variables.


We need to transform categorical variables that are not encoded into numbers.
The function `factorize()` encodes into integers. In this case, we will get 0 for Male, 1 for Female for the gender variable. We will get 0 for no chidren, 1 for children. This means that we can use the encoding already as dummy variables that can be interpreted as 'isFemale' and 'hasChildren'.

In [6]:
affairs_pd['gender'] = affairs_pd['gender'].factorize()[0]
affairs_pd['children'] = affairs_pd['children'].factorize()[0]


After this transformation, we can already pass the pandas dataframe to biogeme, though mindful that some variables as occupation are numeric but should be transformed to dummy. 

# Ordinal logit with biogeme

We will first estimate the ordered logit 'manually' using biogeme, later we create some auxiliary functions to simplify the process.

We update the globals as usual.

In [7]:
qbus_update_globals_bgm(affairs_pd)

### Utility function

The first step is to create the utility function, an important difference with respect to what we have done in multinomial logit and nested is that there will be only one utility function, while in the MNL there is one utility function per alternative.

As for the variables, after some consideration (maybe you disagree), we can take
the variables religiousness and education as numeric, even though they are categorical. This is because religiousness can be represented as 'intensity' or the religion and education can be interpreted as 'years spent in education'.

In [8]:
# Parameters to be estimated
B_gender = exp.Beta('B_gender', 0, None, None, 0)
B_age = exp.Beta('B_age', 0, None, None, 0)
B_yearsmarried = exp.Beta('B_yearsmarried', 0, None, None, 0)
B_children = exp.Beta('B_children', 0, None, None, 0)
B_religiousness = exp.Beta('B_religiousness', 0, None, None, 0)
B_education = exp.Beta('B_education', 0, None, None, 0)

Define the one utility function using our familiar syntax.
We will simplify and not apply any transformation or interaction, though it might very well improve the fitting.

In [9]:
V_one = B_gender*gender + B_age*age + B_yearsmarried*yearsmarried + B_children*children + B_religiousness*religiousness + B_education*education

### Cutoff points

An important part of the ordinal logit model are the cut-off points the values that will split the utility into different categories. These are being estimated from the data. 
Remember that we have to impose some order in the cutoff points, this is, tau1
is smaller than tau2, tau2 is smaller than tau3 and so on.

The way to do this in biogeme (and other estimation software) is to create some 'auxiliary' parameters that can be interpreted as the deltas or differences between taus. These deltas are created with `exp.Beta` as the any other parameter of the model.
 We impose that the deltas can be estimated, but they have to be greater than 0. Remember that when we define the parameters in Biogeme, one of the arguments of the function `exp.Beta` are the bounds that restrict the range to which the values can be estimated.

We can 'recreate' the taus from the deltas, for example, tau2 = tau1 + delta1.

Just to clarify what we are doing with the deltas is imposing the order in the taus, it is just a trick to impose, 'tau1' can have any value, tau2 has to be between tau1 and tau3, tau3 has to be between tau2 and tau4, ..., tau6 has to be between tau5 and +infinity. What we are doing is just a way to pass the information to biogeme.

In the next cell we declare the taus and the deltas. Notice the third argument in the `exp.Beta` when declaring the deltas is set to 0, this was the argument for the lower bound in the estimation range. Notice how the taus are defined as the taus before them plus some delta. With the exception of tau1. 
The default value of tau1 is 1, but this is arbitrary.

There next cell will be very verbose, because we have to manually define each of the deltas and taus. This is a candidate to be done automatically in an auxiliary function...

In [28]:
tau1 = exp.Beta('tau1', 1, None, 0, 0)

delta2 = exp.Beta('delta2', 1, 0, None, 0)
tau2 = tau1 + delta2

delta3 = exp.Beta('delta3', 2, 0, None, 0)
tau3 = tau2 + delta3

delta4 = exp.Beta('delta4', 3, 0, None, 0)
tau4 = tau3 + delta4

delta5 = exp.Beta('delta5', 4, 0, None, 0)
tau5 = tau4 + delta5


What is left to define is the 'model' in the biogeme, the equivalent of the 
biogeme `models.loglogit` for the multinomial logit. In this case, we have to do it manually (another candidate for auxiliary functions).

We define the 'model' in biogeme using a dictionary. The dictionary maps the **values for the choice variable** to the computed choice probabilities.

In the next cell the function `dist.logisticcdf` is just the name of the logistic transform in biogeme (there is the logistic probability distribution, which the cumulative distribution funtion being the logistic transform).

In [29]:
alt_probs_map = {
    0: dist.logisticcdf(tau1 - V_one),
    1: dist.logisticcdf(tau2 - V_one) - dist.logisticcdf( tau1 - V_one),
    2: dist.logisticcdf(tau3 - V_one) - dist.logisticcdf( tau2 - V_one),
    3: dist.logisticcdf(tau4 - V_one) - dist.logisticcdf( tau3 - V_one),
    7: dist.logisticcdf(tau5 - V_one) - dist.logisticcdf( tau4 - V_one),
    12: 1- dist.logisticcdf(tau5 - V_one)}

Then we take the log of the choice probabilities (for the loglikelihood) and specify which variable in the dataset contains the choice or alternatives.

In [30]:

logprob = exp.log(exp.Elem(alt_probs_map, affairs))

We declare teh biogeme object as usual, and we estimate it.

In [31]:
affairs_db = db.Database('affairs', affairs_pd)

biogeme  = bio.BIOGEME(affairs_db, logprob, removeUnusedVariables=False)

results = biogeme.estimate()

And here are the results, for your interpretation!

In [32]:
results.getEstimatedParameters()

Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
B_age,-0.04237,0.017327,-2.445382,0.01446987,0.017461,-2.426499,0.01524529
B_children,0.345982,0.282166,1.226166,0.2201364,0.307155,1.126407,0.2599932
B_education,-0.008019,0.043939,-0.182502,0.8551885,0.047161,-0.170033,0.8649841
B_gender,-0.313018,0.214352,-1.4603,0.1442075,0.218586,-1.432013,0.1521401
B_religiousness,-0.386474,0.086052,-4.491184,7.082821e-06,0.089515,-4.3174,1.578778e-05
B_yearsmarried,0.128422,0.03166,4.056271,4.986234e-05,0.034298,3.744317,0.000180885
delta2,0.35254,0.058812,5.994317,2.043426e-09,0.057617,6.118661,9.436476e-10
delta3,0.206242,0.049258,4.186961,2.827138e-05,0.049672,4.152075,3.294743e-05
delta4,0.264076,0.059582,4.432159,9.329398e-06,0.05928,4.454712,8.400605e-06
delta5,0.859037,0.130093,6.603266,4.021983e-11,0.129027,6.657829,2.779021e-11


Simulation is as usual, the database and the dictionary with the targets. In this case, to get the choice probabilities, we can reuse the `alt_probs_map` dictionary as the targets. 


In [33]:
  bgm_pred_model = bio.BIOGEME(affairs_db, alt_probs_map)
  simulatedValues = bgm_pred_model.simulate(results.getBetaValues())
  simulatedValues

Unnamed: 0,0,1,2,3,7,12
0,0.767921,0.056870,0.027841,0.030194,0.063948,0.053227
1,0.901220,0.027248,0.012545,0.013061,0.025944,0.019982
2,0.374871,0.085503,0.051475,0.065398,0.185992,0.236762
3,0.861626,0.036941,0.017313,0.018237,0.036875,0.029008
4,0.794859,0.051589,0.024936,0.026811,0.055995,0.045810
...,...,...,...,...,...,...
596,0.619070,0.079004,0.041621,0.047560,0.110037,0.102708
597,0.637683,0.076923,0.040143,0.045552,0.104109,0.095590
598,0.560769,0.084160,0.045701,0.053423,0.128773,0.127174
599,0.645735,0.075960,0.039479,0.044666,0.101556,0.092605


To simulate the utilities we have to create a dictionary just with the utility funcion. 

In [36]:
V_pred_map = {
    'util': V_one}

WE do the simulation as usual, notice how the more positive the utility, the more it tends towars more infidelities. This is because we considered arbitrarily to intepret the 'utility that is received from having an affair'. We could have also used the 'utility from being faithful'.

In [37]:
  bgm_pred_util = bio.BIOGEME(affairs_db, V_pred_map)
  simulated_util = bgm_pred_util.simulate(results.getBetaValues())
  simulated_util

Unnamed: 0,util
0,-1.587234
1,-2.601481
2,0.120752
3,-2.219491
4,-1.745094
...,...
596,-0.876228
597,-0.955946
598,-0.634909
599,-0.990967


# Exercise 1) Create a more fine tuned ordered logit model



# Exercise 2) Compare the predictions and accuracy of the fine tuned model to a multinomial logit