<a href="https://colab.research.google.com/github/pmontman/tmp_choicemodels/blob/main/nb/tutorials/WK_03_tut_binlogit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 3: Binary logit with statsmodel

In this tutorial, we will see:

* Introduction to the binary logit, using the statsmodels library
* Discuss one of the typical formars that data can come: as experimental data without properly identified alternatives
* A comparison to another model famiiy, the probit, that assumes the gaussian.

In future tutorials, we will change the main data analysis library to biogeme, which is more flexible (and a bit more cumbersome to use).

In [1]:

import numpy as np
import statsmodels.api as sm

---
---

# Dataset: Sports car choices

Sports car choices coming from survey data, sourced from [here](https://github.com/spensorflow/Marketing-Analytics---Choice-Modeling-Sports-Car-Sales/).

Lets take a look at the data, we have

The fields in this dataset are as follows:

<table style="width:144%;">
<colgroup>
<col width="18%" />
<col width="126%" />
</colgroup>
<thead>
<tr class="header">
<th align="left"><strong>Field</strong></th>
<th align="left"><strong>Description</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">resp_id</td>
<td align="left">The identifier of each individual in the dataset</td>
</tr>
<tr class="even">
<td align="left">ques</td>
<td align="left">The identifier of each specific purchase scenario</td>
</tr>
<tr class="odd">
<td align="left">alt</td>
<td align="left">The identifier of each alternative choice within a question</td>
</tr>
<tr class="even">
<td align="left">segment</td>
<td align="left">The commercial segment of a sportscar model ('basic', 'fun', 'racer')</td>
</tr>
<tr class="odd">
<td align="left">seat</td>
<td align="left">The number of seats in the vehicle (2, 4, 5)</td>
</tr>
<tr class="even">
<td align="left">trans</td>
<td align="left">The transmission type of the vehicle ('auto','manual')</td>
</tr>
<tr class="odd">
<td align="left">convert</td>
<td align="left">Whether or not the vehicle has a convertible top</td>
</tr>
<tr class="even">
<td align="left">price</td>
<td align="left">The sportscar price (in thousands/$)</td>
</tr>
<tr class="odd">
<td align="left">choice</td>
<td align="left">Dummy indicator of the decision made. (1 = car chosen, 0 = alternative cars chosen from)</td>
</tr>
</tbody>
</table>

In [2]:
import pandas as pd

sportscar = pd.read_csv("https://raw.githubusercontent.com/pmontman/tmp_choicemodels/main/data/sportscar_choice_long.csv")
sportscar.head(5)


Unnamed: 0,resp_id,ques,alt,segment,seat,trans,convert,price,choice
0,1,1,1,basic,2,manual,yes,35,0
1,1,1,2,basic,5,auto,no,40,0
2,1,1,3,basic,5,auto,no,30,1
3,1,2,1,basic,5,manual,no,35,0
4,1,2,2,basic,2,manual,no,30,1


The data comes in what is called **'long format'**, each row represents one posible alternative of an individual, with its attributes and characteristics. We would like to reformat the data so that each row represents the complete choice situation, we will see that later.

We see a variable indicating the alternative (`alt`) with three possible options. We will keep only two alternatives for the binary analysist, which are `alt=1` and `alt=2`, in the following cell.

In [3]:
sportscar = sportscar[sportscar['alt'] < 3]
sportscar.head()

Unnamed: 0,resp_id,ques,alt,segment,seat,trans,convert,price,choice
0,1,1,1,basic,2,manual,yes,35,0
1,1,1,2,basic,5,auto,no,40,0
3,1,2,1,basic,5,manual,no,35,0
4,1,2,2,basic,2,manual,no,30,1
6,1,3,1,basic,5,auto,yes,35,1


---
---

### Transformation to wide format
The data is in long format, we wil put it in **wide format** which is more common in some data analysis packages (both long and wide are used today, historically wide was more popular).

**This transformation process is shown for completenes, but you will be given datasets already preprocessed, you will no be asked to transform to wide format in the exam**

Transformation to wide format can be done via the `pivot` in `pandas`, we have to choose which columns identify an individual, which would be `resp_id` and `ques` (the responder id identifies the individual, and ques identifies the set of alternatives given to each individual in a bundle. The second arguments identies the variable that will make the new columns for each row, in our case is `alt`, this will create columns having the variables `seat, trans, price,...` and so on, per alternative, so we will have `seat_1`, `seat_2`, etc.

In [4]:
sportscar = sportscar.pivot(['ques', 'resp_id'], 'alt')

  sportscar = sportscar.pivot(['ques', 'resp_id'], 'alt')


Now we should have the data in wide format, notice how we have, in each row, all information about the choice situation for that individual. For example, we have `seat_1` and `seat_2` meaning the number of `seats for the car given in alternative 1` vs `seats for car in alternative 2`.

In [5]:
sportscar

Unnamed: 0_level_0,Unnamed: 1_level_0,segment,segment,seat,seat,trans,trans,convert,convert,price,price,choice,choice
Unnamed: 0_level_1,alt,1,2,1,2,1,2,1,2,1,2,1,2
ques,resp_id,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
1,1,basic,basic,2,5,manual,auto,yes,no,35,40,0,0
1,2,basic,basic,5,5,manual,manual,no,yes,35,35,0,0
1,3,basic,basic,4,4,auto,auto,yes,no,35,35,1,0
1,4,basic,basic,2,5,auto,manual,no,no,30,30,0,1
1,5,basic,basic,5,2,auto,auto,yes,yes,35,30,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10,196,basic,basic,4,2,manual,auto,yes,yes,30,40,0,0
10,197,basic,basic,2,2,auto,manual,yes,no,40,40,0,0
10,198,basic,basic,5,4,auto,auto,no,yes,30,35,1,0
10,199,basic,basic,5,5,manual,auto,yes,yes,35,35,0,1


The columns in the pandas dataframe are in hierarchical format, which I personally do not like, so we will flatten them so the suffix `_1` and `_2` identifie the alternative.

In [6]:
sportscar.columns = ['_'.join([str(element) for element in a]) for a in sportscar.columns.to_flat_index()]

In [7]:
sportscar

Unnamed: 0_level_0,Unnamed: 1_level_0,segment_1,segment_2,seat_1,seat_2,trans_1,trans_2,convert_1,convert_2,price_1,price_2,choice_1,choice_2
ques,resp_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,1,basic,basic,2,5,manual,auto,yes,no,35,40,0,0
1,2,basic,basic,5,5,manual,manual,no,yes,35,35,0,0
1,3,basic,basic,4,4,auto,auto,yes,no,35,35,1,0
1,4,basic,basic,2,5,auto,manual,no,no,30,30,0,1
1,5,basic,basic,5,2,auto,auto,yes,yes,35,30,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10,196,basic,basic,4,2,manual,auto,yes,yes,30,40,0,0
10,197,basic,basic,2,2,auto,manual,yes,no,40,40,0,0
10,198,basic,basic,5,4,auto,auto,no,yes,30,35,1,0
10,199,basic,basic,5,5,manual,auto,yes,yes,35,35,0,1


Finally, the column choice 2 is not needed, since it is redundant information to choice 1, only one of those will be `=1`, the other `=0`

In [8]:
sportscar = sportscar.drop(columns=['choice_2'])
sportscar

Unnamed: 0_level_0,Unnamed: 1_level_0,segment_1,segment_2,seat_1,seat_2,trans_1,trans_2,convert_1,convert_2,price_1,price_2,choice_1
ques,resp_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,1,basic,basic,2,5,manual,auto,yes,no,35,40,0
1,2,basic,basic,5,5,manual,manual,no,yes,35,35,0
1,3,basic,basic,4,4,auto,auto,yes,no,35,35,1
1,4,basic,basic,2,5,auto,manual,no,no,30,30,0
1,5,basic,basic,5,2,auto,auto,yes,yes,35,30,0
...,...,...,...,...,...,...,...,...,...,...,...,...
10,196,basic,basic,4,2,manual,auto,yes,yes,30,40,0
10,197,basic,basic,2,2,auto,manual,yes,no,40,40,0
10,198,basic,basic,5,4,auto,auto,no,yes,30,35,1
10,199,basic,basic,5,5,manual,auto,yes,yes,35,35,0


---
---

## Modelling with Logit

We are ready for the analysis, one more step would be encoding the categorical variables as dummies. We put this in the modelling step, because some modelling packages deal with categorical variables internally, so it might not be needed as a 'preprocessing', technically.

In [9]:
sportscar_d = pd.get_dummies(sportscar)
sportscar_d.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,seat_1,seat_2,price_1,price_2,choice_1,segment_1_basic,segment_1_fun,segment_1_racer,segment_2_basic,segment_2_fun,segment_2_racer,trans_1_auto,trans_1_manual,trans_2_auto,trans_2_manual,convert_1_no,convert_1_yes,convert_2_no,convert_2_yes
ques,resp_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1,2,5,35,40,0,1,0,0,1,0,0,0,1,1,0,0,1,1,0
1,2,5,5,35,35,0,1,0,0,1,0,0,0,1,0,1,1,0,0,1
1,3,4,4,35,35,1,1,0,0,1,0,0,1,0,1,0,0,1,1,0
1,4,2,5,30,30,0,1,0,0,1,0,0,1,0,0,1,1,0,1,0
1,5,5,2,35,30,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1


We will model the data now.

We will need to pass to `statsmodels` one dataset with the response variable, and the other with the explanatory variables.

For simplicity, we will use a few of the variables, like seat, price, transmission type and convertible.

 There are important decisions to make, for example,
  * We are choosing to model Alternative 1 as the response variable, **What do you think will happen if we choose Alternative 2?**
  * We are choosing to add as input variables only variables on alternative 1, (e.g. `seat_1` instead of `seat_1` and `seat_2`) **What do you think are the consequences of this?**
  * Should we include an intercept?

In [10]:
logit_mod = sm.Logit(sportscar_d.choice_1, sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']] )
logit_res = logit_mod.fit(disp=0)

print(logit_res.summary())

                           Logit Regression Results                           
Dep. Variable:               choice_1   No. Observations:                 2000
Model:                          Logit   Df Residuals:                     1996
Method:                           MLE   Df Model:                            3
Date:                Wed, 16 Aug 2023   Pseudo R-squ.:                 0.08771
Time:                        00:23:09   Log-Likelihood:                -1154.5
converged:                       True   LL-Null:                       -1265.5
Covariance Type:            nonrobust   LLR p-value:                 7.394e-48
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
seat_1            0.2665      0.040      6.725      0.000       0.189       0.344
price_1          -0.0696      0.005    -13.706      0.000      -0.080      -0.060
trans_1_auto      1.1318      0.103     

Lets analyze the results above, take a look at:
* The estimated coefficients: Sign, magnitude, etc. (remember the possible pitfalls from linear regression).
* Look at the 'fitting' indicator
* Look at the 'reference' model


## Confusion matrix

For further validation, we could do accuracy of go to the confusion matrix.
In `statsmodels` rows are actual outcome,cols are predicted outcome.

In [11]:
logit_res.pred_table()

array([[1213.,  131.],
       [ 452.,  204.]])

---
---

# Basic Predictions

We can

In [12]:

logit_res.predict(sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']].iloc[1:15])

ques  resp_id
1     2          0.249368
      3          0.460610
      4          0.396057
      5          0.527127
      6          0.264410
      7          0.264410
      8          0.190037
      9          0.357900
      10         0.612160
      11         0.357900
      12         0.316542
      13         0.264410
      14         0.612160
      15         0.215908
dtype: float64


# Compare to the probit

Lets fit a probit model, and compare the coefficients

In [13]:
probit_mod = sm.Probit(sportscar_d.choice_1, sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']] )
probit_res = probit_mod.fit(disp=0)

print(probit_res.summary())



                          Probit Regression Results                           
Dep. Variable:               choice_1   No. Observations:                 2000
Model:                         Probit   Df Residuals:                     1996
Method:                           MLE   Df Model:                            3
Date:                Wed, 16 Aug 2023   Pseudo R-squ.:                 0.08766
Time:                        00:23:09   Log-Likelihood:                -1154.6
converged:                       True   LL-Null:                       -1265.5
Covariance Type:            nonrobust   LLR p-value:                 7.940e-48
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
seat_1            0.1578      0.023      6.744      0.000       0.112       0.204
price_1          -0.0416      0.003    -14.272      0.000      -0.047      -0.036
trans_1_auto      0.6791      0.061     

### Exercise: Compare predictions of logit vs probit

---
---
## Advanced Predictions

For a 'new' individual, we will creat a car with many seats, see what the model is telling us.

In [14]:
weirdcar = sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']].iloc[0].copy()
weirdcar['seat_1'] = weirdcar['seat_1'] + 12
weirdcar

seat_1           14
price_1          35
trans_1_auto      0
convert_1_yes     1
Name: (1, 1), dtype: int64

It seems they will like the car

In [15]:

logit_res.predict(weirdcar)

None    0.798243
dtype: float64

## Exercise: Get the utility for that individual

*Clue: Remember something about log-odds*

In [16]:
logit_res.predict(weirdcar)

None    0.798243
dtype: float64

## Exercise: Calculate WTP per seat

## Exercise: on Modelling the seats variable better, in a nonlinear form
*Clue: Think about the utility interpretation, to imagine how the curve of seats vs satisfaction should look like*

## Exercise: Re-analyze the dataset using alternatives 1 and 3, instead of 1 and 2, and compare the differences.