<a href="https://colab.research.google.com/github/pmontman/tmp_choicemodels/blob/main/nb/tutorials/WK_03_tut_binlogit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 3

In this tutorial, we will see:

* Introduction to the binary logit, using the statsmodels library
* A view of the interpretation of the logit

# Binary logit with statsmodel

In [1]:

import numpy as np
import statsmodels.api as sm

---
---

Dataset

Car choices from
https://github.com/spensorflow/Marketing-Analytics---Choice-Modeling-Sports-Car-Sales/

The fields in this dataset are as follows:

<table style="width:144%;">
<colgroup>
<col width="18%" />
<col width="126%" />
</colgroup>
<thead>
<tr class="header">
<th align="left"><strong>Field</strong></th>
<th align="left"><strong>Description</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">resp_id</td>
<td align="left">The identifier of each individual in the dataset</td>
</tr>
<tr class="even">
<td align="left">ques</td>
<td align="left">The identifier of each specific purchase scenario</td>
</tr>
<tr class="odd">
<td align="left">alt</td>
<td align="left">The identifier of each alternative choice within a question</td>
</tr>
<tr class="even">
<td align="left">segment</td>
<td align="left">The commercial segment of a sportscar model ('basic', 'fun', 'racer')</td>
</tr>
<tr class="odd">
<td align="left">seat</td>
<td align="left">The number of seats in the vehicle (2, 4, 5)</td>
</tr>
<tr class="even">
<td align="left">trans</td>
<td align="left">The transmission type of the vehicle ('auto','manual')</td>
</tr>
<tr class="odd">
<td align="left">convert</td>
<td align="left">Whether or not the vehicle has a convertible top</td>
</tr>
<tr class="even">
<td align="left">price</td>
<td align="left">The sportscar price (in thousands/$)</td>
</tr>
<tr class="odd">
<td align="left">choice</td>
<td align="left">Dummy indicator of the decision made. (1 = car chosen, 0 = alternative cars chosen from)</td>
</tr>
</tbody>
</table>

In [2]:
import pandas as pd

sportscar = pd.read_csv("https://raw.githubusercontent.com/pmontman/tmp_choicemodels/main/data/sportscar_choice_long.csv")
sportscar.head(5)


Unnamed: 0,resp_id,ques,alt,segment,seat,trans,convert,price,choice
0,1,1,1,basic,2,manual,yes,35,0
1,1,1,2,basic,5,auto,no,40,0
2,1,1,3,basic,5,auto,no,30,1
3,1,2,1,basic,5,manual,no,35,0
4,1,2,2,basic,2,manual,no,30,1


We will keep two alternative sfor the binary, which are alt=1 and alt=2

In [3]:
sportscar = sportscar[sportscar['alt'] < 3]
sportscar.head()

Unnamed: 0,resp_id,ques,alt,segment,seat,trans,convert,price,choice
0,1,1,1,basic,2,manual,yes,35,0
1,1,1,2,basic,5,auto,no,40,0
3,1,2,1,basic,5,manual,no,35,0
4,1,2,2,basic,2,manual,no,30,1
6,1,3,1,basic,5,auto,yes,35,1


The data is in long format, we wil put ut in

In [4]:
sportscar = sportscar.pivot(['ques', 'resp_id'], 'alt')

  sportscar = sportscar.pivot(['ques', 'resp_id'], 'alt')


In [5]:
sportscar

Unnamed: 0_level_0,Unnamed: 1_level_0,segment,segment,seat,seat,trans,trans,convert,convert,price,price,choice,choice
Unnamed: 0_level_1,alt,1,2,1,2,1,2,1,2,1,2,1,2
ques,resp_id,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
1,1,basic,basic,2,5,manual,auto,yes,no,35,40,0,0
1,2,basic,basic,5,5,manual,manual,no,yes,35,35,0,0
1,3,basic,basic,4,4,auto,auto,yes,no,35,35,1,0
1,4,basic,basic,2,5,auto,manual,no,no,30,30,0,1
1,5,basic,basic,5,2,auto,auto,yes,yes,35,30,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10,196,basic,basic,4,2,manual,auto,yes,yes,30,40,0,0
10,197,basic,basic,2,2,auto,manual,yes,no,40,40,0,0
10,198,basic,basic,5,4,auto,auto,no,yes,30,35,1,0
10,199,basic,basic,5,5,manual,auto,yes,yes,35,35,0,1


In [6]:
#'_'.join([str(element) for element in  list(sportscar.columns)[0]])

In [7]:
sportscar.columns = ['_'.join([str(element) for element in a]) for a in sportscar.columns.to_flat_index()]

In [9]:
sportscar

Unnamed: 0_level_0,Unnamed: 1_level_0,segment_1,segment_2,seat_1,seat_2,trans_1,trans_2,convert_1,convert_2,price_1,price_2,choice_1,choice_2
ques,resp_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,1,basic,basic,2,5,manual,auto,yes,no,35,40,0,0
1,2,basic,basic,5,5,manual,manual,no,yes,35,35,0,0
1,3,basic,basic,4,4,auto,auto,yes,no,35,35,1,0
1,4,basic,basic,2,5,auto,manual,no,no,30,30,0,1
1,5,basic,basic,5,2,auto,auto,yes,yes,35,30,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10,196,basic,basic,4,2,manual,auto,yes,yes,30,40,0,0
10,197,basic,basic,2,2,auto,manual,yes,no,40,40,0,0
10,198,basic,basic,5,4,auto,auto,no,yes,30,35,1,0
10,199,basic,basic,5,5,manual,auto,yes,yes,35,35,0,1


In [12]:
sportscar = sportscar.drop(columns=['choice_2'])
sportscar

Unnamed: 0_level_0,Unnamed: 1_level_0,segment_1,segment_2,seat_1,seat_2,trans_1,trans_2,convert_1,convert_2,price_1,price_2,choice_1
ques,resp_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,1,basic,basic,2,5,manual,auto,yes,no,35,40,0
1,2,basic,basic,5,5,manual,manual,no,yes,35,35,0
1,3,basic,basic,4,4,auto,auto,yes,no,35,35,1
1,4,basic,basic,2,5,auto,manual,no,no,30,30,0
1,5,basic,basic,5,2,auto,auto,yes,yes,35,30,0
...,...,...,...,...,...,...,...,...,...,...,...,...
10,196,basic,basic,4,2,manual,auto,yes,yes,30,40,0
10,197,basic,basic,2,2,auto,manual,yes,no,40,40,0
10,198,basic,basic,5,4,auto,auto,no,yes,30,35,1
10,199,basic,basic,5,5,manual,auto,yes,yes,35,35,0


In [23]:
sportscar_d = pd.get_dummies(sportscar)

In [38]:
logit_mod = sm.Logit(sportscar_d.choice_1, sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']] )
logit_res = logit_mod.fit(disp=0)

print(logit_res.summary())

                           Logit Regression Results                           
Dep. Variable:               choice_1   No. Observations:                 2000
Model:                          Logit   Df Residuals:                     1996
Method:                           MLE   Df Model:                            3
Date:                Tue, 15 Aug 2023   Pseudo R-squ.:                 0.08771
Time:                        13:36:48   Log-Likelihood:                -1154.5
converged:                       True   LL-Null:                       -1265.5
Covariance Type:            nonrobust   LLR p-value:                 7.394e-48
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
seat_1            0.2665      0.040      6.725      0.000       0.189       0.344
price_1          -0.0696      0.005    -13.706      0.000      -0.080      -0.060
trans_1_auto      1.1318      0.103     

## Confusion matrix

Rows are actual,cols are predicted

In [42]:
logit_res.pred_table()

array([[1213.,  131.],
       [ 452.,  204.]])

# Predictions

In [51]:

logit_res.predict(sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']].iloc[1:15])

ques  resp_id
1     2          0.249368
      3          0.460610
      4          0.396057
      5          0.527127
      6          0.264410
      7          0.264410
      8          0.190037
      9          0.357900
      10         0.612160
      11         0.357900
      12         0.316542
      13         0.264410
      14         0.612160
      15         0.215908
dtype: float64

# Compare to the probit

Lets fit a probit model, and compare the coefficients

In [39]:
probit_mod = sm.Probit(sportscar_d.choice_1, sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']] )
probit_res = probit_mod.fit(disp=0)

print(probit_res.summary())



                          Probit Regression Results                           
Dep. Variable:               choice_1   No. Observations:                 2000
Model:                         Probit   Df Residuals:                     1996
Method:                           MLE   Df Model:                            3
Date:                Tue, 15 Aug 2023   Pseudo R-squ.:                 0.08766
Time:                        13:36:52   Log-Likelihood:                -1154.6
converged:                       True   LL-Null:                       -1265.5
Covariance Type:            nonrobust   LLR p-value:                 7.940e-48
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
seat_1            0.1578      0.023      6.744      0.000       0.112       0.204
price_1          -0.0416      0.003    -14.272      0.000      -0.047      -0.036
trans_1_auto      0.6791      0.061     

Now take a look at the predictions

In [52]:
probit_res.predict(sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']].iloc[1:15])

ques  resp_id
1     2          0.252089
      3          0.459172
      4          0.399684
      5          0.522047
      6          0.266389
      7          0.266389
      8          0.190495
      9          0.361361
      10         0.603892
      11         0.361361
      12         0.321930
      13         0.266389
      14         0.603892
      15         0.217230
dtype: float64

Lets highlight the differences

In [69]:

logit_res.predict(sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']].iloc[1:15]) - probit_res.predict(sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']].iloc[1:15])

ques  resp_id
1     2         -0.002721
      3          0.001438
      4         -0.003627
      5          0.005080
      6         -0.001979
      7         -0.001979
      8         -0.000458
      9         -0.003461
      10         0.008267
      11        -0.003461
      12        -0.005388
      13        -0.001979
      14         0.008267
      15        -0.001321
dtype: float64

### Predictions

For a 'new' individual with a sportscar with many seats

In [75]:
a = sportscar_d[ ['seat_1', 'price_1', 'trans_1_auto', 'convert_1_yes']].iloc[0].copy()
a['seat_1'] = a['seat_1'] + 12
a

seat_1           14
price_1          35
trans_1_auto      0
convert_1_yes     1
Name: (1, 1), dtype: int64

In [76]:

logit_res.predict(a)

None    0.798243
dtype: float64

## Exercise: Get the utility for that individual

*Clue: Remember something about log-odds*

In [None]:
logit_res.predict(a)

## Exercise: Calculate WTP per seat

## Exercise: on Modelling the seats variable better, in a nonlinear form
*Clue: Think about the utility interpretation*