In [2]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - A warm-up challenge (~1h)

## Select features

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [3]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

👉 Import your dataset:

In [4]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)
orders

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0,delivered,0,0,4,1,1,29.99,8.72,18.063837
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0,delivered,0,0,4,1,1,118.70,22.76,856.292580
2,47770eb9100c2d0c44946d9cf07ec65d,9.394213,26.639711,0.0,delivered,1,0,5,1,1,159.90,19.22,514.130333
3,949d5b44dbf5de918fe9c16f97b45f8a,13.208750,26.188819,0.0,delivered,1,0,5,1,1,45.00,27.20,1822.800366
4,ad21c59c0840e6cb83a9ceb5573f8159,2.873877,12.112049,0.0,delivered,1,0,5,1,1,19.90,8.72,30.174037
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95875,9c5dedf39a927c1b2549525ed64a053c,8.218009,18.587442,0.0,delivered,1,0,5,1,1,72.00,13.08,69.481037
95876,63943bddc261676b46f01ca7ac2f7bd8,22.193727,23.459051,0.0,delivered,0,0,4,1,1,174.90,20.10,474.098245
95877,83c1379a015df1e13d02aae0204711ab,24.859421,30.384225,0.0,delivered,1,0,5,1,1,205.99,65.02,968.051192
95878,11c177c8e97725db2631073c19f07b62,17.086424,37.105243,0.0,delivered,0,0,2,2,1,359.98,81.18,370.146853


👉 Select in a list which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

💡 To figure out the impact of `wait_time` and `delay_vs_expected` we need to control for the impact of other features, include in your list all features that may be relevant

In [5]:
# YOUR CODE HERE
orders.head()
orders_selection = ["wait_time","delay_vs_expected", "distance_seller_customer", "price"]
orders_selection

['wait_time', 'delay_vs_expected', 'distance_seller_customer', 'price']

🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [6]:
# YOUR CODE HERE
orders_standardized = orders.copy()
for feature in orders_selection:
    mu = orders_standardized[feature].mean()
    sigma = orders_standardized[feature].std()
    orders_standardized[feature] = orders_standardized[feature].apply(lambda x: (x-mu)/sigma)
orders_standardized

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,e481f51cbdc54678b7cc49136f2d6af7,-0.431192,15.544063,-0.161781,delivered,0,0,4,1,1,-0.513802,8.72,-0.979475
1,53cdb2fc8bc7dce0b6741e2150273451,0.134174,19.137766,-0.161781,delivered,0,0,4,1,1,-0.086640,22.76,0.429743
2,47770eb9100c2d0c44946d9cf07ec65d,-0.329907,26.639711,-0.161781,delivered,1,0,5,1,1,0.111748,19.22,-0.145495
3,949d5b44dbf5de918fe9c16f97b45f8a,0.073540,26.188819,-0.161781,delivered,1,0,5,1,1,-0.441525,27.20,2.054621
4,ad21c59c0840e6cb83a9ceb5573f8159,-1.019535,12.112049,-0.161781,delivered,1,0,5,1,1,-0.562388,8.72,-0.959115
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95875,9c5dedf39a927c1b2549525ed64a053c,-0.454309,18.587442,-0.161781,delivered,1,0,5,1,1,-0.311513,13.08,-0.893033
95876,63943bddc261676b46f01ca7ac2f7bd8,1.023841,23.459051,-0.161781,delivered,0,0,4,1,1,0.183977,20.10,-0.212797
95877,83c1379a015df1e13d02aae0204711ab,1.305780,30.384225,-0.161781,delivered,1,0,5,1,1,0.333684,65.02,0.617630
95878,11c177c8e97725db2631073c19f07b62,0.483664,37.105243,-0.161781,delivered,0,0,2,2,1,1.075186,81.18,-0.387558


👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [7]:
# YOUR CODE HERE
from statsmodels.stats.outliers_influence import variance_inflation_factor 
X = orders_standardized[["wait_time","delay_vs_expected", "distance_seller_customer", "price"]] 
vif_orders = pd.DataFrame() 
vif_orders["feature"] = X.columns 

vif_orders["VIF"] = [variance_inflation_factor(X.values, i) 
                          for i in range(len(X.columns))] 
  
print(vif_orders)

                    feature       VIF
0                 wait_time  2.603820
1         delay_vs_expected  2.205962
2  distance_seller_customer  1.327356
3                     price  1.007191


## Logistic Regressions

👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [8]:
# YOUR CODE HERE
model1 = smf.logit(formula='dim_is_one_star ~ wait_time + delay_vs_expected + distance_seller_customer+ price', data=orders_standardized).fit()
model1.params

Optimization terminated successfully.
         Current function value: 0.281693
         Iterations 7


Intercept                  -2.411814
wait_time                   0.664907
delay_vs_expected           0.264990
distance_seller_customer   -0.181853
price                       0.096475
dtype: float64

In [9]:
model1.summary()

0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95867.0
Method:,MLE,Df Model:,4.0
Date:,"Sat, 25 Nov 2023",Pseudo R-squ.:,0.1194
Time:,11:46:30,Log-Likelihood:,-27006.0
converged:,True,LL-Null:,-30669.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.4118,0.012,-193.366,0.000,-2.436,-2.387
wait_time,0.6649,0.017,40.196,0.000,0.632,0.697
delay_vs_expected,0.2650,0.018,14.487,0.000,0.229,0.301
distance_seller_customer,-0.1819,0.013,-13.826,0.000,-0.208,-0.156
price,0.0965,0.009,10.295,0.000,0.078,0.115


`Logit 5️⃣`

In [10]:
# YOUR CODE HERE
model5 = smf.logit(formula='dim_is_five_star ~ wait_time + delay_vs_expected + distance_seller_customer+ price', data=orders_standardized).fit()
model5.params

Optimization terminated successfully.
         Current function value: 0.642015
         Iterations 7


Intercept                   0.338358
wait_time                  -0.504764
delay_vs_expected          -0.435388
distance_seller_customer    0.088825
price                      -0.005111
dtype: float64

In [11]:
model5.summary()

0,1,2,3
Dep. Variable:,dim_is_five_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95867.0
Method:,MLE,Df Model:,4.0
Date:,"Sat, 25 Nov 2023",Pseudo R-squ.:,0.05039
Time:,11:46:30,Log-Likelihood:,-61551.0
converged:,True,LL-Null:,-64817.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.3384,0.007,47.526,0.000,0.324,0.352
wait_time,-0.5048,0.012,-43.626,0.000,-0.527,-0.482
delay_vs_expected,-0.4354,0.023,-18.587,0.000,-0.481,-0.389
distance_seller_customer,0.0888,0.008,11.215,0.000,0.073,0.104
price,-0.0051,0.007,-0.752,0.452,-0.018,0.008


💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [12]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [13]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /root/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /root/code/loredana1h/04-Decision-Science/04-Logistic-Regression/data-logit/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                           [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master



<details>
    <summary>- <i>Explanations and advanced concepts </i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
❗️ However, to be totally rigorous, we have to be **more careful when comparing coefficients from two different models**, because **they might not be based on similar populations**!
    We have 2 sub-populations here: (people who gave 1-stars; and people who gave 5-stars) and they may exhibit intrinsically different behavior patterns. It may well be that "happy-people" (who tends to give 5-stars easily) are less sensitive as "grumpy-people" (who shoot 1-stars like Lucky-Luke), when it comes to "delay", or "price"...

</details>


## Logistic vs. Linear ?

👉 Compare the coefficients obtained from:
- A `Logistic Regression` to explain `dim_is_five_star`
- A `Linear Regression` to explain `review_score` 

Make sure to use the same set of features for both regressions.  

⚠️ Check that both sets of coefficients  tell  "the same story".

> YOUR ANSWER HERE

In [18]:
linear_model = smf.ols('review_score ~ wait_time + delay_vs_expected', data=orders_standardized).fit()
linear_model.params

Intercept            4.155509
wait_time           -0.362105
delay_vs_expected   -0.095600
dtype: float64

🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !