# Logit Orders - A warm-up challenge (~1h)

Let's figure out the impact of `wait_time` and `delay_vs_expected` on very good and very bad reviews

Using our `orders` training_set, we will run two multivariate logistic regressions (`logit_one` and `logit_five`) to predict `dim_is_one_star` and `dim_is_five_star` respectively.

 

In [1]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

❓ Import your dataset

In [2]:
# Import olist data
from olist.data import Olist
olist=Olist()
data=olist.get_data()  ## is dict
matching_table = olist.get_matching_table()
list(data.keys())

['sellers',
 'product_category_name_translation',
 'orders',
 'order_items',
 'customers',
 'geolocation',
 'order_payments',
 'order_reviews',
 'products']

In [3]:
matching_table.shape

(114100, 5)

In [12]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

In [50]:
# Import olist data
from olist.seller import Seller
seller=Seller()
sellers=seller.get_seller_delay_wait_time() ## is dict
scores=seller.get_review_score() ## is dict

In [10]:
sellers.head(3)

Unnamed: 0,seller_id,delay_to_carrier,wait_time
0,0015a82c2db000af6aaaf3ae2ecb0532,0.0,10.793885
1,001cca7ae9ae17fb1caed9dfb1094831,0.06326,13.096632
2,002100f778ceb8431b7a1020ff7ab48f,0.342569,16.192371


In [12]:
scores.head(3)

Unnamed: 0,seller_id,share_of_one_stars,share_of_five_stars,review_score
0,0015a82c2db000af6aaaf3ae2ecb0532,0.333333,0.666667,3.666667
1,001cca7ae9ae17fb1caed9dfb1094831,0.13,0.52,3.95
2,001e6ad469a905060d959994f1b41e4f,1.0,0.0,1.0


In [3]:
# Import orders training_set 
from olist.order import Order
order=Order()
orders=order.get_training_data() ## is dict
orders.shape

(97007, 12)

In [4]:
orders.head(3)

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0,delivered,0,0,4,1,1,29.99,8.72
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0,delivered,0,0,4,1,1,118.7,22.76
2,47770eb9100c2d0c44946d9cf07ec65d,9.394213,26.639711,0.0,delivered,1,0,5,1,1,159.9,19.22


❓ Select which features you want to use (avoid data-leaks)

### features 1 

In [20]:
features = ['wait_time',
 'delay_vs_expected',
 'price',
 'freight_value']

In [21]:
orders[features].head(3)

Unnamed: 0,wait_time,delay_vs_expected,price,freight_value
0,8.436574,0.0,29.99,8.72
1,13.782037,0.0,118.7,22.76
2,9.394213,0.0,159.9,19.22


❓ Check the multi-colinearity of your features, using the `VIF index`. It shouldn't be too high (< 10 preferably) to ensure we can trust the partial regression coefficents and their associated `p-values` 

In [22]:
df = pd.DataFrame()
df["vif_index"] = [vif(orders[features].values, i) for i in range(orders[features].shape[1])]
df["features"] = orders[features].columns
df

Unnamed: 0,vif_index,features
0,2.797923,wait_time
1,1.639415,delay_vs_expected
2,1.692361,price
3,2.496467,freight_value


### features 2

In [23]:
features = ['wait_time',
 'delay_vs_expected',
 'number_of_sellers', 'number_of_products',
 'price',
 'freight_value']

In [24]:
orders[features].head(3)

Unnamed: 0,wait_time,delay_vs_expected,number_of_sellers,number_of_products,price,freight_value
0,8.436574,0.0,1,1,29.99,8.72
1,13.782037,0.0,1,1,118.7,22.76
2,9.394213,0.0,1,1,159.9,19.22


❓ Check the multi-colinearity of your features, using the `VIF index`. It shouldn't be too high (< 10 preferably) to ensure we can trust the partial regression coefficents and their associated `p-values` 

In [25]:
df = pd.DataFrame()
df["vif_index"] = [vif(orders[features].values, i) for i in range(orders[features].shape[1])]
df["features"] = orders[features].columns
df

Unnamed: 0,vif_index,features
0,5.48264,wait_time
1,2.05756,delay_vs_expected
2,8.910229,number_of_sellers
3,7.363534,number_of_products
4,1.72179,price
5,3.265091,freight_value


 'wait_time' and 'freight_value' have very high values of VIF,
 indicating that these two variables are highly correlated. 
 ---->  Hence, considering these two features together leads to a model with high multicollinearity.

❓ Fit two LOGIT models (`logit_one` and `logit_five`) to predict `dim_is_one_star` and `dim_is_five_star`

In [26]:
formula_one = "dim_is_one_star ~ " + ' + '.join(features)
formula_one

'dim_is_one_star ~ wait_time + delay_vs_expected + number_of_sellers + number_of_products + price + freight_value'

In [27]:
formula_five = "dim_is_five_star ~ " + ' + '.join(features)
formula_five

'dim_is_five_star ~ wait_time + delay_vs_expected + number_of_sellers + number_of_products + price + freight_value'

In [28]:
logit_one = smf.logit(formula= formula_one, data=orders).fit()
logit_one.params

Optimization terminated successfully.
         Current function value: 0.280352
         Iterations 7


Intercept            -5.264097
wait_time             0.064804
delay_vs_expected     0.066703
number_of_sellers     1.411319
number_of_products    0.507396
price                 0.000300
freight_value        -0.003333
dtype: float64

In [30]:
logit_five = smf.logit(formula= formula_five, data=orders).fit()
logit_five.params

Optimization terminated successfully.
         Current function value: 0.638115
         Iterations 7


Intercept             2.459868
wait_time            -0.049446
delay_vs_expected    -0.101731
number_of_sellers    -1.141325
number_of_products   -0.283877
price                 0.000065
freight_value         0.001584
dtype: float64

In [31]:
logit_one.summary()

0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,97007.0
Model:,Logit,Df Residuals:,97000.0
Method:,MLE,Df Model:,6.0
Date:,"Thu, 22 Jul 2021",Pseudo R-squ.:,0.1407
Time:,11:35:35,Log-Likelihood:,-27196.0
converged:,True,LL-Null:,-31650.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-5.2641,0.069,-75.777,0.000,-5.400,-5.128
wait_time,0.0648,0.002,40.920,0.000,0.062,0.068
delay_vs_expected,0.0667,0.004,17.660,0.000,0.059,0.074
number_of_sellers,1.4113,0.063,22.524,0.000,1.289,1.534
number_of_products,0.5074,0.020,25.910,0.000,0.469,0.546
price,0.0003,5.29e-05,5.684,0.000,0.000,0.000
freight_value,-0.0033,0.001,-5.308,0.000,-0.005,-0.002


In [32]:
logit_five.summary()

0,1,2,3
Dep. Variable:,dim_is_five_star,No. Observations:,97007.0
Model:,Logit,Df Residuals:,97000.0
Method:,MLE,Df Model:,6.0
Date:,"Thu, 22 Jul 2021",Pseudo R-squ.:,0.05771
Time:,11:35:46,Log-Likelihood:,-61902.0
converged:,True,LL-Null:,-65693.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.4599,0.064,38.448,0.000,2.334,2.585
wait_time,-0.0494,0.001,-45.267,0.000,-0.052,-0.047
delay_vs_expected,-0.1017,0.005,-20.746,0.000,-0.111,-0.092
number_of_sellers,-1.1413,0.063,-18.123,0.000,-1.265,-1.018
number_of_products,-0.2839,0.015,-18.383,0.000,-0.314,-0.254
price,6.46e-05,3.63e-05,1.778,0.075,-6.61e-06,0.000
freight_value,0.0016,0.000,3.878,0.000,0.001,0.002


In [37]:
import math

In [40]:
coef = 6.46e-05
n = math.exp(coef); print(n, n / (1 + n))

1.0000646020866248 0.5000161499999943


❓Interpret your results:

- Interpret the partial coefficients in your own words.
- Check their statistical significance with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importance?

- wait_time: odds   $𝑒^-0.0494= 0.95 $
- Probability P = .95 / (1 + .95) = 48.76 %

- price: odds   $𝑒^6.46e-05= 1. $
- Probability P = 1. / (1 + 1.) = 50 %


#### Intervalle de confiance

In [41]:
#intervalle de confiance des coefficients à 90%
print(logit_one.conf_int(alpha=0.1))

                           0         1
Intercept          -5.378362 -5.149831
wait_time           0.062199  0.067408
delay_vs_expected   0.060491  0.072916
number_of_sellers   1.308257  1.514382
number_of_products  0.475185  0.539606
price               0.000214  0.000387
freight_value      -0.004366 -0.002300


In [42]:
#intervalle de confiance des coefficients à 90%
print(logit_five.conf_int(alpha=0.1))

                           0         1
Intercept           2.354633  2.565103
wait_time          -0.051243 -0.047650
delay_vs_expected  -0.109796 -0.093665
number_of_sellers  -1.244910 -1.037740
number_of_products -0.309278 -0.258477
price               0.000005  0.000124
freight_value       0.000912  0.002255


#### p-value

In [43]:
print(logit_one.llr_pvalue, logit_five.llr_pvalue)

0.0 0.0


#### coeff 

In [44]:
logit_one.params

Intercept            -5.264097
wait_time             0.064804
delay_vs_expected     0.066703
number_of_sellers     1.411319
number_of_products    0.507396
price                 0.000300
freight_value        -0.003333
dtype: float64

In [45]:
logit_five.params

Intercept             2.459868
wait_time            -0.049446
delay_vs_expected    -0.101731
number_of_sellers    -1.141325
number_of_products   -0.283877
price                 0.000065
freight_value         0.001584
dtype: float64

In [46]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [47]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

platform darwin -- Python 3.8.6, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/kenzaelhoussaini/.pyenv/versions/3.8.6/bin/python3
cachedir: .pytest_cache
rootdir: /Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/04-Logistic-Regression/01-Logit
plugins: dash-1.20.0, anyio-3.2.1
[1mcollecting ... [0mcollected 1 item

tests/test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master


<details>
    <summary>Explanations</summary>


> _All other thing being equal, the delay factor tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
</details>


❓ How do these regression coefficients compare with an OLS on `review_score` using the same features? Double check that both OLS and Logit analyses tell approximately "the same story".

In [48]:
formula = "review_score ~ " + ' + '.join(features)
formula

'review_score ~ wait_time + delay_vs_expected + number_of_sellers + number_of_products + price + freight_value'

In [49]:
model = smf.ols(formula = formula, data = orders).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           review_score   R-squared:                       0.141
Model:                            OLS   Adj. R-squared:                  0.141
Method:                 Least Squares   F-statistic:                     2661.
Date:                Thu, 22 Jul 2021   Prob (F-statistic):               0.00
Time:                        11:58:26   Log-Likelihood:            -1.5545e+05
No. Observations:               97007   AIC:                         3.109e+05
Df Residuals:                   97000   BIC:                         3.110e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              6.0171      0

### 🏁 Congratulation! Don't forget to commit and push your notebook