In [91]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# `Logit` on Orders - A warm-up challenge (~1h)

## Select features

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [33]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif




👉 Import your dataset:

In [34]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

👉 Select in a list which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

💡 To figure out the impact of `wait_time` and `delay_vs_expected` we need to control for the impact of other features, include in your list all features that may be relevant

In [38]:
orders.columns

Index(['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected',
       'order_status', 'dim_is_five_star', 'dim_is_one_star', 'review_score',
       'number_of_products', 'number_of_sellers', 'price', 'freight_value',
       'distance_seller_customer'],
      dtype='object')

In [46]:
features = ["wait_time","delay_vs_expected","number_of_products",'price','freight_value']

In [47]:
orders[features].head()


Unnamed: 0,wait_time,delay_vs_expected,number_of_products,price,freight_value
0,8.436574,0.0,1,29.99,8.72
1,13.782037,0.0,1,118.7,22.76
2,9.394213,0.0,1,159.9,19.22
3,13.20875,0.0,1,45.0,27.2
4,2.873877,0.0,1,19.9,8.72


🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [48]:
features = ["wait_time","delay_vs_expected","number_of_products",'price','freight_value']


# YOUR CODE HERE
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

features_transformed = orders.copy()

features_transformed[features]= scaler.fit_transform(orders[features])


👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [90]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant


X = add_constant(features_transformed)
pd.Series([variance_inflation_factor(X[features].values, i)\
           for i in range(X[features].shape[1])],index=X[features].columns)

wait_time             2.100763
delay_vs_expected     2.021527
number_of_products    1.258599
price                 1.204735
freight_value         1.541195
dtype: float64

## Logistic Regressions

👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [88]:
# YOUR CODE HERE
from statsmodels.formula.api import logit


Logit1 = logit('dim_is_one_star~'+'+'.join(features), data=features_transformed).fit()

Logit1.summary()

PatsyError: Error evaluating factor: NameError: name 'dim_is_one_star' is not defined
    dim_is_one_star~wait_time+delay_vs_expected+number_of_products+price+freight_value
    ^^^^^^^^^^^^^^^

`Logit 5️⃣`

In [89]:
Logit5 = logit('dim_is_five_star~'+'+'.join(features), data=features_transformed).fit()

Logit5.summary()


PatsyError: Error evaluating factor: NameError: name 'dim_is_five_star' is not defined
    dim_is_five_star~wait_time+delay_vs_expected+number_of_products+price+freight_value
    ^^^^^^^^^^^^^^^^

💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [None]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = []

🧪 __Test your code__

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

<details>
    <summary>- <i>Explanations and advanced concepts </i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
❗️ However, to be totally rigorous, we have to be **more careful when comparing coefficients from two different models**, because **they might not be based on similar populations**!
    We have 2 sub-populations here: (people who gave 1-stars; and people who gave 5-stars) and they may exhibit intrinsically different behavior patterns. It may well be that "happy-people" (who tends to give 5-stars easily) are less sensitive as "grumpy-people" (who shoot 1-stars like Lucky-Luke), when it comes to "delay", or "price"...

</details>


## Logistic vs. Linear ?

👉 Compare:
- the regression coefficients obtained from the `Logistic Regression `
- with the regression coefficients obtained through a `Linear Regression` 
- on `review_score`, using the same features. 

⚠️ Check that both sets of coefficients  tell  "the same story".

> YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE

🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !