# Simple orders analysis

We are finally ready to start analysing our order dataset!

Our objective is to get an initial understanding of
- Orders properties
- Their associated `review_scores`

In [1]:
#import modules 
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
%load_ext autoreload
%autoreload 2

In [None]:
# import your newly coded order training set
from olist.order import Order
# Change `with_distance_seller_customer` to False if you have not yet completed the optional part of challenge 1
orders = Order().get_training_data(with_distance_seller_customer=True)

## 1 - Inspect features

❓ Print summary statistics `DataFrame.describe()` of each column of the order dataset, and in particular `wait_time`

Plot various histograms to get a sense of each variable's distribution.
In particular, create a `sns.FacetGrid()` of histograms for each `review_score`

What do you notice for variables `price` and `freight_value` ? Also, analyse `distance_seller_customer` if you have created it in `order.py`

In [36]:
# Your plots here

----
❓Inspect the various correlations between features: which one seems most correlated with `review_score`?

<details>
    <summary>Hint</summary>

`DataFrame.corr()` combined with `sns.heatmap()` and `cmap='coolwarm'`
</details>

In [44]:
# Your plots here

## 2 - Simple regression of `review_score` against delivery duration

It seems that `review_score` is mostly correlated with `wait_time` (r = 33%) and `delay_vs_expected` (r=27%).
Let's investigate these with seaborn

### 2.1 Plots
❓ In one figure, create 2 subplots, that regress `review_score` on `wait_time` and `delay_vs_expected` respectively

- Reduce your dataframe to a random subsample of 10,000 rows for speed purposes (a good practice in data exploration)
- use `DataFrame.sample()` with a fixed `random_state` to avoid changing the sample at each execution if needed
- Use `sns.regplot()` to plot the regression line
- Add some `y_jitter` parameters to better visualize the scatterplot density
- Limit `xlim` and `ylim` to hide outliers

In [1]:
# SUB-SAMPLE YOUR DATASET

In [None]:
# YOUR PLOT HERE

### 2.2 Interpretation

❓Try to visually 'compute' the `slope` of each curve. Write down, in plain English, how you would interpret these coefficients if you were to explain it to a non-data scientist

In [1]:
# Your answer here

<details>
    <summary>Answer</summary>


- Slope wait_time = -0.05 : "For each additional day an order takes to deliver, the review_score on average is reduced by 0.05"
- Slope delay = -0.1 : "For each additional day an order takes to deliver _above expected_, the review_score on average is reduced by 0.1"

Try to convince yourself intuitively why the latter is more impactful than the former!

### 2.3 Inferential analysis

Even if we used all 100,000 orders for these regplots, they only represent 16 months of data after all.

**How certain** are we that these coefficients are **statistically significant**? i.e that they do not capture random fluctuations due to the limited observation period, and would not generalize well to future orders (all else being equal)?

We need to estimate the **confidence interval** around the mean value for these slopes  
$$slope_{wait} = -0.05 ± \ ?? \ \text{[95% interval]} $$
$$slope_{delay} = -0.1 ± \ ?? \ \text{[95% interval]} $$

Fortunately, seaborn already computes this 95% confidence interval for us with a shaded blue cone around the regression line!

❓Use seaborn `regplot` to visualize the two confidence intervals:
- Change the size of the sample by sub-sampling your dataset: Notice how the slope may change as the sample size becomes smaller. What about the confidence interval?
- Change the size of the confidence interval by playing with regplot `ci` parameter (95% by default)

In [None]:
# YOUR PLOT HERE

<details>
    <summary>🔎 Interpretation</summary>

When plotting all our datapoints:
- The 95% confidence interval for the slope does not contain the value 0.
- We are 95% confident that slower deliveries are associated with weaker reviews.
- The `p-value` associated with the null-hypothesis "review_score is not related to delivery duration" is close to 0, and we can safely reject this hypothesis

$\implies$ Our findings are said to be **statistically significant**. 

However, **correlation does not imply causality**. It may well be that some products, which happen to be inherently slow to deliver on average (heavy ones maybe?) also happen to have a consistently low review_score, however long they take to be delivered. Identifying these **confounding factor** is crucial and cannot be done with simple univariate regression. We will see tomorrow the power of multivariate linear regression. 
</details>



🏁 Congratulations! Don't forget to commit and push your notebook