# Products

Our goal is to find products **categories** that repetively underperform others, and understand why

## 0 - `product.py` 

We have given you the solution to `product.py` in your challenge's folder
👉 Copy-paste it to your local olist/product.py folder

It provides aggregates at `product_id` level of the various orders that have taken place with Olist

`get_training_data` method in `olist/product.py` returns the following DataFrame:

  - `product_id` (_str_) _the id of the product_ **UNIQUE**
  - `category` (_str_) _the category name (in english)_
  - `height` (_float_) _height of the product (in cm)_
  - `width` (_float_) _width of the product (in cm)_
  - `length` (_float_) _length of the product (in cm)_
  - `weight` (_float_) _weight of the product (in g)_
  - `price` (_float_) _average price at which the product is sold_
  - `freight_value` (_float_) _average value of freight_
  - `product_name_length` (_float_) _character length of product name_
  - `product_description_length` (_float_) _character length of product description_
  - `n_orders` (_int_) _the number of orders in which the product appeared_
  - `quantity` (_int_) _the total number of product sold_
  - `wait_time` (_float_) _the average wait time in days for orders in which the product was sold._
  - `share_of_five_stars` (_float_) _The share of five stars orders for orders in which the product was sold_
  - `share_of_one_stars` (_float_) _The share of one stars orders for orders in which the product was sold_
  - `review_score` (_float_) _The average review score of the order in which each product is sold_

## 1 - Analysis per product_id

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

❓ Inspect the new `Product().get_training_data()` dataframe, for instance by plotting histograms of each variables using `plt.hist()`. 

In [None]:
from olist.product import Product
products = Product().get_training_data()

In [None]:
# Your code

❓ Model review_score by an OLS with the continuous feature of your choice, and conclude about R-square and important features

## 2 - Aggregation per product categories

### 2.1 - Build aggregated dataframe

❓ Create a DataFrame `product_cat` aggregating, for each product category, all the products properties.  
Use sum for `quantity` and the aggregation function of your choice for all other properties. For instance:

  - `quantity` (sum)
  - `wait_time` (median)
  - `review_score` (median)
  - `price` (median)
  - ....

### 2.2 - Exploration

❓ What are the best performing product categories?

In [None]:
# Your code

❓ Let's try to understand _why_ some categories are performing better than the others. 

Using plotly, create different scatterplots, varying `x`, `y`, `color` and `size`, to finds clues about factors impacting the "review_score". 

- Do you notice underperforming product categories?
- Can you think of a strategy to improve Olist's profit margin as per CEO request?

<details>
    <summary>Hints</summary>

Try plotting `product_length_cm` against `wait_time`, with color = `review_score`, and bubble size = "sales" for instance
</details>

In [112]:
# YOUR CODE BELOW
import plotly.express as px

### 2.3 - Causal inference

☝️ It seems that large products like Furniture, which also happend to takes longer to deliver, are performing worse than the others. Are consumer disappointed about the product itself, or by the slow delivery time?

❓ To answer that, run an OLS to model `review_score` so as to isolate the true contribution of each product category on customer satisfaction, by holding `wait_time` constant? 

- Which dataset should you use for this regression? `product_cat` or the entire `products` training dataset?

- Which regressors / independent variables / features should you use? 

Investigate the results: Which product categories correlate with higher review_score holding wait_time constant?

Feel free to use `return_significative_coef(model)` coded for you in `olist/utils.py`

In [None]:
# Your code

☝️ Furniture is not anymore in the list of signigicant coefficients. The problem may have come from delivery rather than the product itself! On the contrary, books are regularly driving higher reviews, even after accounting for generally quicker delivery time. 

🏁 **Congratulation with this final challenge! Don't forget to commit and push your analysis**