# Products

Our goal is to find product **categories** that repeatdely underperform vs others, and understand why

## 0 - `product.py` 

We have given you the solution to `product.py` in your challenge folder
👉 Copy-paste it to your local olist/product.py folder

It provides aggregates at a `product_id` level of the various orders that have taken place with Olist

`get_training_data` method in `olist/product.py` returns the following DataFrame:

  - `product_id` (_str_) _the id of the product_ **UNIQUE**
  - `category` (_str_) _the category name (in English)_
  - `product_name_length` (_float_) _character length of product name_
  - `product_description_length` (_float_) _character length of product description_
  - `product_photos_qty` (_int_) _the number of photos for the product_
  - `product_weight_g` (_float_) _weight of the product (in g)_
  - `product_length_cm` (_float_) _length of the product (in cm)_
  - `product_height_cm` (_float_) _height of the product (in cm)_
  - `product_width_cm` (_float_) _width of the product (in cm)_
  - `price` (_float_) _average price at which the product is sold_
  - `wait_time` (_float_) _the average wait time in days for orders in which the product was sold._
  - `share_of_five_stars` (_float_) _the share of five star orders for orders in which the product was sold_
  - `share_of_one_stars` (_float_) _the share of one star orders for orders in which the product was sold_
  - `review_score` (_float_) _the average review score of the order in which each product is sold_
  - `n_orders` (_int_) _the number of orders in which the product appeared_
  - `quantity` (_int_) _the total number of products sold_
  - `sales` (_float_) _the total value of sales in $BRL for the product_

## 1 - Analysis per product_id

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

❓ Inspect the new `Product().get_training_data()` dataframe, for instance by plotting histograms of each variable using `plt.hist()`. 

In [None]:
from olist.product import Product
products = Product().get_training_data()

In [None]:
# Your code

❓ Model review_score by an OLS with the continuous feature of your choice, and discover the R-squared and important features

## 2 - Aggregation per product categories

### 2.1 - Build aggregated dataframe

❓ Create a function `get_product_cat` which accepts an aggregating method as an argument and returns a DataFrame with each `product_category`'s `quantity` summed and all other non `str` type properties aggregated with the passed method.  
For instance `get_product_cat('median')` returns:

  - `quantity` (sum)
  - `wait_time` (median)
  - `review_score` (median)
  - `price` (median)
  - ....

### Test your code

In [None]:
from nbresult import ChallengeResult

product_cat = get_product_cat('mean')
result = ChallengeResult('products',
shape=product_cat.shape,
avg_review_score=int(product_cat['review_score'].mean()),
avg_price=int(product_cat['price'].mean()),
avg_quantity=int(product_cat['quantity'].mean())
)
result.write()
print(result.check())

### 2.2 - Exploration

❓ What are the best performing product categories?

In [None]:
# Your code

❓ Let's try to understand _why_ some categories are performing better than others. 

Using plotly, create different scatterplots, varying `x`, `y`, `color` and `size`, to finds clues about factors impacting the "review_score". 

- Do you notice underperforming product categories?
- Can you think of a strategy to improve Olist's profit margin as per the CEO request?

<details>
    <summary>Hints</summary>

Try plotting `product_length_cm` against `wait_time`, with color as `review_score`, and bubble size as "sales" for instance
</details>

In [112]:
# YOUR CODE BELOW
import plotly.express as px

### 2.3 - Causal inference

☝️ It seems that large products like furniture, which happen to take longer to deliver, are performing worse than other products. Are consumers disappointed about the product itself, or by the slow delivery time?

❓ To answer that, run an OLS to model `review_score` so as to isolate the true contribution of each product category on customer satisfaction, by holding `wait_time` constant? 

- Which dataset should you use for this regression? `product_cat` or the entire `products` training dataset?

- Which regressors / independent variables / features should you use? 

Investigate the results: which product categories correlate with higher review_score holding wait_time constant?

Feel free to use `return_significative_coef(model)` coded for you in `olist/utils.py`

In [None]:
# Your code

☝️ Furniture is no longer in the list of signigicant coefficients. The problem may have come from delivery rather than the product itself! On the contrary, books are regularly driving higher reviews, even after accounting for generally quicker delivery time. 

🏁 **Congratulations with this final challenge! Don't forget to commit and push your analysis**