<a href="https://colab.research.google.com/github/gonzaloavellanal/eccd_assignments/blob/main/assignments/Pricing%20Recommendations-student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q eccd_datasets category_encoders shap pygradus

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m532.9/532.9 kB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.9/223.9 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
STUDENT_NAME = "Gonzalo Avellanal"
COURSE_NAME = "eccd-oct23"
EXERCISE_NAME = "price-recommendation"

# Objective

Explore how a pricing automation / recommendation project looks like.

In a pricing recommendation problem, often the most accurate prediction is not necessary the most important goal.
Indeed, sometimes offering a range of possible values or an explanation on how a certain variable affects the outcomes can be more useful for an end-user.

We will use a very basic data cleaning of a popular dataset before proceeding.

In [3]:
import pandas as pd
import numpy as np

from eccd_datasets import load_mercari
from category_encoders import TargetEncoder
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import shap

from pygradus import create_exercise, check_solution

In [4]:
df = load_mercari()

In [5]:
df.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
1071845,1071845,PlayStation 1 games,3,Electronics/Video Games & Consoles/Games,Sony,25.0,1,No description yet
1341567,1341567,Cropped Sweater,2,Women/Tops & Blouses/Knit Top,,9.0,0,-long sleeve -cropped -tight
584822,584822,Color Speaker,3,"Electronics/TV, Audio & Surveillance/Home Spea...",,16.0,0,"Fantasy color speaker. Has aux, Bluetooth, can..."
197407,197407,⚡️vs pink nation dog⚡️,1,Vintage & Collectibles/Collectibles/Doll,Victoria's Secret,24.0,0,New
904192,904192,Lululemon Mens Shirt,2,Men/Athletic Apparel/Shirts & Tops,Lululemon,35.0,1,"Lululemon Men's Metal vent short sleeve shirt,..."


# Data Cleaning

In this excercise we are going to ignore both `name` and `item_description` categories.

For the `category_name` feature, we are going to split it in three.

Then, we are going to use a categorical encoder to encode all string atributes into numbers.

In [6]:
def split_cat(text):
    try: return text.split("/")
    except: return ("No Label", "No Label", "No Label")

In [7]:
df["cat_1"], df["cat_2"], df["cat_3"] = zip(*df["category_name"].apply(split_cat))

In [8]:
df.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,cat_1,cat_2,cat_3
1071845,1071845,PlayStation 1 games,3,Electronics/Video Games & Consoles/Games,Sony,25.0,1,No description yet,Electronics,Video Games & Consoles,Games
1341567,1341567,Cropped Sweater,2,Women/Tops & Blouses/Knit Top,,9.0,0,-long sleeve -cropped -tight,Women,Tops & Blouses,Knit Top
584822,584822,Color Speaker,3,"Electronics/TV, Audio & Surveillance/Home Spea...",,16.0,0,"Fantasy color speaker. Has aux, Bluetooth, can...",Electronics,"TV, Audio & Surveillance",Home Speakers & Subwoofers
197407,197407,⚡️vs pink nation dog⚡️,1,Vintage & Collectibles/Collectibles/Doll,Victoria's Secret,24.0,0,New,Vintage & Collectibles,Collectibles,Doll
904192,904192,Lululemon Mens Shirt,2,Men/Athletic Apparel/Shirts & Tops,Lululemon,35.0,1,"Lululemon Men's Metal vent short sleeve shirt,...",Men,Athletic Apparel,Shirts & Tops


In [9]:
df = df[[
    "item_condition_id", "brand_name", "shipping", "cat_1", "cat_2", "cat_3", "price"
]]

In [10]:
df.head()

Unnamed: 0,item_condition_id,brand_name,shipping,cat_1,cat_2,cat_3,price
1071845,3,Sony,1,Electronics,Video Games & Consoles,Games,25.0
1341567,2,,0,Women,Tops & Blouses,Knit Top,9.0
584822,3,,0,Electronics,"TV, Audio & Surveillance",Home Speakers & Subwoofers,16.0
197407,1,Victoria's Secret,0,Vintage & Collectibles,Collectibles,Doll,24.0
904192,2,Lululemon,1,Men,Athletic Apparel,Shirts & Tops,35.0


# Data preparation

As always, we divide our dataset into a training and test datasets.

We fix the `random_seed` to make sure that our experiment is reproducible!

In [11]:
y = df.pop("price")
X = df.copy()

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, shuffle=True)

In [13]:
X_train.head()

Unnamed: 0,item_condition_id,brand_name,shipping,cat_1,cat_2,cat_3
1181562,3,LuLaRoe,1,Women,Tops & Blouses,Tunic
1060152,1,,0,Women,Tops & Blouses,T-Shirts
339643,1,James Avery,1,Women,Jewelry,Bracelets
159854,3,,0,Women,Tops & Blouses,Button Down Shirt
1452651,1,PINK,1,Women,Underwear,Panties


In [27]:
y_train.head()

1181562    12.0
1060152    22.0
339643     30.0
159854     10.0
1452651    10.0
Name: price, dtype: float64

## Target Encoder

We will proceed to build a target encoder for the columns that still have strings in them

In [28]:
def build_target_encoder(X: pd.DataFrame, y: pd.DataFrame) -> TargetEncoder:
    """
    Train a target encoder on columns "brand_name", "cat_1", "cat_2", "cat_3"
    using the train dataset and return the "fitted" encoder.
    """
    # Define las columnas que son categóricas y deben ser codificadas
    columns_to_encode = ['brand_name', 'cat_1', 'cat_2', 'cat_3']

    # Inicializa el codificador de destino para las columnas categóricas
    encoder = TargetEncoder(cols=columns_to_encode)

    # Ajusta el codificador de destino con los datos de entrenamiento X y la variable objetivo y
    encoder.fit(X[columns_to_encode], y)

    # Retorna el codificador ajustado
    return encoder

In [29]:
te = build_target_encoder(X_train, y_train)
print(te)

TargetEncoder(cols=['brand_name', 'cat_1', 'cat_2', 'cat_3'])


In [30]:
row1 = X_train.iloc[:1]
print(row1)

         item_condition_id brand_name  shipping  cat_1           cat_2  cat_3
1181562                  3    LuLaRoe         1  Women  Tops & Blouses  Tunic


In [43]:
row1_t = te.transform(row1[['brand_name', 'cat_1', 'cat_2', 'cat_3']])

print(row1_t)

         brand_name      cat_1      cat_2      cat_3
1181562   33.832282  28.937263  18.024321  27.472431


In [15]:
te = build_target_encoder(X_train, y_train)

row1 = X_train.iloc[:1]
row1_t = te.transform(row1)

ValueError: ignored

In [34]:
assert np.allclose(row1_t["cat_1"], y_train.loc[X_train["cat_1"] == row1["cat_1"].iloc[0]].mean())

In [35]:
row2 = X_test.iloc[:1]
row2_t = te.transform(row2)
answer_target_encoder = row2_t["cat_2"].values[0]
print("cat_2 target encoder", answer_target_encoder)

cat_2 target encoder 15.01733102253033


# Training

For training we are going to use a very popular machine learning model named `LightGBM` from Micrsoft.

One of the advantages of this model is that it includes a `quantile loss` that we can use to obtain intervals.

We will train the models three times, one for each quantile: `10%, 50% (the median) and 90%`.


In [38]:
X_train_t = te.transform(X_train)

params = {
    'objective': 'quantile',
    'metric': 'quantile',
    'max_depth': 4,
    'num_leaves': 15,
    'learning_rate': 0.1,
    'n_estimators': 100,
    'boosting_type': 'gbdt',
    'seed': 42,
    'num_threads': 1
}

quantiles = [.1, .5, .9]

preds = []

for i in range(len(quantiles)):

    reg = lgb.LGBMRegressor(alpha=quantiles[i], **params)

    model = reg.fit(X_train_t, y_train)

    X_test_t = te.transform(X_test)

    y_pred = model.predict(X_test_t)

    preds.append(y_pred)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003064 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 572
[LightGBM] [Info] Number of data points in the train set: 75000, number of used features: 4
[LightGBM] [Info] Start training from score 7.000000
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001287 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 572
[LightGBM] [Info] Number of data points in the train set: 75000, number of used features: 4
[LightGBM] [Info] Start training from score 17.000000
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003119 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 572
[LightGBM] [Info] Number of data points in the train set: 75

Here we process the three predicitions, one for each model and use them to build the corresponding intervals

In [39]:
df_preds = pd.DataFrame(preds).T
df_preds["y_test"] = y_test.values
df_preds.columns = ["q10", "q50", "q90", "y_test"]

In [40]:
df_preds.head()

Unnamed: 0,q10,q50,q90,y_test
0,5.103917,9.948704,27.968533,5.0
1,7.646818,12.895601,20.345078,10.0
2,21.072682,57.665507,130.325124,120.0
3,8.076641,14.131647,34.417692,9.0
4,7.188538,16.581596,40.579466,59.0


In [None]:
def get_result_within_interval(df_preds: pd.DataFrame) -> int:
    """
    Implement a function that counts for how many
    rows it holds that the true value $y \in [q10, q90]$

    For example, if in a row the real value of the target variable is 10,
    q10 is 5 and q90 is 15, that row counts.
    If in a different row, the target variable is 20, q10 is 5 and q90 is 15,
    that row does not count.
    """
    # Write your code here


In [None]:
answer_interval =  get_result_within_interval(df_preds)
print("Results within interval", answer_interval)

# Shapely value

Finally, we can use the shapley value to obtain an explanation of how each feature contributes to the final prediction.

In a price recommendatino problem this information is very helpful to end-users, possibly even more than the actual price

In [None]:
shap.initjs()

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test_t)

In [None]:
shap.summary_plot(shap_values, X_test_t)

We can also use the Shapley value to predict a single element

In [None]:
shap.force_plot(explainer.expected_value, shap_values[0,:], X_test_t.iloc[0,:])

In [41]:

proposed_solution = {
'attempt': {
    'course_name': COURSE_NAME,
    'exercise_name': EXERCISE_NAME,
    'username': STUDENT_NAME,
},
'task_attempts': [
	{
		"name": "target-encoder",
		"answer": answer_target_encoder,
	},
	{
		"name": "results-within-interval",
		"answer": answer_interval,
	},
]

}
check_solution(proposed_solution)


NameError: ignored