<img src="data/pexels-sammsara-luxury-modern-home-372468-1099816.jpg" alt="E-commerce Home Decor Goods" width="600">

E-commerce businesses rely on explainable artificial intelligence (AI) models to anticipate customer needs, improve inventory planning, personalize marketing campaigns, and most importantly, explain the results to stakeholders. Understanding what factors drive purchases in a specific product category, such as home decor, can help businesses tailor their strategies to maximize sales. If a company knows that customers who historically spend more on children's accessories would also spend more on home decor items moving forward, they could target these customers with promotional ads on home decor items and give bundling discounts to foster this pattern.

A major online retailer has enlisted your help for this very task. You already have two forecast models (`model.pkl`, `knn_model.pkl`) and now you need to explain the results to stakeholders so they can make key business decisions about marketing and budgets.

---
## Data

Each row in `X_train` represents a snapshot of a customer's features for a specific month, and `y_train` is the customer's sales for the next month for the `'home_decor'` product category. The data is a modified version of the original data that is publicly available on Kaggle.

[//]: # (https://www.kaggle.com/datasets/gabrielramos87/an-online-shop-business/data)

### X_train/X_test.csv

| Column | Description |
|--------|------------|
| `logsales` | Logarithm of (customer sales+1) (+1 to handle 0 sales) |
| `lag1` | The log of sales from 1 month ago |
| `lag2`| The log of sales from 2 months ago|
| `sma_2m` | Average log sales over the last 2 months (simple moving average)|
| `sma_4m` | Average log sales over the last 4 months (simple moving average)|
| `sma_6m` | Average log sales over the last 5 months (simple moving average)|
| `months_since_first` | Months since first purchase |
|<ul><li>`children_s_accessories`</li><li>`colourful_essentials`</li><li>`home_decor`</li><li>`home_storage`</li><li>`quirky_stationery`</li><li>`soft_furnishings`</li><li>`toys_games`</li>...</ul> | Category-specific logarithm of (customer sales+1) | 
| <ul><li>`sma_2m__birthday_gifts`</li><li>`sma_4m__birthday_gifts`</li><li>`sma_3m__birthday_gifts` | 2, 4, 6-month average log sales per category (simple moving average)|

### y_train/y_test.csv

- `'nextmonth__home_decor'`: Logarithm of (customer sales+1) for the `home_decor` product category in the next month for prediction

---
## Model

This forecast model has been trained on the `X_train`, `y_train` provided.

### model.pkl

- Fitted `sklearn.ensemble.RandomForestRegressor` on `X_train`, `y_train`

### knn_model.pkl

- Fitted `sklearn.neighbors.KNeighborsRegressor` on `X_train`, `y_train`

___
### Update to Python 3.10

Due to how frequently the libraries required for this project are updated, you'll need to update your environment to Python 3.10:

1. In the workbook, click on "Environment," in the top toolbar and select "Session details".

2. In the workbook language dropdown, select "Python 3.10".

3. Click "Confirm" and hit "Done" once the session is ready.

In [15]:
# Re-run this cell
# Import required libraries
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
import joblib
import sys
assert (
    sys.version_info.major == 3 and sys.version_info.minor == 10
), "Please ensure that you are on Python 3.10."

# Load a sample of the data and the models
X_train = pd.read_csv("data/X_train.csv").sample(500, random_state=42)
X_test = pd.read_csv("data/X_test.csv").sample(500, random_state=42)
y_train = pd.read_csv("data/y_train.csv")["nextmonth__home_decor"].sample(500, random_state=42)
y_test = pd.read_csv("data/y_test.csv")["nextmonth__home_decor"].sample(500, random_state=42)
model = joblib.load("data/model.pkl")
knn_model = joblib.load("data/knn_model.pkl")

Explore `model`, `knn_model`, `X_test`, `y_test`, and use Explainable AI (XAI) to answer the following questions:

- Identify an XAI method that assigns feature attributions using game theory principles and save the library name as a `str` variable, `xai`. Import this library to complete the remaining tasks.

In [16]:
# Start coding here
# Use as many cells as you need

# Here is a view of how the RandomForestRegressor model was fitted:
# from sklearn.ensemble import RandomForestRegressor
# model = RandomForestRegressor(
#     **{
#         "max_depth": 16,
#         "min_samples_split": 12,
#         "min_samples_leaf": 7,
#         "max_features": "sqrt",
#         "bootstrap": False,
#         "random_state": 42,
#         "n_jobs": -1,
#     }
# )
# model.fit(X_train, y_train)

# Identify a game theory-based XAI method

xai = "shap"

# Import this library
import shap

- Compute feature importance based on the `model`'s predictions on `X_test`. Extract the top five features and store them as a DataFrame in `top_feats`.

In [17]:
# Compute feature importance based on the model's predictions on X_test. Extract the top five features and store them as a set in top_feats

# Use Shap's TreeExplainer since RandomForestRegressor is a Tree-based model
explainer = shap.TreeExplainer(model)

# Calculate SHAP values
shap_values = explainer.shap_values(X_test)

# Get feature importances
feature_importance = np.abs(shap_values).mean(axis=0)

# Create a DataFrame of the feature importance
feature_importance_df = pd.DataFrame(
    {"Feature": X_test.columns, "Importance": feature_importance}
).sort_values(by="Importance", ascending=False)

# Top five most impactful features based on SHAP
top_feats = feature_importance_df.head(5)
print(top_feats)

                         Feature  Importance
2                           lag2    0.117177
4                         sma_4m    0.098794
1                           lag1    0.095220
3                         sma_2m    0.059562
44  sma_2m__colourful_essentials    0.058533


- Compute and extract the top five features for the `knn_model` (use `.sample(50, random_state=42)` for faster running) and evaluate the consistency across both models (`model`, `knn_model`). Save the consistency as a variable `consistency`, rounded to two decimal places and of type `float`.

In [18]:
# Evaluate the consistency of feature importance explanations across the two models provided

# Here is a view of how the k-NN model was fitted:
# knn_model = KNeighborsRegressor(
#     n_neighbors=80,
#     weights="uniform",
#     algorithm="auto",
#     leaf_size=30,
#     p=2,
#     metric="minkowski",
#     metric_params=None,
#     n_jobs=-1,
# )

# Create a SHAP Kernel Explainer
knn_explainer = shap.KernelExplainer(knn_model.predict, shap.kmeans(X_test, 5))

# Calculate SHAP values
knn_shap_values = knn_explainer.shap_values(X_test.sample(50, random_state=42))

# Get feature importance
knn_feature_importance = np.abs(knn_shap_values).mean(axis=0)

# Create a DataFrame of the feature importance
knn_feature_importance_df = pd.DataFrame(
  {"Feature": X_test.columns, "Importance": knn_feature_importance}
).sort_values(by="Importance", ascending=False)

# Top five most impactful features based on SHAP
knn_top_feats = knn_feature_importance_df.head(5)

# Calculate cosine similarty consistency across both models
consistency = round(cosine_similarity([feature_importance], [knn_feature_importance])[0][0], 2)
print("Consistency between SHAP values:", consistency)

  0%|          | 0/50 [00:00<?, ?it/s]

Consistency between SHAP values: 0.89


- The marketing team wants to know if your model interpretations are stable and reliable. Save your response ("yes", "no") to `reliable`.

In [19]:
# The marketing team wants to know if your models are stable and reliable. What is your response?
reliable = "yes"
print(reliable)

yes
