In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import json
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import sys
import os

# File path to the src directory for both linux and windows
# workaround for the issue of relative imports in Jupyter notebooks to import modules from src without using the full path
src_path = os.path.abspath("../src")
if src_path not in sys.path:
    sys.path.insert(0, src_path)

In [None]:
# Rerun this cell after making changes to the utils module
from the_team.utils import etl, viz
import importlib
importlib.reload(etl)
importlib.reload(viz)

# Set custom plot style for consistency
viz.set_plot_style()

# Base Models Comparison

I tried out 4 baseline models that are commonly used for classification tasks.
- Logistic Regression: A simple yet highly interpretable baseline that works well for binary classification
- Random Forest: Go-to model for complex classification tasks, aptures non-linear patterns and ranks feature importance, making it useful for understanding buyer behavior.
- XGBoost: Strong performance on tabular data, especially with imbalanced classes
- LightGBM: Fast and scalable, making it efficient for large datasets with mixed features and iterative tuning.

In [None]:
rf_path = Path("../data/08_reporting/random_forest_model_metrics.json")
lr_path = Path("../data/08_reporting/logistic_model_metrics.json")
xg_path = Path("../data/08_reporting/xgboost_model_metrics.json")
lgbm_path = Path("../data/08_reporting/lightgbm_model_metrics.json")

In [None]:
rf = etl.load_model_metrics(rf_path)
lr = etl.load_model_metrics(lr_path)
xg = etl.load_model_metrics(xg_path)
lgbm = etl.load_model_metrics(lgbm_path)
models = {"Random Forest": rf, "Logistic Regression": lr, "XGBoost": xg, "LightGBM": lgbm}

In [None]:
# Compare raw accuracies
raw_accuracies = pd.DataFrame({
    "Random Forest": rf["classification_report"]["accuracy"],
    "Logistic Regression": lr["classification_report"]["accuracy"],
    "XGBoost": xg["classification_report"]["accuracy"],
    "LightGBM": lgbm["classification_report"]["accuracy"]
}, index=["Accuracy"])
raw_accuracies.plot(kind="bar", figsize=(5, 3), title="Raw Model Accuracies", ylabel="Accuracy", xlabel="Models")
plt.xticks(rotation=0)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()

- The accuracies can still be improved but are acceptable. 
- However, the training data has huge class imbalance (6% True, 94% False).
- Even if the model say False to every instance, the accuray would still achieve 94% accuracy.
- Thus, overall accuracy (as well as ROC-AUC) metrics are not so suitable in our case.

In [None]:
for model_name, model_metrics in models.items():
    viz.plot_classification_report(model_metrics["classification_report"], model=model_name)

- As seen above, the model is good at identifying False (non-repeat buyers) instances but missing out on True (repeat buyer) classes.
- Although class weights and scaling were already used during training these base models, there is still much room for improvement, possible through hyperparamter tuning and probability thresholding. 

# Logistic Finetuning

- Logistic regression, which has slightly better metrics and very explainable, is chosen to prove the feasability of model improvement.

In [None]:
lr_fine_tuned = etl.load_model_metrics(Path("../data/08_reporting/logistic_model_tuning_metrics.json"))

In [None]:
print(f"Base threshold has been changed to {lr_fine_tuned['best_threshold']:.2f} after finetuning.")
# The model is now more sensitive to positive class predictions

In [None]:
lr_vs_fine_tuned = {"Logistic Regression": lr, "Finetuned Logistic Regression": lr_fine_tuned}

viz.plot_before_after_metrics(lr_vs_fine_tuned, "Fine-tuning")

- Precision, in our case, indicates how many of our model's predicted repeat buyers are acutal repeat buyers. This is useful in targeted marketing campaigns where high precision means that the customers we target are truly likely to buy again, reducing wasted marketing effort.
- Thus, precison was used as the target metric during hyperparameter tuning.
- Precision increased by 7% but recall was sacrificed in the process although the overall f1 score still increased.

In [None]:
# Compare top10 precision beforeand after finetuning
print(f"Before finetuning, the top 10 precision scores was: {lr['top_10_precision']:.2f}")
print(f"After finetuning, the top 10 precision scores is: {lr_fine_tuned['top_10_precision']:.2f}")


- Not much difference, but is surprisingly high for such an imbalanced dataset. 
- 20% in top-10 precision means that, among the top 10 customers ranked most likely to be repeat buyers by our model, 2 of them are actually repeat buyers.(There might be other POTENTIAL repeat buyers in that top 10 customers as well.)

# Semi-supervised Learning

- Our main business goal was to identify POTENTIAL repeat buyers, and all of our features were engineered towards it. 
- But, our flag for is_repeat_buyer is defined as customers who have more than once unique purchases within the whole provided dataset, meaning they are existing repeat buyers. 
- Thus, when our model predicts a buyer as a repeat buyer, the person, at that point in time, might not have become a repeat buyer but had potential. Yet, since the flag was False, the model was told wrong, accounting for low precision. 
- Therefore, we are trying out semi-supervised learning, for instances, where we logically think the customer may buy again, but we cannot say for sure: pseudo labels or weak labels.

In [None]:
# mask = (
#         (df["review_score"] > 3)
#         | (df["deli_duration_exp"] <= -7)
#         | (df["voucher"] >= 0.3)
#         | (df["total_spent"] >= df["total_spent"].quantile(0.8))
#         | (df["product_category_name"].isin(top_categories))
# )

These features are based on domain expertise like
- if the customer is satisfied (review > 3), the person might buy again, or
- if the customer paid 30% of the total spent in vouchers, the person is knowledgeable about Olist platforms (coupons, loyalty points) and might buy again.

In [None]:
ssl_lr = etl.load_model_metrics(Path("../data/08_reporting/ssl_logistic_model_metrics.json"))

In [None]:
lr_all = {"Logistic Regression": lr, "Finetuned Logistic Regression": lr_fine_tuned, "SSL Logistic Regression": ssl_lr}
viz.plot_before_after_metrics(lr_all, "Semi-supervised Learning")

- The model has improved so much that it's too good to be true now. 
- But this feasability proves that semi-supervised learning may work in our case of predicting POTENTIAL repeat buyer, where the goal is about a weak label. (There is no such thing as a potential repeat buyer in the provided dataset.)
- How well the SSL works largely depends on defining the correct masking pesudo labels without much bias, and this can be further improved when the label actually becomes True (from potential to actual repeat buyer) in the future through continuous reinforcement learning.

In [None]:
# Plot PRC curves for all lr models
plt.figure(figsize=(6, 6))
for model_name, result in lr_all.items():
    precision = result["prc_curve"]["precision"]
    recall = result["prc_curve"]["recall"]
    auc = result["prc_auc"]
    label = f"{model_name} (PRC-AUC = {auc:.3f})"
    plt.plot(recall, precision, label=label)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (All Logistic Models)")
plt.grid(True)
plt.legend(title="Model", loc="lower left", bbox_to_anchor=(0, -0.7))
plt.tight_layout()
plt.show()


(SSL model is way too optimistic and should be configured with above suggestions.)

In [None]:
ssl_bias = pd.read_csv(Path("../data/08_reporting/ssl_bias_report.csv"))
ssl_bias.head()

- Out of 93617 repeat buyers, 61154 (~65%) came from the top 10 categories (which we intentionally defined earlier for semi-supervised learning). 
- This indicates that the pseudo-labeling model might be biased. It may have learned to assign “repeat buyer” labels primarily based on category frequency, rather than user behavior. 
- This risks overfitting to popular products, and failing on underrepresented or niche categories.
- Countermeaures would be to downweight product categories during pseudo-labelling or adding more diversity.
(This same appraoch can be used to check model biasness for each of the new conditions we defined during pesudo-labelling.)