This notebook is used for experimentation to find the best reccomendations. Depending on the case, not all cells are ran.

In [66]:
# Enable auto-reloading so you can edit .py files without restarting the kernel
%load_ext autoreload
%autoreload 2


import joblib
import sys

# Add the project root to path so we can import 'src'
sys.path.append('../')

from src.data_loader import DataLoader
from src.models import CollaborativeRecommender, ContentRecommender, HybridRecommender
from src.utils import tune_alpha, get_tuning_sample, generate_kaggle_submission, tune_half_life
from src.evaluation import ModelEvaluator

# Intercepts calls to NearestNeighbors, SVD, and k-means and routes them through Intel's optimized oneDAL library.
# This often results in a 10x-100x speedup on Intel chips without changing the code logic.
from sklearnex import patch_sklearn
patch_sklearn()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)


# Load Data

In [67]:
# 1. Initialize Loader
loader = DataLoader(base_path='../data')

# 2. Get Chronological Split (Train/Test)
# We apply Time Decay (Half-Life = 120 days) for the models that use it
train_df, test_df = loader.get_time_split(train_ratio=0.8, half_life_days=120)

# 3. Get Content Artifacts (Matrices & Map)
tfidf, vectors, item_map = loader.get_content_data()

# 4. Get Full Dataset
full_df = loader.get_full_data(half_life_days=120)

print("\n>>> Data Summary:")
print(f"   Train Interactions: {len(train_df)}")
print(f"   Test Interactions:  {len(test_df)}")
print(f"   Unique Items:       {len(item_map)}")

>>> Loading Interactions...
   -> Interactions loaded: 87045 rows
>>> Splitting Data (First 80% Train)...
   -> Applying Time Decay (Half-Life: 120 days)...
   -> Train: 66580 | Test: 20465
   -> Total: 66580 + 20465 = 87045
>>> Loading Content Artifacts...
   -> Loaded features for 15291 items.
>>> Preparing Full Dataset (Half-Life: 120 days)...
   -> Full Data Weighted: 87045 rows

>>> Data Summary:
   Train Interactions: 66580
   Test Interactions:  20465
   Unique Items:       15291


# Experimentation

## Collaborative-Based Recommender

In [36]:
print(">>> Building Baseline Models...")

# 1. Collaborative Filtering (Memory-Based)
# It automatically detects the 'weight' column in train_df
cf_model = CollaborativeRecommender(train_df)
print(">>> Collaborative Model Built.")

>>> Building Baseline Models...
   -> Applying TF-IDF to Interaction Matrix...
>>> Collaborative Model Built.


In [37]:
print("\n>>> Evaluating: Collaborative Filtering")
all_item_ids = full_df['item_id'].unique()
evaluator = ModelEvaluator(train_df, all_item_ids)
res_cf= evaluator.evaluate(cf_model, test_df, k=10, model_name="Collaborative (TF-IDF)")

print(f"\n>>> Results:")
print(f"   -> Hit Rate @ 10: {res_cf['Hit Rate @ 10']:.4%}")
print(f"   -> MAP @ 10: {res_cf['MAP @ 10']:.4%}")
print(f"   -> Novelty: {res_cf['Novelty']:.4%}")
print(f"   -> Coverage: {res_cf['Coverage']:.4%}")


>>> Evaluating: Collaborative Filtering
   -> Pre-computing Item Popularity for Novelty metrics...
>>> Evaluating Collaborative (TF-IDF) on 20465 users...


Eval Collaborative (TF-IDF):   0%|          | 0/20465 [00:00<?, ?it/s]


>>> Results:
   -> Hit Rate @ 10: 27.0364%
   -> MAP @ 10: 13.8385%
   -> Novelty: 1309.2480%
   -> Coverage: 85.5384%


## Tuning Alphas Hyperparameters

### Content-Based Recommender

In [38]:
# Content-Based Filtering
# We need to tune alpha (TF-IDF vs MiniLM)
print("\n>>> Tuning Content Alpha... (Sample of user)")
tuning_df = get_tuning_sample(test_df, n_users=1000)
best_content_alpha = tune_alpha(
    model=ContentRecommender(train_df, tfidf, vectors, item_map),
    test_df=tuning_df,
    param_name='alpha'
)


>>> Tuning Content Alpha... (Sample of user)
   -> Sampling Strategy: Selected 1000 Users
   -> Original Rows: 20465 | Sampled Rows: 2496
>>> Tuning 'alpha' on 2496 users...


alpha=0.0: 100%|██████████| 2496/2496 [02:54<00:00, 14.34it/s]


   [0.0] Hit Rate: 20.59294872%


alpha=0.2: 100%|██████████| 2496/2496 [02:56<00:00, 14.15it/s]


   [0.2] Hit Rate: 21.11378205%


alpha=0.5: 100%|██████████| 2496/2496 [03:10<00:00, 13.12it/s]


   [0.5] Hit Rate: 20.99358974%


alpha=0.8: 100%|██████████| 2496/2496 [03:08<00:00, 13.23it/s]


   [0.8] Hit Rate: 19.75160256%


alpha=1.0: 100%|██████████| 2496/2496 [03:04<00:00, 13.52it/s]

   [1.0] Hit Rate: 17.90865385%

>>> Best alpha: 0.2 (Hit Rate: 21.11378205%)





In [42]:
# Instantiate optimized model
content_model = ContentRecommender(train_df, tfidf, vectors, item_map)
print(f">>> Content Model Built (Alpha: {best_content_alpha})")

>>> Content Model Built (Alpha: 0.2)


### Hybrid Content-Collaboration Recommender

In [43]:
# Instantiate the Hybrid
# We pass the best_content_alpha we just found so the Content engine inside is optimized.
hybrid_model = HybridRecommender(
    cf_model,
    content_model,
    content_alpha=best_content_alpha
)

In [44]:
# We use the same 'tuning_df' sample to keep it fast
print(">>> Tuning Hybrid Alpha (CF vs Content)...")

best_hybrid_alpha = tune_alpha(
    model=hybrid_model,
    test_df=tuning_df,  # Use the same sample for consistency/speed
    param_name='hybrid_alpha',
    values=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]
)

print(f"\n>>> Optimal Hybrid Configuration:")
print(f"   -> Content Internal Alpha: {best_content_alpha}")
print(f"   -> Hybrid Balance Alpha:   {best_hybrid_alpha}")

if best_hybrid_alpha > 0.5:
    print("   -> Interpretation: The model leans towards Collaborative Filtering (Social Signals).")
elif best_hybrid_alpha < 0.5:
    print("   -> Interpretation: The model leans towards Content Matching (Metadata).")
else:
    print("   -> Interpretation: A perfect 50/50 balance.")

>>> Tuning Hybrid Alpha (CF vs Content)...
>>> Tuning 'hybrid_alpha' on 2496 users...


hybrid_alpha=0.0: 100%|██████████| 2496/2496 [03:03<00:00, 13.58it/s]


   [0.0] Hit Rate: 21.27403846%


hybrid_alpha=0.2: 100%|██████████| 2496/2496 [03:04<00:00, 13.52it/s]


   [0.2] Hit Rate: 22.87660256%


hybrid_alpha=0.4: 100%|██████████| 2496/2496 [03:08<00:00, 13.28it/s]


   [0.4] Hit Rate: 24.43910256%


hybrid_alpha=0.5: 100%|██████████| 2496/2496 [03:03<00:00, 13.61it/s]


   [0.5] Hit Rate: 24.83974359%


hybrid_alpha=0.6: 100%|██████████| 2496/2496 [03:21<00:00, 12.40it/s]


   [0.6] Hit Rate: 25.16025641%


hybrid_alpha=0.8: 100%|██████████| 2496/2496 [03:11<00:00, 13.06it/s]


   [0.8] Hit Rate: 25.28044872%


hybrid_alpha=1.0: 100%|██████████| 2496/2496 [03:01<00:00, 13.75it/s]

   [1.0] Hit Rate: 25.40064103%

>>> Best hybrid_alpha: 1.0 (Hit Rate: 25.40064103%)

>>> Optimal Hybrid Configuration:
   -> Content Internal Alpha: 0.2
   -> Hybrid Balance Alpha:   1.0
   -> Interpretation: The model leans towards Collaborative Filtering (Social Signals).





In [45]:
# best_content_alpha = 0.5
# best_hybrid_alpha = 0.6

final_alpha_config = {
    "best_content_alpha": best_content_alpha,
    "best_hybrid_alpha": best_hybrid_alpha,
}

joblib.dump(final_alpha_config, '../data/artifacts/best_params_hl180.pkl')
print(">>> Alphas Hyperparameters Saved:")
print(final_alpha_config)

>>> Alphas Hyperparameters Saved:
{'best_content_alpha': 0.2, 'best_hybrid_alpha': 1.0}


#### (Tuning Half-Life)

In [15]:
saved_config = joblib.load('../data/artifacts/best_params.pkl')
best_content_alpha = saved_config.get('best_content_alpha', 0.5)
best_hybrid_alpha = saved_config.get('best_hybrid_alpha', 0.6)
print(f">>> Loaded Saved Alphas:")
print(f"   -> Content Alpha: {best_content_alpha}")
print(f"   -> Hybrid Alpha:  {best_hybrid_alpha}")

>>> Loaded Saved Alphas:
   -> Content Alpha: 0.5
   -> Hybrid Alpha:  0.6


In [16]:
all_item_ids = full_df['item_id'].unique()
evaluator = ModelEvaluator(train_df, all_item_ids)

best_half_life = tune_half_life(
    loader=loader,
    test_df=test_df,
    item_tfidf=tfidf,
    item_minilm=vectors,
    item_map=item_map,
    evaluator=evaluator,
    best_c_alpha=best_content_alpha,
    best_h_alpha=best_hybrid_alpha,
    k=10
)
saved_config['best_half_life'] = best_half_life

   -> Pre-computing Item Popularity for Novelty metrics...
>>> Tuning Half-Life (Data Decay) with Checkpointing...
   -> Found existing checkpoint with 4 results.

--- [Skipping] Half-Life = 30 days (Found: 25.9956%) ---

--- [Skipping] Half-Life = 60 days (Found: 26.7579%) ---

--- [Skipping] Half-Life = 90 days (Found: 27.0462%) ---

--- [Skipping] Half-Life = 120 days (Found: 27.2221%) ---

--- Testing Half-Life = 180 days ---
>>> Splitting Data (First 80% Train)...
   -> Applying Time Decay (Half-Life: 180 days)...
   -> Train: 66580 | Test: 20465
   -> Total: 66580 + 20465 = 87045
   -> Applying TF-IDF to Interaction Matrix...
>>> Evaluating Hybrid (HL=180) on 20465 users...


Eval Hybrid (HL=180):   0%|          | 0/20465 [00:00<?, ?it/s]

>>> Hit Rate: 27.2563% at HL=180 days
   -> Saved Checkpoint: ../data/artifacts/half_life_results.pkl

--- Testing Half-Life = 365 days ---
>>> Splitting Data (First 80% Train)...
   -> Applying Time Decay (Half-Life: 365 days)...
   -> Train: 66580 | Test: 20465
   -> Total: 66580 + 20465 = 87045
   -> Applying TF-IDF to Interaction Matrix...
>>> Evaluating Hybrid (HL=365) on 20465 users...


Eval Hybrid (HL=365):   0%|          | 0/20465 [00:00<?, ?it/s]

>>> Hit Rate: 27.0266% at HL=365 days
   -> Saved Checkpoint: ../data/artifacts/half_life_results.pkl

>>> Evaluation Summary:
   -> Hit Rate: 25.9956% at HL=30 days
   -> Hit Rate: 26.7579% at HL=60 days
   -> Hit Rate: 27.0462% at HL=90 days
   -> Hit Rate: 27.2221% at HL=120 days
   -> Hit Rate: 27.2563% at HL=180 days
   -> Hit Rate: 27.0266% at HL=365 days

>>> Best Half-Life: 180 days
   -> Hit Rate: 27.2563%


In [23]:
joblib.dump(saved_config, '../data/artifacts/best_params.pkl')
print(">>> Half-Life + Alphas Hyperparameters Saved:")
print(saved_config)

>>> Half-Life + Alphas Hyperparameters Saved:
{'best_content_alpha': 0.5, 'best_hybrid_alpha': 0.6, 'best_half_life': 180}


We tuned our models with 'half_life'= 120 but now we know that the best half_life value is actually 180 so we will re-tune our hyperparameters by running previous cells again with best_half_life

# FINAL : Retraining on full dataset

In [77]:
final_config = joblib.load('../data/artifacts/winning_best_params.pkl')
print(final_config)
best_content_alpha = final_config.get('best_content_alpha', 0.5)
best_hybrid_alpha = final_config.get('best_hybrid_alpha', 0.6)
print(f">>> Loaded Saved Alphas:")
print(f"   -> Content Alpha: {best_content_alpha}")
print(f"   -> Hybrid Alpha:  {best_hybrid_alpha}")

{'best_content_alpha': 0.5, 'best_hybrid_alpha': 0.6, 'best_half_life': 120}
>>> Loaded Saved Alphas:
   -> Content Alpha: 0.5
   -> Hybrid Alpha:  0.6


In [78]:
print("\n>>> BUILDING FINAL PRODUCTION MODELS...")

#Train Collaborative Model (Full Data)
cf_full = CollaborativeRecommender(full_df)
print("   -> Collaborative Model Retrained.")

# Train Content Model (Full Data)
# Reuses the artifacts (tfidf, vectors) which are already full catalog
content_full = ContentRecommender(
    full_df,
    tfidf,
    vectors,
    item_map
)
print("   -> Content Model Retrained.")


>>> BUILDING FINAL PRODUCTION MODELS...
   -> Applying TF-IDF to Interaction Matrix...
   -> Collaborative Model Retrained.
   -> Content Model Retrained.


In [79]:
# Hybrid Model
# We pass the optimized alpha we just saved
hybrid_full = HybridRecommender(
    cf_full,
    content_full,
    content_alpha=best_content_alpha
)

print(f"   -> Hybrid Model Assembled (Content Alpha: {best_content_alpha})")
print("\n>>> Final Model Ready for Submission Generation.")

   -> Hybrid Model Assembled (Content Alpha: 0.5)

>>> Final Model Ready for Submission Generation.


In [None]:
target_users = full_df['user_id'].unique()
submission = generate_kaggle_submission(
    model=hybrid_full,
    target_user_ids=target_users,
    k=10,
    hybrid_alpha=best_hybrid_alpha,
    pop_weight=0.2)

In [72]:
submission

Unnamed: 0,user_id,recommendation
0,0,23 16 13 19 24 21 22 611 13261 17
1,1,37 36 38 10715 39 8999 9926 33 611 2553
2,2,58 53 14990 92 3055 91 8999 10715 75 14991
3,3,132 169 149 14107 171 2553 167 611 165 12087
4,4,195 205 203 207 11366 248 206 200 204 202
...,...,...
7833,7833,7760 7322 975 5838 611 667 7127 10967 8498 4921
7834,7834,7128 1367 8999 3055 13891 92 10715 14991 36 7121
7835,7835,3055 6791 4820 9310 10715 45 53 8999 9719 3019
7836,7836,3471 14550 14557 3816 3055 611 10715 3470 8999...


## Save

In [73]:
# save submission
submission.to_csv('../submissions/submission_final.csv', index=False)

In [74]:
# save all the models
joblib.dump(cf_full, '../models/cf_model.pkl')
joblib.dump(content_full, '../models/content_model.pkl')
joblib.dump(hybrid_full, '../models/hybrid_model.pkl')

['../models/hybrid_model.pkl']

## Final Evaluation

In [82]:
final_model = joblib.load('../models/hybrid_model.pkl')

all_item_ids = full_df['item_id'].unique()
evaluator = ModelEvaluator(train_df, all_item_ids)
res_cf= evaluator.evaluate(final_model, test_df, k=10, model_name="Hybrid")

print(f"\n>>> Results:")
print(f"   -> Hit Rate @ 10: {res_cf['Hit Rate @ 10']:.4%}")
print(f"   -> MAP @ 10: {res_cf['MAP @ 10']:.4%}")
print(f"   -> Novelty: {res_cf['Novelty']:.4%}")
print(f"   -> Coverage: {res_cf['Coverage']:.4%}")

   -> Pre-computing Item Popularity for Novelty metrics...
>>> Evaluating Hybrid on 20465 users...


Eval Hybrid:   0%|          | 0/20465 [00:00<?, ?it/s]


>>> Results:
   -> Hit Rate @ 10: 75.8417%
   -> MAP @ 10: 42.3133%
   -> Novelty: 1375.0498%
   -> Coverage: 89.1058%
