# PantryPalML: Production Training Notebook

This notebook demonstrates how I build and train the production model used by `ProductionRecipeScorer`.

It reuses our modules via imports and shows:
- Environment setup (Colab-friendly)
- Data preparation using `HybridRecommendationDataBuilder`
- Model training, evaluation, and saving via `HybridGBMRecommender`
- Artifacts produced for inference (model + metadata)
- Brief discussion of task, loss, metrics, and practical objective alignment


In [1]:
# Colab/Local environment setup (silent if local)
import sys, subprocess, os, pathlib

IN_COLAB = "google.colab" in sys.modules
repo_root = pathlib.Path.cwd()

if IN_COLAB:
    try:
        subprocess.run([sys.executable, "-m", "pip", "install", "-q",
                        "lightgbm", "pandas", "numpy", "scikit-learn", "matplotlib", "seaborn"],
                       check=False)
    except Exception as e:
        print(f"pip install warning: {e}")

    if not (repo_root / "recipe_recommender").exists():
        subprocess.run(["git", "clone", "-q", "https://github.com/marcel-qayoom-taylor/PantryPalML.git"], check=True)
        os.chdir("PantryPalML")
        repo_root = pathlib.Path.cwd()

print(f"Environment ready. Project root: {repo_root}")


Environment ready. Project root: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/notebooks


In [2]:
# Imports from production codebase
from recipe_recommender.config import get_ml_config

# Central config object (paths, hyperparams, event weights)
config = get_ml_config()
print("Config paths:")
print(" - output_dir:", config.output_dir)
print(" - input_dir:", config.input_dir)
print(" - model_dir:", config.model_dir)


Config paths:
 - output_dir: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output
 - input_dir: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/input
 - model_dir: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models


### Build Training Data
I create ML-ready datasets from our real event logs + recipe DB extracts using `HybridRecommendationDataBuilder`.


In [3]:
# Build datasets. Orchestrates loading and feature engineering
from recipe_recommender.models.training_data_builder import TrainingDataBuilder


builder = TrainingDataBuilder(config)  # 

# Reads recipe input data from database
ok_recipes = builder.load_real_recipe_data()
# Reads user interaction history from event logs
ok_events = builder.extract_user_interactions_from_events()

if not (ok_recipes and ok_events):
    raise RuntimeError("Missing required data files. Ensure recipe and event outputs exist in recipe_recommender/output.")

# Aggregates per-user stats (avg/total rating, activity, device/platform, engagement)
user_profiles = builder.create_user_profiles()
# Generates positive/negative user–recipe pairs with labels
training_pairs = builder.create_user_recipe_pairs()
# Final feature matrix + train/val/test CSVs and metadata/feature list
train_df, val_df, test_df = builder.prepare_training_data()

print(train_df.shape, val_df.shape, test_df.shape)


2025-09-29 18:01:56,001 - recipe_recommender.models.training_data_builder - INFO - Initialized Training Data Builder
2025-09-29 18:01:56,002 - recipe_recommender.models.training_data_builder - INFO - Loading real recipe database
2025-09-29 18:01:56,013 - recipe_recommender.models.training_data_builder - INFO - Loaded 1967 recipes with enhanced features
2025-09-29 18:01:56,015 - recipe_recommender.models.training_data_builder - INFO - Loaded 21439 recipe-ingredient relationships
2025-09-29 18:01:56,018 - recipe_recommender.models.training_data_builder - INFO - Loaded 2092 ingredients
2025-09-29 18:01:56,018 - recipe_recommender.models.training_data_builder - INFO - Extracting user interactions from events
2025-09-29 18:01:56,018 - recipe_recommender.models.training_data_builder - INFO -    Processing v1_events_20250827.json...
2025-09-29 18:01:56,301 - recipe_recommender.models.training_data_builder - INFO -    Processing v2_events_20250920.json...
2025-09-29 18:01:56,445 - recipe_recom

(12591, 40) (4197, 40) (4197, 40)


### Train, Evaluate, Save Model
I train `HybridGBMRecommender`, evaluate on validation data with appropriate metrics, and save artifacts used by inference.


In [4]:

from recipe_recommender.models.recipe_ranker import RecipeRanker

# Train and evaluate
recommender = RecipeRanker(config)  # wraps LightGBM with config-driven hyperparams and tracked features

# Reads train/val/test CSVs prepared by the data builder
recommender.load_training_data()
# Ensures recipe-level features are available (used for context/eval)
recommender.load_recipe_features()

# Fits LightGBM Lambdarank (ranking) with early stopping (NDCG on validation)
recommender.train_model()

# Reports AUC/Precision/Recall/F1 and per-user NDCG@k ranking metrics
recommender.evaluate_model()
# LightGBM feature importance by gain (sum loss reduction per feature)
importance = recommender.get_feature_importance()
print("Top 10 features:\n", importance.head(10))

# Writes booster + metadata (feature columns, config, training stats) to model_dir
recommender.save_model()

print("Artifacts saved in:", config.model_dir)


2025-09-29 18:01:59,112 - recipe_recommender.models.recipe_ranker - INFO - Initialized Recipe Ranker with lightgbm
2025-09-29 18:01:59,113 - recipe_recommender.models.recipe_ranker - INFO - Loading training data
2025-09-29 18:01:59,247 - recipe_recommender.models.recipe_ranker - INFO - Successfully loaded training data:
2025-09-29 18:01:59,247 - recipe_recommender.models.recipe_ranker - INFO -    Train: 12,591 samples
2025-09-29 18:01:59,247 - recipe_recommender.models.recipe_ranker - INFO -    Validation: 4,197 samples
2025-09-29 18:01:59,248 - recipe_recommender.models.recipe_ranker - INFO -    Test: 4,197 samples
2025-09-29 18:01:59,248 - recipe_recommender.models.recipe_ranker - INFO - Loaded 19 feature columns
2025-09-29 18:01:59,249 - recipe_recommender.models.recipe_ranker - INFO - Loaded training metadata
2025-09-29 18:01:59,258 - recipe_recommender.models.recipe_ranker - INFO - Loaded raw recipe features from enhanced_recipe_features_from_db.csv
2025-09-29 18:01:59,259 - recip

Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[5]	train's ndcg@5: 0.999498	train's ndcg@10: 0.999513	train's ndcg@20: 0.999478	validation's ndcg@5: 0.999901	validation's ndcg@10: 0.999631	validation's ndcg@20: 0.999777


2025-09-29 18:01:59,781 - recipe_recommender.models.recipe_ranker - INFO - Model performance:
2025-09-29 18:01:59,781 - recipe_recommender.models.recipe_ranker - INFO -    NDCG@5: 0.6430
2025-09-29 18:01:59,782 - recipe_recommender.models.recipe_ranker - INFO -    NDCG@10: 0.6443
2025-09-29 18:01:59,782 - recipe_recommender.models.recipe_ranker - INFO -    Recall@5: 0.9464
2025-09-29 18:01:59,782 - recipe_recommender.models.recipe_ranker - INFO -    Recall@10: 0.9845
2025-09-29 18:01:59,786 - recipe_recommender.models.recipe_ranker - INFO - Saving trained model
2025-09-29 18:01:59,787 - recipe_recommender.models.recipe_ranker - INFO - Model saved to: hybrid_lightgbm_model.txt
2025-09-29 18:01:59,787 - recipe_recommender.models.recipe_ranker - INFO - Metadata saved to: hybrid_lightgbm_metadata.json


Top 10 features:
                          feature    importance
14         user_complexity_match  10387.193581
15  user_recipe_engagement_match   4338.539917
18                   is_ios_user    125.913300
17                is_mobile_user    119.825996
1                     avg_rating     57.169771
3                     rating_std     35.269730
2                   total_rating     31.214653
10              ingredient_count     11.039721
11            unique_ingredients      3.589320
0             total_interactions      0.469215
Artifacts saved in: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models


### Notes on Learning Task and Objective
- Input (train): user–recipe feature matrix with label indicating positive interaction
- Output (deploy): score per recipe for a given user
- Learning objective: LightGBM Lambdarank (pairwise ranking) optimized for NDCG@k
- Evaluation: primary ranking metrics NDCG@k (+ Recall@k); AUC/PR are reference only
- Alignment: We optimize directly for ranking quality to match top-N recommendation goals


### Why Lambdarank (and how it works)

- **Why this objective**
  - **We care about ranking, not calibrated probabilities**: recommendations are evaluated by order (top‑K), so optimizing NDCG@k aligns the loss with our goal.
  - **Direct optimization of a ranking surrogate**: Lambdarank approximates NDCG gains, typically improving NDCG/Recall over binary logloss in recommendation tasks.
  - **Handles class imbalance and variable list sizes**: Works well with sparse positives and per‑user candidate sets of different lengths.

- **How it works (intuitively)**
  - For each user (a "group"), the model forms **pairwise preferences** between items and computes gradients ("lambdas") proportional to the **change in NDCG** if a pair were swapped.
  - LightGBM then **boosts decision trees** to reduce this surrogate loss, directly pushing relevant items upward in the list.
  - We pass per‑user group sizes, set `objective = "lambdarank"`, `metric = "ndcg"`, and choose `ndcg_eval_at = (5, 10, 20)` for validation/early stopping.

- **Loss used**
  - Pairwise logistic loss on score differences with lambda weights approximating ΔNDCG: `L = log(1 + exp(-(s_i - s_j)))`, weighted by per‑pair lambdas derived from the expected NDCG change.

- **Practical effects**
  - Training stops when validation NDCG@k stops improving.
  - At inference we get scores; **higher score ⇒ higher rank**. No thresholding is required for top‑N recommendations.



### Smoke Test: Saved Artifacts
Verify that the trained model and metadata were written to `config.model_dir`.


In [5]:
model_file = config.model_dir / "hybrid_lightgbm_model.txt"
meta_file = config.model_dir / "hybrid_lightgbm_metadata.json"

print("Model exists:", model_file.exists(), model_file)
print("Metadata exists:", meta_file.exists(), meta_file)


Model exists: True /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models/hybrid_lightgbm_model.txt
Metadata exists: True /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models/hybrid_lightgbm_metadata.json
