# PantryPalML: Production Training Notebook

This notebook demonstrates how we build and train the actual production model used by `ProductionRecipeScorer`.

It reuses our production modules via imports (no code duplication) and shows:
- Environment setup (Colab-friendly)
- Data preparation using `HybridRecommendationDataBuilder`
- Model training, evaluation, and saving via `HybridGBMRecommender`
- Artifacts produced for inference (model + metadata)
- Brief discussion of task, loss, metrics, and practical objective alignment


In [1]:
# Colab/Local environment setup (silent if local)
import sys, subprocess, os, pathlib

IN_COLAB = "google.colab" in sys.modules
repo_root = pathlib.Path.cwd()

if IN_COLAB:
    try:
        subprocess.run([sys.executable, "-m", "pip", "install", "-q",
                        "lightgbm", "pandas", "numpy", "scikit-learn", "matplotlib", "seaborn"],
                       check=False)
    except Exception as e:
        print(f"pip install warning: {e}")

    if not (repo_root / "recipe_recommender").exists():
        subprocess.run(["git", "clone", "-q", "https://github.com/marcel-qayoom-taylor/PantryPalML.git"], check=True)
        os.chdir("PantryPalML")
        repo_root = pathlib.Path.cwd()

print(f"Environment ready. Project root: {repo_root}")


Environment ready. Project root: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/notebooks


In [None]:
# Imports from production codebase
from pathlib import Path
from recipe_recommender.config import get_ml_config
from recipe_recommender.models.hybrid_recommendation_data_builder import HybridRecommendationDataBuilder
from recipe_recommender.models.hybrid_gbm_recommender import HybridGBMRecommender

# Central config object (paths, hyperparams, event weights)
config = get_ml_config()
print("Config paths:")
print(" - output_dir:", config.output_dir)
print(" - input_dir:", config.input_dir)
print(" - model_dir:", config.model_dir)


Config paths:
 - output_dir: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output
 - input_dir: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/input
 - model_dir: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models


### Build Training Data
We create ML-ready datasets from our real event logs + recipe DB extracts using `HybridRecommendationDataBuilder`.


In [None]:
# Build datasets. Orchestrates loading and feature engineering
builder = HybridRecommendationDataBuilder(config)  # 

# Reads recipe input data
ok_recipes = builder.load_real_recipe_data()
# Reads user interaction historuy
ok_events = builder.extract_user_interactions_from_events()

if not (ok_recipes and ok_events):
    raise RuntimeError("Missing required data files. Ensure recipe and event outputs exist in recipe_recommender/output.")

# Aggregates per-user stats (avg/total rating, activity, device/platform, engagement)
user_profiles = builder.create_user_profiles()
# Generates positive/negative user‚Äìrecipe pairs with labels
training_pairs = builder.create_user_recipe_pairs()
# Final feature matrix + train/val/test CSVs and metadata/feature list
train_df, val_df, test_df = builder.prepare_training_data()

print(train_df.shape, val_df.shape, test_df.shape)


2025-09-20 16:28:13,401 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - üèóÔ∏è Initialized Hybrid Recommendation Data Builder
2025-09-20 16:28:13,402 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - üìä Loading real recipe database...
2025-09-20 16:28:13,419 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - ‚úÖ Loaded 1967 recipes with enhanced features
2025-09-20 16:28:13,422 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - ‚úÖ Loaded 21439 recipe-ingredient relationships
2025-09-20 16:28:13,425 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - ‚úÖ Loaded 2092 ingredients
2025-09-20 16:28:13,426 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - üì± Extracting user interactions from events...
2025-09-20 16:28:13,426 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO -    Processing v1_events_20250827.json...
2025-09-20 16:28:13,71

(12438, 40) (4146, 40) (4146, 40)


### Train, Evaluate, Save Model
We train `HybridGBMRecommender`, evaluate on validation data with appropriate metrics, and save artifacts used by inference.


In [None]:
# Train and evaluate
recommender = HybridGBMRecommender(config)  # wraps LightGBM with config-driven hyperparams and tracked features

# Reads train/val/test CSVs prepared by the data builder
recommender.load_training_data()
# Ensures recipe-level features are available (used for context/eval)
recommender.load_recipe_features()

# Fits LightGBM (binary logloss) with early stopping on validation set
recommender.train_model()

# Reports AUC/Precision/Recall/F1 and per-user NDCG@k ranking metrics
recommender.evaluate_model()
# LightGBM feature importance by gain (sum loss reduction per feature)
importance = recommender.get_feature_importance()
print("Top 10 features:\n", importance.head(10))

# Writes booster + metadata (feature columns, config, training stats) to model_dir
recommender.save_model()

print("Artifacts saved in:", config.model_dir)


2025-09-20 16:28:16,016 - recipe_recommender.models.hybrid_gbm_recommender - INFO - üöÄ Initialized GBM Recommender with lightgbm
2025-09-20 16:28:16,017 - recipe_recommender.models.hybrid_gbm_recommender - INFO - üìä Loading training data...
2025-09-20 16:28:16,148 - recipe_recommender.models.hybrid_gbm_recommender - INFO - ‚úÖ Successfully loaded training data:
2025-09-20 16:28:16,149 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    Train: 12,438 samples
2025-09-20 16:28:16,149 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    Validation: 4,146 samples
2025-09-20 16:28:16,149 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    Test: 4,146 samples
2025-09-20 16:28:16,150 - recipe_recommender.models.hybrid_gbm_recommender - INFO - üìã Loaded 36 feature columns
2025-09-20 16:28:16,151 - recipe_recommender.models.hybrid_gbm_recommender - INFO - üìã Loaded training metadata
2025-09-20 16:28:16,161 - recipe_recommender.models.hybrid_gbm_recommender

Training until validation scores don't improve for 50 rounds


2025-09-20 16:28:17,183 - recipe_recommender.models.hybrid_gbm_recommender - INFO - ‚úÖ Model training completed!
2025-09-20 16:28:17,184 - recipe_recommender.models.hybrid_gbm_recommender - INFO - üìà Evaluating model performance...


[100]	train's binary_logloss: 0.00201516	validation's binary_logloss: 0.029075
Early stopping, best iteration is:
[54]	train's binary_logloss: 0.00848836	validation's binary_logloss: 0.022781


2025-09-20 16:28:17,522 - recipe_recommender.models.hybrid_gbm_recommender - INFO - üìä MODEL PERFORMANCE:
2025-09-20 16:28:17,522 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    AUC: 0.9992 (0.5=random, 1.0=perfect)
2025-09-20 16:28:17,522 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    Precision: 0.9913
2025-09-20 16:28:17,523 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    Recall: 0.9626
2025-09-20 16:28:17,523 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    F1-Score: 0.9767
2025-09-20 16:28:17,523 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    NDCG@5: 0.6193
2025-09-20 16:28:17,523 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    NDCG@10: 0.6221
2025-09-20 16:28:17,526 - recipe_recommender.models.hybrid_gbm_recommender - INFO - üíæ Saving trained model...
2025-09-20 16:28:17,528 - recipe_recommender.models.hybrid_gbm_recommender - INFO - ‚úÖ Model saved to: hybrid_lightgbm_model.txt
2025-09-

Top 10 features:
                          feature    importance
17         user_complexity_match  34483.877358
18  user_recipe_engagement_match  12766.716292
20                is_mobile_user    927.095745
13              ingredient_count    889.140933
11                      servings    496.852943
10                    total_time    441.104870
1                     avg_rating    349.497566
3                     rating_std    294.887384
19       user_time_compatibility    240.116808
15              complexity_score    148.442660
Artifacts saved in: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models


### Notes on Learning Task and Objective
- Input (train): user‚Äìrecipe feature matrix with label indicating positive interaction
- Output (deploy): score per recipe for a given user
- Learning objective: LightGBM binary logloss; practical objective: top-N ranking quality (monitored with AUC, AP, NDCG)
- Alignment: we tune thresholding/ranking and measure ranking metrics to reflect deployment goals


### Smoke Test: Saved Artifacts
Verify that the trained model and metadata were written to `config.model_dir`.


In [5]:
from pathlib import Path
model_file = config.model_dir / "hybrid_lightgbm_model.txt"
meta_file = config.model_dir / "hybrid_lightgbm_metadata.json"

print("Model exists:", model_file.exists(), model_file)
print("Metadata exists:", meta_file.exists(), meta_file)


Model exists: True /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models/hybrid_lightgbm_model.txt
Metadata exists: True /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models/hybrid_lightgbm_metadata.json
