## Text Feature Encodings (optional)

This version supports adding numeric encodings for select text fields so LightGBM can use them.

- Turn on via `config.text_encoding.enable_text_features = True`
- Defaults (simple, low‑dependency):
  - `author_id`: frequency (or target) encoding
  - `tags`: top‑K multi‑hot (or hashing)
  - `recipe_name`: word n‑gram hashing
  - `description`/`instruction`: disabled by default
- Encoders are fit on training split only and applied to all splits and inference.
- Persisted artifacts (when applicable) are saved to `output/hybrid_models/text_vectorizers/`.

Example:
```python
from recipe_recommender.config import get_ml_config

config = get_ml_config()
config.text_encoding.enable_text_features = True
config.text_encoding.tags_encoding = "topk_multi_hot"
config.text_encoding.tags_top_k = 50
config.text_encoding.author_id_encoding = "freq"  # or "target"
config.text_encoding.name_encoding = "hashing"
# description/instruction remain disabled by default
```

Outputs:
- `output/enhanced_recipe_features_encoded.csv` (for inference loading)
- Updated `hybrid_feature_columns.txt` and `hybrid_training_metadata.json` with text feature info.



# PantryPalML: Production Training Notebook

This notebook demonstrates how I build and train the actual production model used by `ProductionRecipeScorer`.

It reuses our production modules via imports (no code duplication) and shows:
- Environment setup (Colab-friendly)
- Data preparation using `HybridRecommendationDataBuilder`
- Model training, evaluation, and saving via `HybridGBMRecommender`
- Artifacts produced for inference (model + metadata)
- Brief discussion of task, loss, metrics, and practical objective alignment


In [1]:
# Colab/Local environment setup (silent if local)
import sys, subprocess, os, pathlib

IN_COLAB = "google.colab" in sys.modules
repo_root = pathlib.Path.cwd()

if IN_COLAB:
    try:
        subprocess.run([sys.executable, "-m", "pip", "install", "-q",
                        "lightgbm", "pandas", "numpy", "scikit-learn", "matplotlib", "seaborn"],
                       check=False)
    except Exception as e:
        print(f"pip install warning: {e}")

    if not (repo_root / "recipe_recommender").exists():
        subprocess.run(["git", "clone", "-q", "https://github.com/marcel-qayoom-taylor/PantryPalML.git"], check=True)
        os.chdir("PantryPalML")
        repo_root = pathlib.Path.cwd()

print(f"Environment ready. Project root: {repo_root}")


Environment ready. Project root: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/notebooks


In [2]:
# Imports from production codebase
from pathlib import Path
from recipe_recommender.config import get_ml_config
from recipe_recommender.models.hybrid_recommendation_data_builder import HybridRecommendationDataBuilder
from recipe_recommender.models.hybrid_gbm_recommender import HybridGBMRecommender

# Central config object (paths, hyperparams, event weights)
config = get_ml_config()
print("Config paths:")
print(" - output_dir:", config.output_dir)
print(" - input_dir:", config.input_dir)
print(" - model_dir:", config.model_dir)


Config paths:
 - output_dir: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output
 - input_dir: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/input
 - model_dir: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models


### Build Training Data
I create ML-ready datasets from our real event logs + recipe DB extracts using `HybridRecommendationDataBuilder`.


In [3]:
# Build datasets. Orchestrates loading and feature engineering
builder = HybridRecommendationDataBuilder(config)  # 

# Reads recipe input data
ok_recipes = builder.load_real_recipe_data()
# Reads user interaction historuy
ok_events = builder.extract_user_interactions_from_events()

if not (ok_recipes and ok_events):
    raise RuntimeError("Missing required data files. Ensure recipe and event outputs exist in recipe_recommender/output.")

# Aggregates per-user stats (avg/total rating, activity, device/platform, engagement)
user_profiles = builder.create_user_profiles()
# Generates positive/negative user–recipe pairs with labels
training_pairs = builder.create_user_recipe_pairs()
# Final feature matrix + train/val/test CSVs and metadata/feature list
train_df, val_df, test_df = builder.prepare_training_data()

print(train_df.shape, val_df.shape, test_df.shape)


2025-09-20 20:30:36,861 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - 🏗️ Initialized Hybrid Recommendation Data Builder
2025-09-20 20:30:36,862 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - 📊 Loading real recipe database...
2025-09-20 20:30:36,878 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - ✅ Loaded 1967 recipes with enhanced features
2025-09-20 20:30:36,883 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - ✅ Loaded 21439 recipe-ingredient relationships
2025-09-20 20:30:36,885 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - ✅ Loaded 2092 ingredients
2025-09-20 20:30:36,886 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO - 📱 Extracting user interactions from events...
2025-09-20 20:30:36,886 - recipe_recommender.models.hybrid_recommendation_data_builder - INFO -    Processing v1_events_20250827.json...
2025-09-20 20:30:37,151 - recipe_recomm

(12591, 40) (4197, 40) (4197, 40)


### Train, Evaluate, Save Model
I train `HybridGBMRecommender`, evaluate on validation data with appropriate metrics, and save artifacts used by inference.


In [4]:
# Train and evaluate
recommender = HybridGBMRecommender(config)  # wraps LightGBM with config-driven hyperparams and tracked features

# Reads train/val/test CSVs prepared by the data builder
recommender.load_training_data()
# Ensures recipe-level features are available (used for context/eval)
recommender.load_recipe_features()

# Fits LightGBM (binary logloss) with early stopping on validation set
recommender.train_model()

# Reports AUC/Precision/Recall/F1 and per-user NDCG@k ranking metrics
recommender.evaluate_model()
# LightGBM feature importance by gain (sum loss reduction per feature)
importance = recommender.get_feature_importance()
print("Top 10 features:\n", importance.head(10))

# Writes booster + metadata (feature columns, config, training stats) to model_dir
recommender.save_model()

print("Artifacts saved in:", config.model_dir)


2025-09-20 20:30:39,636 - recipe_recommender.models.hybrid_gbm_recommender - INFO - 🚀 Initialized GBM Recommender with lightgbm
2025-09-20 20:30:39,637 - recipe_recommender.models.hybrid_gbm_recommender - INFO - 📊 Loading training data...
2025-09-20 20:30:39,755 - recipe_recommender.models.hybrid_gbm_recommender - INFO - ✅ Successfully loaded training data:
2025-09-20 20:30:39,755 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    Train: 12,591 samples
2025-09-20 20:30:39,755 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    Validation: 4,197 samples
2025-09-20 20:30:39,755 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    Test: 4,197 samples
2025-09-20 20:30:39,756 - recipe_recommender.models.hybrid_gbm_recommender - INFO - 📋 Loaded 36 feature columns
2025-09-20 20:30:39,757 - recipe_recommender.models.hybrid_gbm_recommender - INFO - 📋 Loaded training metadata
2025-09-20 20:30:39,767 - recipe_recommender.models.hybrid_gbm_recommender - INFO - ✅ Lo

Training until validation scores don't improve for 50 rounds


2025-09-20 20:30:40,616 - recipe_recommender.models.hybrid_gbm_recommender - INFO - ✅ Model training completed!
2025-09-20 20:30:40,617 - recipe_recommender.models.hybrid_gbm_recommender - INFO - 📈 Evaluating model performance...


[100]	train's binary_logloss: 0.00384467	validation's binary_logloss: 0.0351023
Early stopping, best iteration is:
[63]	train's binary_logloss: 0.00903672	validation's binary_logloss: 0.0303626


2025-09-20 20:30:40,963 - recipe_recommender.models.hybrid_gbm_recommender - INFO - 📊 MODEL PERFORMANCE:
2025-09-20 20:30:40,964 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    AUC: 0.9987 (0.5=random, 1.0=perfect)
2025-09-20 20:30:40,964 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    Precision: 0.9926
2025-09-20 20:30:40,964 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    Recall: 0.9583
2025-09-20 20:30:40,964 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    F1-Score: 0.9751
2025-09-20 20:30:40,965 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    NDCG@5: 0.6383
2025-09-20 20:30:40,965 - recipe_recommender.models.hybrid_gbm_recommender - INFO -    NDCG@10: 0.6433
2025-09-20 20:30:40,968 - recipe_recommender.models.hybrid_gbm_recommender - INFO - 💾 Saving trained model...
2025-09-20 20:30:40,970 - recipe_recommender.models.hybrid_gbm_recommender - INFO - ✅ Model saved to: hybrid_lightgbm_model.txt
2025-09-20 20:30

Top 10 features:
                          feature    importance
17         user_complexity_match  36522.389740
18  user_recipe_engagement_match  10890.626203
20                is_mobile_user   1135.670645
13              ingredient_count    920.622285
10                    total_time    653.630570
11                      servings    443.509857
1                     avg_rating    410.598063
19       user_time_compatibility    314.900758
3                     rating_std    314.642685
21                   is_ios_user    236.938262
Artifacts saved in: /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models


### Notes on Learning Task and Objective
- Input (train): user–recipe feature matrix with label indicating positive interaction
- Output (deploy): score per recipe for a given user
- Learning objective: LightGBM binary logloss; practical objective: top-N ranking quality (monitored with AUC, AP, NDCG)
- Alignment: I tune thresholding/ranking and measure ranking metrics to reflect deployment goals


### Smoke Test: Saved Artifacts
Verify that the trained model and metadata were written to `config.model_dir`.


In [5]:
from pathlib import Path
model_file = config.model_dir / "hybrid_lightgbm_model.txt"
meta_file = config.model_dir / "hybrid_lightgbm_metadata.json"

print("Model exists:", model_file.exists(), model_file)
print("Metadata exists:", meta_file.exists(), meta_file)


Model exists: True /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models/hybrid_lightgbm_model.txt
Metadata exists: True /Users/marcelqayoomtaylor/Documents/GitHub/PantryPalML/recipe_recommender/output/hybrid_models/hybrid_lightgbm_metadata.json
