In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ast
from collections import Counter, defaultdict
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import random

from tqdm import tqdm
import time

In [2]:
# Load interaction splits
train_df = pd.read_csv("dataset/interactions_train.csv")
val_df = pd.read_csv("dataset/interactions_validation.csv")
test_df = pd.read_csv("dataset/interactions_test.csv")

# Optional: load raw metadata for future use
raw_recipes = pd.read_csv("dataset/RAW_recipes.csv")
raw_interactions = pd.read_csv("dataset/RAW_interactions.csv")

# Preprocessed embeddings (not needed yet)
pp_users = pd.read_csv("dataset/PP_users.csv")
pp_recipes = pd.read_csv("dataset/PP_recipes.csv")

# Exploratory Data Analysis

## Interaction Data

In [None]:
train_df.head()

In [None]:
train_df.info()
train_df.describe()

### Distribution of Recipe Ratings

In [None]:
plt.figure(figsize=(6,4))
train_df['rating'].hist(bins=6, edgecolor='black')
plt.title("Distribution of Recipe Ratings")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()

The distribution of ratings in the dataset is heavily skewed toward the positive end of the scale. Ratings of 4 and 5 stars dominate the dataset, with 5-star ratings forming a noticeable spike. This suggests that users tend to leave high ratings for recipes they try, which is consistent with user behavior on recipe platforms—people often cook recipes they expect to enjoy and are more motivated to provide feedback when the outcome is favorable.

Interestingly, while ratings of 1 and 2 stars are rare, the number of 0-star ratings is noticeably higher than both 1 and 2 stars. In fact, 0-star ratings are closer in frequency to 3-star ratings than to the lower end of the scale. This suggests that users may assign a 0 rating deliberately to express strong dissatisfaction, rather than using intermediate low values like 1 or 2. As a result, the distribution forms a somewhat smooth curve from 1–5, but with a bimodal shape at the extremes: a large concentration at 5 stars, and an unnatural bump at 0.

This behavior has important implications for model design:

- The dominance of high ratings means that predicting the mean rating is not a useful baseline—a trivial model could appear accurate without making meaningful recommendations.

- Because the data is not evenly distributed across the rating scale, metrics like MSE can be misleading.

- Instead, ranking-based evaluation metrics (e.g., Precision@K, Recall@K, MRR) are more appropriate for this dataset, as they focus on how well a model orders preferred recipes rather than how close it gets to numeric ratings.

Overall, this positive and somewhat polarized rating behavior provides a strong motivation for using ranking-based recommender models rather than simple rating prediction approaches.

### Distribution of Ratings per User

In [None]:
user_counts = train_df.groupby('u').size()

plt.figure(figsize=(6,4))
user_counts.hist(bins=50, log=True, edgecolor='black')
plt.title("Number of Ratings per User (log scale)")
plt.xlabel("Ratings per User")
plt.ylabel("User Count (log)")
plt.show()

user_counts.describe()

The distribution of user activity displays an extreme long-tail pattern. While a small number of users have rated hundreds or even thousands of recipes, the vast majority have contributed only a handful of ratings. This is typical of user-generated content platforms: most users interact casually, whereas a small subset of so-called power users engage extensively.

Because the y-axis is plotted on a logarithmic scale, we can clearly see the steep drop-off in participation. The dataset contains many users who have rated fewer than 10 recipes, and progressively fewer users as the number of ratings increases. Only a tiny fraction exceed 1,000 ratings, and some appear to rate over 6,000 recipes, indicating extremely active contributors.

This imbalance has meaningful implications for model training:

- Sparse user histories make it difficult for the model to learn stable preference patterns for many users.

- Power users may disproportionately influence learned embeddings if the model is not properly regularized.

- Techniques such as latent factor models, which generalize user preferences through shared structure, are well-suited to handle this kind of interaction sparsity.

Overall, the user activity distribution underscores the importance of collaborative approaches that leverage patterns across users, rather than relying solely on individual histories.

### Distribution of Ratings per Recipe

In [None]:
item_counts = train_df.groupby('i').size()

plt.figure(figsize=(6,4))
item_counts.hist(bins=50, log=True, edgecolor='black')
plt.title("Number of Ratings per Recipe (log scale)")
plt.xlabel("Ratings per Recipe")
plt.ylabel("Recipe Count (log)")
plt.show()

item_counts.describe()

The distribution of recipe popularity shows an even stronger long-tail effect than user activity. Most recipes receive very few ratings, while a small number accumulate hundreds or even over 1,000 ratings. Once again, the use of a logarithmic scale reveals this imbalance clearly: a large proportion of recipes appear only once or twice in the dataset, whereas only a handful receive widespread attention.

This structure is typical of domains with expansive item catalogs, where new or niche items receive little engagement. For recommender systems, this creates a challenge known as the item cold-start problem: many recipes lack enough historical data for the model to learn their characteristics well.

From a modeling standpoint, this motivates the use of techniques that incorporate additional recipe features—such as ingredients, cooking time, tags, or nutritional attributes—to supplement sparse interaction histories. Later extensions to collaborative filtering or factorization models can leverage this metadata to improve recommendations for less frequently rated recipes.

In summary, the long-tail nature of recipe ratings demonstrates both the difficulty and necessity of a recommendation model capable of generalizing from limited item-specific data.

### Sparsity of the Interaction Matrix

In [None]:
n_users = train_df['u'].nunique()
n_items = train_df['i'].nunique()
n_interactions = len(train_df)

sparsity = 1 - n_interactions / (n_users * n_items)
sparsity

The combination of a large number of users, a large number of recipes, and relatively few observed interactions produces an interaction matrix that is extremely sparse. In other words, the vast majority of possible user–recipe pairs have no recorded rating. High sparsity is characteristic of recommendation datasets and directly motivates the use of models such as collaborative filtering and matrix factorization, which exploit shared patterns across users and items to predict missing entries.

## Meta Data

In [None]:
raw_recipes.head()

In [None]:
raw_recipes.info()
raw_recipes.describe()

### Distribution of Ingredient Counts

In [None]:
plt.figure(figsize=(6,4))
raw_recipes['n_ingredients'].hist(bins=30, edgecolor='black')
plt.title("Distribution of Ingredient Counts")
plt.xlabel("Number of Ingredients")
plt.ylabel("Recipe Count")
plt.show()

raw_recipes['n_ingredients'].describe()

The number of ingredients per recipe follows a moderately right-skewed distribution. Most recipes require a manageable set of ingredients, with the majority falling between 6 and 11 ingredients. The median of 9 ingredients aligns with this visual impression, suggesting that typical recipes in the dataset are neither overly simple nor excessively complex.

Recipes with fewer than 5 ingredients are relatively uncommon, indicating that most dishes involve a moderate combination of components rather than minimalist, two- or three-ingredient preparations. At the other end of the spectrum, a small number of recipes contain 20 or more ingredients, with the maximum reaching 43, representing highly elaborate dishes that demand substantial preparation and planning.

This distribution provides several useful insights for recommendation strategies:

- The ingredient count reflects a dimension of recipe complexity that may influence user preference. Some users may gravitate toward simple, quick recipes, while others enjoy more involved cooking projects.

- The presence of many recipes clustered around similar ingredient counts suggests room for latent structure in modeling item similarity.

- Given the long tail of high-ingredient recipes, integrating ingredient-based metadata could be valuable for improving recommendations in cold-start contexts for less frequently rated recipes.

Overall, the ingredient counts provide a quantitative lens into recipe complexity and reinforce the potential importance of metadata in augmenting collaborative filtering models.

### Distribution of Cooking Times

In [None]:
valid_minutes = raw_recipes['minutes']
valid_minutes = valid_minutes[valid_minutes > 0]  # remove zero-minute anomalies

MAX_REASONABLE_MINUTES = 7 * 24 * 60  # 7 days
valid_minutes = valid_minutes[valid_minutes <= MAX_REASONABLE_MINUTES]

plt.figure(figsize=(6,4))
plt.hist(valid_minutes, bins=100, edgecolor='black', log=True)
plt.title("Distribution of Cooking Times (log scale, 7-day cutoff)")
plt.xlabel("Minutes")
plt.ylabel("Recipe Count (log)")
plt.show()

valid_minutes.describe()

In [None]:
valid_minutes = raw_recipes['minutes']
valid_minutes = valid_minutes[valid_minutes > 0]  # remove zero-minute anomalies

MAX_REASONABLE_MINUTES = 24 * 60  # 1 day
valid_minutes = valid_minutes[valid_minutes <= MAX_REASONABLE_MINUTES]

plt.figure(figsize=(6,4))
plt.hist(valid_minutes, bins=100, edgecolor='black', log=True)
plt.title("Distribution of Cooking Times (log scale, 1-day cutoff)")
plt.xlabel("Minutes")
plt.ylabel("Recipe Count (log)")
plt.show()

valid_minutes.describe()

The distribution of cooking times exhibits a clear long-tailed structure, with most recipes requiring under an hour to prepare. The median cooking time is 40 minutes, and 75% of recipes complete within 65 minutes, indicating that the dataset is dominated by everyday dishes rather than lengthy, multi-day preparations.

However, the histogram also reveals a striking pattern: distinct spikes at regular time intervals such as 20, 40, 60, 120, 1440, and 2880 minutes. These peaks do not reflect natural variation in cooking durations; instead, they suggest that Food.com's interface offers preset time categories rather than free-form numeric entry. Recipe creators likely selected from options such as "30 minutes or less," "1–2 hours," or "overnight," which the dataset subsequently converted into exact minute values.

This discretization is important for later modeling considerations:

- Cooking time should not be treated as a precise continuous variable

- It is best interpreted as a coarse indicator of recipe effort level

- If used as a feature, it may require binning or embedding, not raw numeric scaling

Overall, the cooking time distribution reinforces the idea that recipe metadata contains interpretable structure, but also highlights the importance of understanding the data collection process behind it.

In [None]:
value_counts = valid_minutes.value_counts().sort_index() 
value_counts.sort_values(ascending=False).head(10)

In [None]:
bins = [0, 15, 30, 60, 120, 240, 720, 1440, 2880, 10080]
labels = [
    "< 15 min",
    "15-30 min",
    "30-60 min",
    "1-2 hours",
    "2-4 hours",
    "4-12 hours",
    "12-24 hours",
    "1-2 days",
    "2-7 days"
]

binned = pd.cut(valid_minutes, bins=bins, labels=labels, right=False)
binned.value_counts().sort_index().plot(kind='bar', figsize=(8,4), edgecolor='black')

plt.title("Grouped Cooking Time Categories")
plt.xlabel("Estimated Time Category")
plt.ylabel("Recipe Count")
plt.show()

This confirms that recipe cooking times are not continuous values but fall into a small number of predefined ranges. Rather than reflecting precise preparation durations, the minutes field appears to encode preset time selections from the original interface. This means cooking time should be interpreted as a coarse indicator of effort level rather than an exact numeric measurement, and treated accordingly in any downstream modeling.

### Most Common Recipe Tags

In [None]:
# Convert stringified lists to actual Python lists
raw_tags = raw_recipes['tags'].dropna().apply(ast.literal_eval)

# Flatten the tag lists and count frequencies
tag_counter = Counter(tag for tags in raw_tags for tag in tags)

top_50 = dict(tag_counter.most_common(50))

plt.figure(figsize=(12,6))
plt.bar(top_50.keys(), top_50.values(), edgecolor='black')
plt.xticks(rotation=90, ha='right')
plt.title("Top 50 Most Common Recipe Tags")
plt.ylabel("Count")
plt.xlabel("Tag")
plt.tight_layout()
plt.show()

The top 50 tags reveal that Food.com recipes are annotated with a structured and hierarchical tagging system rather than informal, user-generated keywords. Broad organizational tags such as *preparation*, *time-to-make*, and *course* dominate the dataset, reflecting how the platform categorizes recipes before more specific descriptors like ingredients, cuisines, or dietary restrictions are applied. Time-based tags (e.g., *30-minutes-or-less*, *60-minutes-or-less*) and complexity indicators (e.g., *easy*, *3-steps-or-less*) further reinforce the idea that effort level is central to how users search for and evaluate recipes. Together, these tags provide rich semantic information that can enhance recommendations—especially for new recipes with few ratings—by linking items through shared characteristics.

### Ratings Over Time

In [None]:
interactions = train_df.copy()
interactions['date'] = pd.to_datetime(interactions['date'], errors='coerce')

# Aggregate counts by month
ratings_per_month = interactions.groupby(interactions['date'].dt.to_period('M')).size()

plt.figure(figsize=(10,4))
ratings_per_month.plot()
plt.title("Number of Ratings Over Time")
plt.xlabel("Date")
plt.ylabel("Ratings Count")
plt.tight_layout()
plt.show()

ratings_per_month.tail()

The number of ratings submitted to Food.com increased steadily in the early 2000s, surging sharply between 2006 and 2009 before peaking at over 10,000 ratings per month. After 2009, rating activity declined, likely due to users migrating toward newer recipe platforms and social media–driven food communities. This temporal pattern indicates that user engagement is not static—preferences, recipe trends, and platform usage all evolve over time. For a recommender system, this raises the possibility that a time-aware model could outperform a static one by accounting for changes in recipe popularity or shifts in user interests. While our initial models will treat interactions as time-independent, this trend highlights a clear path for future work: incorporating temporal dynamics to personalize recommendations based on when users interact, not just what they interact with.

### Nutritional Distributions

In [None]:
# Parse calories from nutrition column (index 0 in the list)
calories = raw_recipes['nutrition'].dropna().apply(lambda x: ast.literal_eval(x)[0])

# Clip extreme values for readability (95th percentile)
calories_clipped = calories.clip(upper=calories.quantile(0.98))

# Plot histogram
plt.figure(figsize=(6,4))
plt.hist(calories_clipped, bins=50, edgecolor='black')
plt.title("Distribution of Recipe Calories (clipped at 98th percentile)")
plt.xlabel("Calories")
plt.ylabel("Recipe Count")
plt.tight_layout()
plt.show()

The calorie distribution is distinctly right-skewed: most recipes fall between 100 and 500 calories, with progressively fewer high-calorie dishes extending into the 1,000+ calorie range. This reflects the broad range of meals on Food.com, from lighter snacks and sides to calorie-dense entrées and desserts. The shape of this distribution suggests that caloric content is a meaningful differentiator between recipe types and could influence user behavior—some users may consistently prefer lighter dishes, while others gravitate toward more indulgent options. Although our baseline recommender will ignore this information, these nutritional features provide a clear avenue for future extensions, such as health-aware or diet-constrained recommendations that personalize suggestions based not only on past ratings but also on user dietary preferences.

## Interaction × Metadata

### Popularity vs Rating

In [None]:
# Compute recipe popularity (# of ratings) and mean rating
recipe_stats = train_df.groupby('i').agg(
    n_ratings=('rating', 'count'),
    avg_rating=('rating', 'mean')
).reset_index()

# Optional: remove recipes with extremely few ratings, to avoid noise
recipe_stats = recipe_stats[recipe_stats['n_ratings'] >= 5]

# Scatter plot
plt.figure(figsize=(7,5))
plt.scatter(recipe_stats['n_ratings'], recipe_stats['avg_rating'], alpha=0.3)
plt.xscale('log')  # popularity is long-tailed
plt.xlabel("Number of Ratings (log scale)")
plt.ylabel("Average Rating")
plt.title("Recipe Popularity vs Average Rating")
plt.tight_layout()
plt.show()

This plot illustrates how recipe popularity relates to perceived quality. Recipes with only a few ratings span the full range of possible averages—from very low to seemingly perfect scores—reflecting the high variance typical of sparse data. As the number of ratings increases, this variance collapses, and recipes converge toward a stable average of roughly 4.5 stars, which mirrors the global rating bias observed earlier. Popular recipes therefore do not become worse; rather, their ratings are more reliable because they reflect opinions from a larger and more diverse user base. In contrast, many little-reviewed recipes appear unrealistically strong due to a handful of enthusiastic reviewers. This highlights a core challenge for recommender systems: naively sorting by average rating will overemphasize niche items with inflated scores, underscoring the need for ranking-based methods or latent factor models that can account for popularity and data sparsity.

### Complexity vs Rating

In [None]:
meta = raw_recipes[['id', 'minutes', 'n_steps', 'n_ingredients']].copy()
meta = meta.rename(columns={'id': 'recipe_id'})

# Merge metadata with interactions
df = train_df.merge(meta, on='recipe_id')

# Filter out extreme time anomalies (as before)
df = df[df['minutes'] > 0]
df = df[df['minutes'] <= 7*24*60]  # 7 days max

# Select complexity components
features = df[['minutes', 'n_steps', 'n_ingredients']].astype(float)

# Standardize each metric
scaler = StandardScaler()
scaled = scaler.fit_transform(features)

# Complexity = sum of normalized components
df['complexity'] = scaled.sum(axis=1)

# Compute average rating per complexity level (bin complexity for readability)
df['complexity_bin'] = pd.qcut(df['complexity'], q=30, duplicates='drop')
comp_rating = df.groupby('complexity_bin')['rating'].mean().reset_index()

# Plot
plt.figure(figsize=(7,4))
plt.plot(comp_rating.index, comp_rating['rating'], marker='o')
plt.xlabel("Complexity (binned quantiles)")
plt.ylabel("Average Rating")
plt.title("Recipe Complexity vs Average Rating")
plt.tight_layout()
plt.show()

Recipe complexity shows only a weak relationship to user satisfaction. The simplest recipes receive slightly higher ratings on average, but the effect is very small—about 0.07 stars across the entire complexity range—indicating that users do not strongly reward or penalize complexity in isolation. While there is a modest initial decline as recipes become more involved, the ratings quickly stabilize, suggesting that highly complex dishes are not necessarily perceived as better or worse, just more demanding. This minimal variation implies that complexity is not a reliable global predictor of preference. Instead, complexity may matter only at the individual level—some users consistently enjoy ambitious cooking projects, while others prefer quick meals—highlighting the need for a personalized recommender rather than a system that assumes simpler or more complex recipes are universally better.

# Evaluation Framework

In recommendation problems, the goal is not to predict exact rating values but to rank items so that relevant ones appear in a user’s top suggestions. Because the Food.com dataset is very sparse and ratings are heavily skewed toward high values, traditional regression-style metrics such as RMSE or MAE are not meaningful. They depend on predicting accurate numeric ratings and are dominated by the dataset’s skewed distribution.

Instead, we evaluate models using top-K ranking metrics computed on held-out user interactions in the validation set. These metrics measure how effectively a recommender retrieves items the user actually interacted with, which directly reflects the practical goal of a recipe recommender system.

We use four standard ranking metrics:

- **Precision@K** — the fraction of recommended items in the top-K that the user actually interacted with. Measures recommendation accuracy.

- **Recall@K** — the fraction of the user’s relevant items that appear in the top-K. Measures coverage of true preferences.

- **MRR (Mean Reciprocal Rank)** — evaluates how early the first relevant item appears in the recommendation list.

- **AUC (Area Under the ROC Curve)** — measures how well the model ranks positive items ahead of unobserved ones, accounting for overall ranking quality.

Together, these metrics give a robust picture of recommendation quality: accuracy of the top-K list (Precision), ability to retrieve most of the user’s relevant items (Recall), ranking sharpness (MRR), and global discrimination ability (AUC). These metrics are standard in information retrieval and recommender systems and do not rely on rating magnitude, making them well-suited to implicit feedback data.

All models developed later in the notebook (e.g., improved similarity functions or alternative baselines) will be evaluated using this same framework for consistent comparison.

In [16]:
# ==============================================================
# Ranking Metrics for Recipe Recommendation
# ==============================================================

def precision_at_k(recommended, relevant, k=10):
    """
    Precision@K = (# of recommended items in top-K that are relevant) / K
    recommended: list of item_ids (ranked)
    relevant: set of ground-truth item_ids
    """
    if k == 0:
        return 0.0
    recommended_at_k = recommended[:k]
    hits = sum(1 for item in recommended_at_k if item in relevant)
    return hits / k


def recall_at_k(recommended, relevant, k=10):
    """
    Recall@K = (# of recommended items in top-K that are relevant) / (# relevant items)
    """
    if len(relevant) == 0:
        return 0.0
    recommended_at_k = recommended[:k]
    hits = sum(1 for item in recommended_at_k if item in relevant)
    return hits / len(relevant)


def mrr(recommended, relevant):
    """
    Mean Reciprocal Rank = 1 / rank of first relevant item (if any)
    """
    for rank, item in enumerate(recommended, start=1):
        if item in relevant:
            return 1.0 / rank
    return 0.0


def auc_score(recommended, relevant, all_items, num_neg_samples=100):
    """
    Sampled AUC computed using sklearn's roc_auc_score.
    
    Parameters
    ----------
    recommended : list
        Ranked list of recommended item_ids (best items first).
    relevant : set or list
        Ground-truth positive items for the user (from val_df).
    all_items : set
        Universe of item_ids (train + val).
    num_neg_samples : int
        Number of negative items to sample for AUC estimation.

    Returns
    -------
    float
        Sampled AUC score.
    """
    relevant = set(relevant)

    # Universe of negative items
    non_relevant = list(all_items - relevant)
    if len(relevant) == 0 or len(non_relevant) == 0:
        return 0.0

    # Sample a subset of negatives
    sampled_neg = random.sample(non_relevant, min(num_neg_samples, len(non_relevant)))

    # Construct y_true (1 = relevant, 0 = negative)
    y_true = np.array([1] * len(relevant) + [0] * len(sampled_neg))

    # Map items to ranks (lower rank is better)
    # Items not in recommended get rank = large number
    max_rank = len(recommended)
    ranks = {item: idx for idx, item in enumerate(recommended)}

    def score(item):
        # Higher score = more relevant
        return -(ranks.get(item, max_rank + 1))

    # Score vectors
    y_scores = np.array([score(i) for i in relevant] + 
                        [score(j) for j in sampled_neg])

    # sklearn AUC
    try:
        return roc_auc_score(y_true, y_scores)
    except ValueError:
        return 0.0


# Item–Item Jaccard Recommender

We build a simple recommender using item–item collaborative filtering on implicit feedback. Item–item CF is preferred here because users have very few ratings, while items have more interactions, making item overlaps more reliable and scalable.

Since Food.com ratings are heavily skewed toward 4–5 stars, we convert explicit ratings into implicit likes (rating ≥ threshold). This removes noise from low/mid ratings and yields cleaner co-occurrence patterns.

For similarity, we use Jaccard similarity, which measures the overlap in users who liked two items. Jaccard works well under high sparsity, ignores skewed rating magnitudes, and is efficient to compute using set intersections. Other metrics (cosine, Pearson, Euclidean) are less stable here due to extremely sparse and biased rating vectors.

To keep computation efficient, we avoid a full item–item similarity matrix. Instead, for each user, we generate a small candidate set of items co-liked by users with similar interactions and compute Jaccard only within this set.

Scores are computed by summing Jaccard similarities between each candidate item and the items the user liked, then returning the top-K unseen items.

This provides a clear, interpretable, and fast baseline recommender that we can later tune or extend.

In [None]:
# ------------------------------------------------
# 1. Build binary interactions with adjustable threshold
# ------------------------------------------------

def make_binary_interactions(train_df, threshold=4):
    """
    Convert explicit ratings to implicit binary likes (rating >= threshold).
    Returns:
        user_likes: dict {user_id -> set of liked item_ids}
        item_likes: dict {item_id -> set of users who liked it}
    """

    # Filter to positive interactions only
    pos_df = train_df[train_df["rating"] >= threshold]

    user_likes = defaultdict(set)
    item_likes = defaultdict(set)

    for u, i in zip(pos_df["user_id"], pos_df["recipe_id"]):
        user_likes[u].add(i)
        item_likes[i].add(u)

    return user_likes, item_likes


# ------------------------------------------------
# 2. Jaccard similarity between two items (fast)
# ------------------------------------------------

def jaccard_similarity(users_i, users_j):
    """
    Jaccard similarity between sets of users who liked item i and item j.
    """
    if not users_i or not users_j:
        return 0.0
    inter = len(users_i & users_j)
    union = len(users_i | users_j)
    return inter / union if union > 0 else 0.0


# ------------------------------------------------
# 3. Get candidate items for a user
# ------------------------------------------------

def get_candidate_items(user_id, user_likes, item_likes):
    """
    All items co-liked by any user who liked at least one item in user_likes[user_id].
    This reduces computation dramatically compared to full item space.
    """
    candidates = set()
    for item in user_likes[user_id]:
        for u in item_likes[item]:                 # users who liked this item
            candidates |= user_likes[u]            # everything those users liked
    return candidates


# ------------------------------------------------
# 4. Recommend top-K items using sum of Jaccard similarities
# ------------------------------------------------

def recommend_jaccard(user_id, user_likes, item_likes, top_k=200, max_candidates=5000):
    """
    More balanced Jaccard CF recommender:
    - No per-item truncation (preserves co-like signal)
    - Limit only final candidate set size (prevents explosion)
    """

    if user_id not in user_likes or len(user_likes[user_id]) == 0:
        return []

    liked_items = user_likes[user_id]

    # Build candidate set
    candidate_items = set()
    for item in liked_items:
        for u in item_likes[item]:
            candidate_items |= user_likes[u]

    # Remove items user already liked
    candidate_items -= liked_items

    # Truncate large candidate sets (but only after full construction)
    if len(candidate_items) > max_candidates:
        candidate_items = set(random.sample(list(candidate_items), max_candidates))

    # Score candidates
    scores = {}
    for c in candidate_items:
        score = 0.0
        users_c = item_likes[c]
        for i in liked_items:
            score += jaccard_similarity(item_likes[i], users_c)
        scores[c] = score

    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [item for item, score in ranked[:top_k]]



# ------------------------------------------------
# 5. Build model from train_df
# ------------------------------------------------

THRESHOLD = 4           # adjustable threshold
user_likes, item_likes = make_binary_interactions(train_df, threshold=THRESHOLD)


# ===========================================================
# Minimal, decoupled inspection helpers
# ===========================================================

def show_user_likes(user_id, train_df, raw_recipes, threshold=4, max_items=10):
    df = (train_df[(train_df["user_id"] == user_id) & (train_df["rating"] >= threshold)]
          .sort_values("rating", ascending=False)
          .head(max_items))
    return df.merge(raw_recipes[["id", "name"]], left_on="recipe_id", right_on="id")

def show_recommendations(recipe_ids, train_df, raw_recipes, max_items=10):
    stats = (train_df[train_df["recipe_id"].isin(recipe_ids)]
             .groupby("recipe_id")["rating"].agg(["count", "mean"]).reset_index())
    return stats.merge(raw_recipes[["id", "name"]], left_on="recipe_id", right_on="id")


In [9]:
# Example Usage

sample_user = random.choice(list(user_likes.keys()))
recs = recommend_jaccard(sample_user, user_likes, item_likes, top_k=10)

display(show_user_likes(sample_user, train_df, raw_recipes))
display(show_recommendations(recs, train_df, raw_recipes))


Unnamed: 0,user_id,recipe_id,date,rating,u,i,id,name
0,385180,77585,2006-11-18,5.0,17498,166441,77585,copycat olive garden minestrone soup by todd w...
1,385180,114506,2009-01-03,5.0,17498,21940,114506,blue baby shower punch
2,385180,15688,2007-11-24,4.0,17498,81738,15688,white trash candy snack
3,385180,147335,2008-07-06,4.0,17498,78435,147335,chipotle copycat lime rice recipe
4,385180,16746,2008-12-24,4.0,17498,45632,16746,mom s gingersnaps


Unnamed: 0,recipe_id,count,mean,id,name
0,13762,1,4.0,13762,cheddar crust granny smith apple pie
1,14083,1,5.0,14083,toasted swiss ham sandwich
2,16494,1,4.0,16494,lemon icebox cookies
3,19027,3,3.333333,19027,ice cream
4,20995,1,5.0,20995,bittersweet grand marnier chicken
5,22161,2,4.0,22161,marble squares
6,64363,1,5.0,64363,fettuccine with prosciutto peas and peppers
7,99456,5,3.2,99456,country scalloped potatoes ham crock pot
8,106442,1,4.0,106442,chocolate cake brownies with splenda
9,122507,1,4.0,122507,strawberry chantilly


## Evaluation of the Jaccard Baseline

In [None]:
# ===========================================================
# Evaluate Jaccard Baseline on val_df (Precision, Recall, MRR)
# ===========================================================

# 1. Build ground-truth interactions
val_user_truth = defaultdict(set)
for u, i in zip(val_df["user_id"], val_df["recipe_id"]):
    val_user_truth[u].add(i)

# 2. Define ranking metrics
Ks = [5, 10, 20]

results = { 
    "precision": {k: [] for k in Ks},
    "recall":    {k: [] for k in Ks},
    "mrr": []
}

# 3. Sample validation users for efficiency
NUM_USERS_FOR_RANKING = 1000  # adjust: 500–1500 is typical
all_val_users = list(val_user_truth.keys())
users_to_eval = random.sample(all_val_users, min(NUM_USERS_FOR_RANKING, len(all_val_users)))

print(f"Evaluating {len(users_to_eval)} users for Precision/Recall/MRR...\n")

# 4. Evaluation loop (fast)
for user in users_to_eval:

    # User must have training data
    if user not in user_likes or len(user_likes[user]) == 0:
        continue
    
    relevant = val_user_truth[user]
    recs = recommend_jaccard(user, user_likes, item_likes, top_k=200)
    
    # Precision & Recall
    for k in Ks:
        results["precision"][k].append( precision_at_k(recs, relevant, k=k) )
        results["recall"][k].append( recall_at_k(recs, relevant, k=k) )
        
    # MRR
    results["mrr"].append( mrr(recs, relevant) )

# 5. Aggregate into summary (no AUC yet)
summary_rows = []

for k in Ks:
    summary_rows.append({
        "Metric": f"Precision@{k}",
        "Score": np.mean(results["precision"][k])
    })
    summary_rows.append({
        "Metric": f"Recall@{k}",
        "Score": np.mean(results["recall"][k])
    })

summary_rows.append({
    "Metric": "MRR",
    "Score": np.mean(results["mrr"])
})

jaccard_summary_partial = pd.DataFrame(summary_rows)
jaccard_summary_partial

Evaluating 1000 users for Precision/Recall/MRR...



Unnamed: 0,Metric,Score
0,Precision@5,0.0
1,Recall@5,0.0
2,Precision@10,0.0
3,Recall@10,0.0
4,Precision@20,0.0
5,Recall@20,0.0
6,MRR,0.0


In [12]:
# ===========================================================
# Compute Sampled AUC for Jaccard Baseline (Subset of Users)
# ===========================================================

NUM_USERS_FOR_AUC = 400  # adjust as needed

# Universe of items
all_items = set(train_df["recipe_id"].unique()) | set(val_df["recipe_id"].unique())

# Randomly sample users for AUC
users_for_auc = random.sample(
    list(val_user_truth.keys()),
    min(NUM_USERS_FOR_AUC, len(val_user_truth))
)

auc_scores = []

for user in users_for_auc:
    if user not in user_likes or len(user_likes[user]) == 0:
        continue
        
    relevant = val_user_truth[user]
    recs = recommend_jaccard(user, user_likes, item_likes, top_k=200)
    
    auc_scores.append( auc_score(recs, relevant, all_items) )

# Add to summary
auc_row = {
    "Metric": "AUC (sampled users)",
    "Score": np.mean(auc_scores) if auc_scores else 0.0
}

jaccard_summary = pd.concat(
    [jaccard_summary_partial, pd.DataFrame([auc_row])],
    ignore_index=True
)

jaccard_summary

Unnamed: 0,Metric,Score
0,Precision@5,0.0
1,Recall@5,0.0
2,Precision@10,0.0
3,Recall@10,0.0
4,Precision@20,0.0
5,Recall@20,0.0
6,MRR,0.0
7,AUC (sampled users),0.499449


In [19]:
overlap_count = sum(
    1 for items in val_user_truth.values()
    for i in items
    if i in item_likes
)

overlap_count

0

### Challenges in Applying Item–Item Jaccard Collaborative Filtering to the Food.com Dataset

When evaluating a simple item–item recommender based on **Jaccard similarity** over implicit feedback, we observed that all ranking metrics (Precision@K, Recall@K, and MRR) were exactly zero, and AUC hovered around 0.5 (random performance). After investigating the data and our preprocessing pipeline, we found that this behavior stems from fundamental characteristics of the Food.com dataset and the limitations of item–item Jaccard CF, rather than from an implementation error.

#### 1. Extreme Item Sparsity and Lack of Co-Rating Signal
The Food.com dataset contains over 230,000 unique recipes, but most recipes are rated by only a single user. This results in:
- Very few items sharing more than one user.
- Many items sharing **no** users with any other item.
- An item–item graph that is essentially disconnected.

Since Jaccard similarity relies on overlapping user sets to establish item similarity, the model is unable to form meaningful relationships between most recipes.

#### 2. No Overlap Between Training and Validation Items
A critical diagnostic showed that **none of the recipes appearing in the validation set also appear in the training set’s implicit “likes” graph**. Even after lowering the implicit-feedback threshold to include *all* ratings (rating ≥ 0), the overlap remained zero.  
If a validation item never appears among the items in the training graph, then **no item-based CF model can ever recommend it**, making non-zero Precision/Recall/MRR impossible.

#### 3. Consequences of Random Train/Validation Splitting
Because the train/validation split was randomized, many items that appear only once in the entire dataset happened to fall into the validation set. These items therefore disappear from the training interactions entirely. In sparse recommendation settings, random splits destroy the item–item co-occurrence structure required for collaborative filtering models to function.

#### 4. Why AUC ≈ 0.5
AUC measures the probability that a relevant item scores higher than a non-relevant one. If the model never ranks relevant items at all—and assigns only default or noise-driven scores—then the positive-vs-negative comparisons effectively become random, producing an AUC near 0.5.

#### Summary
Taken together, these challenges mean that **item–item Jaccard collaborative filtering is not viable on this dataset with the current preprocessing and split strategy**. The sparsity of interactions, the near-complete absence of co-liked items, and the lack of training–validation item overlap make it impossible for the model to retrieve held-out items. This motivates the use of alternative baselines (e.g., popularity) or different modeling approaches (e.g., user–user similarity or embedding-based methods) that are more robust under extreme sparsity.


In [20]:
item_user_counts = train_df.groupby("recipe_id")["user_id"].nunique()
item_user_counts.describe()

count    160901.000000
mean          4.343671
std          13.421461
min           1.000000
25%           1.000000
50%           2.000000
75%           4.000000
max        1091.000000
Name: user_id, dtype: float64

In [21]:
num_multi_user_items = (item_user_counts > 1).sum()
num_single_user_items = (item_user_counts == 1).sum()

num_multi_user_items, num_single_user_items


(np.int64(94423), np.int64(66478))

In [22]:
val_items = set(val_df["recipe_id"].unique())
train_items = set(train_df["recipe_id"].unique())

len(val_items & train_items), len(val_items), len(train_items)

(0, 6621, 160901)