# Spicerack: A Recipe Recommendation System Based on Users Spices   

## General Info

### Background:

Background: A common problem I run into when shopping for groceries is completely forgetting what spices I already have at home. I’ll stand in the spice aisle, unsure whether I need anything, which often leads me to buy duplicates or skip the real spices I actually need. Other times, I end up with a single spice that doesn’t really pair with anything else I own, so it just sits in the pantry unused. Over time, this creates clutter, wasted money, and limits what I’m able to cook. While not a life-threatening issue, this project aims to introduce a quality-of-life feature that addresses it by helping users understand which spices they already have, how they relate to one another, and how they can be combined into consistent flavor profiles and practical recipes.


### Functionality:

Users will input the spices they currently have in their pantry. The system will analyze these spices to identify flavor profiles and common pairings. Based on these profiles, the application will recommend recipes that can be made with the user’s existing spices, as well as suggest complementary spices and recipes to expand their cooking options. The focus is on simplicity and usability, allowing users to quickly see how their spices can be used without needing extensive cooking knowledge.

### Tech Stack & Data: 



As of right now, the project will be built using Python for data processing and modeling. Libraries such as pandas and NumPy will be used for data handling, while scikit-learn or similar tools may be used for clustering and similarity analysis of flavor profiles. Recipe and spice data will be sourced from Kaggle or publicly available datasets such as recipe APIs or open food databases. The final product may be presented through a simple web interface or notebook based demo. If there is enough time, I hope to present the project with a fully developed app. Implementing this step will be last, as none of us knows any app development. 


### Proposed Timeline (subject to change)


Week of 1/27 – Exploratory Data Analysis & Dataset Selection

 Research and select appropriate spice and recipe datasets.

Week of 2/3 – Model Building

Begin researching and implementing the initial logic for mapping spices to flavor profiles, and begin implementing similarity or clustering methods. Start building a preliminary recommendation approach based on the available data.


Week of 2/12 – Dataset & Preliminary Model Deliverable

Finalize cleaned datasets and submit an initial working model that demonstrates basic spice analysis and recipe recommendation functionality.


Week of 2/17 – Iteration & Feature Expansion

 Use feedback from the preliminary model to improve accuracy and usability. Refine flavor profile mappings, improve recommendations


Week of 2/24 – Midpoint Showcase

 Prepare and present a functional mid-project demo showcasing current progress, model behavior, and planned next steps.


Week of 3/3 – Polishing & User Experience Improvements

 Focus on improving usability, presentation, and overall flow of the system. Refine outputs and prepare for People’s Choice considerations if applicable.


March–May – Final Improvements & Presentation Prep


## The Project

### Imports and configurations

In [8]:
"""
Preliminary Spice → Recipe Recommender (RecipeNLG-friendly)

What this does:
1) Loads a RecipeNLG-style CSV 
2) Extracts spices from each recipe's ingredient text using a spice vocabulary
3) Builds a binary spice matrix (recipes x spices)
4) Recommends recipes for a user's spice list using Jaccard similarity

Notes:
- This is a baseline model (no ML training). Perfect for your "preliminary model" deliverable.
- Works best if you sample down to 50k–200k recipes for speed on a laptop.
"""

import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import pairwise_distances


In [9]:

# 1) CONFIG: spice vocabulary

# We will start with 30–80 common spices. Then Expand later.
SPICES = [
    "salt", "black pepper", "pepper",
    "garlic", "garlic powder",
    "onion powder", "onion",
    "cumin", "paprika", "smoked paprika",
    "chili powder", "crushed red pepper", "red pepper flakes",
    "cayenne", "turmeric", "coriander",
    "ginger", "ground ginger",
    "cinnamon", "nutmeg", "cloves", "allspice",
    "oregano", "basil", "thyme", "rosemary", "sage",
    "bay leaf", "bay leaves",
    "parsley",
    "cardamom", "fennel", "mustard", "mustard powder",
    "curry powder", "garam masala",
    "star anise", "anise",
    "tarragon", "dill",
    "sumac", "za'atar",
]

# Optional: common names and generalizations. 
# Note: CHANGE THIS IF NEEDED LATER
ALIASES = {
    "bay leaves": "bay leaf",
    "red pepper flakes": "crushed red pepper",
    "ground ginger": "ginger",
    "garlic powder": "garlic",
    "smoked paprika": "paprika",
    "pepper": "black pepper",  # if you want "pepper" to map to black pepper
}

# If you want to avoid matching very generic things (salt/pepper),
# you can exclude them later or downweight them in future iterations.
# For now we will keep them (baseline).



In [10]:

# 2) Text cleaning utilities

_word_re = re.compile(r"[a-z]+")

def normalize_text(s: str) -> str:
    """Lowercase and remove weird chars; keep letters/spaces."""
    if not isinstance(s, str):
        return ""
    s = s.lower()
    # Replace punctuation with spaces
    s = re.sub(r"[^a-z\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def build_spice_patterns(spice_list):
    """
    Compile regex patterns for spices with word boundaries.
    Sort by length so multi-word spices match before single words.
    """
    spices_sorted = sorted(spice_list, key=len, reverse=True)
    patterns = []
    for sp in spices_sorted:
        sp_norm = normalize_text(sp)
        # word boundary-ish match on spaces:
        # \b doesn't work perfectly with multi-word; we use: (^| )sp( |$)
        pat = re.compile(rf"(^| ){re.escape(sp_norm)}( |$)")
        patterns.append((sp, sp_norm, pat))
    return patterns

SPICE_PATTERNS = build_spice_patterns(SPICES)

def extract_spices_from_ingredients(ingredients) -> set:
    """
    ingredients: can be a list of ingredient strings OR one big string.
    Returns: a set of normalized spice names.
    """
    # If list -> join
    if isinstance(ingredients, list):
        raw = " ".join([str(x) for x in ingredients])
    else:
        raw = str(ingredients)

    text = normalize_text(raw)

    found = set()
    for original, norm, pat in SPICE_PATTERNS:
        if pat.search(" " + text + " "):
            found.add(norm)

    # Apply alias mapping to canonical names (optional)
    canonical = set()
    for sp in found:
        canonical.add(normalize_text(ALIASES.get(sp, sp)))

    return canonical

def parse_ingredients_field(x):
    """
    RecipeNLG ingredients are often stored as a stringified Python list like:
    "['1 cup sugar', '2 eggs', ...]"
    If that's your case, try to parse safely.
    If it's already a list, we keep it.
    If it's plain text, we keep it as text.
    """
    if isinstance(x, list):
        return x

    if not isinstance(x, str):
        return ""

    s = x.strip()

    # Heuristic: looks like a python list
    if s.startswith("[") and s.endswith("]"):
        # Very lightweight parsing: pull quoted chunks
        # Works for many Kaggle-ish list strings.
        items = re.findall(r"'([^']*)'|\"([^\"]*)\"", s)
        parsed = []
        for a, b in items:
            parsed.append(a if a else b)
        # If parsing fails, fall back to raw string
        return parsed if len(parsed) > 0 else x

    return x

In [11]:
# 3) Load the data
def load_recipes(csv_path: str, sample_n: int | None = 100_000, seed: int = 42) -> pd.DataFrame:
    """
    Expected columns:
      - title (string)
      - ingredients (list-like string OR list OR text)
    If your file uses different column names, adjust below.

    sample_n: set to None to use full dataset (not recommended at first).
    """
    df = pd.read_csv(csv_path)

    # Common RecipeNLG-ish column names:
    # Some files use "title", "ingredients", "directions"/"instructions".
    # If yours differs, rename here.
    required = ["title", "ingredients"]
    missing = [c for c in required if c not in df.columns]
    if missing:
        raise ValueError(
            f"Missing required columns: {missing}. "
            f"Your columns are: {list(df.columns)}. "
            f"Rename columns to include {required}."
        )

    # Sample for speed
    if sample_n is not None and len(df) > sample_n:
        df = df.sample(n=sample_n, random_state=seed).reset_index(drop=True)

    # Parse ingredients field to list/text
    df["ingredients_parsed"] = df["ingredients"].apply(parse_ingredients_field)

    return df



In [12]:
# 4) Build spice matrix
def build_spice_matrix(df: pd.DataFrame) -> tuple[pd.DataFrame, MultiLabelBinarizer, np.ndarray]:
    """
    Returns:
      - df_with_spices: original df + "spices" column (set)
      - mlb: fitted MultiLabelBinarizer
      - X: binary matrix (n_recipes x n_spices) as a numpy array (0/1)
    """
    df = df.copy()
    df["spices"] = df["ingredients_parsed"].apply(extract_spices_from_ingredients)

    # Fit binarizer on spice vocabulary (to ensure fixed columns)
    spice_vocab = [normalize_text(s) for s in SPICES]
    spice_vocab = [normalize_text(ALIASES.get(s, s)) for s in spice_vocab]
    spice_vocab = sorted(set(spice_vocab))

    mlb = MultiLabelBinarizer(classes=spice_vocab)
    X = mlb.fit_transform(df["spices"])

    return df, mlb, X

In [13]:
# 5) Recommend with Jaccard

def recommend_recipes(
    df: pd.DataFrame,
    mlb: MultiLabelBinarizer,
    X: np.ndarray,
    user_spices: list[str],
    top_k: int = 10,
    min_match: int = 1
) -> pd.DataFrame:
    """
    user_spices: list of spices the user has (strings)
    top_k: number of recommendations
    min_match: require at least this many shared spices to show up

    Output: ranked dataframe with similarity score and matched spices
    """
    # Normalize + alias
    user_norm = [normalize_text(s) for s in user_spices]
    user_norm = [normalize_text(ALIASES.get(s, s)) for s in user_norm]
    user_set = set(user_norm)

    # Build user vector in same feature space
    user_vec = mlb.transform([user_set])  # shape (1, n_spices)

    # Jaccard distance -> similarity = 1 - distance
    # pairwise_distances supports metric="jaccard" on boolean/binary arrays.
    # Returns (1, n_recipes)
    distances = pairwise_distances(user_vec, X, metric="jaccard")
    sims = 1.0 - distances.flatten()

    # Compute match counts for filtering/explainability
    # Intersection count: (user_vec & recipe_vec).sum()
    match_counts = (X & user_vec).sum(axis=1)

    # Filter by minimum overlap
    valid_idx = np.where(match_counts >= min_match)[0]
    if len(valid_idx) == 0:
        # No matches -> return best overall (even if 0 overlap)
        valid_idx = np.arange(len(df))

    # Rank by similarity then by match count
    rank_idx = valid_idx[np.lexsort((-match_counts[valid_idx], -sims[valid_idx]))]
    rank_idx = rank_idx[:top_k]

    out = df.loc[rank_idx, ["title"]].copy()
    out["similarity"] = sims[rank_idx]
    out["matched_spices"] = df.loc[rank_idx, "spices"].apply(lambda s: sorted(list(s & user_set)))
    out["num_matched"] = match_counts[rank_idx]
    out = out.sort_values(["similarity", "num_matched"], ascending=False).reset_index(drop=True)
    return out

In [15]:
# 6) Example run

if __name__ == "__main__":
    # 1) Point this to your downloaded RecipeNLG CSV
    # Example: "RecipeNLG_dataset.csv"
    CSV_PATH = "/Users/daniellarson/Desktop/SpiceRack/cookingdataset/RecipeNLG_dataset.csv"  # <-- change this based on your user/ file location for now

    # Load + sample
    df = load_recipes(CSV_PATH, sample_n=100_000)

    # Build matrix
    df_sp, mlb, X = build_spice_matrix(df)

    # Example user input
    user_spices = ["garlic", "cumin", "paprika", "chili powder", "oregano", "pepper"]

    # Recommend
    recs = recommend_recipes(
        df=df_sp,
        mlb=mlb,
        X=X,
        user_spices=user_spices,
        top_k=10,
        min_match=2
    )

    print("\nTop recommendations:")
    print(recs.to_string(index=False))


Top recommendations:
                                    title  similarity                                                matched_spices  num_matched
                                  Chorizo    0.857143 [black pepper, chili powder, cumin, garlic, oregano, paprika]            6
                          Beef Tamale Pie    0.857143 [black pepper, chili powder, cumin, garlic, oregano, paprika]            6
Spicy Beer Braised Beef And Buffalo Chili    0.750000 [black pepper, chili powder, cumin, garlic, oregano, paprika]            6
                          Chicken Fajitas    0.750000 [black pepper, chili powder, cumin, garlic, oregano, paprika]            6
                Ultimate Vegetarian Chili    0.750000 [black pepper, chili powder, cumin, garlic, oregano, paprika]            6
          Southwestern Pasta with Chicken    0.750000 [black pepper, chili powder, cumin, garlic, oregano, paprika]            6
                     Beef And Pork Chili     0.750000 [black pepper, chili 

