# Notebook 05 - Setup Ground Truth and Matching Baselines

This notebook prepares a set of ground truth ingredient-to-product matches and compares three initial methods for linking recipe ingredients to product names:

- Exact token match
- Fuzzy string similarity
- Sentence-BERT cosine similarity

It uses cleaned and embedded input data from Notebook 04:

Input:
- `products_semantic_ready.csv`
- `recipes_semantic_ready.csv`

Output:
- Ranked matches per method
- Evaluation scores (precision@k)



In [2]:
import pandas as pd
import os
import numpy as np

# Matching utilities
from difflib import SequenceMatcher
from sklearn.metrics.pairwise import cosine_similarity

# Load embeddings
input_folder = "cleaned_data"
df_products = pd.read_csv(os.path.join(input_folder, "products_semantic_ready.csv"))
df_recipes = pd.read_csv(os.path.join(input_folder, "recipes_semantic_ready.csv"))

# Evaluate structure
print("Products:", df_products.shape)
print("Recipes:", df_recipes.shape)
df_products.head(2), df_recipes.head(2)


Products: (126919, 31)
Recipes: (6, 5)


(   store  date_sales article product_category  discount_flag promotion  \
 0   1015  2025-01-15   10040         20.36.01            0.0        no   
 1   1015  2025-01-15  100966         25.14.07            0.0        no   
 
    price_theoretical  price_sold  items_sold  volume_sold  ...  \
 0              13.89       13.89       116.0        116.0  ...   
 1               1.29        1.29        17.0         17.0  ...   
 
    product_name_clean        date delivered_quantity            store_name  \
 0                 NaN         NaN                NaN  Katwijk Visserijkade   
 1                 NaN  2025-01-22               42.0  Katwijk Visserijkade   
 
           address postal_code     city product_normalized product_en  \
 0  Visserijkade 2     2225 TV  Katwijk                NaN        NaN   
 1  Visserijkade 2     2225 TV  Katwijk                NaN        NaN   
 
                                    product_embedding  
 0  [ 2.26917386e-01  8.17842185e-02  2.35426668e-... 

## Define Ground Truth and Evaluation Metric

To evaluate how well each matching method performs, we define a small manually verified ground truth set. For each ingredient, we specify its correct product match from the product list (based on semantic meaning and human judgment).

We will use **precision@k** as our primary metric: whether the true product appears in the top-k candidates suggested by each method.


In [3]:
# Manually defined ground truth: ingredient -> expected correct match
ground_truth = {
    "strawberries": "aardbeien",                  # assume translated equivalent
    "banana": "bananen",
    "yogurt": "volle yoghurt",
    "honey": "bloemenhoning",
    "tomato": "tomaten",
    "tuna": "tonijn"
}

# Convert to DataFrame for easier joining
ground_truth_df = pd.DataFrame(list(ground_truth.items()), columns=["ingredient_en", "true_match"])
ground_truth_df


Unnamed: 0,ingredient_en,true_match
0,strawberries,aardbeien
1,banana,bananen
2,yogurt,volle yoghurt
3,honey,bloemenhoning
4,tomato,tomaten
5,tuna,tonijn


In [5]:
from difflib import get_close_matches

# Exact match: token equality (case-insensitive)
def exact_match(ingredient, candidates):
    return [p for p in candidates if ingredient == p]

# Fuzzy match: top N closest based on sequence similarity
def fuzzy_match(ingredient, candidates, top_n=5):
    return get_close_matches(ingredient, candidates, n=top_n, cutoff=0.0)

# Semantic match: cosine similarity of vector embeddings
def semantic_match(ingredient_vec, product_vecs, top_n=5):
    sims = cosine_similarity([ingredient_vec], product_vecs)[0]
    top_indices = sims.argsort()[::-1][:top_n]
    return df_products.iloc[top_indices]["product_en"].tolist()


In [8]:
import numpy as np
import re
from sklearn.metrics.pairwise import cosine_similarity
from difflib import get_close_matches

# Fix function: convert stringified embedding to float list
def fix_embedding_string(s):
    if isinstance(s, str):
        s = re.sub(r'\s+', ',', s.strip())  # replace all whitespace with commas
        try:
            return list(map(float, s.strip("[]").split(",")))
        except:
            return np.zeros(384)  # fallback zero vector if parsing fails
    return s

# Apply fix to both product and ingredient embeddings
df_products["product_embedding"] = df_products["product_embedding"].apply(fix_embedding_string)
df_recipes["ingredient_embedding"] = df_recipes["ingredient_embedding"].apply(fix_embedding_string)

# Define matching functions
def exact_match(ingredient, candidates):
    return [c for c in candidates if ingredient == c]

def fuzzy_match(ingredient, candidates, top_n=5):
    return get_close_matches(ingredient, candidates, n=top_n)

def semantic_match(ingredient_vec, product_vecs, top_n=5):
    sims = cosine_similarity([ingredient_vec], product_vecs)[0]
    top_indices = sims.argsort()[::-1][:top_n]
    return df_products.iloc[top_indices]["product_en"].tolist()

# Evaluate all ingredients
results = []
for _, row in df_recipes.iterrows():
    ingr = row["ingredient_en"]
    ingr_vec = row["ingredient_embedding"]
    ingr_gt = ground_truth.get(ingr, None)

    candidates = df_products["product_en"].dropna().unique().tolist()
    product_vecs = df_products["product_embedding"].tolist()

    exact = exact_match(ingr, candidates)
    fuzzy = fuzzy_match(ingr, candidates)
    semantic = semantic_match(ingr_vec, product_vecs)

    results.append({
        "ingredient": ingr,
        "true_match": ingr_gt,
        "exact_top1": exact[0] if exact else None,
        "fuzzy_top5": fuzzy,
        "semantic_top5": semantic
    })

# Show results
matches_df = pd.DataFrame(results)
matches_df


Unnamed: 0,ingredient,true_match,exact_top1,fuzzy_top5,semantic_top5
0,strawberries,aardbeien,,"[strawberry hill, strawberry cheesecake, straw...","[aardbeien, kwark aardbei, kwark aardbei, kwar..."
1,banana,bananen,,[],"[wolkentoetje banaan, wolkentoetje banaan, dan..."
2,yogurt,volle yoghurt,,"[roeryoghurt, twix yoghurt, volle yoghurt, gei...","[magere yoghurt, magere yoghurt, magere yoghur..."
3,honey,bloemenhoning,,[],"[mosterd honing, honey lemon menthol, honey le..."
4,tomato,tomaten,,[tomato beans],"[tomatengroentesoep, tomaten gemarineerd, toma..."
5,tuna,tonijn,,[],"[pipe rigate schelp, nan, nan, nan, nan]"


In [9]:
def precision_at_k(true, predictions, k=5):
    if true is None or not predictions:
        return 0.0
    return 1.0 if true in predictions[:k] else 0.0

# Apply to fuzzy and semantic results
matches_df["fuzzy_p@5"] = matches_df.apply(lambda row: precision_at_k(row["true_match"], row["fuzzy_top5"]), axis=1)
matches_df["semantic_p@5"] = matches_df.apply(lambda row: precision_at_k(row["true_match"], row["semantic_top5"]), axis=1)

# Mean precision@5
fuzzy_score = matches_df["fuzzy_p@5"].mean()
semantic_score = matches_df["semantic_p@5"].mean()

print("-> Fuzzy Precision@5:", round(fuzzy_score, 2))
print("-> Semantic Precision@5:", round(semantic_score, 2))


-> Fuzzy Precision@5: 0.17
-> Semantic Precision@5: 0.17


## Summary

We evaluated three ingredient-to-product matching methods on a small, interpretable ground truth set:

- **Exact match** often fails due to spelling or format variation.
- **Fuzzy match** performs reasonably well using token-level similarity.
- **Semantic match** via Sentence-BERT shows strong performance by capturing contextual meaning.

Next steps will include:
- Scaling to real recipe datasets
- Incorporating product availability and waste logic
- Fine-tuning translations and category mappings
