# Notebook 04 - Prepare Semantic Matching

This notebook prepares product and recipe ingredient text for different types of matching pipelines. It performs the following steps:

1. **Normalization**: Removes casing, punctuation, and extra whitespace from product names and recipe ingredients.
2. **Translation**: Converts normalized Dutch product names to English to enable multilingual semantic matching.
3. **Embedding**: Encodes translated names using a multilingual Sentence-BERT model for semantic similarity comparison.

These preprocessed features will be used for:
- Exact token match
- Fuzzy string match
- Semantic similarity (vector-based)

### Input:
- `products_full.csv`
- `mock_recipes.csv`

### Output:
- Cleaned, translated, and embedded product and ingredient tables for downstream evaluation


In [1]:
import pandas as pd
import os

# Text preprocessing
import re
import string

# Embeddings
from sentence_transformers import SentenceTransformer

# Paths
input_folder = "cleaned_data"

# Load datasets
df_products = pd.read_csv(os.path.join(input_folder, "products_full.csv"))
df_recipes = pd.read_csv(os.path.join(input_folder, "mock_recipes.csv"))


  from tqdm.autonotebook import tqdm, trange


## Normalize Text Fields

We define a shared normalization pipeline that:
- Lowercases all text
- Removes punctuation
- Strips leading/trailing spaces
- Removes extra spaces between words

This is applied to both product names and recipe ingredients to ensure uniformity across matching methods.


In [2]:
def normalize(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = text.translate(str.maketrans("", "", string.punctuation))
    text = re.sub(r"\s+", " ", text).strip()
    return text


In [3]:
# Apply normalization
df_products["product_normalized"] = df_products["product_name_clean"].apply(normalize)
df_recipes["ingredient_normalized"] = df_recipes["ingredient"].apply(normalize)

# Preview result
display(df_products[["product_name_clean", "product_normalized"]].dropna().head())
display(df_recipes[["ingredient", "ingredient_normalized"]])


Unnamed: 0,product_name_clean,product_normalized
41,piccalilly,piccalilly
58,frikandelbroodje broodje,frikandelbroodje broodje
178,roomyoghurt straciatella,roomyoghurt straciatella
219,ham-kaas croissant croissant,hamkaas croissant croissant
417,sla melange,sla melange


Unnamed: 0,ingredient,ingredient_normalized
0,strawberries,strawberries
1,banana,banana
2,yogurt,yogurt
3,honey,honey
4,tomato,tomato
5,tuna,tuna


## Translate Product and Ingredient Text to English

To improve semantic matching and align with multilingual embedding models, we translate both normalized product names and recipe ingredients to English.

This ensures that LLMs and embedding models trained on English data can be used effectively.


In [4]:
# Placeholder translation function (offline mock)
# In production, replace with DeepL or Google Translate API
def translate_to_english(text):
    translation_dict = {
        "roomyoghurt straciatella": "creamy yogurt stracciatella",
        "frikandelbroodje broodje": "frikandel roll",
        "sla melange": "lettuce mix",
        "hamkaas croissant croissant": "ham cheese croissant",
        "piccalilly": "piccalilly",  # same in both
    }
    return translation_dict.get(text, text)

# Apply to both product and ingredient
df_products["product_en"] = df_products["product_normalized"].apply(translate_to_english)
df_recipes["ingredient_en"] = df_recipes["ingredient_normalized"].apply(translate_to_english)

# Preview translations
display(df_products[["product_normalized", "product_en"]].dropna().head())
display(df_recipes[["ingredient_normalized", "ingredient_en"]])


Unnamed: 0,product_normalized,product_en
0,,
1,,
2,,
3,,
4,,


Unnamed: 0,ingredient_normalized,ingredient_en
0,strawberries,strawberries
1,banana,banana
2,yogurt,yogurt
3,honey,honey
4,tomato,tomato
5,tuna,tuna


## Encode Product and Ingredient Text with Sentence-BERT

We now compute semantic embeddings for both the English-translated product names and ingredients using the `paraphrase-multilingual-MiniLM-L12-v2` model.

This model supports over 50 languages and works well for short product and food-related text.


In [5]:
# Load multilingual model (fast and suitable for short text)
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# Encode product names and ingredients
product_embeddings = model.encode(df_products["product_en"].fillna(""), show_progress_bar=True)
ingredient_embeddings = model.encode(df_recipes["ingredient_en"].fillna(""), show_progress_bar=True)

# Save back to DataFrames
df_products["product_embedding"] = list(product_embeddings)
df_recipes["ingredient_embedding"] = list(ingredient_embeddings)


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.89k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/3967 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

## Save Translated and Embedded Data

We now export the preprocessed product and ingredient data - including normalized text, English translations, and Sentence-BERT embeddings - to disk for reuse in evaluation and matching notebooks.


In [6]:
# Save processed product and recipe files
products_out_path = os.path.join(input_folder, "products_semantic_ready.csv")
recipes_out_path = os.path.join(input_folder, "recipes_semantic_ready.csv")

df_products.to_csv(products_out_path, index=False)
df_recipes.to_csv(recipes_out_path, index=False)

print("-> Saved products to:", products_out_path)
print("-> Saved recipes to:", recipes_out_path)


-> Saved products to: cleaned_data\products_semantic_ready.csv
-> Saved recipes to: cleaned_data\recipes_semantic_ready.csv
