# Notebook 17 - Expand Recipe Coverage with Ontology Variants

## Purpose
This notebook improves recipe matching recall by propagating variant forms of ingredient and product concepts, such as plurals and common synonyms. It expands the ontology to account for linguistic diversity in recipe and product descriptions.

## Objectives
- Generate a variant mapping (e.g., "strawberries" -> "strawberry")
- Enrich recipes and products with variant-aware concept annotations
- Save enhanced outputs for downstream fuzzy and semantic matching

## Inputs
- ontology_expanded.csv - Canonical concept mappings (from Notebook 16)
- recipes_with_ontology.csv - Recipe ingredients tagged with concepts
- products_with_priority.csv - Products tagged with concepts

## Outputs
- recipes_with_variants.csv - Recipes annotated with variant-aware concepts
- products_with_variants.csv - Products annotated with variant-aware concepts
- variant_map.csv - Saved variant->concept map with type metadata
- Console previews of enriched recipe/product rows and coverage statistics


In [1]:
import os
import pandas as pd
import numpy as np

# Define folders
input_folder = "cleaned_data"
output_folder = "variant_exports"
os.makedirs(output_folder, exist_ok=True)

# File paths
ontology_file = os.path.join("ontology_exports", "ontology_expanded.csv")
recipes_file = os.path.join(input_folder, "recipes_with_ontology.csv")
products_file = os.path.join(input_folder, "products_with_priority.csv")


In [2]:
df_ontology = pd.read_csv(ontology_file)
df_recipes = pd.read_csv(recipes_file)
df_products = pd.read_csv(products_file)

print("Loaded:")
print(f"- Ontology: {df_ontology.shape}")
print(f"- Recipes: {df_recipes.shape}")
print(f"- Products: {df_products.shape}")


Loaded:
- Ontology: (6, 2)
- Recipes: (6, 6)
- Products: (126919, 35)


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


We manually define a small variant dictionary based on common linguistic deviations (plural forms, compound nouns, synonyms).
These are used to simulate GPT-style generalization.


In [4]:
manual_variants = {
    "strawberry": ["strawberries"],
    "honey": ["flower honey", "organic honey"],
    "tomato": ["tomatoes", "roma tomato"],
    "tuna": ["canned tuna", "tuna chunks"]
}


This step creates both:
- variant_map: maps each variant to its canonical concept
- variant_df: tabular representation for inspection and logging


In [5]:
variant_pairs = []
variant_map = {}

for concept, variants in manual_variants.items():
    for alt in variants:
        variant_pairs.append((concept, alt))
        variant_map[alt] = concept

df_variants = pd.DataFrame(variant_pairs, columns=["concept", "variant"])
print("Manual variant definitions:")
display(df_variants)


Manual variant definitions:


Unnamed: 0,concept,variant
0,strawberry,strawberries
1,honey,flower honey
2,honey,organic honey
3,tomato,tomatoes
4,tomato,roma tomato
5,tuna,canned tuna
6,tuna,tuna chunks


In [11]:
# Annotate variants with type (e.g., plural vs. synonym)
variant_type_rows = []
for concept, variants in manual_variants.items():
    for variant in variants:
        vtype = "plural" if variant.endswith("s") else "synonym"
        variant_type_rows.append((concept, variant, vtype))

df_variants = pd.DataFrame(variant_type_rows, columns=["concept", "variant", "variant_type"])
print("Annotated variant definitions:")
display(df_variants)


Annotated variant definitions:


Unnamed: 0,concept,variant,variant_type
0,strawberry,strawberries,plural
1,honey,flower honey,synonym
2,honey,organic honey,synonym
3,tomato,tomatoes,plural
4,tomato,roma tomato,synonym
5,tuna,canned tuna,synonym
6,tuna,tuna chunks,plural


In [12]:
# Save variant map for inspection and downstream reuse
variant_table_out = os.path.join(output_folder, "variant_map.csv")
df_variants.to_csv(variant_table_out, index=False)
print(f"Variant map saved to: {variant_table_out}")


Variant map saved to: variant_exports\variant_map.csv


This step enriches the recipe and product data by checking if their original concept has any known variant.
We store the matched variants and use them to aid flexible downstream matching.


In [13]:
# Match recipe ingredient concepts to variants
df_recipes["ingredient_variants"] = df_recipes["ingredient_concept"].map(
    lambda c: {k for k, v in variant_map.items() if v == c} if pd.notna(c) else set()
)

# Match product concepts to variants
df_products["product_variants"] = df_products["product_concept"].map(
    lambda c: {k for k, v in variant_map.items() if v == c} if pd.notna(c) else set()
)


In [14]:
print("Recipes with variant matches:")
print(df_recipes[df_recipes["ingredient_variants"].apply(bool)][["ingredient_concept", "ingredient_variants"]])

print("\nProducts with variant matches:")
print(df_products[df_products["product_variants"].apply(bool)][["product_concept", "product_variants"]].head())


Recipes with variant matches:
  ingredient_concept            ingredient_variants
3              honey  {flower honey, organic honey}
4             tomato        {roma tomato, tomatoes}
5               tuna     {canned tuna, tuna chunks}

Products with variant matches:
       product_concept               product_variants
119042           honey  {flower honey, organic honey}
126530          tomato        {roma tomato, tomatoes}


We create new columns that contain the concept name after resolving variants. These will be used in downstream fuzzy or exact matching pipelines.


In [15]:
variant_to_concept = {v: k for k, v in variant_map.items()}

def resolve_variant(concept):
    if pd.isna(concept):
        return concept
    return variant_to_concept.get(concept, concept)

df_recipes["ingredient_concept_variant"] = df_recipes["ingredient_concept"].apply(resolve_variant)
df_products["product_concept_variant"] = df_products["product_concept"].apply(resolve_variant)

print("Variants resolved into enriched concept columns.")


Variants resolved into enriched concept columns.


In [17]:
# Final variant coverage summary
num_recipe_variants = df_recipes["ingredient_variants"].apply(bool).sum()
num_product_variants = df_products["product_variants"].apply(bool).sum()

print("Final coverage:")
print(f"- Recipes with variants: {num_recipe_variants} / {len(df_recipes)}")
print(f"- Products with variants: {num_product_variants} / {len(df_products)}")


Final coverage:
- Recipes with variants: 3 / 6
- Products with variants: 2 / 126919


In [16]:
recipes_out = os.path.join(output_folder, "recipes_with_variants.csv")
products_out = os.path.join(output_folder, "products_with_variants.csv")

df_recipes.to_csv(recipes_out, index=False)
df_products.to_csv(products_out, index=False)

print("Files saved:")
print(f"- Recipes: {recipes_out}")
print(f"- Products: {products_out}")


Files saved:
- Recipes: variant_exports\recipes_with_variants.csv
- Products: variant_exports\products_with_variants.csv
