# Notebook 16 - Expand Ontology with Semantic and LLM-based Matching

## Purpose
This notebook improves the concept ontology used for matching products to recipes by resolving inconsistencies, rare expressions, and ambiguous terms. It employs two complementary techniques:

- **Semantic Embedding Matching**: Leverages multilingual Sentence-BERT to detect concept-level similarities between terms, accounting for linguistic variation and phrasing.
- **LLM-based Expansion**: Uses GPT (or simulated alternatives) to suggest generalized concepts for noisy, branded, or compound entries.

These methods increase ontology coverage, improve robustness to multilingual and brand-specific variation, and support more accurate and scalable matching pipelines.

## Inputs
- `ontology_manual.csv`: Initial raw_term -> concept mappings, manually curated.
- `recipes_with_ontology.csv`: Recipe ingredients tagged with potentially incomplete or inconsistent concepts.
- `products_with_priority.csv`: Products tagged with product concepts and priority metadata.
- *(Optional)* OpenAI API key for GPT-powered expansion.

## Outputs
- `ontology_expanded.csv`: Final enriched ontology with updated concept assignments.
- `gpt_candidates.csv`: Table of flagged low-frequency or misaligned terms considered for concept normalization.
- Printed suggestions from Sentence-BERT and GPT or manual substitutes for audit and refinement.


In [43]:
import os
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util

# Optional GPT Expansion
try:
    import openai
    openai.api_key = os.getenv("OPENAI_API_KEY")
except ImportError:
    openai = None

# Define folders and files
input_folder = "cleaned_data"
output_folder = "ontology_exports"
os.makedirs(output_folder, exist_ok=True)

ontology_file = os.path.join(input_folder, "ontology_manual.csv")
recipes_file = os.path.join(input_folder, "recipes_with_ontology.csv")
products_file = os.path.join(input_folder, "products_with_priority.csv")


In [44]:
df_ontology = pd.read_csv(ontology_file)
df_recipes = pd.read_csv(recipes_file)
df_products = pd.read_csv(products_file)

print("Loaded:")
print(f"  Ontology: {df_ontology.shape}")
print(f"  Recipes: {df_recipes.shape}")
print(f"  Products: {df_products.shape}")


Loaded:
  Ontology: (6, 2)
  Recipes: (6, 6)
  Products: (126919, 35)


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [45]:
model = SentenceTransformer("distiluse-base-multilingual-cased-v2")

raw_terms = df_ontology["raw_term"].astype(str).tolist()
embeddings = model.encode(raw_terms, convert_to_tensor=True, show_progress_bar=True)

cosine_sim = util.pytorch_cos_sim(embeddings, embeddings).cpu().numpy()
np.fill_diagonal(cosine_sim, 0)

threshold = 0.85
suggestions = []
for i, term in enumerate(raw_terms):
    j = np.argmax(cosine_sim[i])
    score = cosine_sim[i][j]
    if score >= threshold:
        suggestions.append((term, raw_terms[j], score))

df_semantic = pd.DataFrame(suggestions, columns=["term", "suggested_concept", "similarity"])
df_semantic = df_semantic.drop_duplicates("term")
print("Semantic suggestions generated:", len(df_semantic))


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Semantic suggestions generated: 0


In [46]:
df_mismatch = df_ontology[df_ontology["raw_term"].str.lower() != df_ontology["concept"].str.lower()]
term_counts = pd.concat([
    df_recipes["ingredient_concept"],
    df_products["product_concept"]
]).dropna().value_counts()

rare_terms = term_counts[term_counts <= 2].index.tolist()
candidates = set(df_ontology["raw_term"]).intersection(rare_terms)
df_candidates = df_ontology[df_ontology["raw_term"].isin(candidates)].copy()

print(f"{len(df_candidates)} rare or inconsistent terms flagged for review.")
df_candidates.to_csv(os.path.join(output_folder, "gpt_candidates.csv"), index=False)


4 rare or inconsistent terms flagged for review.


In [47]:
# Simulated normalization mappings
manual_concepts = {
    "strawberries": "strawberry",
    "honey": "honey",
    "tomato": "tomato",
    "tuna": "tuna"
}

df_candidates["suggested_concept"] = df_candidates["raw_term"].map(manual_concepts)
print("Manual LLM-style suggestions:")
print(df_candidates[["raw_term", "suggested_concept"]])


Manual LLM-style suggestions:
       raw_term suggested_concept
0  strawberries        strawberry
3         honey             honey
4        tomato            tomato
5          tuna              tuna


In [48]:
# Merge updated mappings into the original ontology
update_map = dict(zip(df_candidates["raw_term"], df_candidates["suggested_concept"]))
df_ontology["concept"] = df_ontology.apply(
    lambda row: update_map[row["raw_term"]] if row["raw_term"] in update_map else row["concept"],
    axis=1
)


In [49]:
expanded_file = os.path.join(output_folder, "ontology_expanded.csv")
df_ontology.to_csv(expanded_file, index=False)
print(f"Final ontology exported to: {expanded_file}")


Final ontology exported to: ontology_exports\ontology_expanded.csv
