# Notebook 08 - Enrich Waste and Markdown with Concepts

This notebook prepares markdown and waste snapshot data by aligning them with the ontology used in the main recipe-product pipeline. We normalize product names and map them to standardized food concepts for concept-level reasoning and planning.

Inputs:
- `Dirk data/2025-01-24T07_03_00+00_00zero waste lab Mark Down 2025-01-24.xlsx`
- `Dirk data/2025-01-24T07_01_23+00_00zero waste lab Mark Down_Waste 2025-01-24.xlsx`

Output:
- `cleaned_data/markdown_with_concept.csv`
- `cleaned_data/waste_with_concept.csv`


In [2]:
import pandas as pd
import os

# Set folders
raw_data_folder = r"C:\Users\User\Desktop\University\BEP\Data\Dirk data"
output_folder = "cleaned_data"

# File paths
waste_file = os.path.join(raw_data_folder, "2025-01-24T07_01_23+00_00zero waste lab Mark Down_Waste 2025-01-24.xlsx")
markdown_file = os.path.join(raw_data_folder, "2025-01-24T07_03_00+00_00zero waste lab Mark Down 2025-01-24.xlsx")

# Attempt to load Excel files (we will adjust skiprows as needed)
df_waste = pd.read_excel(waste_file, skiprows=2)
df_markdown = pd.read_excel(markdown_file, skiprows=2)

# Preview dimensions
print("Waste:", df_waste.shape)
print("Markdown:", df_markdown.shape)

# Show column names to confirm correct header parsing
print("Waste columns:", df_waste.columns.tolist())
print("Markdown columns:", df_markdown.columns.tolist())


Waste: (18382, 14)
Markdown: (5605, 9)
Waste columns: ['Store', 'Date', 'Article', 'Unnamed: 3', 'Product name', 'Brand', 'Content', 'Eenheid CE', 'Supplier', 'Unnamed: 9', 'Content category', 'Waste reason', 'Items wasted', 'Value wasted']
Markdown columns: ['Filiaal', 'Date', 'Time', 'Article', 'Unnamed: 4', 'Discount percentage', 'Regular price', 'Pakking price', 'total amount discounted']


In [3]:
# Normalize product names from waste (markdown doesn't contain names)
df_waste["product_name_clean"] = df_waste["Product name"].astype(str).str.strip().str.lower()

# Preview cleaned names
display(df_waste[["Product name", "product_name_clean"]].head(5))


Unnamed: 0,Product name,product_name_clean
0,Salade Surinaamse ei,salade surinaamse ei
1,Roomyoghurt Spaanse sinaasappel,roomyoghurt spaanse sinaasappel
2,Rundvleesslaatje,rundvleesslaatje
3,Grillworst kip,grillworst kip
4,Grillworst,grillworst


### Align Waste Items to Canonical Food Concepts

We now map the cleaned `product_name_clean` column in the waste dataset to our food ontology. This enables concept-level reasoning across recipes, products, and waste.


In [7]:
# Canonical ontology mapping used throughout the project
ontology = {
    # Fruits
    "aardbeien": "strawberries",
    "strawberries": "strawberries",
    "bananen": "banana",
    "banana": "banana",

    # Dairy
    "volle yoghurt": "yogurt",
    "magere yoghurt": "yogurt",
    "roeryoghurt": "yogurt",
    "kwark aardbei": "yogurt",
    "yogurt": "yogurt",

    # Sweeteners
    "bloemenhoning": "honey",
    "mosterd honing": "honey",
    "honey": "honey",

    # Vegetables
    "tomaten": "tomato",
    "tomatengroentesoep": "tomato",
    "tomato": "tomato",

    # Fish
    "tonijn": "tuna",
    "tuna": "tuna",

    # Fallbacks
    "wolkentoetje banaan": "banana"
}

# Ontology mapping function
def map_to_ontology(text, mapping):
    if isinstance(text, str):
        return mapping.get(text.lower().strip(), None)
    return None

# Apply to waste data
df_waste["product_concept"] = df_waste["product_name_clean"].apply(lambda x: map_to_ontology(x, ontology))

# Preview result
display(df_waste[["product_name_clean", "product_concept"]].dropna().drop_duplicates().head(10))


Unnamed: 0,product_name_clean,product_concept
932,kwark aardbei,yogurt
1390,volle yoghurt,yogurt
9437,mosterd honing,honey
12067,roeryoghurt,yogurt
14437,aardbeien,strawberries
16137,magere yoghurt,yogurt
17517,wolkentoetje banaan,banana


In [6]:
# Save processed versions
waste_out = os.path.join(output_folder, "waste_with_concept.csv")
markdown_out = os.path.join(output_folder, "markdown_with_concept.csv")

df_waste.to_csv(waste_out, index=False)
df_markdown.to_csv(markdown_out, index=False)

print("Saved waste to:", waste_out)
print("Saved markdown to:", markdown_out)


Saved waste to: cleaned_data\waste_with_concept.csv
Saved markdown to: cleaned_data\markdown_with_concept.csv


### Summary

This notebook enriched markdown and waste snapshot files by aligning product names to standardized food concepts from the central ontology used throughout the matching pipeline.

Key tasks completed:
- Cleaned raw product names from the waste dataset
- Mapped noisy entries (e.g. "kwark aardbei") to canonical concepts (e.g. "yogurt")
- Enabled concept-level linking across waste, markdown, products, and recipes
- Exported enriched data to `cleaned_data/` for downstream prioritization

This prepares the foundation for Notebook 09, where we integrate concept-level waste and markdown signals into store-specific meal box planning.
