# Notebook 03 - Merge Metadata and Recipes

This notebook combines cleaned product, store, and delivery datasets to create a unified product metadata table. Additionally, we load or construct a mock recipe dataset to support testing of downstream matching pipelines.

Outputs:
- Enriched product table with delivery, waste, and store info
- Mock recipe file (or real recipes if available)


In [1]:
import pandas as pd
import os

# Input location
input_folder = "cleaned_data"

# Load cleaned CSVs
df_waste = pd.read_csv(os.path.join(input_folder, "waste_snapshot_cleaned.csv"))
df_sales = pd.read_csv(os.path.join(input_folder, "sales_snapshot_cleaned.csv"))
df_delivery = pd.read_csv(os.path.join(input_folder, "delivery_snapshot_cleaned.csv"))
df_naw = pd.read_csv(os.path.join(input_folder, "store_metadata_cleaned.csv"))


## Merge Strategy

We will enrich product data by:
- Merging `df_sales` and `df_waste` on (`store`, `article`) using outer join
- Merging in `df_delivery` to append `delivered_quantity`
- Merging in `df_naw` to add store location and city metadata

The merged output will contain:
- Unique products per store
- Sales and waste context
- Delivery counts
- Location information


In [2]:
# Merge sales and waste on store + article
df_products = pd.merge(df_sales, df_waste, on=["store", "article"], how="outer", suffixes=("_sales", "_waste"))

print("Merged sales + waste:", df_products.shape)
display(df_products.head())


Merged sales + waste: (126919, 22)


Unnamed: 0,store,date_sales,article,product_category,discount_flag,promotion,price_theoretical,price_sold,items_sold,volume_sold,...,product_name,brand,content,unit,supplier,content_category,waste_reason,items_wasted,value_wasted,product_name_clean
0,1015,2025-01-15,10040,20.36.01,0.0,no,13.89,13.89,116.0,116.0,...,,,,,,,,,,
1,1015,2025-01-15,100966,25.14.07,0.0,no,1.29,1.29,17.0,17.0,...,,,,,,,,,,
2,1015,2025-01-15,101093,20.22.21,0.0,no,2.95,2.95,2.0,2.0,...,,,,,,,,,,
3,1015,2025-01-15,101109,20.37.30,0.0,no,4.99,4.99,23.0,23.0,...,,,,,,,,,,
4,1015,2025-01-15,101330,20.12.04,0.0,no,1.99,1.99,1.0,1.0,...,,,,,,,,,,


In [4]:
# Ensure both keys are of the same type before merging
df_products["article"] = df_products["article"].astype(str)
df_delivery["article"] = df_delivery["article"].astype(str)

# Perform the merge
df_products = pd.merge(df_products, df_delivery, on=["store", "article"], how="left")

print("Merged with deliveries:", df_products.shape)
display(df_products[["store", "article", "delivered_quantity"]].dropna().head())


Merged with deliveries: (126919, 24)


Unnamed: 0,store,article,delivered_quantity
1,1015,100966,42.0
3,1015,101109,6.0
4,1015,101330,6.0
11,1015,102757,6.0
12,1015,102759,10.0


In [5]:
# Ensure 'store' is of consistent type before merge
df_products["store"] = df_products["store"].astype(int)
df_naw["store"] = df_naw["store"].astype(int)

# Merge in store location and address info
df_products = pd.merge(df_products, df_naw, on="store", how="left")

print("Merged with store metadata:", df_products.shape)
display(df_products[["store", "store_name", "city"]].drop_duplicates().head())


Merged with store metadata: (126919, 28)


Unnamed: 0,store,store_name,city
0,1015,Katwijk Visserijkade,Katwijk
5569,1024,Sassenheim Wasbeekerlaan,Sassenheim
10124,1032,Noordwijk Raadhuisstraat,Noordwijk
15457,1040,Oude Wetering Meerkreuk,Oude Wetering
19672,1058,Leiden Langegracht,Leiden


In [6]:
# Export enriched product-level dataset
output_path = os.path.join(input_folder, "products_full.csv")
df_products.to_csv(output_path, index=False)

print("-> Merged product metadata saved to:", output_path)


-> Merged product metadata saved to: cleaned_data\products_full.csv


## Create Mock Recipe Dataset

We define a small set of example recipes and ingredients to support development and testing of the product-to-recipe matching logic. These mock recipes can later be replaced by real data from a structured recipe source.


In [7]:
# Example test recipes for matching experiments
mock_recipes = [
    {"recipe": "Strawberry Smoothie", "ingredient": "strawberries"},
    {"recipe": "Banana Yogurt Bowl", "ingredient": "banana"},
    {"recipe": "Greek Yogurt & Honey", "ingredient": "yogurt"},
    {"recipe": "Honey Glazed Carrots", "ingredient": "honey"},
    {"recipe": "Pasta with Tomato Sauce", "ingredient": "tomato"},
    {"recipe": "Tuna Sandwich", "ingredient": "tuna"}
]

df_mock_recipes = pd.DataFrame(mock_recipes)

# Save for downstream use
recipe_path = os.path.join(input_folder, "mock_recipes.csv")
df_mock_recipes.to_csv(recipe_path, index=False)

print("-> Mock recipes saved to:", recipe_path)
df_mock_recipes


-> Mock recipes saved to: cleaned_data\mock_recipes.csv


Unnamed: 0,recipe,ingredient
0,Strawberry Smoothie,strawberries
1,Banana Yogurt Bowl,banana
2,Greek Yogurt & Honey,yogurt
3,Honey Glazed Carrots,honey
4,Pasta with Tomato Sauce,tomato
5,Tuna Sandwich,tuna


## Summary and Outputs

This notebook created a unified dataset combining:

- Sales data
- Waste data
- Delivery quantities
- Store metadata

All product entries are linked by `store` and `article`, enabling enriched context per item. We also created a lightweight `mock_recipes.csv` file to support development of the recipe-product matching system.

### Saved files:
- `products_full.csv`: Enriched product table
- `mock_recipes.csv`: Sample recipes for matching experiments

These outputs are now ready to be used for matching ingredient names to store-specific product metadata in downstream notebooks.
