# Notebook 01 - Preview Raw Data

This notebook loads and examines all raw Excel snapshot files provided for the BEP project. It provides a structural overview of each file, prints sheet names and column headers, and identifies relevant columns to inform further preprocessing steps.

The structure and cleanliness of these raw datasets will guide how we prepare them for downstream use in semantic and ontology-based matching pipelines.


In [14]:
import pandas as pd
import os

def preview_excel(file_path, sheet_index=0, skiprows=0):
    """
    Load and preview an Excel file using a specified header row.

    Parameters:
    - file_path (str): Path to the Excel file
    - sheet_index (int): Index of the sheet to parse (default: 0)
    - skiprows (int): Number of rows to skip before header (default: 0)

    Prints:
    - File name
    - Sheet names
    - DataFrame shape and column names
    - First 3 rows of data
    """
    try:
        xl = pd.ExcelFile(file_path)
        print(f"\n{'='*60}")
        print("File:", os.path.basename(file_path))
        print("Sheet names:", xl.sheet_names)

        sheet_name = xl.sheet_names[sheet_index]
        df = xl.parse(sheet_name, skiprows=skiprows)
        print("Shape:", df.shape)
        print("Columns:", list(df.columns))
        display(df.head(3))
    except Exception as e:
        print("Error previewing file:", file_path)
        print(e)


## Snapshot Data Files

The files previewed here are all part of the "snapshot" provided in the BEP data bundle (one-day data). These include waste, sales, discount, delivery, and store metadata files. Each file will be previewed to examine its format and contents.


In [17]:
# Each tuple: (filename, skiprows)
files_to_preview = [
    ("2025-01-24T07_01_23+00_00zero waste lab Mark Down_Waste 2025-01-24.xlsx", 2),
    ("2025-01-24T07_03_00+00_00zero waste lab Mark Down 2025-01-24.xlsx", 2),
    ("2025-01-24T07_02_33+00_00zero waste lab_Salesdata I 2025-01-24.xlsx", 2),
    ("2025-01-24T06_03_32+00_00leveringen filialen c.e..xlsx", 2),
    ("NAW filialen.xlsx", 0)  # this one is already clean
]


In [18]:
for fname, skip in files_to_preview:
    path = os.path.join(data_folder, fname)
    preview_excel(path, skiprows=skip)



File: 2025-01-24T07_01_23+00_00zero waste lab Mark Down_Waste 2025-01-24.xlsx
Sheet names: ['zero waste lab Mark Down_Waste']
Shape: (18382, 14)
Columns: ['Store', 'Date', 'Article', 'Unnamed: 3', 'Product name', 'Brand', 'Content', 'Eenheid CE', 'Supplier', 'Unnamed: 9', 'Content category', 'Waste reason', 'Items wasted', 'Value wasted']


Unnamed: 0,Store,Date,Article,Unnamed: 3,Product name,Brand,Content,Eenheid CE,Supplier,Unnamed: 9,Content category,Waste reason,Items wasted,Value wasted
0,1015,2025-01-15,342095,Johma Surinaamse Ei 2 Ster Salade 175gr,Salade Surinaamse ei,JOHMA,175.0,Gram,71578,Signature Foods Nederland Bv,30.05.03,Voedsel2kans,10,28.9
1,1015,2025-01-15,36387,Almhof Roomyoghurt Sp.Sinaasappel 500gr,Roomyoghurt Spaanse sinaasappel,ALMHOF,500.0,Gram,8177,Muller Nederland,24.06.04,Voedsel2kans,1,2.19
2,1015,2025-01-15,378461,Karaat Rundvleesslaatje 150 Gram,Rundvleesslaatje,KARAAT,150.0,Gram,80418,Smilde Foods B.V.,30.05.01,Voedsel2kans,7,3.99



File: 2025-01-24T07_03_00+00_00zero waste lab Mark Down 2025-01-24.xlsx
Sheet names: ['zero waste lab Mark Down']
Shape: (5605, 9)
Columns: ['Filiaal', 'Date', 'Time', 'Article', 'Unnamed: 4', 'Discount percentage', 'Regular price', 'Pakking price', 'total amount discounted']


Unnamed: 0,Filiaal,Date,Time,Article,Unnamed: 4,Discount percentage,Regular price,Pakking price,total amount discounted
0,1015,2025-01-15,11:07,437792,1db Makreelfilet Gerookt 240 Gram,30,4.49,0.0,1
1,1015,2025-01-15,11:08,470474,Vismarine Pangasiusfilet Citr Knoflook 2st. 365gr,30,4.99,0.0,1
2,1015,2025-01-15,13:40,448179,Poesiat & Kater Little Smulling 330ml,25,2.79,0.0,19



File: 2025-01-24T07_02_33+00_00zero waste lab_Salesdata I 2025-01-24.xlsx
Sheet names: ['zero waste lab_Salesdata I']
Shape: (111350, 12)
Columns: ['Store', 'Date', 'Article', 'Unnamed: 3', 'product category', 'Discount 0/1', 'Promotion', 'Theoretische Kassaverkoopprijs', 'Selling price', 'items', 'Volume', 'Sold value']


Unnamed: 0,Store,Date,Article,Unnamed: 3,product category,Discount 0/1,Promotion,Theoretische Kassaverkoopprijs,Selling price,items,Volume,Sold value
0,1015,2025-01-15,10040,Heineken Bier 24x30cl,20.36.01,0,no,13.89,13.89,116,116.0,1611.24
1,1015,2025-01-15,100966,Bio Bieten Vacuum,25.14.07,0,no,1.29,1.29,17,17.0,21.93
2,1015,2025-01-15,101093,Maggi Doseer Jus Naturel,20.22.21,0,no,2.95,2.95,2,2.0,5.9



File: 2025-01-24T06_03_32+00_00leveringen filialen c.e..xlsx
Sheet names: ['leveringen filialen c.e.']
Shape: (185770, 6)
Columns: ['Filiaal', 'Datum', 'Subgroep', 'Artikel', 'Unnamed: 4', 'Aantal Ontvangen CE']


Unnamed: 0,Filiaal,Datum,Subgroep,Artikel,Unnamed: 4,Aantal Ontvangen CE
0,1015,2025-01-22,20.01.01,25640,D.E. Aroma Rood Snf 500gr,15
1,1015,2025-01-22,20.01.01,409548,Bio+ Filterkoffie Arabica Robusta 250 Gram,6
2,1015,2025-01-22,20.01.01,444092,1 De Beste Filterkoffie Roodmerk 500 Gram,12



File: NAW filialen.xlsx
Sheet names: ['Blad1']
Shape: (135, 5)
Columns: ['Fil. Nr.', 'Filiaal', 'Adres', 'Postcode', 'Plaatsnaam']


Unnamed: 0,Fil. Nr.,Filiaal,Adres,Postcode,Plaatsnaam
0,1015,Katwijk Visserijkade,Visserijkade 2,2225 TV,Katwijk
1,1024,Sassenheim Wasbeekerlaan,Wasbeekerlaan 63,2171 AE,Sassenheim
2,1032,Noordwijk Raadhuisstraat,Raadhuisstraat 9,2201 MA,Noordwijk


In [19]:
# Load waste data and inspect for product name columns
waste_file = os.path.join(data_folder, "2025-01-24T07_01_23+00_00zero waste lab Mark Down_Waste 2025-01-24.xlsx")
df_waste = pd.read_excel(waste_file, sheet_name=0, skiprows=2)

# Check all column names
print("Column names in waste file:")
print(df_waste.columns.tolist())

# Display example rows from product name columns
display(df_waste[["Product name", "Brand"]].head(10))


Column names in waste file:
['Store', 'Date', 'Article', 'Unnamed: 3', 'Product name', 'Brand', 'Content', 'Eenheid CE', 'Supplier', 'Unnamed: 9', 'Content category', 'Waste reason', 'Items wasted', 'Value wasted']


Unnamed: 0,Product name,Brand
0,Salade Surinaamse ei,JOHMA
1,Roomyoghurt Spaanse sinaasappel,ALMHOF
2,Rundvleesslaatje,KARAAT
3,Grillworst kip,ZANDVLIET
4,Grillworst,THUISMERK
5,Sensational burger 2 st. 2 stuks,G.GOURMET
6,Vegetarische gehackt rul 200 gram,VEGETA.SL
7,Lasagne pittig gehakt,DAILY CHEF
8,Ciabatta donker wit tarwe-gerstemoutbroodje,VERS AFBAK
9,Triangel bruin meergranenbroodje,VERS AFBAK


## Canonical Product Name Definition

To ensure consistency in semantic matching and ontology alignment, we define a canonical naming convention:

- `Product name`: Often includes brands, package sizes, or marketing language.
- `Brand`: Appears as a separate column.
- `Product name.1`: **Not found in this version** of the waste file. Instead, we will define a canonical name ourselves using cleaned `Product name` entries.

The cleaned version of `Product name` will be lowercase, stripped of whitespace, and optionally filtered to remove brand names or units. This canonical version will be propagated across all downstream notebooks as `product_name_clean`.


In [21]:
def clean_product_name(name):
    """
    Standardizes product names for matching by:
    - Lowercasing
    - Stripping whitespace
    - Optionally: removing brand terms or size indicators (to be expanded later)
    """
    if isinstance(name, str):
        return name.lower().strip()
    return name

# Apply to waste dataset
df_waste["product_name_clean"] = df_waste["Product name"].apply(clean_product_name)

# Preview
df_waste[["Product name", "product_name_clean"]].head(10)


Unnamed: 0,Product name,product_name_clean
0,Salade Surinaamse ei,salade surinaamse ei
1,Roomyoghurt Spaanse sinaasappel,roomyoghurt spaanse sinaasappel
2,Rundvleesslaatje,rundvleesslaatje
3,Grillworst kip,grillworst kip
4,Grillworst,grillworst
5,Sensational burger 2 st. 2 stuks,sensational burger 2 st. 2 stuks
6,Vegetarische gehackt rul 200 gram,vegetarische gehackt rul 200 gram
7,Lasagne pittig gehakt,lasagne pittig gehakt
8,Ciabatta donker wit tarwe-gerstemoutbroodje,ciabatta donker wit tarwe-gerstemoutbroodje
9,Triangel bruin meergranenbroodje,triangel bruin meergranenbroodje


## Summary of Data Preview and Canonicalization

This notebook achieved the following:

- Loaded and previewed all major snapshot Excel files (waste, markdown, sales, deliveries, store metadata).
- Corrected header offsets by skipping the first two rows where necessary.
- Identified `Product name` as the most suitable source for canonical product naming.
- Created a new column `product_name_clean` by lowercasing and standardizing `Product name` values.

This standardized name will serve as the reference point for all downstream ingredient-to-product matching and ontology-based processing. It ensures brand-agnostic, format-consistent string matching across datasets.

We now proceed to notebook 02, where we begin general cleaning, column renaming, and alignment of structure across datasets.
