# Part 1: LLM – Exploratory Data Analysis & Missing Data Imputation

In **Part 1**, we begin our Final Project into the SAP product dataset provided by Kärcher. Our goal is to prepare and enrich this data for downstream interactions with a large language model (LLM). The steps in this phase are:

1. **Classical Exploratory Data Analysis (EDA)**
   - Load and inspect the raw dataset extract from SAP.
   - Examine each field’s data type, distribution, and cardinality.
   - Compute basic descriptive statistics to gain a holistic view of the dataset.

2. **Data Transformation for LLM Consumption**
   - Convert original “object”-typed fields into appropriate **categorical** or **numerical** types.
   - Document all transformation rules and mappings to ensure reproducibility.

3. **Missing Data Identification & Imputation Strategy**
   - Identify records and cells with missing values in critical fields.
   - Leverage the **product description PDFs** (extracted in Part 0 using agentic embeddings) as our external knowledge source.
   - Query the IBM Granite 13B Instruct V2 model to cross-reference the markdowns from Part 0 and suggest imputed values for each missing field.

4. **Validation & Iteration**
   - Cross-check the LLM’s imputations against ground-truth data or manual inspection of the source PDFs.


In [None]:
import sys
print(sys.version)

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm
import difflib 
import json
import re

In [None]:
project_root = Path.cwd()  
data_file    = project_root / "data" / "SAP_Produktstammdaten_vfinal.csv"

df = pd.read_csv(
    data_file,
    sep=",",                # comma-separated
    engine="python",        # tolerant parser
    quotechar='"',          # respect commas inside quotes
    skipinitialspace=True,  # trim spaces after delimiters
    dtype=str               # load all as strings initially
)

original = df.copy(deep=True)

print(df.shape)
df.head()


In [None]:
print("Shape:", df.shape)
print("\nColumn types:")
print(df.dtypes)

In order to proceed with data manipulation we would need to change the types of the data in order to be able to perfom manipulations that would not be possible if numbers are stored as string value.

In [None]:
numeric_cols = [
    "Preis (€ inkl. MwSt.)",
    "Flächenleistung (m²/h)",
    "Anschlussleistung (kW)",
    "Anschlusskabel (m)",
    "Gewicht ohne Zubehör (kg)",
    "Gewicht inkl. Verpackung (kg)"
]

for col in numeric_cols:
    # German‐style decimals: comma → dot
    cleaned = (
        df[col]
        .str.replace(r"[^\d,.\-]", "", regex=True)
        .str.replace(",", ".", regex=False)
    )
    df[col] = pd.to_numeric(cleaned, errors="coerce")

cat_cols = [c for c in df.columns if c not in numeric_cols]
df[cat_cols] = df[cat_cols].astype("category")

print(df.dtypes)
print(df[numeric_cols].describe())


In [None]:
print("\nMissing values per column:")
print(df.isna().sum())

As we can see, some of the data is already missing, lets increase the number of missing data by blanking 10%, this will give us the possibility to check the accuracy of the LLM performance at the end as we know the "banked" values.

In [None]:
df_test = df.copy(deep=True)

n_rows, n_cols = df_test.shape
total_cells     = n_rows * n_cols
n_blanks        = int(total_cells * 0.1)


product_col = df_test.columns.get_loc("Produktname")

# Builds a list of all flat indices except those in Produktname
# For each row r and column c ≠ product_col, flat_index = r*n_cols + c
valid_indices = [
    r * n_cols + c
    for r in range(n_rows)
    for c in range(n_cols)
    if c != product_col
]

# Picks 10% of those valid positions at random (reproducibly)
rng = np.random.default_rng(seed=42)
flat_indices = rng.choice(valid_indices, size=n_blanks, replace=False)

# Maps flat indices back to (row, col) and set to NaN
for idx in flat_indices:
    i = idx // n_cols      # row index
    j = idx %  n_cols      # col index
    df_test.iat[i, j] = np.nan

# Save to a new CSV (ensure your `data/processed` folder exists)
out_path = project_root / "data" / "processed" / "SAP_Produktstammdaten_vfinal_test_missing.csv"
out_path.parent.mkdir(parents=True, exist_ok=True)
df_test.to_csv(out_path, index=False)

print(f"Test CSV with 10% random blanks written to: {out_path}")


In [None]:
project_root = Path.cwd()  
data_file    = project_root / "data" / "processed" / "SAP_Produktstammdaten_vfinal_test_missing.csv"
df_test = pd.read_csv(
    data_file,
    sep=",",                
    engine="python",        
    quotechar='"',          
    skipinitialspace=True,  
    dtype=str               
)


print(df.shape)
df.head()


In [None]:
project_root = Path.cwd()
test_csv     = project_root / "data" / "processed" / "SAP_Produktstammdaten_vfinal_test_missing.csv"
df_test      = pd.read_csv(test_csv, sep=",", engine="python", quotechar='"', skipinitialspace=True)

missing_locs = [
    (i, col)
    for i, row in df_test.iterrows()
    for col in df_test.columns
    if pd.isna(row[col])
]
print(f"Found {len(missing_locs)} missing cells. Sample:")
print(missing_locs[:10])

Great. The display data is hard to be viewed. We need to find a way to structure the missing data in order to make the work with the LLM easyer

In [None]:
project_root = Path.cwd()
test_csv     = project_root / "data" / "processed" / "SAP_Produktstammdaten_vfinal_test_missing.csv"
df_test      = pd.read_csv(test_csv, sep=",", engine="python",
                           quotechar='"', skipinitialspace=True)

cols = [c for c in df_test.columns if c != "Produktname"]

for _, row in df_test.iterrows():
    missing = [c for c in cols if pd.isna(row[c])]
    if missing:
        prod = row["Produktname"]
        print(f"In {prod} we are missing: {', '.join(missing)}")


In [None]:
project_root = Path.cwd()
csv_path     = project_root / "data" / "processed" / "SAP_Produktstammdaten_vfinal_test_missing.csv"
df_test      = pd.read_csv(csv_path, sep=",", engine="python", quotechar='"', skipinitialspace=True)

cols = [c for c in df_test.columns if c != "Produktname"]
product = None
fields  = None

for _, row in df_test.iterrows():
    missing = [c for c in cols if pd.isna(row[c])]
    if missing:
        product = row["Produktname"]
        fields  = missing
        break

print(f"Product with missing data: {product}")
print(f"Missing fields: {fields}")

In [None]:
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters
from tqdm import tqdm
from sklearn.metrics import classification_report


WX_API_KEY = "Kmvh0N6KGE3Rq2eJtOSZOgA_0n3oEUEZhbqd5w0fyGRd"
PROJECT_ID = "d0c9b183-186c-4eaf-96dc-d8e4285fe71b"

credentials = Credentials(
    url="https://us-south.ml.cloud.ibm.com",
    api_key=WX_API_KEY
)
client = APIClient(credentials=credentials, project_id=PROJECT_ID)

In [None]:
PARAMS = TextGenParameters(
    temperature=0,
    max_new_tokens=50,
    stop_sequences=["\n"]
)
model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
    params=PARAMS
)

In [None]:
desc_dir = project_root / "parsed_markdown"
descriptions = {}
for md in desc_dir.glob("*.md"):
    product = md.stem 
    text    = (md.read_text(encoding="utf-8"))
    descriptions[product] = text

print("Loaded descriptions for", len(descriptions), "products")

In [None]:
PROMPT_TEMPLATE = """
You are an expert product-data imputer.

Im giving you:

  1. The name of exactly *one* field that is missing: {col}
  2. The product name: {product}
  3. A list of all *other* fields and values in the row
  4. A full raw markdown dump

Your **only** job is to return *exactly* one thing: the *value* for {col}.
- **No** explanations, **no** punctuation around it, **no** units.
- If you cannot confidently infer it, return **NA**.

Here is the data:

Field: {col}  
Product: {product}  

Known fields:  
{other_fields}

Raw markdown:  
{markdown}
"""

# Copy your test DataFrame and locate all missing cells
filled = df_test.copy()
missing_locs = [
    (i, col)
    for i, row in df_test.iterrows()
    for col in df_test.columns
    if pd.isna(row[col])
]

# Helper to safely grab a number from the model’s text
def extract_first_number(s: str) -> float:
    m = re.search(r"-?\d+(?:[.,]\d+)?", s)
    return float(m.group(0).replace(",", ".")) if m else np.nan

# Loop through each missing cell and call the model
for (i, col) in missing_locs:
    product = filled.at[i, "Produktname"] or ""
    other_fields = "\n".join(
        f"- {c}: {filled.at[i, c]}"
        for c in filled.columns
        if c != col and pd.notna(filled.at[i, c])
    )
    raw_md = descriptions.get(product, "")

    prompt = PROMPT_TEMPLATE.format(
        col=col,
        product=product,
        other_fields=other_fields,
        markdown=raw_md
    )

    # Send the prompt
    resp = model.generate(prompt)
    raw_guess = resp["results"][0]["generated_text"].strip()

    # Fill in either as a number or as raw text/NA
    if col in numeric_cols:
        guess = extract_first_number(raw_guess)
    else:
        guess = raw_guess or "NA"

    filled.iat[i, filled.columns.get_loc(col)] = guess


In [None]:
records = []
for i, col in missing_locs:
    records.append({
        'Field':   col,
        'Imputed': filled.at[i, col],
        'Actual':  df.at[i, col]
    })

compare_df = pd.DataFrame(records, columns=['Field','Imputed','Actual'])
print(compare_df)


We can already read that the model is halucinating. Lets format the output in order to make it easyer to analyse visually

In [None]:
from IPython.display import display

# after you’ve built `compare_df`:
display(compare_df)

In [None]:
mismatches = compare_df[
    compare_df['Imputed'].astype(str) != compare_df['Actual'].astype(str)
]
display(mismatches)


In [None]:
error_counts = mismatches.groupby('Field').size().sort_values(ascending=False)
print(error_counts)


In [None]:
n_correct = len(correct)
n_total   = len(compare_df)
print(f"Correct imputations: {n_correct}/{n_total} ({n_correct/n_total:.1%})")


As we can see, no correct imput was correct. We can also see that the model returned for some of the fields the name of the category as value (e.g.: Abmessungen (L × B × H) (mm)	Abmessungen (L × B × H) (mm)). Lets change our promt in order to see if we can get rid of the halucination.

In [None]:
BASE_SYSTEM = """
You are an expert product-data imputer.
When asked for a single missing field, return exactly one token: the missing value and nothing else (no quotes, no units, no explanation). If you cannot determine it, respond with NA. Always use dot for decimals.
"""

FEW_SHOT_TEXT = """
### EXAMPLE 1
Field: Druck (bar/MPa)  (type: numeric)
Product: K 2 Battery
Markdown:
### Description
A portable battery washer …

## Technische Daten
- Druck (bar/MPa): max. 110
- Fördermenge (l/h): 340

→ 110

### EXAMPLE 2
Field: Farbe  (type: categorical)
Product: K 7 Premium Power Flex
Markdown:
### Description
Yellow-and-black pressure washer …

## Technische Daten
- Gewicht ohne Zubehör (kg): 17.8
- Gewicht inkl. Verpackung (kg): 22.2

→ gelb
"""

# 2) Copy df_test & find missing cells
filled = df_test.copy()
missing_locs = [
    (i, col)
    for i, row in df_test.iterrows()
    for col in df_test.columns
    if pd.isna(row[col])
]

# 3) Helper to extract a number
def extract_first_number(s: str) -> float:
    m = re.search(r"-?\d+(?:[.,]\d+)?", s)
    return float(m.group(0).replace(",", ".")) if m else np.nan

# 4) Loop & impute
for (i, col) in missing_locs:
    product = filled.at[i, "Produktname"] or ""
    dtype   = "numeric" if col in numeric_cols else "categorical"
    other   = "\n".join(f"- {c}: {filled.at[i,c]}" 
                        for c in filled.columns 
                        if c != col and pd.notna(filled.at[i,c]))
    md      = descriptions.get(product, "")

    # 5) Assemble one big prompt string
    prompt_text = "\n".join([
        BASE_SYSTEM,
        FEW_SHOT_TEXT,
        f"### YOUR TURN",
        f"Field: {col}  (type: {dtype})",
        f"Product: {product}",
        "Markdown:",
        md,
        "Other fields:",
        other,
        "→"
    ])

    # 6) Call the model with a single string
    resp = model.generate(prompt=prompt_text)
    raw = resp["results"][0]["generated_text"].strip()

    # 7) Coerce numeric vs categorical
    if col in numeric_cols:
        guess = extract_first_number(raw)
    else:
        guess = raw or "NA"

    filled.iat[i, filled.columns.get_loc(col)] = guess

In [None]:
records = []
for i, col in missing_locs:
    records.append({
        'Field':   col,
        'Imputed': filled.at[i, col],
        'Actual':  df.at[i, col]
    })

compare_df = pd.DataFrame(records, columns=['Field','Imputed','Actual'])
print(compare_df)


In [None]:
display(compare_df)

In [None]:
mismatches = compare_df[
    compare_df['Imputed'].astype(str) != compare_df['Actual'].astype(str)
]
display(mismatches)

The model is obviously halucinating. Lets be deterministic and try to find the value only for the first missing cell.

In [None]:
project_root = Path.cwd()
csv_path     = project_root / "data/processed/SAP_Produktstammdaten_vfinal_test_missing.csv"
clean_dir    = project_root / "parsed_markdown"

df = pd.read_csv(csv_path, sep=",", engine="python", quotechar='"', skipinitialspace=True)
cols = [c for c in df.columns if c != "Produktname"]

product = field = None
for _, row in df.iterrows():
    missing = [c for c in cols if pd.isna(row[c])]
    if missing:
        product = row["Produktname"]
        field   = missing[0]
        break

assert product and field, "No missing field found!"

md_path = clean_dir / f"{product}.md"
md_text = md_path.read_text(encoding="utf-8") if md_path.exists() else ""

BASE_SYSTEM = """
You are an expert product-data imputer.
When asked for a single missing field, you will:
1. Read the cleaned Markdown file named <Produktname>.md from parsed_markdown_clean.
2. Find the requested field.
3. Return exactly one line:
In <Produktname> the correct <Field> is <Value>
If the field is absent, return:
In <Produktname> the correct <Field> is NaN
Use a dot for decimals; no quotes, units, or extra text.
"""

FEW_SHOT = """
### EXAMPLE 1
Field: Druck (bar/MPa)
Product: K 2 Battery
(Excerpt:)
- Druck (bar/MPa): max. 110
→ In K 2 Battery the correct Druck (bar/MPa) is 110

### EXAMPLE 2
Field: Farbe
Product: K 7 Premium Power Flex
(Excerpt:)
- Gewicht ohne Zubehör (kg): 17.8
→ In K 7 Premium Power Flex the correct Farbe is NaN
"""

prompt = (
    BASE_SYSTEM.strip() + "\n\n" +
    FEW_SHOT.strip() + "\n\n" +
    f"Field: {field}\n" +
    f"Product: {product}\n\n" +
    f"(Contents of {product}.md below:)\n" +
    md_text + "\n\n" +
    "→"
)

In [None]:
result_line = model.generate_text(prompt)
print(result_line.strip())

As we can see, the model is halucinating even with one data imput. The actual value is **Zulauftemperatur (°C)**: max. 60. What if we try to ask the model in a promt? To reduce the possibility of an error, we use here the identical papameters and code that we used in MA3 from our group.

In [None]:
from langchain.llms import WatsonxLLM
from langchain_ibm import WatsonxLLM

llm = WatsonxLLM(
    model_id="ibm/granite-13b-instruct-v2",
    url="https://us-south.ml.cloud.ibm.com",
    apikey=WX_API_KEY,
    project_id=PROJECT_ID,
    params={
        "decoding_method": "greedy",
        "temperature": 0.0,
        "min_new_tokens": 5,
        "max_new_tokens": 1_000,
        "repetition_penalty": 1.2,
    },
)


In [None]:
llm_result = llm.invoke("Hi how are you?")

print(type(llm_result))
print(llm_result)

In [None]:
llm_result = llm.invoke(
    "You are a product-data extractor. In the text below, locate and return the value for “Zulauftemperatur (°C)”. If the field isn’t present, respond with “Not found”.\n\n"
    "Text:\n\n"
    "## Page Header\n\n"
    "The image shows the logo of Kärcher, a company known for its cleaning equipment. The logo consists of the word \"KÄRCHER\" in bold, black uppercase letters. Below the text, there is a yellow horizontal bar. The background is white, providing contrast to the black text and yellow bar. <!-- page_header, ID 77001531-ac4c-4427-8d9d-4871b3fbf695 -->\n\n"
    "# K 5 PREMIUM SMART CONTROL FLEX eco!Booster <!-- title, ID 18485f43-17b0-4038-a9bf-5ed087e286ad -->\n\n"
    "## Text\n\n"
    "Für mehr Performance: der Hochdruckreiniger K 5 Premium Smart Control Flex eco!Booster mit PremiumFlex-Schlauch, G 180 Q Smart Control-Pistole, Schlauchtrommel und eco!Booster Kit. <!-- text, ID 743bca50-0296-4a44-872f-f6134441c566 -->\n\n"
    "## Description\n\n"
    "The image displays a Kärcher pressure washer, specifically the K5 Smart Control model. The device is predominantly yellow with black accents and features a sleek, modern design. It includes a handle for easy maneuverability and wheels for transport. The pressure washer is equipped with a hose reel for convenient storage of the hose.\n\n"
    "### Components\n\n"
    "- **Pressure Washer Unit**:\n"
    "  - **Color**: Yellow and black.\n"
    "  - **Model**: K5 Smart Control.\n"
    "  - **Brand**: Kärcher.\n"
    "  - **Features**:\n"
    "    - Integrated handle.\n"
    "    - Wheels for mobility.\n"
    "    - Hose reel for storage.\n\n"
    "- **Accessories**:\n"
    "  - **Spray Gun**:\n"
    "    - Black with yellow accents.\n"
    "    - Branded with Kärcher logo.\n"
    "  - **Lance**:\n"
    "    - Black with yellow tip.\n"
    "    - Designed for various cleaning tasks.\n"
    "  - **Detergent Bottle**:\n"
    "    - Labeled \"Universal\".\n"
    "    - Features an image of a person using the pressure washer.\n\n"
    "### Additional Details\n\n"
    "- The pressure washer is designed for home use, suitable for cleaning cars, patios, and other surfaces.\n"
    "- The image suggests a focus on versatility and ease of use, with the inclusion of multiple attachments and a detergent bottle for enhanced cleaning performance. <!-- figure, ID dac59e3f-1a35-4150-a02f-2cf0e71df6cb -->\n\n"
    "## Price\n\n"
    "The image shows a price of 539,99 €. <!-- text, ID 27d5d7da-d92d-4c2d-b800-17bbfad4ad1a -->\n\n"
    "## Text Content\n\n"
    "inkl. MwSt. - kostenlose Lieferung ab 50 € <!-- text, ID 1a5b0407-8076-462f-9b79-cc46a8d4e8f2 -->\n\n"
    "## Page Footer\n\n"
    "[https://www.kaercher.com/de/home-garden/hochdruckreiniger/k-5-premium-smart-control-flex-eco-booster-13246870.html](https://www.kaercher.com/de/home-garden/hochdruckreiniger/k-5-premium-smart-control-flex-eco-booster-13246870.html) <!-- page_footer, ID 7567dd32-5f11-4cda-a5d8-5011045f6c0d -->\n\n"
    "## Page Number\n\n"
    "1/10 <!-- page_number, ID c25e9df6-69c3-4714-8eb0-9b4517817945 -->\n\n"
    "## Page Header\n\n"
    "30/04/2025, 23:44 <!-- page_header, ID 200ff78b-640d-4cb1-8ef9-6c5883e4ef6d -->\n\n"
    "## K 5 Premium Smart Control Flex eco!Booster | Kärcher\n\n"
    "This is a page header indicating the title of a product from Kärcher, specifically the \"K 5 Premium Smart Control Flex eco!Booster.\" <!-- page_header, ID e51076b1-3049-4369-894d-be3c7988960f -->\n\n"
    "## Text Information\n\n"
    "- Lieferbar in 3-4 Werktagen\n"
    "- Bestellnummer: 1.324-687.0 <!-- key_value, ID 4be821ac-2c5f-4ae2-8a20-8924b72d464c -->\n\n"
    "## Händler Suche\n\n"
    "- **Ort oder PLZ**: [Text input field]\n\n"
    "### Bewertung\n\n"
    "- **Sterne**: [ ] [ ] [ ] [ ] [ ] (0)\n"
    "- **Aktion**: Jetzt Produkt bewerten\n\n"
    "### Produkt Optionen\n\n"
    "- **Produkt vergleichen**: [Text link]\n\n"
    "### Hilfe\n\n"
    "- **Benötigen Sie Hilfe?**\n"
    "  - **Hotline**: +49 7195 903 0 <!-- form, ID 50cec018-d8bb-4273-9abc-1bad6bea78f9 -->\n\n"
    "## Text Content\n\n"
    "Dank integriertem Bluetooth lässt sich der Hochdruckreiniger K 5 Premium Smart Control Flex eco!Booster mit der Kärcher Home & Garden App verbinden. Die App bietet viele nützliche Funktionen wie den Anwendungsberater mit hilfreichen Tipps und Tricks, eine Aufbauanleitung, Wartungs- und Pflegehinweise sowie das Kärcher Serviceportal. Darüber hinaus verfügt das Gerät über einen Boost Mode für extra Power, G 180 Q Smart Control-Pistole mit LCD-Display und das 3-in-1-Multi Jet-Strahlrohr. Die Druckeinstellungen werden direkt an der Pistole vorgenommen oder mit Hilfe des Anwendungsberaters aus der App auf die Pistole übertragen. Über das LCD-Display lässt sich überprüfen, welche Druckstufe eingestellt ist. Weitere Ausstattungsdetails sind die Schlauchtrommel, das Plug 'n' Clean-Reinigungssystem, der PremiumFlex-Hochdruckschlauch, der Aluminium-Teleskopgriff sowie die Parkposition für jederzeit griffbereites Zubehör. Inklusive eco!Booster Kit mit eco!Booster und 1 Liter Universalreiniger. Der eco!Booster ist ideal für empfindliche Oberflächen und sorgt mit einer um 50 Prozent höheren Reinigungsleistung im Vergleich zum Flachstrahl für eine Wasser-, Energie- und Zeitersparnis. <!-- text, ID e6a43fc4-0308-4ec7-8aa5-064878973f75 -->\n\n"
    "## Merkmale und Vorteile <!-- text, ID 8f4cbbb5-c478-495f-a400-8e93008927ce -->\n\n"
    "## Page Number\n\n"
    "5/10 <!-- page_number, ID 8a12ffa8-af64-4455-a64a-da610b8122d2 -->\n\n"
    "### Page Header\n\n"
    "- **Date and Time**: 30/04/2025, 23:44 <!-- page_header, ID 2988337e-2b21-44bd-9e82-ce3230efd6c9 -->\n\n"
    "## K 5 Premium Smart Control Flex eco!Booster | Kärcher <!-- page_header, ID f648a450-9186-4323-b735-28539ff3487d -->\n\n"
    "## Schlauchtrommel für komfortable Handhabung\n\n"
    "- Der Hochdruckschlauch ist optimal geschützt und platzsparend verstaut.\n"
    "- Bequemes Arbeiten: Jederzeit griffbereiter Schlauch durch leichtes Auf- und Abrollen.\n"
    "- Tiefer Schwerpunkt für einen sicheren Stand auch auf schrägen Oberflächen. <!-- text, ID 7796f8d0-5d1d-4fb7-b8f3-7f4049ce99ed -->\n\n"
    "## Wahrgenommene Lautstärkenreduktion\n\n"
    "- Angenehmer Klang der Anwendung**\n\n"
    "** Vergleich zur wahrgenommenen Lautstärke bei der Anwendung des Kärcher Standard-Flachstrahls. <!-- text, ID 16b63c93-743d-4bb8-a0a2-500169bdbed3 -->\n\n"
    "## SPEZIFIKATIONEN\n\n"
    "## Technische Daten\n\n"
    "- Stromart (V/Hz): 230 / 50\n"
    "- Druck (bar/MPa): 20 - max. 145 / 2 - max. 14,5\n"
    "- Fördermenge (l/h): max. 500\n"
    "- Flächenleistung (m²/h): 40\n"
    "- Zulauftemperatur (°C): max. 60\n"
    "- Anschlussleistung (kW): 2,1\n"
    "- Anschlusskabel (m): 5\n"
    "- Farbe: gelb\n"
    "- Gewicht ohne Zubehör (kg): 13,5\n"
    "- Gewicht inkl. Verpackung (kg): 18,5\n"
    "- Abmessungen (L × B × H) (mm): 417 × 306 × 584 <!-- key_value, ID ff4355bf-a76b-40ac-a30b-21aa7848d77b -->\n\n"
    "## Lieferumfang\n\n"
    "- Hochdruckpistole: G 180 Q Smart Control\n"
    "- Multi Jet 3-in-1\n"
    "- eco!Booster <!-- text, ID 2050473b-7124-4cbb-8fda-9aaaf6e6f335 -->\n\n"
    "## Page Footer\n\n"
    "[https://www.kaercher.com/de/home-garden/hochdruckreiniger/k-5-premium-smart-control-flex-eco-booster-13246870.html](https://www.kaercher.com/de/home-garden/hochdruckreiniger/k-5-premium-smart-control-flex-eco-booster-13246870.html) <!-- page_footer, ID 29726f76-d2b3-4d03-b2b5-b77372b17c05 -->\n\n"
    "### Page Number\n\n"
    "6/10 <!-- page_number, ID 0b62e4cc-caff-40a0-b2b7-4b438d3c250e -->"
)
print(llm_result)

As we can see, the model is halucinating even with direct data imput in the promt. The actual value is **Zulauftemperatur (°C)**: max. 60. What if we clean the markdowns and try to leave only the "Technical Details" like temperature?

In [None]:
project_root = Path.cwd()
src_dir      = project_root / "parsed_markdown"
clean_dir    = project_root / "parsed_markdown_clean"

clean_dir.mkdir(exist_ok=True)

In [None]:
comment_re          = re.compile(r"<!--.*?-->", flags=re.DOTALL)
blank_re            = re.compile(r"(?:\r?\n){3,}")
remove_sections_re  = re.compile(
    r"^##\s+(?:Page Header|Page Footer|Page Number)[\s\S]*?(?=^##\s|\Z)",
    flags=re.MULTILINE,
)

KEYS = [
    "Bestellnummer",
    "Preis",                 
    "Lieferzeit",
    "Stromart",
    "Druck",
    "Fördermenge",
    "Flächenleistung",
    "Zulauftemperatur",
    "Anschlussleistung",
    "Anschlusskabel",
    "Farbe",
    "Gewicht ohne Zubehör",
    "Gewicht inkl. Verpackung",
    "Abmessungen",
]


key_line_re = re.compile(
    rf"^\s*(?:[-*]\s*)?(?:\*\*|__)?(?:{'|'.join(map(re.escape, KEYS))})(?:\b|:)",
    flags=re.IGNORECASE,
)


SECTION_TITLES = {"Lieferumfang", "Ausstattung"}

for md_path in src_dir.glob("*.md"):
    raw = md_path.read_text(encoding="utf-8")


    txt = remove_sections_re.sub("", comment_re.sub("", raw))

    out_lines        = []
    keep_section     = False      
    product_written  = False      # first non-page heading becomes Produktname

    for line in txt.splitlines():

        if line.startswith("## "):
            hdr = line.lstrip("#").strip()


            if not product_written:
                prod_name = hdr.split("|")[0].strip()
                out_lines.append("### Produktname")
                out_lines.append(prod_name)
                product_written = True
                keep_section = False
                continue


            if hdr in SECTION_TITLES:
                out_lines.append(f"### {hdr}")
                keep_section = True
            else:
                keep_section = False

            continue  

        if keep_section or key_line_re.match(line):
            out_lines.append(line)

    cleaned = blank_re.sub("\n\n", "\n".join(out_lines).strip()) + "\n"
    (clean_dir / md_path.name).write_text(cleaned, encoding="utf-8")

print(
    f"Cleaned markdown saved for {len(list(src_dir.glob('*.md')))} files into {clean_dir}"
)

In [None]:
desc_dir = project_root / "parsed_markdown_clean"
descriptions = {}
for md in desc_dir.glob("*.md"):
    product = md.stem 
    text    = (md.read_text(encoding="utf-8"))
    descriptions[product] = text

print("Loaded descriptions for", len(descriptions), "products")

In [None]:
PROMPT_TEMPLATE = """
You are an expert product-data imputer.

Im giving you:

  1. The name of exactly *one* field that is missing: {col}
  2. The product name: {product}
  3. A list of all *other* fields and values in the row
  4. A full raw markdown dump

Your **only** job is to return *exactly* one thing: the *value* for {col}.
- **No** explanations, **no** punctuation around it, **no** units.
- If you cannot confidently infer it, return **NA**.

Here is the data:

Field: {col}  
Product: {product}  

Known fields:  
{other_fields}

Raw markdown:  
{markdown}
"""

# Copy your test DataFrame and locate all missing cells
filled = df_test.copy()
missing_locs = [
    (i, col)
    for i, row in df_test.iterrows()
    for col in df_test.columns
    if pd.isna(row[col])
]

# Helper to safely grab a number from the model’s text
def extract_first_number(s: str) -> float:
    m = re.search(r"-?\d+(?:[.,]\d+)?", s)
    return float(m.group(0).replace(",", ".")) if m else np.nan

# Loop through each missing cell and call the model
for (i, col) in missing_locs:
    product = filled.at[i, "Produktname"] or ""
    other_fields = "\n".join(
        f"- {c}: {filled.at[i, c]}"
        for c in filled.columns
        if c != col and pd.notna(filled.at[i, c])
    )
    raw_md = descriptions.get(product, "")

    prompt = PROMPT_TEMPLATE.format(
        col=col,
        product=product,
        other_fields=other_fields,
        markdown=raw_md
    )

    # Send the prompt
    resp = model.generate(prompt)
    raw_guess = resp["results"][0]["generated_text"].strip()

    # Fill in either as a number or as raw text/NA
    if col in numeric_cols:
        guess = extract_first_number(raw_guess)
    else:
        guess = raw_guess or "NA"

    filled.iat[i, filled.columns.get_loc(col)] = guess


In [None]:
records = []
for i, col in missing_locs:
    records.append({
    'Field':   col,
    'Imputed': filled.at[i, col],
    'Actual':  original.at[i, col]  # ← now you grab from your unmasked copy
})


compare_df = pd.DataFrame(records, columns=['Field','Imputed','Actual'])
print(compare_df)


In [None]:
display(compare_df)

In [None]:
error_counts = mismatches.groupby('Field').size().sort_values(ascending=False)
print(error_counts)

Seems no luck for now. But can it actually extract values from a "curated" imput?

In [None]:
llm_result = llm.invoke(
    "You are a product-data extractor. In the text below, locate and return the value for “Zulauftemperatur (°C)”. " 
    "If the field isn’t present, respond with “Not found”.\n\n"
    "Text:\n"
    "### Technische Daten\n"
    "- Stromart (Ph/V/Hz): 1 / 230 / 50\n"
    "- Druck (bar/MPa): 20 - max. 180 / 2 - max. 18\n"
    "- Fördermenge (l/h): max. 600\n"
    "- Flächenleistung (m²/h): 60\n"
    "- Zulauftemperatur (°C): max. 60\n"
    "- Anschlussleistung (kW): 3\n"
    "- Anschlusskabel (m): 5\n"
    "- Farbe: gelb\n"
    "- Gewicht ohne Zubehör (kg): 17,3\n"
    "- Gewicht inkl. Verpackung (kg): 24\n"
    "- Abmessungen (L × B × H) (mm): 458 × 330 × 669"
)
print(llm_result)


Great so our 13 Billion parameter model is working, but is to small to perform tasks designated for a quality check tool.