## LLM-Based Product Match Verification - Project Walkthrough

This notebook provides a complete walkthrough of the LLM-based product match verification system. It demonstrates:
- How the project is structured
- Needed env variables
- How datasets are loaded and filtered
- How LLM components are used
- How the label verification is used
- A look at the final results
- Conclusion

### Project Structure

```src/
src/
    data_loading/
        data_loader.py
    llm/
        client.py
        prompt.py
        label_verifier.py
        llm_output_model.py
    settings/
        config.py
project_walkthrough.ipynb
README.MD
requirements.txt
```

### Architecture Overview

```mermaid
flowchart TD
    A[DataLoader] --> B[Filtered DataFrame]
    B --> C[LabelVerifier]
    C --> D[LLMClient]
    D --> E[OpenAI Responses API]
    C --> F[LLMOutput]
    F --> G[Final Results DataFrame]



The end-to-end running will be done in this notebook.

### Importing all Useful Classes and Configuration Strings 

In [None]:
import pandas as pd

# Classes for the end-to-end pipeline
from data_loading.data_loader import DataLoader
from llm.label_verifier import LabelVerifier
from llm.client import LLMClient

# Not needed to directly run end-to-end pipeline, but used in an example demonstration
from llm.llm_output_model import LLMOutput

# Configuration strings (used in the notebook for demonstration purposes)
from llm.prompt import PROMPT
from settings.config import DATA_PATH, OPENAI_API_KEY

### Needed Enviornment Variables

To run this project, a local .env file needs to be created with the env variables:
- OPENAI_API_KEY: This will be the API key needed to configure the LLM Client
- DATA_PATH: This is the path to the downloaded datasets "shopping_queries_dataset_examples.parquet" and "shopping_queries_dataset_products.parquet" from https://github.com/amazon-science/esci-data 

In [None]:
print("DATA_PATH:", bool(DATA_PATH))
print("OPENAI_API_KEY loaded:", bool(OPENAI_API_KEY))

### Loading and Filtering the Datasets



We will load the 2 neccessary datasets using the data loader. This was designed to be a very re-usable class that is meant for our use case but flexible for other usages as well. It can:
- load
- merge
- filter

#### Initializing the class

In [None]:
loader = DataLoader(
    dataset_a_name="shopping_queries_dataset_examples.parquet",
    dataset_b_name="shopping_queries_dataset_products.parquet"
)

#### Merging & Filtering

Our filtering method internally handles the merging and also allows us to filter based on an "include" condition and an "equals" condition. In our case, we will make sure the query is in the included queries of interest, and the ecsi label is equal to "E"

In [None]:
# Choosing the columns to merge the two datasets on
merge_columns = ["product_id", "product_locale"]

# First filtering condition: Query
include_col = "query"
include_values = [
    "aa batteries 100 pack",
    "kodak photo paper 8.5 x 11 glossy",
    "dewalt 8v max cordless screwdriver kit, gyroscopic"
]

# Second filtering condition: Label
equal_col = 'esci_label'
equal_values = 'E'

# Use filtering method
df_filtered = loader.create_filtered_dataset(
    merge_on=merge_columns,
    include_col=include_col, 
    include_values=include_values, 
    equal_col=equal_col, equal_value=equal_values
)

In [None]:
# Demonstrating what the filtered data looks like

df_filtered.head(3)

### LLM Components

The LLM Client is a wrapper around an azure call & returns a structured response in the form of a structured output.

For structured outputs, we are using pydantic models to force the LLM to repond in a consistent schema (this simplifies downstream data processing).
The model response has the following fields:
- is_match_correct: a boolean that says whether the query matches the product
- corrected_query: a string that contains the reformulated query if the match was incorrect

In [None]:
client = LLMClient()

The prompt needs to validate whether the query and product match. In order to do so, it relies on some parameters from each unique query-match pair, particularly:
- The query
- Product information: Title, brand, description, and bullet points

The prompt relies on a template and a structured output to make iteration easy

In [None]:
print(PROMPT)

### Example LLM Call

In [None]:
example_row = df_filtered.iloc[0]

example_prompt = PROMPT.format(
    query=example_row["query"],
    query_id=example_row["query_id"],
    product_id=example_row["product_id"],
    title=example_row["product_title"],
    brand=example_row["product_brand"],
    description=example_row["product_description"],
    bullets=example_row["product_bullet_point"],
)

single_output = client.generate_structured_response(
    prompt=example_prompt,
    output_model=LLMOutput
)

print("Parsed Structured LLM Output (LLMOutput Model):")
single_output

### Label Verification

This is the final crucial step of the pipeline. It runs the prompt on our particular filtered dataset, and returns a dataframe of the correct and incorrect matches.
This class contains all LLM interaction logic in one place.
It relies and handles interation between all the independent LLM components.

In [None]:
verifier = LabelVerifier(llm_client=client)

df_results = verifier.run_dataframe(df_filtered)

Let's take a look a the final table with our results. Our end-to-end pipeline has been run to completion!

In [None]:
df_results

Here, we can see the results showing that most of our entries have a correct match between the query and the product. However, there are cases where a mix up has been made on some details, and the mismatches are shown. This gives us a powerful verification tool that could be generalized to other queues. 

### Conclusion:

This project follows a clean, modular architecture designed for clarity, maintainability, and correctness.

#### Design Decisions
- Modular Components:
  - `DataLoader` handles loading and filtering only
  - `LLMClient` abstracts all OpenAI interactions
  - `LabelVerifier` orchestrates row-level processing and prompt formatting 
  - `LLMOutput` enforces a strict schema using Pydantic for reliable parsing

- Structured LLM Output:
  Using `responses.parse()` ensures consistent return types

- Prompt Isolation:
  The full prompt lives in `prompt.py`, making it easy to update and iterate independently from code

- Declarative Filtering:
  The DataLoader accepts flexible filter parameters (`include_col`, `equal_col`) instead of hard-coding logic

### Summary
This notebook shows the full end-to-end pipeline: loading data, preparing prompts, calling the LLM through a clean wrapper, parsing structured outputs, and producing final match decisions. The project is intentionally simple, readable, and production-friendlyâ€”making it easy for reviewers to run, inspect, and extend.