# The Baseline RAG Pipeline

**RAG Prototyping Workshop**

---

## What You Will Learn

This notebook is the starting point for the workshop. It introduces the **key concepts** behind Retrieval-Augmented Generation (RAG) and walks through every step of the **baseline pipeline** that the later phases build upon.

After working through this notebook you will be able to:
- Explain why a standalone LLM is insufficient for grounded enterprise Q&A
- Describe the five stages of a RAG pipeline (chunk -> embed -> store -> retrieve -> generate)
- Run the full baseline pipeline against the PrimePack AG corpus
- Use the retrieval inspection step as the primary debugging tool
- Identify the three main failure modes this workshop addresses

**Workshop Phases at a Glance**
| Notebook | Focus |
|---|---|
| **Baseline (this notebook)** | Key concepts + end-to-end baseline |
| Feature Track 1 | Chunking strategies & document ingestion |
| Feature Track 2 | Evaluation metrics (retrieval + generation) |
| Feature Track 3 | Reliable & structured outputs |
| Feature Track 4 | Advanced retrieval |
| Feature Track 5 | Multi-step agent workflows |

---

## 1. Why RAG? The Problem with a Standalone LLM

### The Scenario
**PrimePack AG** buys packaging materials (pallets, cardboard boxes, tape) from multiple suppliers. Sustainability claims are increasingly scrutinised by customers and regulators. Employees need to answer questions like:
> *"What is the GWP of the Logypal 1 pallet, and is the figure verified?"*  
> *"Can we tell a customer that the tesa tape is PFAS-free?"*  
> *"Which of our suppliers have a certified EPD?"*

### Why Not Just Ask ChatGPT?
A general-purpose LLM has three fundamental problems for this task:

| Problem | Why It Matters |
|---|---|
| **No product knowledge** | LLMs know nothing about Logypal 1, Andrea Packaging's specific portfolio, or the individual supplier documents. |
| **Hallucination** | When asked about unknown products the LLM invents plausible-sounding but false figures. |
| **No evidence trail** | Even when correct, a raw LLM answer cannot be traced back to a source document. |

### The RAG Solution
RAG adds a **retrieval step** between the user's question and the LLM:

```
 Documents ──► Chunker ──► Embedder ──► Vector DB
                                              │
 User query ─────────────────► Embedder ─────►  Retriever ──► Top-k Chunks
                                                                      │
                                                               LLM + Prompt
                                                                      │
                                                               Answer + Sources
```

The LLM only sees documents that are **actually in the corpus**. The answer can be traced to specific source chunks. If the corpus does not contain the answer, the LLM is instructed to say so.

### What RAG Does *Not* Fix
RAG shifts the problem from hallucination to **retrieval quality**. If the right chunk is not retrieved, the answer will still be wrong (or absent). The later phases of this workshop address exactly this: better chunking, better retrieval, and better output structure.

---

## 2. Core Concepts

### Chunks

A **chunk** is a short excerpt from a source document, a section of a PDF, one sheet of a spreadsheet, or one heading-delimited paragraph of a Markdown file. Chunks are the unit of indexing and retrieval.

```python
@dataclass
class Chunk:
    id: str           # unique identifier
    title: str        # e.g. section heading
    content: str      # the text that gets embedded
    metadata: dict    # source_file, page, ...
```

### Embeddings
An **embedding** converts text to a dense numeric vector (e.g. 384 dimensions). Semantically similar texts produce similar vectors. Here we use `all-MiniLM-L6-v2`, a compact local model that runs without an API key.

### Vector Store (ChromaDB)
A **vector store** persists chunk embeddings on disk and supports approximate nearest-neighbour search. Given a query embedding, it returns the `top_k` most similar chunks in milliseconds.

### Retriever
A **retriever** wraps a vector store and exposes a single `retrieve(query)` method. The baseline uses a `VectorStoreRetriever` with `top_k=5`.

### RAG Agent
The **RAG agent** combines a retriever and an LLM. Its `answer()` method:
1. Embeds the query
2. Retrieves the top-k chunks
3. Formats chunks as XML `<source>` tags in the prompt
4. Calls the LLM and returns the answer + cited sources

---

## 3. Setup

**Prerequisites:**
- `conversational-toolkit` installed in editable mode (`pip install -e conversational-toolkit/`) (already done on Renku)
- `backend` installed in editable mode (`pip install -e backend/`) (already done on Renku)
- For the **Ollama** backend (default): `ollama serve` running + `ollama pull mistral-nemo:12b`
- For the **OpenAI** backend: `OPENAI_API_KEY` set in the environment (already done on Renku)

In [None]:
# imports and confiurations
from pathlib import Path


from conversational_toolkit.agents.base import QueryWithContext
from conversational_toolkit.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from conversational_toolkit.retriever.vectorstore_retriever import VectorStoreRetriever

from sme_kt_zh_collaboration_rag.baseline_rag import (
    load_chunks,
    inspect_chunks,
    build_vector_store,
    inspect_retrieval,
    build_agent,
    build_llm,
    ask,
    DATA_DIR,
    VS_PATH,
    EMBEDDING_MODEL,
    RETRIEVER_TOP_K,
)

BACKEND = "ollama"  # "ollama" (local) or "openai" (requires OPENAI_API_KEY)

ROOT = Path().resolve().parents[1]  # backend/notebooks/ → project root
print(f"Project root : {ROOT}")
print(f"Data dir     : {DATA_DIR}")
print(f"Vector store : {VS_PATH}")
print(f"LLM backend  : {BACKEND}")

Consider using the pymupdf_layout package for a greatly improved page layout analysis.
Project root : /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag
Data dir     : /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag/data
Vector store : /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag/backend/data_vs.db
LLM backend  : ollama


---

## Step 1: Load and Chunk Documents

The `load_chunks()` function walks `data/` and dispatches each file to the right chunker:

| Extension | Chunker | Strategy |
|---|---|---|
| `.pdf` | `PDFChunker` | Convert to Markdown via `pymupdf4llm`, split on `#` headings |
| `.xlsx`, `.xls` | `ExcelChunker` | One chunk per sheet, serialised as a Markdown table |
| `.md`, `.txt` | `MarkdownChunker` | Split on `#` headings |

The result is a flat `list[Chunk]`, the same structure regardless of the original format.

We use `max_files=5` here for speed. Remove the limit (or set `None`) to load the full corpus.

> **Feature Track 1** explores alternative chunking strategies in depth.

In [2]:
chunks = load_chunks(max_files=5)
inspect_chunks(chunks)

# Quick size distribution
char_lengths = [len(c.content) for c in chunks]
over_limit = sum(1 for n in char_lengths if n > 1024)
print(f"\nChunks total       : {len(chunks)}")
print(f"Mean length (chars): {sum(char_lengths) // len(char_lengths)}")
print(f"Over 1024-char limit (≈256 tok embedding limit): {over_limit} / {len(chunks)}")

[32m2026-02-20 15:41:10.821[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m178[0m - [1mChunking 5 files from /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag/data[0m


5


[32m2026-02-20 15:41:19.315[0m | [34m[1mDEBUG   [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m187[0m - [34m[1m  1_Product-Life-Cycle-Accounting-Reporting-Standard_041613.pdf: 32 chunks[0m
[32m2026-02-20 15:41:20.152[0m | [34m[1mDEBUG   [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m187[0m - [34m[1m  2_EPD_pallet_CPR.pdf: 11 chunks[0m
[32m2026-02-20 15:41:21.541[0m | [34m[1mDEBUG   [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m187[0m - [34m[1m  3_EPD_pallet_relicyc.pdf: 17 chunks[0m
[32m2026-02-20 15:41:22.869[0m | [34m[1mDEBUG   [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m187[0m - [34m[1m  4_EPD_pallet_Stabilplastik.pdf: 2 chunks[0m
[32m2026-02-20 15:41:22.870[0m | [34m[1mDEBUG   [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m187[0m - [34m[1m  ART_customer_inquiry_f


Chunks total       : 68
Mean length (chars): 5304
Over 1024-char limit (≈256 tok embedding limit): 34 / 68


### What a Chunk Looks Like

Each chunk carries a `title` (the heading), the raw text `content`, and a `metadata` dict
with the source file name. This metadata is returned alongside the answer so the user can
trace every claim back to its origin document.

In [3]:
# Print 3 representative chunks
for c in chunks[:3]:
    print(f"--- [{c.metadata.get('source_file', '?')}] ---")
    print(f"Title  : {c.title!r}")
    print(f"Length : {len(c.content)} chars")
    print(f"Preview: {c.content[:250].strip()!r}")
    print()

--- [1_Product-Life-Cycle-Accounting-Reporting-Standard_041613.pdf] ---
Title  : '###### **_01 Introduction_**'
Length : 31 chars
Preview: '###### **_01 Introduction_**'

--- [1_Product-Life-Cycle-Accounting-Reporting-Standard_041613.pdf] ---
Title  : '# **_E_**'
Length : 15995 chars
Preview: '# **_E_**\n\n\n\n_**missions of the anthropogenic greenhouse gases (GHG) that drive climate change**_\n\n_**and its impacts around the world are growing. According to climate scientists,**_\n\n_**global carbon dioxide emissions must be cut by as much as 85 p'

--- [1_Product-Life-Cycle-Accounting-Reporting-Standard_041613.pdf] ---
Title  : '###### **_02 Defining Business Goals_**'
Length : 42 chars
Preview: '###### **_02 Defining Business Goals_**'



---

## Step 2: Embed Chunks and Build the Vector Store

`SentenceTransformerEmbeddings` converts every chunk's `content` to a 384-dimensional vector using `all-MiniLM-L6-v2`. The resulting matrix (shape `[n_chunks, 384]`) is inserted into a persistent `ChromaDBVectorStore`.

**On subsequent runs**, leave `reset=False` (the default) to skip re-embedding, it takes time and the store on disk is already correct. Pass `reset=True` only when the corpus or chunking strategy changes.

> **Why 384 dimensions?** `all-MiniLM-L6-v2` is a distilled model: small enough to run on CPU in seconds but good enough for retrieval on short technical texts. OpenAI's `text-embedding-3-small` produces 1536-dimensional vectors with higher quality at the cost of an API call per chunk.

In [4]:
embedding_model = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL)
print(f"Embedding model: {EMBEDDING_MODEL}")

# Set reset=True to rebuild the store from scratch
vector_store = await build_vector_store(
    chunks, embedding_model, db_path=VS_PATH, reset=False
)
print("Vector store ready.")

[32m2026-02-20 15:41:24.884[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36m__init__[0m:[36m57[0m - [34m[1mSentence Transformer embeddings model loaded: sentence-transformers/all-MiniLM-L6-v2 with kwargs: {}[0m
[32m2026-02-20 15:41:24.972[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mbuild_vector_store[0m:[36m235[0m - [1mVector store already contains 78 chunks — skipping embedding.[0m


Embedding model: sentence-transformers/all-MiniLM-L6-v2
Vector store ready.


### Similarity in Embedding Space

Embeddings that are close in vector space share semantic meaning. The cell below embeds several sentences and measures their cosine similarity: a value between -1 (opposite) and 1 (identical). You can change the sentences to see the impact on cosine similarity.

In [5]:
import numpy as np

sentence1 = "carbon footprint of a pallet"
sentence2 = "GWP value for the Logypal 1"
sentence3 = "PFAS-free tape declaration"
sentence4 = "the annual report of a software firm"


async def cosine_similarity(a: str, b: str) -> float:
    vecs = await embedding_model.get_embeddings([a, b])
    return float(
        np.dot(vecs[0], vecs[1]) / (np.linalg.norm(vecs[0]) * np.linalg.norm(vecs[1]))
    )


pairs = [
    (sentence1, sentence2),
    (sentence1, sentence3),
    (sentence1, sentence4),
]

print("Cosine similarities:")
for a, b in pairs:
    sim = await cosine_similarity(a, b)
    print(f"{sim:.3f}  -->  {a!r}  vs  {b!r}")

[32m2026-02-20 15:41:25.082[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (2, 384)[0m


Cosine similarities:
0.133  -->  'carbon footprint of a pallet'  vs  'GWP value for the Logypal 1'


[32m2026-02-20 15:41:25.169[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (2, 384)[0m
[32m2026-02-20 15:41:25.178[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (2, 384)[0m


-0.064  -->  'carbon footprint of a pallet'  vs  'PFAS-free tape declaration'
0.014  -->  'carbon footprint of a pallet'  vs  'the annual report of a software firm'


---

## Step 3: Inspect Retrieval (Before the LLM Sees Anything)

This is the **most important diagnostic step** in the whole pipeline:

> If the retrieved chunks are wrong, the final answer will be wrong regardless of how good the LLM is.

`inspect_retrieval()` runs the query through the embedding model, fetches the top-k most similar chunks from ChromaDB, and prints them with scores. Use this to:
- Verify that relevant documents are in the index
- Tune `top_k`
- Compare different query phrasings
- Identify retrieval gaps before blaming the LLM

The **similarity score** is the L2 distance, range [0,4], lower = more similar. L2 distance is used becuase it works for any vectors, normalised or not. Cosine similarity only makes sense for direction (magnitude doesn't matter), so it requires that vectors be unit-length to be meaningful. L2 makes no such assumption, making it the safer general default. ChromaDB defaults to L2 because it's simpler to compute and works even if vector magnitudes vary. Since our embedding model always produces equal-length vectors, we get cosine-equivalent ranking. The score numbers look different, but the top-5 results would be identical either way.

In [6]:
QUERY = "What materials is the Logypal 1 pallet made from?"

results = await inspect_retrieval(QUERY, vector_store, embedding_model)

[32m2026-02-20 15:43:06.665[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)[0m
[32m2026-02-20 15:43:06.671[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36minspect_retrieval[0m:[36m274[0m - [1mRetrieval for query: 'What materials is the Logypal 1 pallet made from?'[0m



Top-5 retrieved chunks (returned=5; showing a maximum of 1000 content characters):
  [1] score=0.8810  file='3_EPD_pallet_relicyc.pdf'  title='# PRODUCT INFORMATION'
       '# PRODUCT INFORMATION\n\nThis Environmental Product Declaration concerns the environmental\n\nimpacts associated with a model of recycled polypropylene pallet:\n\n Logypal 1 [®]\n\n All these pallets are produced with secondary raw materials\n\n(a mix of polypropylene and high density polyethylene).\n\nThese new plastic pallets are the real alternative to the ISPM-15\n\ntreated wooden pallet (HT standard phytosanitary treatment that\n\ncertifies the suitability of the material to the international regulations\n\ndrawn up by the IPPC), having a comparable cost, but without the\n\nbureaucracy and mandatory certifications for purchase.\n\nThese products are also light, resistant, washable and resistant to\n\nmold and humidity.\n\nThe main characteristics of the model of pallet under study are shown\n\nin the followin

### Retrieval for a Product Outside the Portfolio

The PrimePack AG product catalog defines the portfolio boundary. The **Lara Pallet** is not in the catalog, it does not exist. Watch which chunks are returned and what scores they have. A **higher** minimum score (large L2 distance) signals *weaker semantic match*.

In [8]:
QUERY_OOK = "What materials is the Lara pallet made from?"

results_ook = await inspect_retrieval(QUERY_OOK, vector_store, embedding_model)

[32m2026-02-20 15:49:19.544[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)[0m
[32m2026-02-20 15:49:19.545[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36minspect_retrieval[0m:[36m274[0m - [1mRetrieval for query: 'What materials is the Lara pallet made from?'[0m



Top-5 retrieved chunks (returned=5; showing a maximum of 1000 content characters):
  [1] score=1.0161  file='3_EPD_pallet_relicyc.pdf'  title='# PRODUCT INFORMATION'
       '# PRODUCT INFORMATION\n\nThis Environmental Product Declaration concerns the environmental\n\nimpacts associated with a model of recycled polypropylene pallet:\n\n Logypal 1 [®]\n\n All these pallets are produced with secondary raw materials\n\n(a mix of polypropylene and high density polyethylene).\n\nThese new plastic pallets are the real alternative to the ISPM-15\n\ntreated wooden pallet (HT standard phytosanitary treatment that\n\ncertifies the suitability of the material to the international regulations\n\ndrawn up by the IPPC), having a comparable cost, but without the\n\nbureaucracy and mandatory certifications for purchase.\n\nThese products are also light, resistant, washable and resistant to\n\nmold and humidity.\n\nThe main characteristics of the model of pallet under study are shown\n\nin the followin

> **Observation:** The retriever always returns the *closest* chunks it can find, it has no concept of "no match". For an unknown product the L2 distances are **higher** (the closest chunks are still about other pallets), but without a score-threshold guard the LLM receives those chunks anyway and may silently answer about the wrong product.
> **Phase 3** shows how to combat this issue.

---

## Step 4: Build the RAG Agent

`build_agent()` assembles the three components:

```
VectorStoreRetriever
    └─ ChromaDBVectorStore (on disk, persists across runs)
    └─ SentenceTransformerEmbeddings

RAG Agent
    ├─ LLM (Ollama / OpenAI / SDSC Qwen)
    ├─ Retriever
    └─ System prompt
```

### The System Prompt

The system prompt is a very powerful lever for controlling LLM behaviour:

```
You are a helpful AI assistant specialised in sustainability and product compliance. Answer questions using the provided sources. If the information is not in the sources, say so clearly.
```

The instruction *"If the information is not in the sources, say so clearly"* should prevents hallucination about missing products and unverified claims.

In [9]:
llm = build_llm(backend=BACKEND)
agent = build_agent(
    vector_store=vector_store,
    embedding_model=embedding_model,
    llm=llm,
    top_k=RETRIEVER_TOP_K,
    number_query_expansion=0,  # 0 = no expansion; will look at this more in feautre track 4
)
print("RAG agent assembled.")

[32m2026-02-20 15:54:39.250[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mbuild_llm[0m:[36m129[0m - [1mLLM backend: Ollama (mistral-nemo:12b)[0m
[32m2026-02-20 15:54:39.277[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.llms.ollama[0m:[36m__init__[0m:[36m60[0m - [34m[1mOllama LLM loaded: mistral-nemo:12b; temperature: 0.3; seed: 42; tools: None; response_format: None[0m
[32m2026-02-20 15:54:39.279[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mbuild_agent[0m:[36m306[0m - [1mRAG agent ready (top_k=5  query_expansion=0)[0m


RAG agent assembled.


---

## Step 5: Ask a Question

`ask()` sends the query to the agent and returns the answer string. The internal flow is:

1. Embed the query
2. Retrieve top-k chunks
3. Build the prompt: `<system>` + `<sources>` XML block + user question
4. Generate the answer with the LLM
5. Return the answer and a list of cited source chunks

In [None]:
QUERY = "What materials is the Logypal 1 pallet made from?"

answer = await ask(agent, QUERY)
print("\n--------------- Answer --------------- ")
print(answer)

[32m2026-02-20 15:55:02.149[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mask[0m:[36m323[0m - [1mQuery: 'What materials is the Logypal 1 pallet made from?'[0m
[32m2026-02-20 15:55:02.396[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)[0m
[32m2026-02-20 15:55:27.865[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.llms.ollama[0m:[36mgenerate_stream[0m:[36m116[0m - [34m[1mmodel='mistral-nemo:12b' created_at='2026-02-20T14:55:27.844893Z' done=False done_reason=None total_duration=None load_duration=None prompt_eval_count=None prompt_eval_duration=None eval_count=None eval_duration=None message=Message(role='assistant', content='Based', thinking=None, images=None, tool_name=None, tool_calls=None) logprobs=None[0m
[32m2026-02-20 15:55:27.919[0m | [34m[1mDEBUG   [0m | [36


=== Answer ===
Based on the provided sources:

1. **Main Material**: The Logypal 1 pallet is primarily made from recycled polypropylene (PP), with some high-density polyethylene (HDPE). This is confirmed by both Source 1 and Source 2.
   - Source 1: "All these pallets are produced with secondary raw materials (a mix of polypropylene and high density polyethylene)."
   - Source 2: "The product under this study has a recycled plastic content of 100% [...] mainly composed (> 99%) of polyolefins."

2. **Recycled Content**: Both sources mention that the Logypal 1 pallet is made from recycled materials.
   - Source 1: "These products are also fully recyclable packaging [...]."
   - Source 2: "The product under this study has a recycled plastic content of 100% and recycled materials are post-consumer plastic waste."

3. **Other Materials**: While not the main components, other materials may include micronized aluminum, cellulose, resin, glass fibers, additives, and pigments (Source 5). Howev

---

## 4. Probing Failure Modes

The corpus was designed with three deliberate challenges. Run the queries below and observe the answers.

### 4a: Out-of-Portfolio Query

The **Lara Pallet** does not exist. A good RAG must say so instead of describing a different pallet.

In [11]:
answer_ook = await ask(agent, "What materials is the Lara pallet made from?")
print(answer_ook)

[32m2026-02-20 16:02:52.260[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mask[0m:[36m323[0m - [1mQuery: 'What materials is the Lara pallet made from?'[0m
[32m2026-02-20 16:02:52.411[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)[0m
[32m2026-02-20 16:03:16.585[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.llms.ollama[0m:[36mgenerate_stream[0m:[36m116[0m - [34m[1mmodel='mistral-nemo:12b' created_at='2026-02-20T15:03:16.579046Z' done=False done_reason=None total_duration=None load_duration=None prompt_eval_count=None prompt_eval_duration=None eval_count=None eval_duration=None message=Message(role='assistant', content='Based', thinking=None, images=None, tool_name=None, tool_calls=None) logprobs=None[0m
[32m2026-02-20 16:03:16.649[0m | [34m[1mDEBUG   [0m | [36mconv

Based on the sources provided, here's what we know about the materials used to make the Lara pallet:

1. **Primary Material**:
   - The Lara pallet is made primarily from plastic.
   - It is composed of greater than 99% polyolefins and other trace materials (Source: 'a733b848-ef47-4690-bd0c-cf41a134abc3').

2. **Recycled Content**:
   - The pallet has a recycled plastic content of 100%.
   - It is made up of post-consumer plastic waste (Source: 'a733b848-ef47-4690-bd0c-cf41a134abc3').
   - The pallet is free from hazardous chemical substances as classified under REACH and CLP regulations (Source: '6b28b9fa-d78d-49f0-a056-6d11748a6ed0').

3. **Additional Materials**:
   - Some sources mention additional materials used in the production of similar pallets, such as:
     - Micronized aluminum and cellulose (Source: '6b28b9fa-d78d-49f0-a056-6d11748a6ed0').
     - Glass fibers and additives/pigments (Source: '6b28b9fa-d78d-49f0-a056-6d11748a6ed0').

However, it's important to note that the 

### 4b: Missing Data (LogyLight Pallet)

The LogyLight datasheet marks all LCA fields as *"not yet available"*. The correct answer is that we don't have the data, not a fabricated figure.

In [12]:
answer_gap = await ask(agent, "What is the GWP of the LogyLight pallet?")
print(answer_gap)

[32m2026-02-20 16:04:38.952[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mask[0m:[36m323[0m - [1mQuery: 'What is the GWP of the LogyLight pallet?'[0m
[32m2026-02-20 16:04:39.412[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)[0m
[32m2026-02-20 16:05:15.519[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.llms.ollama[0m:[36mgenerate_stream[0m:[36m116[0m - [34m[1mmodel='mistral-nemo:12b' created_at='2026-02-20T15:05:15.513433Z' done=False done_reason=None total_duration=None load_duration=None prompt_eval_count=None prompt_eval_duration=None eval_count=None eval_duration=None message=Message(role='assistant', content='Based', thinking=None, images=None, tool_name=None, tool_calls=None) logprobs=None[0m
[32m2026-02-20 16:05:15.648[0m | [34m[1mDEBUG   [0m | [36mconversa

Based on the provided information, here's a summary of the environmental impact assessment for the plastic pallet using an alternative functional unit of transporting 1 m³ under specific conditions:

**Global Warming Potential (GWP):**
- Upstream: 5.50E-04 kg CO₂ eq.
- Core: 2.31E+00 kg CO₂ eq.
- Downstream: 2.05E+00 kg CO₂ eq.
- Total: 4.36E+00 kg CO₂ eq.

**Acidification Potential (AP):**
- Total: 1.86E-02 kg mol H⁺ eq.

**Eutrophication Potential (EP):**
- Aquatic freshwater: 4.76E-04 kg P eq.
- Aquatic marine: 4.77E-03 kg N eq.
- Aquatic terrestrial: 5.08E-02 mol N eq.

**Photochemical Oxidant Creation Potential:**
- Total: 1.33E-02 kg NMVOC eq.

**Ozone Layer Depletion:**
- Total: 8.49E-07 kg CFC 11 eq.

**Abiotic Depletion Potential (ADP):**
- Metals and minerals: 1.48E-05 kg Sb eq.
- Fossil resources: 6.70E+01 MJ

**Water Deprivation Potential (WDP):**
- Total: 1.71E+00 m³ depriv.

**Resources Use:**
- Renewable materials used as energy carrier: 4.85E-01 MJ, net calorific value


### 4c: Conflicting Evidence (Relicyc GWP Figures)

The 2021 Relicyc datasheet reports **4.1 kg CO₂e** per pallet. The 2023 EPD (third-party verified) reports a different, more recent figure. The RAG should flag the conflict and prefer the verified, more recent source.

In [13]:
answer_conflict = await ask(
    agent, "What is the GWP of the Logypal 1 pallet, and how reliable is the figure?"
)
print(answer_conflict)

[32m2026-02-20 16:06:51.244[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mask[0m:[36m323[0m - [1mQuery: 'What is the GWP of the Logypal 1 pallet, and how reliable is the figure?'[0m
[32m2026-02-20 16:06:51.597[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)[0m
[32m2026-02-20 16:07:27.854[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.llms.ollama[0m:[36mgenerate_stream[0m:[36m116[0m - [34m[1mmodel='mistral-nemo:12b' created_at='2026-02-20T15:07:27.83971Z' done=False done_reason=None total_duration=None load_duration=None prompt_eval_count=None prompt_eval_duration=None eval_count=None eval_duration=None message=Message(role='assistant', content='Based', thinking=None, images=None, tool_name=None, tool_calls=None) logprobs=None[0m
[32m2026-02-20 16:07:27.926[0m | [34m

Based on the provided source, here's a summary of how to calculate and report greenhouse gas (GHG) inventory results for a product:

**Steps to Calculate Total Inventory Results:**

1. **Calculate Emissions and Removals per GHG:**
   - Sum emissions and removals for each GHG (CO2, CH4, N2O).
   - Use the global warming potential (GWP) factors to convert them into CO2 equivalents.

2. **Sum Emissions and Removals on Reference Flow Basis:**
   - Ensure all results are on the same reference flow basis.
   - Sum emissions and removals per GHG to get total CO2e emissions and removals per reference flow.

3. **Include Land-Use Change Impacts:**
   - If applicable, include land-use change impacts in the total inventory results.

4. **Calculate Total Inventory Results (CO2e/Unit of Analysis):**
   - Sum emissions and removals on the reference flow basis to get the total CO2e per unit of analysis.

**Reporting Inventory Results:**

- Report total inventory results as the sum of biogenic emissio

### 4d: Unverified Supplier Claim (Tesa ECO Tape)

The tesa supplier brochure claims **68% CO₂ reduction** compared to conventional tape. This is a self-declared marketing claim — there is no independent EPD. The RAG should report the claim but flag that it is unverified.

In [19]:
answer_claim = await ask(
    agent,
    "How much lower is the carbon footprint of tesa ECO tape compared to standard tape?",
)
print(answer_claim)

[32m2026-02-20 17:05:18.937[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mask[0m:[36m323[0m - [1mQuery: 'How much lower is the carbon footprint of tesa ECO tape compared to standard tape?'[0m
[32m2026-02-20 17:05:19.307[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)[0m
[32m2026-02-20 17:05:53.184[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.llms.ollama[0m:[36mgenerate_stream[0m:[36m116[0m - [34m[1mmodel='mistral-nemo:12b' created_at='2026-02-20T16:05:53.176204Z' done=False done_reason=None total_duration=None load_duration=None prompt_eval_count=None prompt_eval_duration=None eval_count=None eval_duration=None message=Message(role='assistant', content='Here', thinking=None, images=None, tool_name=None, tool_calls=None) logprobs=None[0m
[32m2026-02-20 17:05:53.259[0

Here are the environmental impact indicators from the provided sources, separated into categories:

**1. Greenhouse Gas Emissions (GWP-GHG):**
- Upstream: 2.58E+00 kg CO2 eq.
- Core: 5.09E+00 kg CO2 eq.
- Downstream: 1.33E+00 kg CO2 eq.
- Total: 9.00E+00 kg CO2 eq.

**2. Global Warming Potential (GWP):**
- Upstream:
  - GHG: 4.57E-01 kg CO2 eq.
  - Non-GHG: 3.63E-01 kg CO2 eq.
- Core:
  - GHG: 8.94E+00 kg CO2 eq.
  - Non-GHG: 2.17E+00 kg CO2 eq.
- Downstream:
  - GHG: 2.56E+00 kg CO2 eq.
  - Non-GHG: 3.83E-01 kg CO2 eq.
- Total:
  - GHG: 9.47E+00 kg CO2 eq.
  - Non-GHG: 2.75E+00 kg CO2 eq.

**3. Particulate Matter (PM):**
- Disease incidences: 1.41E-07

**4. Ionising Radiation:**
- Human health: 2.38E+00 kBq U235 eq.

**5. Ecotoxicity Fresh Water (EFW):**
- CTUe: 4.67E+01

**6. Human Toxicity:**
- Cancer (HTC): 1.17E-09 CTUh
- Non-cancer (HTNC): 5.66E-08 CTUh

**7. Land Use:**
- Pt: 3.67E+01

**8. Waste Disposal:**
- Hazardous waste disposed: 8.84E-09 kg
- Non-hazardous waste disposed:

> Can you think of and find **other failure modes**?

---

## 5. Multi-Turn Conversation

The `ask()` function accepts a `history` argument, a list of prior `LLMMessage` objects. When history is provided the agent first **rewrites the query** to be self-contained (*"it"* becomes the actual product name) before retrieval.

This prevents the retriever from embedding vague pronouns that match nothing in the corpus.

In [None]:
from conversational_toolkit.llms.base import LLMMessage, Roles

history: list[LLMMessage] = []


async def conversation_turn(query: str) -> str:
    global history
    answer = await agent.answer(QueryWithContext(query=query, history=history))
    history.append(LLMMessage(role=Roles.USER, content=query))
    history.append(LLMMessage(role=Roles.ASSISTANT, content=answer.content))
    return answer.content


# Turn 1: ask about a specific product
reply1 = await conversation_turn(
    "Which pallets in our portfolio have a third-party verified EPD?"
)
print("User: Which pallets in our portfolio have a third-party verified EPD?")
print(f"Assistant: {reply1}\n")

# Turn 2: follow-up using a pronoun — the agent should resolve "it" before retrieval
reply2 = await conversation_turn(
    "What is the GWP figure reported in it for the Logypal 1?"
)
print("User: What is the GWP figure reported in it for the Logypal 1?")
print(f"Assistant: {reply2}")

---

## 6. Running the Full Pipeline in One Call

The `run_pipeline()` convenience function executes all five steps end-to-end. It is also
what the `__main__` entry point calls.

Use it for quick one-shot queries. Use the individual step functions above when you need
to inspect intermediate results or iterate on a specific stage.

In [18]:
from sme_kt_zh_collaboration_rag.baseline_rag import run_pipeline

answer = await run_pipeline(
    backend=BACKEND,
    query="What sustainability certifications do the pallets in the portfolio have?",
    reset_vs=False,
)
print(answer)

[32m2026-02-20 17:04:08.628[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mrun_pipeline[0m:[36m353[0m - [1mStarting Baseline RAG pipeline[0m
[32m2026-02-20 17:04:08.628[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mrun_pipeline[0m:[36m354[0m - [1mbackend='ollama'  model=None  max_files=5  reset_vs=False  top_k=5[0m
[32m2026-02-20 17:04:08.631[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m178[0m - [1mChunking 5 files from /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag/data[0m


5


[32m2026-02-20 17:04:17.095[0m | [34m[1mDEBUG   [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m187[0m - [34m[1m  1_Product-Life-Cycle-Accounting-Reporting-Standard_041613.pdf: 32 chunks[0m
[32m2026-02-20 17:04:17.942[0m | [34m[1mDEBUG   [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m187[0m - [34m[1m  2_EPD_pallet_CPR.pdf: 11 chunks[0m
[32m2026-02-20 17:04:19.326[0m | [34m[1mDEBUG   [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m187[0m - [34m[1m  3_EPD_pallet_relicyc.pdf: 17 chunks[0m
[32m2026-02-20 17:04:20.635[0m | [34m[1mDEBUG   [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m187[0m - [34m[1m  4_EPD_pallet_Stabilplastik.pdf: 2 chunks[0m
[32m2026-02-20 17:04:20.635[0m | [34m[1mDEBUG   [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mload_chunks[0m:[36m187[0m - [34m[1m  ART_customer_inquiry_f


Top-5 retrieved chunks (returned=5; showing a maximum of 1000 content characters):
  [1] score=0.8083  file='3_EPD_pallet_relicyc.pdf'  title='# 40 YEARS OF SUSTAINABLE INNOVATION'
       '# 40 YEARS OF SUSTAINABLE INNOVATION\n\nRelicyc has a long history in managing end-of-life plastic\n\nand wooden pallets: from recovery to reintroduction into the\n\nmarketplace, it gives the material a new lease on life. Over 40\n\nyears of experience has led the company to become a prominent\n\nplayer in the field and a partner that today’s environmental efficient customers can rely on.\n\nThe need for sustainability is what drives our model, whose focus\n\nis on re-using resources at the end of their life and routing them\n\nproperly for recycling so they can find new uses while bringing the\n\nbusinesses involved new value.'
  [2] score=0.8940  file='3_EPD_pallet_relicyc.pdf'  title='# CONTENT DECLARATION'
       '# CONTENT DECLARATION\n\nLogypal 1, classified as distribution packaging, is mainl

[32m2026-02-20 17:04:47.123[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.llms.ollama[0m:[36mgenerate_stream[0m:[36m116[0m - [34m[1mmodel='mistral-nemo:12b' created_at='2026-02-20T16:04:47.106784Z' done=False done_reason=None total_duration=None load_duration=None prompt_eval_count=None prompt_eval_duration=None eval_count=None eval_duration=None message=Message(role='assistant', content='Based', thinking=None, images=None, tool_name=None, tool_calls=None) logprobs=None[0m
[32m2026-02-20 17:04:47.173[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.llms.ollama[0m:[36mgenerate_stream[0m:[36m116[0m - [34m[1mmodel='mistral-nemo:12b' created_at='2026-02-20T16:04:47.17261Z' done=False done_reason=None total_duration=None load_duration=None prompt_eval_count=None prompt_eval_duration=None eval_count=None eval_duration=None message=Message(role='assistant', content=' on', thinking=None, images=None, tool_name=None, tool_calls=None) logprobs=None[0m
[32m

Based on the provided sources, here are the sustainability certifications and recycling information for the pallets in the portfolio:

1. **Logypal 1**:
   - The Logypal 1 pallet is made from recycled plastic content of 100%, with recycled materials being post-consumer plastic waste (Source: a733b848-ef47-4690-bd0c-cf41a134abc3).
   - It has Kiwa certification for its recycled content (Accr. N.069B) (Source: a733b848-ef47-4690-bd0c-cf41a134abc3).

2. **Noè Pallet**:
   - The Noè pallet is made from secondary materials sourced from various recycling processes, including post-consumer beverage cartons and obsolete Noè pallets (Source: 1d2a1b35-67ae-4a76-ae63-f5557611d93d).
   - It is recyclable at the end of its life, easily washable and sanitizable, resistant to rust, and stable in determining its tare weight (Source: 1d2a1b35-67ae-4a76-ae63-f5557611d93d).

For other pallets mentioned but without specific recycling or certification details:

- Relicyc manages end-of-life plastic and woo

---

## 7. Switching LLM Backends

The pipeline abstracts the LLM behind a common interface. Only `build_llm()` needs to change.

| Backend | `BACKEND=` | Prerequisite |
|---|---|---|
| Ollama (local) | `"ollama"` | `ollama serve` + `ollama pull mistral-nemo:12b` |
| OpenAI | `"openai"` | `OPENAI_API_KEY` env variable |
| SDSC Qwen | `"qwen"` | `SDSC_QWEN3_32B_AWQ` env variable |

You can also override the model within a backend:

```python
llm = build_llm(backend="openai", model_name="gpt-4o") # stronger model
llm = build_llm(backend="ollama", model_name="llama3.2") # smaller local model
```

The RAG pipeline is **backend-agnostic**, the retrieval step is identical regardless of which LLM is used.

In [None]:
# Test openai

QUERY = "What materials is the Lara pallet made from?"

llm_openai = build_llm(backend="openai", model_name="gpt-4o-mini")
agent_openai = build_agent(vector_store, embedding_model, llm_openai)
answer_openai = await ask(agent_openai, QUERY)

[32m2026-02-20 16:11:56.724[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mbuild_llm[0m:[36m110[0m - [1mLLM backend: OpenAI (gpt-4o-mini)[0m
[32m2026-02-20 16:11:56.749[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.llms.openai[0m:[36m__init__[0m:[36m63[0m - [34m[1mOpenAI LLM loaded: gpt-4o-mini; temperature: 0.3; seed: 42; tools: None; tool_choice: None; response_format: {'type': 'text'}[0m
[32m2026-02-20 16:11:56.750[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mbuild_agent[0m:[36m306[0m - [1mRAG agent ready (top_k=5  query_expansion=0)[0m
[32m2026-02-20 16:11:56.750[0m | [1mINFO    [0m | [36msme_kt_zh_collaboration_rag.baseline_rag[0m:[36mask[0m:[36m323[0m - [1mQuery: 'What materials is the Lara pallet made from?'[0m
[32m2026-02-20 16:11:56.859[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m 

---

## 8. Tasks

1. **test set creation**: Go through the dataset and come up with questions and the corresponding correct answers that the RAG should give based on the query. ALso include trick questions, such as asking for information that is not in the data or for qhich contradicting infoormation exists. 

2. **Retrieval inspection** — Call `inspect_retrieval()` with different queries and inspect which files are returned? What do the scores tell you about how well the corpus covers this topic?

3. **Top-k sensitivity**: Change `top_k` from 5 to 1. Does the answer to the the questions change? What about to 10? Is more always better?

4. **System prompt ablation**: In `baseline_rag.py`, locate `SYSTEM_PROMPT`. Try changing it and then rebuild the agent and re-run the Lara Pallet query. Does the answer change?

5. **Query phrasing**: The embedding model is sensitive to wording. Try `"CO₂ footprint Logypal 1"`, `"carbon emissions recycled pallet"`, and `"GWP A1-A3 EPD pallet"`. Do the top-1 chunk and score differ?

In [17]:
# Scratch cell — run your experiments here
async def quick_retrieve(query: str, top_k: int = 5):
    retriever = VectorStoreRetriever(embedding_model, vector_store, top_k=top_k)
    results = await retriever.retrieve(query)
    print(f"Query: {query!r}  (top_k={top_k})")
    for r in results:
        src = r.metadata.get("source_file", "?")
        print(f"  score={r.score:.4f}  {src}  {r.title!r}")


await quick_retrieve("PFAS-free tape declaration")

[32m2026-02-20 17:04:03.251[0m | [34m[1mDEBUG   [0m | [36mconversational_toolkit.embeddings.sentence_transformer[0m:[36mget_embeddings[0m:[36m76[0m - [34m[1msentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)[0m


Query: 'PFAS-free tape declaration'  (top_k=5)
  score=1.2978  Article-Document-CST-Synthetic-Rubber.pdf  '## **1. PRODUCT AND COMPANY IDENTIFICATION**'
  score=1.3428  2_EPD_pallet_CPR.pdf  '#### 5. Content declaration'
  score=1.3798  3_EPD_pallet_relicyc.pdf  '# CONTENT DECLARATION'
  score=1.3867  2_EPD_pallet_CPR.pdf  '# Environmental Product Declaration'
  score=1.3924  Article-Document-CST-Synthetic-Rubber.pdf  '## **15. REGULATORY INFORMATION**'


---

## Summary

| Step | Function | 
|---|---|
| 1. Load & chunk | `load_chunks(max_files)` |
| 2. Embed & index | `build_vector_store(chunks, emb, reset)` |
| 3. Inspect retrieval | `inspect_retrieval(query, vs, emb)` | 
| 4. Build agent | `build_agent(vs, emb, llm, top_k)` |
| 5. Generate answer | `ask(agent, query, history)` |

### Three Core Failure Modes (Addressed in Later Feature Tracks)
- Wrong entity
- Missing data presented as fact
- Low recall

**Next — Feature 1:** Explore chunking strategies and understand how chunk size affects retrieval quality.
