# Tutorial 1 — Basic RAG (Dense Retrieval Baseline)

Welcome!  This is the **starting point** of a five-part series that builds a
complete Retrieval-Augmented Generation (RAG) system step by step.
Every tutorial changes exactly **one thing** so you can see the impact clearly.

## Your Learning Roadmap

```
 Tutorial 1 ──► Tutorial 2 ──► Tutorial 3 ──► Tutorial 4 ──► Tutorial 5
 Basic RAG      Better         Add            Add Keyword    Benchmark
 (you are       Chunking       Reranking      Search         All Four
  here)         (T2)           (T3)           (T4)           (T5)
```

**What you will build in this tutorial:**

```mermaid
flowchart LR
    A[Documents] --> B[Fixed Chunking]
    B --> C[OpenAI Embeddings]
    C --> D[Chroma Vector Index]
    E[User Query] --> F[Query Embedding]
    F --> D
    D --> G[Top-k Chunks]
    G --> H[LLM Answer]
```

**By the end of this notebook you will understand:**
- What RAG is and *why* it exists
- What a token is and why documents must be chunked
- What an embedding vector is and how cosine similarity works
- How nearest-neighbor search finds the most relevant chunks
- How to evaluate a RAG pipeline with Recall, MRR, Groundedness, and Latency

**Prerequisites:** Python basics, a curiosity for how AI systems work — no ML background needed.

Continuity note:
- Tutorial 2 keeps the same pipeline but changes **chunking**.
- Tutorial 3 keeps T2 and adds a **reranker**.
- Tutorial 4 keeps T3 and adds **keyword (BM25) retrieval**.
- Tutorial 5 benchmarks all four under identical conditions.


## What is RAG and Why Does It Exist?

### The Problem: LLMs Know a Lot, But Not Everything

**Large Language Models (LLMs)** — such as GPT-4 — are AI systems trained on enormous
amounts of text (books, websites, code).  They learn patterns in language so well that
they can answer questions, write essays, and summarise documents.

But they have three hard limitations:

| Limitation | Plain-English Meaning | Example |
|------------|----------------------|--------|
| **Knowledge cutoff** | Trained up to a fixed date; knows nothing newer | GPT-4 won't know about a policy updated last month |
| **Private data blindspot** | Has never seen your internal documents | Your employee handbook, contracts, wiki |
| **Hallucination** | Can invent plausible-sounding but wrong answers | Confidently states the wrong leave entitlement |

### The Solution: Give the LLM the Right Context at Query Time

**Retrieval-Augmented Generation (RAG)** fixes all three problems by:

1. **Storing your documents** in a searchable index.
2. **Finding the relevant passages** when a user asks a question.
3. **Injecting those passages** into the LLM's prompt so it answers from *your* data.

Think of it like an open-book exam:
```
Without RAG:  Student answers from memory → may be wrong or outdated
With RAG:     Student looks up the answer in the textbook → grounded in fact
```

### The RAG Pipeline in Plain English

```
INDEXING (done once, ahead of time)
────────────────────────────────────────────────────────
  1. Take your documents (e.g. handbook_manual.txt)
  2. Split into small passages called chunks
  3. Convert each chunk to a numeric vector (embedding)
  4. Store those vectors in a vector database (Chroma)

QUERYING (done every time a user asks a question)
────────────────────────────────────────────────────────
  5. Convert the user's question to a vector
  6. Find the chunks whose vectors are closest to the question vector
  7. Build a prompt: "Answer this question using these passages: ..."
  8. Send prompt to the LLM → get a grounded answer
```

> **Why does this work?**  If two pieces of text mean similar things, their vectors
> point in similar directions.  So "What is the remote work policy?" and
> "Employees may work remotely if..." will produce vectors that are close together —
> even though they share no exact words.  This is the magic of embeddings.


## Learning checkpoint: what works vs what breaks

**What works in Tutorial 1**
- Dense retrieval can find generally related handbook content.
- End-to-end RAG flow is functional (ingest → chunk → embed → retrieve → answer).

**Challenges you should observe**
- Query intent can be too broad for nearest-neighbor retrieval.
- Exception-heavy policy questions may return partially relevant chunks.
- Exact policy identifiers (like forms/codes) are often weakly handled.

**Why move to Tutorial 2**
- The first bottleneck is chunk quality.
- We next improve *how text is split* so policy context stays intact before retrieval.

In [26]:
# 1) Set Up Environment and Dependencies

import importlib
import os
import shutil
import subprocess
import sys
from pathlib import Path

# Ensure uv is available (installs with: pip install uv)
if shutil.which("uv") is None:
    print("uv not found. Installing with pip...")
    subprocess.run([sys.executable, "-m", "pip", "install", "uv"], check=True)

# Ensure notebook runs from repo root and local src/ is importable
cwd = Path.cwd().resolve()
repo_root = next(
    (path for path in [cwd, *cwd.parents] if (path / "pyproject.toml").exists() and (path / "src").exists()),
    cwd,
)
os.chdir(repo_root)
src_path = repo_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

REQUIRED_PACKAGES = [
    "openai",
    "chromadb",
    "numpy",
    "pandas",
    "rank_bm25",
    "sentence_transformers",
    "dotenv",
]

PIP_NAME_MAP = {
    "rank_bm25": "rank-bm25",
    "sentence_transformers": "sentence-transformers",
    "dotenv": "python-dotenv",
}


def find_missing(packages: list[str]) -> list[str]:
    """Return package import names not available in current kernel."""
    importlib.invalidate_caches()
    return [pkg for pkg in packages if importlib.util.find_spec(pkg) is None]


missing = find_missing(REQUIRED_PACKAGES)
if missing:
    print("Missing packages:", missing)
    print("Running: uv sync")
    subprocess.run(["uv", "sync"], check=True)

missing_after_sync = find_missing(REQUIRED_PACKAGES)
if missing_after_sync:
    print("Still missing in active kernel after uv sync:", missing_after_sync)
    pip_targets = [PIP_NAME_MAP.get(pkg, pkg) for pkg in missing_after_sync]
    print("Installing into current kernel with pip:", pip_targets)
    subprocess.run([sys.executable, "-m", "pip", "install", *pip_targets], check=True)

final_missing = find_missing(REQUIRED_PACKAGES)
if final_missing:
    raise ImportError(f"Dependencies still missing in current kernel: {final_missing}")

print("All required packages are available.")
print("Python:", sys.version.split()[0])
print("Working directory:", Path.cwd())
print("Repo root:", repo_root)
print("Using src path:", src_path)

All required packages are available.
Python: 3.11.13
Working directory: /Users/avy/GitHubProjects/allagents/all-things-rag
Repo root: /Users/avy/GitHubProjects/allagents/all-things-rag
Using src path: /Users/avy/GitHubProjects/allagents/all-things-rag/src


In [27]:
# 2) Define Configuration and Paths

from dataclasses import dataclass
from dotenv import load_dotenv

load_dotenv()

@dataclass
class Config:
    embedding_model: str = os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small")
    chat_model: str = os.getenv("OPENAI_CHAT_MODEL", "gpt-4.1-mini")
    chunk_mode: str = "fixed"
    top_k: int = 5
    sample_eval_size: int = 20
    handbook_path: str = "data/handbook_manual.txt"
    queries_path: str = "data/queries.jsonl"

cfg = Config()

if not os.getenv("OPENAI_API_KEY"):
    raise EnvironmentError("OPENAI_API_KEY is not set. Copy .env.example to .env and set your key.")

cfg

Config(embedding_model='text-embedding-3-small', chat_model='gpt-4.1-mini', chunk_mode='fixed', top_k=5, sample_eval_size=20, handbook_path='data/handbook_manual.txt', queries_path='data/queries.jsonl')

In [28]:
# 3) Load and Normalize Source Documents (shared handbook text + query set)

from rag_tutorials.io_utils import load_handbook_documents, load_queries

if not Path(cfg.handbook_path).exists() or not Path(cfg.queries_path).exists():
    raise FileNotFoundError(
        "Shared data files are missing. Run: uv run python scripts/generate_data.py"
    )

documents = load_handbook_documents(cfg.handbook_path)
queries = load_queries(cfg.queries_path)

print("Source text:", cfg.handbook_path)
print("Parsed handbook sections:", len(documents))
print("Queries:", len(queries))
print("Sample parsed document:", documents[0])

Source text: data/handbook_manual.txt
Parsed handbook sections: 5
Queries: 200
Sample parsed document: Document(doc_id='DOC-HB-REMOTEWORK', title='Z-Tech Handbook - Remote Work', section='Remote Work', text='Z-Tech encourages remote work from home, co-working spaces, or temporary domestic locations. Employees must stay reachable during assigned timezone hours and use approved managed devices. Public Wi-Fi usage is allowed only with corporate VPN enabled. Employees are responsible for confirming local workspace privacy when joining meetings that include customer data or personnel topics. Calendar availability must reflect working blocks, breaks, and approved out-of-office windows so cross-functional teams can plan handoffs. Managers may define team-specific overlap hours when projects involve coordination across offices in different time zones. Home-office expenses are reimbursable only for pre-approved categories listed in the internal procurement guide. Employees should review ergonom

### Why Do We Need to Chunk Documents?

Before we can embed and search our documents, we must split them into smaller pieces
called **chunks**.  Here is why — and what a token is.

#### What Is a Token?

A **token** is the basic unit of text that an LLM reads.  It is *roughly* equal to
three-quarters of a word in English.

```
"Hello, world!"      →  3 tokens   ["Hello", ",", " world!"]
"international"      →  3 tokens   ["intern", "ation", "al"]
"I like dogs."       →  4 tokens   ["I", " like", " dogs", "."]
```

An average page of text ≈ 500 tokens.  Most LLMs have a **context window** of
4,000–128,000 tokens — the maximum amount of text they can read *in one go*.

#### Three Reasons We Chunk

1. **Context window limits** — LLMs can only read a fixed number of tokens at once.
   Feeding an entire document would overflow the limit and increase cost significantly.

2. **Retrieval precision** — A chunk captures one *specific idea*.
   If we embedded entire sections, the resulting vector would average out many ideas,
   making it harder to find the exact passage that answers a question.

   ```
   Section vector  ≈ average of (remote work + leave + expenses + ...)
   Chunk vector    ≈ just remote work policy
   → Query "remote work" matches the chunk much better
   ```

3. **Cost** — Embedding and searching smaller units is cheaper and faster.

#### Fixed vs Semantic Chunking (preview)

| Strategy | How it splits | Problem |
|----------|--------------|--------|
| **Fixed** (this tutorial) | Every N tokens, regardless of meaning | Can split a sentence mid-thought |
| **Semantic** (Tutorial 2) | At natural sentence/topic boundaries | Preserves meaning — better retrieval |

```
Original text:
  "...Employees may work remotely for up to 90 days per year.
   A manager approval is required before the period starts..."

Fixed chunking (260 tokens) might split here:
  Chunk A: "...Employees may work remotely for up to 90 days per year.
            A manager approval is"             ← sentence cut off!
  Chunk B: "required before the period starts..."

Semantic chunking keeps the whole rule together in one chunk — Tutorial 2 shows why this matters.
```


In [35]:
# 4) Split Documents into Chunks (fixed chunking baseline)

from dataclasses import asdict
from rag_tutorials.chunking import fixed_chunk_documents
import pandas as pd

chunks = fixed_chunk_documents(documents, chunk_size=260)

chunk_df = pd.DataFrame([asdict(c) for c in chunks])
stats = {
    "chunk_count": len(chunk_df),
    "avg_chunk_chars": chunk_df.text.map(len).mean(),
    "max_chunk_chars": chunk_df.text.map(len).max(),
}
#print(stats)
print(chunk_df)

                           chunk_id                    doc_id  \
0          DOC-HB-REMOTEWORK-FIX-00         DOC-HB-REMOTEWORK   
1          DOC-HB-REMOTEWORK-FIX-01         DOC-HB-REMOTEWORK   
2          DOC-HB-REMOTEWORK-FIX-02         DOC-HB-REMOTEWORK   
3          DOC-HB-REMOTEWORK-FIX-03         DOC-HB-REMOTEWORK   
4          DOC-HB-REMOTEWORK-FIX-04         DOC-HB-REMOTEWORK   
5   DOC-HB-INTERNATIONALWORK-FIX-00  DOC-HB-INTERNATIONALWORK   
6   DOC-HB-INTERNATIONALWORK-FIX-01  DOC-HB-INTERNATIONALWORK   
7   DOC-HB-INTERNATIONALWORK-FIX-02  DOC-HB-INTERNATIONALWORK   
8   DOC-HB-INTERNATIONALWORK-FIX-03  DOC-HB-INTERNATIONALWORK   
9   DOC-HB-INTERNATIONALWORK-FIX-04  DOC-HB-INTERNATIONALWORK   
10   DOC-HB-INTERNATIONALTAX-FIX-00   DOC-HB-INTERNATIONALTAX   
11   DOC-HB-INTERNATIONALTAX-FIX-01   DOC-HB-INTERNATIONALTAX   
12   DOC-HB-INTERNATIONALTAX-FIX-02   DOC-HB-INTERNATIONALTAX   
13   DOC-HB-INTERNATIONALTAX-FIX-03   DOC-HB-INTERNATIONALTAX   
14   DOC-HB-INTERNATIONAL

In [30]:
# Chunk boundary visualization (same source text, different split strategies)

from rag_tutorials.chunking import semantic_chunk_documents

section_doc = next(doc for doc in documents if doc.section == "International Work")
fixed_view = [c.text for c in fixed_chunk_documents([section_doc], chunk_size=120)]
semantic_view = [c.text for c in semantic_chunk_documents([section_doc])]

print("Section:", section_doc.section)
print("\nFixed chunks:")
for idx, chunk_text in enumerate(fixed_view, start=1):
    print(f"[{idx}] {chunk_text}")

print("\nSemantic chunks:")
for idx, chunk_text in enumerate(semantic_view, start=1):
    print(f"[{idx}] {chunk_text}")

Section: International Work

Fixed chunks:
[1] Working from another country is capped at 14 days in a rolling 12-month period without permit support. Beyond 14 days, e
[2] mployees must open a Global Mobility case and obtain HR, Legal, and Payroll approval. Violations can trigger immigration
[3] , payroll, and tax exposure. Employees must submit destination country, travel dates, host entity, and work purpose when
[4]  opening the Global Mobility case. Approval decisions depend on role type, customer access level, and whether on-site ac
[5] tivities include contract negotiation. Some countries require pre-travel right-to-work checks even for short stays under
[6]  the 14-day cap. International work days are counted using local calendar dates at destination, not departure timezone t
[7] imestamps. Repeated short trips to the same country can accumulate toward compliance thresholds and trigger additional r
[8] eview. Employees are responsible for carrying supporting approval documents wh

## Novice Lens: How Embeddings and Retrieval Actually Work

This section slows down and walks through every step with concrete numbers.
If you are new to machine learning, read this before running the retrieval code.

### The Big Picture (sequence diagram)

```mermaid
sequenceDiagram
    participant U as User Query
    participant E as Embedding Model
    participant V as Vector Store
    participant L as LLM
    U->>E: "working from another country"
    Note over E: Converts text to a list of 1536 numbers
    E->>V: query vector [0.12, -0.34, 0.87, ...]
    Note over V: Compares query vector to all stored chunk vectors
    V-->>U: top-k chunks + similarity scores
    U->>L: question + retrieved chunks (as context)
    Note over L: Reads context, generates grounded answer
    L-->>U: "Employees may work remotely for up to 90 days..."
```

We will inspect each of these steps with real numbers below.


### What Is an Embedding Vector?

An **embedding** is a list of floating-point numbers that represents the *meaning*
of a piece of text.  It is produced by a neural network (the "embedding model").

#### Analogy: colour as a 3-number vector

You already know one type of vector: an RGB colour.

```
Red:   [255,   0,   0]   ← 3 numbers
Blue:  [  0,   0, 255]
Pink:  [255, 150, 150]   ← close to Red — similar meaning!
```

A text embedding works the same way, but instead of 3 numbers describing colour,
we use **1536 numbers** describing meaning.  Words or sentences with similar meaning
produce vectors that are numerically close.

```
"remote work policy"      → [0.80, 0.20, 0.50, ...]   (1536 numbers)
"working from abroad"     → [0.85, 0.15, 0.45, ...]   ← very similar!
"annual leave rules"      → [0.10, 0.90, 0.30, ...]   ← very different
```

#### Why 1536 dimensions?

More dimensions = more capacity to encode nuance.  `text-embedding-3-small` uses 1536
because the model learned that many independent "aspects" of meaning from its training data.
You never see all 1536 values directly — the embedding model handles this internally.

---

### How Do We Measure Similarity? — Cosine Similarity

Once we have vectors, we need a way to ask: *how similar are two vectors?*

We use **cosine similarity** — it measures the angle between two vectors.  Think of
each vector as an arrow pointing from the origin in high-dimensional space:

```
Two arrows pointing in the same direction → angle ≈ 0° → cosine ≈ 1.0  (very similar)
Two arrows perpendicular                  → angle = 90° → cosine = 0.0  (unrelated)
Two arrows pointing opposite              → angle = 180° → cosine = -1.0 (opposite)
```

#### The Formula (and what each part means)

$$\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \cdot \|\mathbf{B}\|}$$

| Symbol | Name | Plain English |
|--------|------|---------------|
| $\mathbf{A} \cdot \mathbf{B}$ | **Dot product** | Multiply each pair of matching numbers, then add them all up |
| $\|\mathbf{A}\|$ | **Norm** (magnitude) | The "length" of vector A — square root of sum of squares |
| Dividing by norms | **Normalisation** | Removes the effect of vector length so only direction matters |

**Dot product example (3-dim):**
```
A = [0.80, 0.20, 0.50]
B = [0.85, 0.15, 0.45]

A·B = (0.80×0.85) + (0.20×0.15) + (0.50×0.45)
    =   0.680     +   0.030     +   0.225
    =   0.935
```

**Norm example:**
```
||A|| = sqrt(0.80² + 0.20² + 0.50²)
      = sqrt(0.640 + 0.040 + 0.250)
      = sqrt(0.930)
      = 0.964
```

| Cosine Score | Meaning |
|:---:|---|
| **1.0** | Identical direction — same meaning |
| **0.8–0.99** | Very similar |
| **0.5–0.79** | Moderately similar |
| **0.0–0.49** | Low similarity |
| **< 0** | Opposite meaning (rare for text) |

The code cell below walks through every arithmetic step with toy vectors.


In [None]:
# Vector and cosine similarity walkthrough — toy 3-dimensional example
# (Real OpenAI embeddings use 1536 dims; the math is identical)

import numpy as np

# Toy vectors representing meaning in 3-dimensional space
vec_remote_work   = np.array([0.80, 0.20, 0.50])  # 'remote work policy'
vec_leave_policy  = np.array([0.10, 0.90, 0.30])  # 'annual leave rules'
vec_international_transfer = np.array([0.75, 0.25, 0.55])  # 'international work transfer'

query_vec = np.array([0.85, 0.15, 0.45])          # query: 'working from abroad'

print("Query vector:          ", query_vec)
print("'remote work' vector:  ", vec_remote_work)
print("'leave policy' vector: ", vec_leave_policy)
print("'international transfer' vector:", vec_international_transfer)
print()

# ---- Step-by-step cosine similarity: query vs 'remote work' ----
dot_product    = np.dot(query_vec, vec_remote_work)
norm_query     = np.linalg.norm(query_vec)
norm_remote    = np.linalg.norm(vec_remote_work)
cosine_score   = dot_product / (norm_query * norm_remote)

print("=== Query vs 'remote work' ===")
print(f"  dot product          : {dot_product:.4f}")
print(f"  ||query||            : {norm_query:.4f}")
print(f"  ||remote work||      : {norm_remote:.4f}")
print(f"  cosine similarity    : {cosine_score:.4f}")
print()

# ---- Compare all three candidates at once using the shared helper ----
from rag_tutorials.embeddings import cosine_similarity

candidates = np.stack([vec_remote_work, vec_leave_policy, vec_international_transfer])
labels     = ["remote work policy", "leave policy", "international transfer"]
scores     = cosine_similarity(query_vec, candidates)

print("Cosine scores for query 'working from abroad':")
for label, score in sorted(zip(labels, scores), key=lambda x: -x[1]):
    bar = "█" * int(score * 20)
    print(f"  {label:<22} {score:.4f}  {bar}")
print()
print("Highest score → retrieved first.  Lowest score → may not make top-k.")

### What Is a Vector Store (Database) and Why Do We Need One?

#### The Naive Approach: Compare the Query Against Every Chunk

After embedding all chunks, we have a table like this:

```
chunk_id  text                              vector (1536 numbers)
────────  ────────────────────────────────  ──────────────────────
  0       "Remote work: up to 90 days..."   [0.80, 0.20, 0.50, ...]
  1       "Annual leave entitlement..."     [0.10, 0.90, 0.30, ...]
  2       "International transfer rules"    [0.75, 0.25, 0.55, ...]
  ...     ...                               ...
```

To find the most relevant chunks for a query, we *could* compute cosine similarity
between the query vector and every single row.  This is called **brute-force search**.

```
query_vec vs chunk_0  →  0.97  (retrieved)
query_vec vs chunk_1  →  0.41
query_vec vs chunk_2  →  0.89  (retrieved)
...repeat for all N chunks...
→ sort → return top-k
```

This works fine at small scale.  At large scale it is unacceptably slow:

| Number of chunks | Brute-force time (approx.) |
|-----------------|---------------------------|
| 100 | < 1 ms |
| 10,000 | ~10 ms |
| 1,000,000 | ~1 second |
| 100,000,000 | ~100 seconds (unacceptably slow) |

#### The Solution: An Index (ANN — Approximate Nearest-Neighbor)

A **vector store** (like **Chroma**, used in this tutorial) builds an *index* that
pre-organises the vectors so most comparisons can be skipped.

Chroma uses **HNSW** (Hierarchical Navigable Small World) — think of it as a map
with highways and local roads:

```
HNSW is a graph where:
  • Each chunk vector is a node
  • Nearby vectors are connected by edges
  • At query time, the search "hops" along edges toward the query vector
    instead of visiting every node

Result: finds the nearest neighbors in O(log N) time instead of O(N)
```

#### What Chroma Stores for Each Chunk

| Stored item | Purpose |
|-------------|--------|
| Embedding vector | Used for similarity search |
| Raw text | Returned in results so the LLM can read it |
| Metadata (`doc_id`, `section`) | Allows filtering (e.g. "only search section X") |

**At query time:**
1. The query is embedded → `query_vec`
2. Chroma runs ANN search → finds `top_k` chunk vectors nearest to `query_vec`
3. Returns the corresponding chunk texts + similarity scores


### How Nearest-Neighbor Search Returns Top-k Results

**The core problem:** you have a query vector and N chunk vectors in the store.
You want the k chunks whose vectors are *closest* to the query — the **k nearest neighbors**.

#### Step-by-step: what happens at query time

```
1. Embed the query           → query_vec  (1536 numbers)
2. For each chunk vector in the store
       score[i] = cosine_similarity(query_vec, chunk_vec[i])
3. Sort all scores descending
4. Return the top-k chunk texts (highest scores first)
```

The diagram below uses a tiny 4-chunk example to make every step concrete.

```
Query: 'working from abroad'

chunk_A  'remote work policy'          score: 0.97  ◀── rank 1  in top-3
chunk_B  'annual leave entitlement'    score: 0.41  ◀── rank 4  (not retrieved)
chunk_C  'international transfer rules'score: 0.89  ◀── rank 2  in top-3
chunk_D  'parental leave procedures'   score: 0.55  ◀── rank 3  in top-3

top_k = 3  →  returned: [chunk_A, chunk_C, chunk_D]
```

#### Exact vs Approximate Nearest-Neighbor (ANN)

| Approach | How it works | When used |
|----------|--------------|-----------|
| **Exact (brute-force)** | Compare query against every vector | Small datasets |
| **Approximate (ANN)** | Build an index (e.g., HNSW graph) that skips most comparisons | Large datasets |

Chroma uses **HNSW** (Hierarchical Navigable Small World) by default — it builds a
graph of vectors where nearby vectors are connected.  At query time it traverses
the graph greedily, visiting only a small fraction of all vectors, yet finds the
nearest neighbors with high probability.

> **Key insight:** top-k is not a threshold — it is a *count*.  No matter how
> dissimilar the best chunk is, the system always returns exactly k results.  A
> high cosine score means "very relevant"; a low score means "best we could find
> but probably not very relevant".


In [None]:
# Nearest-neighbor top-k walkthrough — toy example
# Shows exactly how the vector store picks which chunks to return.

import numpy as np
from rag_tutorials.embeddings import cosine_similarity

# ── 6 toy chunk vectors (3-dim for readability; real ones are 1536-dim) ──
chunk_vectors = np.array([
    [0.80, 0.20, 0.50],   # chunk 0: 'remote work policy'
    [0.10, 0.90, 0.30],   # chunk 1: 'annual leave entitlement'
    [0.75, 0.25, 0.55],   # chunk 2: 'international transfer rules'
    [0.15, 0.70, 0.40],   # chunk 3: 'parental leave procedures'
    [0.60, 0.35, 0.65],   # chunk 4: 'home-office equipment policy'
    [0.05, 0.95, 0.20],   # chunk 5: 'sick leave documentation'
])
chunk_labels = [
    "remote work policy",
    "annual leave entitlement",
    "international transfer rules",
    "parental leave procedures",
    "home-office equipment policy",
    "sick leave documentation",
]

query_vec = np.array([0.85, 0.15, 0.45])   # query: 'working from abroad'
TOP_K = 3

# ── Step 1: compute cosine similarity to every chunk ──
scores = cosine_similarity(query_vec, chunk_vectors)

# ── Step 2: rank by descending score ──
ranked_indices = np.argsort(scores)[::-1]

print(f"Query: 'working from abroad'")
print(f"\nAll {len(chunk_labels)} chunks ranked by cosine similarity:")
print(f"{'Rank':<5} {'Score':>6}  {'Chunk label'}")
print("-" * 50)
for rank, idx in enumerate(ranked_indices, 1):
    selected = " ◀ top-k" if rank <= TOP_K else ""
    bar = "█" * int(scores[idx] * 20)
    print(f"  {rank:<4} {scores[idx]:.4f}  {chunk_labels[idx]:<30} {bar}{selected}")

# ── Step 3: return top-k ──
top_k_indices = ranked_indices[:TOP_K]
print(f"\n→ top_k={TOP_K} chunks returned to the LLM:")
for i, idx in enumerate(top_k_indices, 1):
    print(f"  {i}. [{scores[idx]:.4f}] {chunk_labels[idx]}")

print("\n→ chunks NOT retrieved (score too low for top-k):")
for idx in ranked_indices[TOP_K:]:
    print(f"  [ ] [{scores[idx]:.4f}] {chunk_labels[idx]}")


In [31]:
# 5) Create Embeddings and Build Vector Index
# Each chunk text is converted into a high-dimensional vector (1536 dims for text-embedding-3-small).
# These vectors are stored in Chroma so we can search by cosine similarity at query time.

from rag_tutorials.pipeline import build_dense_retriever
from rag_tutorials.embeddings import embed_texts, cosine_similarity
import numpy as np

dense_retriever, doc_vectors = build_dense_retriever(
    chunks=chunks,
    collection_name="tutorial1_basic_dense",
    embedding_model=cfg.embedding_model,
)

# doc_vectors shape: (num_chunks, embedding_dim)
# Each row is one chunk's vector; columns are learned numeric features.
print("Embedding matrix shape (chunks × dims):", doc_vectors.shape)
print("First chunk vector — first 10 of", doc_vectors.shape[1], "dimensions:")
print(" ", np.round(doc_vectors[0][:10], 4))
print("  (every dimension encodes a subtle aspect of meaning)")
print()

# --- Real-embedding cosine similarity trace using 3 actual chunks ---
# We embed the same query and three chunks with the real model so you can
# see that the pattern from the toy demo above holds for real vectors too.
sample_texts = [chunks[i].text for i in range(3)]
sample_vectors = embed_texts(sample_texts, model=cfg.embedding_model)
sample_query = "What is the policy for working from another country?"
sample_query_vector = embed_texts([sample_query], model=cfg.embedding_model)[0]

scores = cosine_similarity(sample_query_vector, sample_vectors)
print("Real cosine similarity scores (query vs first 3 chunks):")
for idx, (score, text) in enumerate(zip(scores, sample_texts), start=1):
    bar = "█" * int(score * 20)
    print(f"  Chunk {idx} score={score:.4f}  {bar}")
    print(f"    preview: {text[:80]}...")
print()
print("The full index contains", doc_vectors.shape[0], "chunks; Chroma runs the same")
print("cosine comparison for ALL of them and returns the top-k highest scores.")

Embedding matrix shape: (24, 1536)
Example vector (first 10 dims): [ 0.0294  0.0544  0.0442  0.0226  0.002  -0.0399  0.0043  0.0554  0.0151
  0.0031]
Toy chunk 1 cosine score: 0.4026
Toy chunk 2 cosine score: 0.2809
Toy chunk 3 cosine score: 0.3770


In [33]:
# 6) Implement Retriever Logic

import pandas as pd

def retrieve_dense(question: str, top_k: int = 5):
    return dense_retriever(question, top_k=top_k)

probe_query = "What is the policy for working from another country?"
probe_results = retrieve_dense(probe_query, top_k=cfg.top_k)

pd.DataFrame([
    {
        "rank": idx + 1,
        "chunk_id": row.chunk_id,
        "score": row.score,
        "source": row.source,
        "preview": row.text,
    }
    for idx, row in enumerate(probe_results)
])

Unnamed: 0,rank,chunk_id,score,source,preview
0,1,DOC-HB-INTERNATIONALWORK-FIX-00,0.168518,dense,Working from another country is capped at 14 d...
1,2,DOC-HB-INTERNATIONALWORK-FIX-02,0.084514,dense,ome countries require pre-travel right-to-work...
2,3,DOC-HB-INTERNATIONALTAX-FIX-00,-0.031337,dense,Employees traveling internationally may need F...
3,4,DOC-HB-INTERNATIONALWORK-FIX-01,-0.09547,dense,xposure. Employees must submit destination cou...
4,5,DOC-HB-REMOTEWORK-FIX-03,-0.113855,dense,view ergonomic setup guidance quarterly and co...


In [36]:
# 7) Implement Prompt Template and LLM Call
# RAG injects the retrieved chunks directly into the LLM prompt.
# We print the full prompt below so you can see exactly what the model receives.

from rag_tutorials.qa import answer_with_context, build_context

def rag_answer(question: str, top_k: int = 5):
    retrieved = retrieve_dense(question, top_k=top_k)
    context = [r.text for r in retrieved]
    answer = answer_with_context(question, context, model=cfg.chat_model)
    return answer, retrieved

# --- Show the actual prompt that is sent to the LLM ---
context_chunks = [r.text for r in probe_results]
context_block = build_context(context_chunks)
full_prompt = (
    "You are a policy assistant. Answer only from the provided context. "
    "If the answer is not present, say you do not have enough context.\n\n"
    f"Question: {probe_query}\n\n"
    f"Context:\n{context_block}\n\n"
    "Provide a concise answer and include a short citation like [Chunk 1]."
)
print("=" * 60)
print("FULL PROMPT SENT TO LLM:")
print("=" * 60)
print(full_prompt)
print("=" * 60)
print()
print("The LLM only sees the text above — it cannot look outside this prompt.")
print("This is what 'grounding' means: the answer must come from these chunks.")
print()

print("Probe question:", probe_query)
answer, retrieved = rag_answer(probe_query)
print("LLM answer:")
print(answer)


Probe question: What is the policy for working from another country?
Retrieved context: ['Working from another country is capped at 14 days in a rolling 12-month period without permit support. Beyond 14 days, employees must open a Global Mobility case and obtain HR, Legal, and Payroll approval. Violations can trigger immigration, payroll, and tax e', 'ome countries require pre-travel right-to-work checks even for short stays under the 14-day cap. International work days are counted using local calendar dates at destination, not departure timezone timestamps. Repeated short trips to the same country can accu', 'Employees traveling internationally may need Form A-12 before departure when cross-border work exceeds 7 business days. The tax team uses Form A-12 to assess treaty relief, withholding obligations, and permanent establishment risk. Form A-12 submissions should', 'xposure. Employees must submit destination country, travel dates, host entity, and work purpose when opening the Globa

### How to Read the Evaluation Metrics

After running the RAG pipeline on a set of test queries, we measure how well it
performed.  Four metrics are used throughout all five tutorials.

#### Recall@k — Did we retrieve the right chunk?

**Plain English:** Out of all test queries, what fraction of the time did the
correct source document appear *anywhere* in the top-k retrieved chunks?

```
Recall@5 = (queries where correct source was in top-5) / (total queries)

Example: 8 out of 10 queries retrieved the correct chunk → Recall@5 = 0.80
```

| Value | Meaning |
|:-----:|---------|
| **1.0** | Perfect — the right chunk was always in the top-k |
| **0.8** | 80 % of queries succeeded — 20 % missed the right chunk |
| **0.5** | Only half the queries retrieved the right chunk |
| **0.0** | The retriever never found the right source |

> Note: Recall@k does **not** care about *where* in the top-k the correct chunk appears —
> just whether it's present at all.  MRR (below) measures the position.

#### MRR — Mean Reciprocal Rank — Was the right chunk near the top?

**Plain English:** When the correct chunk *is* retrieved, how highly ranked is it?
Being ranked 1st is much more useful than being ranked 5th.

```
For each query: reciprocal_rank = 1 / (position of first correct chunk)

  Correct chunk at rank 1 → 1/1 = 1.00
  Correct chunk at rank 2 → 1/2 = 0.50
  Correct chunk at rank 3 → 1/3 = 0.33
  Not found at all        → 0

MRR = average of all reciprocal ranks across queries
```

| MRR Value | Meaning |
|:---------:|---------|
| **1.0** | Always ranked 1st |
| **0.5** | On average, correct chunk is at rank 2 |
| **0.33** | On average, correct chunk is at rank 3 |

#### Groundedness — Did the answer come from the retrieved context?

**Plain English:** What fraction of the LLM's answer words appear in the retrieved
chunks?  A high score means the model is *using* the retrieved context, not inventing.

```
Answer:  "Employees may work remotely for up to 90 days per year."
Context: "...may work remotely for up to 90 days per year with manager approval..."

Most answer words appear in context → groundedness ≈ 0.85  (good!)
```

| Value | Meaning |
|:-----:|---------|
| **0.8–1.0** | Highly grounded — answer closely follows retrieved text |
| **0.5–0.79** | Partially grounded — some invention |
| **< 0.5** | Likely hallucinating — answer not supported by context |

#### Latency (ms) — How fast is the pipeline?

**Plain English:** Total wall-clock time in milliseconds from question to answer,
including embedding, retrieval, and LLM generation.

```
  ~100–300 ms  →  retrieval only (no LLM)
 ~500–2000 ms  →  full RAG pipeline
>5000 ms       →  unusable for real-time chat
```

| Metric | Range | Higher = better? |
|--------|-------|------------------|
| **Recall@k** | 0 – 1 | Yes |
| **MRR** | 0 – 1 | Yes |
| **Groundedness** | 0 – 1 | Yes |
| **Latency (ms)** | > 0 | No (lower is better) |


In [37]:
# 8) Assemble End-to-End RAG Pipeline + 9/10 Smoke Tests and Evaluation

from rag_tutorials.evaluation import evaluate_single, summarize

sample_queries = queries[: cfg.sample_eval_size]
rows = [
    evaluate_single(
        query=q,
        retrieval_fn=lambda question: retrieve_dense(question, top_k=cfg.top_k),
        answer_fn=lambda question, context: answer_with_context(question, context, model=cfg.chat_model),
        top_k=cfg.top_k,
    )
    for q in sample_queries
]

metrics = summarize(rows)
print("Tutorial 1 metrics:", metrics)

# Show one trace row for novice debugging
trace = sample_queries[0]
trace_answer, trace_retrieved = rag_answer(trace.question, top_k=cfg.top_k)
print("\nQuery:", trace.question)
for idx, row in enumerate(trace_retrieved, start=1):
    print(f"{idx}. {row.chunk_id} | score={row.score:.4f} | {row.text[:100]}")
print("\nAnswer:", trace_answer)

Tutorial 1 metrics: {'recall_at_k': 1.0, 'mrr': 0.975, 'latency_ms': 1826.7711563268676, 'groundedness': 0.7853514886656859}
Retrieved context: ['Z-Tech encourages remote work from home, co-working spaces, or temporary domestic locations. Employees must stay reachable during assigned timezone hours and use approved managed devices. Public Wi-Fi usage is allowed only with corporate VPN enabled. Employees', 'view ergonomic setup guidance quarterly and complete the annual safety attestation in the HR portal. Temporary domestic work from a location outside the home office state may require payroll location review if extended beyond 30 days. Use of personal devices f', 'ffs. Managers may define team-specific overlap hours when projects involve coordination across offices in different time zones. Home-office expenses are reimbursable only for pre-approved categories listed in the internal procurement guide. Employees should re', 'Working from another country is capped at 14 days in a rolling