# Tutorial 1 — Basic RAG (Dense Retrieval Baseline)

This notebook implements a complete baseline RAG pipeline and makes embeddings + retrieval transparent for first-time learners.

```mermaid
flowchart LR
    A[Documents] --> B[Fixed Chunking]
    B --> C[OpenAI Embeddings]
    C --> D[Chroma Vector Index]
    E[User Query] --> F[Query Embedding]
    F --> D
    D --> G[Top-k Chunks]
    G --> H[LLM Answer]
```

Continuity note:
- Tutorial 2 keeps the same pipeline but changes **chunking**.
- Tutorial 3 keeps chunking and adds **reranking**.
- Tutorial 4 adds **hybrid retrieval** (keyword + dense).

## Learning checkpoint: what works vs what breaks

**What works in Tutorial 1**
- Dense retrieval can find generally related handbook content.
- End-to-end RAG flow is functional (ingest → chunk → embed → retrieve → answer).

**Challenges you should observe**
- Query intent can be too broad for nearest-neighbor retrieval.
- Exception-heavy policy questions may return partially relevant chunks.
- Exact policy identifiers (like forms/codes) are often weakly handled.

**Why move to Tutorial 2**
- The first bottleneck is chunk quality.
- We next improve *how text is split* so policy context stays intact before retrieval.

In [None]:
# 1) Set Up Environment and Dependencies

import importlib
import os
import sys
from pathlib import Path

REQUIRED_PACKAGES = [
    "openai",
    "chromadb",
    "numpy",
    "pandas",
    "rank_bm25",
    "sentence_transformers",
    "dotenv",
]

missing = [pkg for pkg in REQUIRED_PACKAGES if importlib.util.find_spec(pkg) is None]
if missing:
    print("Missing packages:", missing)
    print("Run: uv sync")
else:
    print("All required packages are available.")

print("Python:", sys.version.split()[0])
print("Working directory:", Path.cwd())

In [None]:
# 2) Define Configuration and Paths

from dataclasses import dataclass
from dotenv import load_dotenv

load_dotenv()

@dataclass
class Config:
    embedding_model: str = os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small")
    chat_model: str = os.getenv("OPENAI_CHAT_MODEL", "gpt-4.1-mini")
    chunk_mode: str = "fixed"
    top_k: int = 5
    sample_eval_size: int = 20
    handbook_path: str = "data/handbook_manual.txt"
    queries_path: str = "data/queries.jsonl"

cfg = Config()

if not os.getenv("OPENAI_API_KEY"):
    raise EnvironmentError("OPENAI_API_KEY is not set. Copy .env.example to .env and set your key.")

cfg

In [None]:
# 3) Load and Normalize Source Documents (shared handbook text + query set)

from rag_tutorials.io_utils import load_handbook_documents, load_queries

if not Path(cfg.handbook_path).exists() or not Path(cfg.queries_path).exists():
    raise FileNotFoundError(
        "Shared data files are missing. Run: uv run python scripts/generate_data.py"
    )

documents = load_handbook_documents(cfg.handbook_path)
queries = load_queries(cfg.queries_path)

print("Source text:", cfg.handbook_path)
print("Parsed handbook sections:", len(documents))
print("Queries:", len(queries))
print("Sample parsed document:", documents[0])

In [None]:
# 4) Split Documents into Chunks (fixed chunking baseline)

from dataclasses import asdict
from rag_tutorials.chunking import fixed_chunk_documents
import pandas as pd

chunks = fixed_chunk_documents(documents, chunk_size=260)

chunk_df = pd.DataFrame([asdict(c) for c in chunks])
stats = {
    "chunk_count": len(chunk_df),
    "avg_chunk_chars": chunk_df.text.map(len).mean(),
    "max_chunk_chars": chunk_df.text.map(len).max(),
}
print(stats)
chunk_df.head(3)

In [None]:
# Chunk boundary visualization (same source text, different split strategies)

from rag_tutorials.chunking import semantic_chunk_documents

section_doc = next(doc for doc in documents if doc.section == "International Work")
fixed_view = [c.text for c in fixed_chunk_documents([section_doc], chunk_size=120)]
semantic_view = [c.text for c in semantic_chunk_documents([section_doc])]

print("Section:", section_doc.section)
print("\nFixed chunks:")
for idx, chunk_text in enumerate(fixed_view, start=1):
    print(f"[{idx}] {chunk_text}")

print("\nSemantic chunks:")
for idx, chunk_text in enumerate(semantic_view, start=1):
    print(f"[{idx}] {chunk_text}")

## Novice Lens: How Embeddings and Retrieval Actually Work

```mermaid
sequenceDiagram
    participant U as User Query
    participant E as Embedding Model
    participant V as Vector Store
    participant L as LLM
    U->>E: "working from another country"
    E->>V: query vector
    V-->>U: top-k chunks + scores
    U->>L: question + retrieved chunks
    L-->>U: grounded answer
```

We will inspect:
1. Query and chunk vectors (dimensions and first values)
2. Cosine similarities
3. Ranked chunk IDs returned to the generator

In [None]:
# 5) Create Embeddings and Build Vector Index

from rag_tutorials.pipeline import build_dense_retriever
from rag_tutorials.embeddings import embed_texts, cosine_similarity
import numpy as np

dense_retriever, doc_vectors = build_dense_retriever(
    chunks=chunks,
    collection_name="tutorial1_basic_dense",
    embedding_model=cfg.embedding_model,
)

print("Embedding matrix shape:", doc_vectors.shape)
print("Example vector (first 10 dims):", np.round(doc_vectors[0][:10], 4))

# Tiny toy example for intuition
sample_chunks = [chunks[i].text for i in range(3)]
sample_vectors = embed_texts(sample_chunks, model=cfg.embedding_model)
sample_query = "What is the policy for working from another country?"
sample_query_vector = embed_texts([sample_query], model=cfg.embedding_model)[0]

scores = cosine_similarity(sample_query_vector, sample_vectors)
for idx, score in enumerate(scores, start=1):
    print(f"Toy chunk {idx} cosine score: {score:.4f}")

In [None]:
# 6) Implement Retriever Logic

import pandas as pd

def retrieve_dense(question: str, top_k: int = 5):
    return dense_retriever(question, top_k=top_k)

probe_query = "What is the policy for working from another country?"
probe_results = retrieve_dense(probe_query, top_k=cfg.top_k)

pd.DataFrame([
    {
        "rank": idx + 1,
        "chunk_id": row.chunk_id,
        "score": row.score,
        "source": row.source,
        "preview": row.text[:120],
    }
    for idx, row in enumerate(probe_results)
])

In [None]:
# 7) Implement Prompt Template and LLM Call

from rag_tutorials.qa import answer_with_context

def rag_answer(question: str, top_k: int = 5):
    retrieved = retrieve_dense(question, top_k=top_k)
    context = [r.text for r in retrieved]
    answer = answer_with_context(question, context, model=cfg.chat_model)
    return answer, retrieved

answer, retrieved = rag_answer(probe_query)
print(answer)

In [None]:
# 8) Assemble End-to-End RAG Pipeline + 9/10 Smoke Tests and Evaluation

from rag_tutorials.evaluation import evaluate_single, summarize

sample_queries = queries[: cfg.sample_eval_size]
rows = [
    evaluate_single(
        query=q,
        retrieval_fn=lambda question: retrieve_dense(question, top_k=cfg.top_k),
        answer_fn=lambda question, context: answer_with_context(question, context, model=cfg.chat_model),
        top_k=cfg.top_k,
    )
    for q in sample_queries
]

metrics = summarize(rows)
print("Tutorial 1 metrics:", metrics)

# Show one trace row for novice debugging
trace = sample_queries[0]
trace_answer, trace_retrieved = rag_answer(trace.question, top_k=cfg.top_k)
print("\nQuery:", trace.question)
for idx, row in enumerate(trace_retrieved, start=1):
    print(f"{idx}. {row.chunk_id} | score={row.score:.4f} | {row.text[:100]}")
print("\nAnswer:", trace_answer)