# RAG Pipeline Evaluation — Tasks 5 & 6

This notebook evaluates two retrieval pipelines for **Little Red Writing Room**:

- **Option A (Baseline):** `RecursiveCharacterTextSplitter` + dense vector search + Cohere rerank
- **Option B (Advanced):** `SemanticChunker` + LLM taxonomy classification + filtered retrieval + Cohere rerank

Both pipelines use in-memory Qdrant, OpenAI `text-embedding-3-small`, and the same RAG prompt.
RAGAS metrics provide the quantitative comparison at the end.

See [ARCHITECTURE.md](../docs/ARCHITECTURE.md) for full pipeline descriptions.

---

## Section 1: Setup

In [59]:
%load_ext autoreload
%autoreload 2

import os
import getpass
from dotenv import load_dotenv

load_dotenv()

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key: ")

if not os.environ.get("COHERE_API_KEY"):
    os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key: ")

if not os.environ.get("LANGCHAIN_API_KEY"):
    os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key: ")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [60]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ.setdefault("LANGCHAIN_PROJECT", "LRWR-RAG-Evaluation")
print(f"LangSmith project: {os.environ['LANGCHAIN_PROJECT']}")

LangSmith project: LRWR-RAG-Evaluation


In [61]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chat_model = ChatOpenAI(model="gpt-4.1-nano")

---

## Section 2: Data Loading

In [62]:
from lib.data_loading import load_sample_documents

raw_docs = load_sample_documents()
print(f"Loaded {len(raw_docs)} documents")
for doc in raw_docs:
    print(f"  - {doc.metadata['source']} ({len(doc.page_content):,} chars)")

Loaded 2 documents
  - /Users/nikos/n/rvm/little-red-writing-room/notebooks/sample_data/purplefrog-finds-her-brother.md (42,963 chars)
  - /Users/nikos/n/rvm/little-red-writing-room/notebooks/sample_data/purplefrog-story-notes.md (39,817 chars)


---

## Section 3: Option A — Baseline Pipeline (Task 5)

**Chunking:** `RecursiveCharacterTextSplitter` (500 chars, 50 overlap)  
**Retrieval:** Dense vector similarity (k=10) + Cohere rerank  
**Qdrant payload:** Minimal (text, source, chunk index)

In [63]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

baseline_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " "],
)
baseline_chunks = baseline_splitter.split_documents(raw_docs)
print(f"Option A chunks: {len(baseline_chunks)}")
print(f"\nExample chunk:\n{baseline_chunks[0].page_content[:300]}...")

Option A chunks: 211

Example chunk:
–1–

"What's the function of a Worm?" asked OchraMags. She was the junior instructor of the last surviving Sparky rebel colony in the Underground. Several kids raised their hands. Behind her at a distance, hanging by two thick cables from the ceiling of the giant terraformed tunnel, the big LED fire...


In [64]:
from langchain_qdrant import QdrantVectorStore

option_a_vectorstore = QdrantVectorStore.from_documents(
    baseline_chunks,
    embeddings,
    location=":memory:",
    collection_name="option_a_baseline",
)
option_a_retriever = option_a_vectorstore.as_retriever(search_kwargs={"k": 10})

In [65]:
from langchain_classic.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor_a = CohereRerank(model="rerank-v3.5")
option_a_rerank_retriever = ContextualCompressionRetriever(
    base_compressor=compressor_a,
    base_retriever=option_a_retriever,
)

In [66]:
from lib.chains import build_rag_chain

option_a_chain = build_rag_chain(option_a_rerank_retriever, chat_model)

### Sanity check — Option A

In [67]:
response = option_a_chain.invoke(
    {"question": "If PurpleFrog had to choose between reaching her brother and following OchraMags's evacuation order, what would she do?"}
)
print(response["response"].content)

Based on the provided context, PurpleFrog is faced with a critical choice: to reach her brother or to follow OchraMags's evacuation order. The context indicates that if PurpleFrog chooses to follow OchraMags, she risks not seeing her brother again ("Follows: she risks not seeing her brother again"). Conversely, if she chooses to escape using OchraMags's MODR (a device mentioned in the notes), she risks not making it to safety ("Escapes: she risks not making it to safety"). 

Additionally, there is a scene where OchraMags surrenders and offers to take PurpleFrog to her brother, saying, "Stop," she breathed, "Don’t do anything stupid, newb. I’ll take you to your brother. I promise." This suggests that if PurpleFrog prioritizes reaching her brother, she might accept OchraMags's offer.

Therefore, if PurpleFrog had to choose, the evidence implies she would prioritize reuniting with her brother, even if it means risking the consequences of defying evacuation orders. She seems motivated by a

In [68]:
response = option_a_chain.invoke(
    {"question": "How does PurpleFrog feel about the Underground?"}
)
print(response["response"].content)

Based on the provided context, PurpleFrog seems to have a complex attitude toward the Underground. She recognizes that the Worms keep the Underground functioning and that they were created to suppress or "zero out" people like her and her brother. This indicates a negative view of the Underground, associating it with control and oppression. 

Furthermore, her desire to hide in the Overground, especially after seeing light coming out from a big hole at the top, suggests she is eager to escape or avoid the Underground and its associated threats. She does not want to be back where the worms are, implying a negative or wary feeling about the Underground.

Overall, PurpleFrog appears to feel negatively about the Underground, viewing it as a place of control, suppression, or danger, and prefers to be elsewhere, notably in the Overground, which she seems to see as a safer or freer space.


In [69]:
response = option_a_chain.invoke(
    {"question": "What is the Ytterbium Entangler and what does it mean for the story?"}
)
print(response["response"].content)

The Ytterbium Entangler is described as a mythic artifact that the rebels have been searching for centuries. In the story, it is a crucial device used by the time travelers SnowRaven and PurpleFrog to jump to a specific space-time point—specifically, a moment before WormWood built the Worms. Its significance lies in its immense value and almost legendary status, as it enables these advanced time manipulations and time jumps. 

For the story, the Ytterbium Entangler represents a pivotal element that could grant the rebels—and by extension, the main characters—power to alter history and potentially achieve true freedom. It signifies hope, mystery, and a long-standing quest, making it a central focus of the narrative's tension and stakes. The mention of it being "mythic" also suggests that acquiring or controlling the Ytterbium Entangler could be a turning point, possibly enabling the protagonists to rewrite the past and influence their future.


---

## Section 4: Option B — Advanced Pipeline (Task 6)

**Chunking:** `SemanticChunker` (percentile threshold on embedding similarity) + content overlap (3 sentences)  
**Classification:** LLM pass per chunk producing taxonomy metadata  
**Metadata title prepend:** Compact `[content_type | narrative_function | characters: … ]` header baked into each chunk's text before embedding  
**Retrieval:** Dense vector similarity (k=10) + Cohere rerank  
**Qdrant payload:** Full taxonomy tags (content_type, narrative_function, characters_present, etc.)

### Expected performance

We expect Option B to improve over the baseline in several areas:

- **Context precision:** Semantic chunking keeps related content coherent
  rather than splitting across arbitrary boundaries, and the taxonomy
  metadata enriches the Qdrant payload so that filtered retrieval reduces
  irrelevant chunks before Cohere reranking.

- **Context recall:** Option A's 500-char chunks frequently omit character
  names because mid-scene passages use pronouns or implicit references, making
  them invisible to character-focused dense similarity queries. Option B addresses
  this in two ways: (1) larger semantic units are more likely to contain character
  names explicitly in `page_content`, and (2) the classification pass resolves
  pronouns and aliases into canonical names stored in `characters_present` in
  each chunk's Qdrant payload, enabling filtered retrieval once activated.

- **Faithfulness:** The enriched metadata gives the LLM better-grounded
  context, reducing the chance of hallucination into undefined character
  attributes.

- **Context entity recall:** The classification pass resolves pronouns and
  aliases into canonical character names, which are then prepended to each
  chunk's text as part of the metadata title before embedding. This means
  character names are present in the embedded text even for chunks that use
  pronouns in the original prose, so dense similarity search should surface
  more entity-relevant chunks than Option A.

- **Latency and cost:** We expect Option B to be slower and more expensive
  per query because semantic chunks are larger (more tokens passed to the
  LLM) and the pipeline includes an additional LLM classification pass
  at indexing time.

Cohere reranking is held constant across both pipelines, so the evaluation
delta isolates the value of semantic chunking and taxonomy classification.

In [70]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
)
semantic_chunks = semantic_chunker.split_documents(raw_docs)
print(f"Option B semantic chunks: {len(semantic_chunks)}")
print(f"\nExample chunk:\n{semantic_chunks[0].page_content[:300]}...")

Option B semantic chunks: 57

Example chunk:
–1–

"What's the function of a Worm?" asked OchraMags. She was the junior instructor of the last surviving Sparky rebel colony in the Underground. Several kids raised their hands. Behind her at a distance, hanging by two thick cables from the ceiling of the giant terraformed tunnel, the big LED fire...


### Content overlap — unit test

Verify `apply_semantic_overlap` on toy fixtures before processing real data.
No API calls required.

In [71]:
from langchain_core.documents import Document
from lib.chunking import apply_semantic_overlap

# --- toy fixtures with known sentence counts ---
_A = Document(page_content="Sentence one. Sentence two. Sentence three. Sentence four.")
_B = Document(page_content="Sentence five. Sentence six. Sentence seven.")
_C = Document(page_content="Sentence eight. Sentence nine.")

_result = apply_semantic_overlap([_A, _B, _C], overlap_sentences=2)

# chunk 0 is untouched
assert _result[0].page_content == _A.page_content, "Chunk 0 should be unchanged"
assert _result[0].metadata["overlap_sentence_count"] == 0

# chunk 1 starts with the last 2 sentences of chunk 0's original text
assert "Sentence three." in _result[1].page_content
assert "Sentence four." in _result[1].page_content
assert _result[1].page_content.index("Sentence three.") < _result[1].page_content.index("Sentence five.")
assert _result[1].metadata["overlap_sentence_count"] == 2

# chunk 2 sources its overlap from chunk 1's *original* text (not the already-overlapped version)
assert "Sentence six." in _result[2].page_content
assert "Sentence seven." in _result[2].page_content
assert _result[2].page_content.index("Sentence six.") < _result[2].page_content.index("Sentence eight.")
assert _result[2].metadata["overlap_sentence_count"] == 2

print("apply_semantic_overlap unit tests passed ✓")

apply_semantic_overlap unit tests passed ✓


### Apply content overlap to semantic chunks

Each chunk (except the first) has the last `OVERLAP_SENTENCES` sentences of its
predecessor prepended. This preserves cross-boundary context for both the
classification pass (Stage 3) and for retrieval at query time.

In [72]:
OVERLAP_SENTENCES = 3
overlapped_chunks = apply_semantic_overlap(semantic_chunks, overlap_sentences=OVERLAP_SENTENCES)
print(f"Overlapped chunks: {len(overlapped_chunks)}  (overlap window: {OVERLAP_SENTENCES} sentences)")
print(f"\nOverlap prefix example — chunk 1:\n{overlapped_chunks[1].page_content[:400]}...")

Overlapped chunks: 57  (overlap window: 3 sentences)

Overlap prefix example — chunk 1:
"TELL ME U MISS WALKIN BAREFOOT AND SILK KIMONOS OR I LL MAKE U SMELL THESE STUPID BOOTS ," she texted her brother through her standard-issued rebel "terminal". When he didn't respond she hid the terminal behind the metal crate and fired up an "anime" she'd discovered in the depths of the colony's data stash – a girl and her pet running in green fields of grass under mountains. Just like home.

If...


### Taxonomy classification pass

Each chunk gets an LLM call that produces structured metadata (content type,
narrative function, characters present, Story Grid tag, etc.).

In [73]:
import importlib, lib.classification; importlib.reload(lib.classification)
from lib.classification import classify_chunks

known_characters = [
    "PurpleFrog",
    "SnowRaven",
    "OchraMags",
    "OzzieHeron",
    "MyaxSerp",
    "WormWood",
]

classification_llm = ChatOpenAI(model="gpt-4.1-mini", max_completion_tokens=500)
enriched_chunks = classify_chunks(
    overlapped_chunks, known_characters, classification_llm, verbose=True
)

[1/57] dialogue / character_reveal — ['OchraMags', 'PurpleFrog']
[2/57] internal_monologue / character_reveal — ['OchraMags']
[3/57] dialogue / worldbuilding — ['PurpleFrog', 'OzzieHeron', 'WormWood']
[4/57] description / character_reveal — ['PurpleFrog', 'WormWood']
[5/57] dialogue / plot_event — ['PurpleFrog', 'OchraMags', 'OzzieHeron', 'WormWood']
[6/57] dialogue / plot_event — ['PurpleFrog', 'OchraMags']
[7/57] dialogue / plot_event — ['PurpleFrog', 'OchraMags']
[8/57] dialogue / plot_event — ['PurpleFrog', 'OchraMags', 'SnowRaven', 'WormWood']
[9/57] action_reaction / plot_event — ['SnowRaven', 'PurpleFrog', 'OchraMags']
[10/57] action_reaction / plot_event — ['OchraMags', 'PurpleFrog']
[11/57] action_reaction / plot_event — ['OchraMags', 'PurpleFrog']
[12/57] action_reaction / plot_event — ['MyaxSerp']
[13/57] action_reaction / character_reveal — ['PurpleFrog', 'MyaxSerp', 'OchraMags']
[14/57] dialogue / plot_event — ['PurpleFrog', 'MyaxSerp', 'OchraMags']
[15/57] action_reaction

In [74]:
from lib.chunking import prepend_metadata_title

titled_chunks = prepend_metadata_title(enriched_chunks)
print(f"Titled chunks: {len(titled_chunks)}")
print(f"\nExample titled chunk:\n{titled_chunks[0].page_content[:400]}...")

Titled chunks: 57

Example titled chunk:
[dialogue | character_reveal | characters: OchraMags, PurpleFrog]

–1–

"What's the function of a Worm?" asked OchraMags. She was the junior instructor of the last surviving Sparky rebel colony in the Underground. Several kids raised their hands. Behind her at a distance, hanging by two thick cables from the ceiling of the giant terraformed tunnel, the big LED firefly screen read: 

"*Days since l...


In [75]:
enriched_chunks[0].metadata

{'source': '/Users/nikos/n/rvm/little-red-writing-room/notebooks/sample_data/purplefrog-finds-her-brother.md',
 'overlap_sentence_count': 0,
 'content_type': 'dialogue',
 'narrative_function': 'character_reveal',
 'characters_present': ['OchraMags', 'PurpleFrog'],
 'story_grid_tag': 'none',
 'external_references': ['anime'],
 'implied_gaps': ['What is the role or function of a Worm in this world?',
  'What is the significance of the Sparky rebel colony and their opposition to the Worms?',
  "Why hasn't PurpleFrog's brother responded to her message?",
  'What is the setting and broader context of the terraformed tunnel and the underground colony?']}

### Build Option B vector store with metadata-titled chunks

Each `titled_chunk` has a compact taxonomy header prepended to its text before
embedding, for example:

```
[dialogue | plot_event | characters: PurpleFrog, OchraMags | turning_point]

"Tell me you miss walking barefoot..."
```

Baking the taxonomy into the `page_content` means dense similarity search
naturally rewards chunks whose `content_type`, `narrative_function`, and
`characters_present` align with the query — no Qdrant filter API required.

In [76]:
option_b_vectorstore = QdrantVectorStore.from_documents(
    titled_chunks,
    embeddings,
    location=":memory:",
    collection_name="option_b_advanced",
)
option_b_retriever = option_b_vectorstore.as_retriever(search_kwargs={"k": 10})

In [77]:
compressor_b = CohereRerank(model="rerank-v3.5")
option_b_rerank_retriever = ContextualCompressionRetriever(
    base_compressor=compressor_b,
    base_retriever=option_b_retriever,
)

In [78]:
option_b_chain = build_rag_chain(option_b_rerank_retriever, chat_model)

### Sanity check — Option B

In [79]:
response = option_b_chain.invoke(
    {"question": "If PurpleFrog had to choose between reaching her brother and following OchraMags's evacuation order, what would she do?"}
)
print(response["response"].content)

Based on the provided context, PurpleFrog's primary motivation is to reach her brother, which she considers more important than following OchraMags's evacuation protocol. 

The context indicates that during the crisis, PurpleFrog is desperate to contact her brother and even risks her safety to do so. She actively seeks to contact him by plucking her terminal out of OchraMags's grasp and attempting to call him from the over-pipes, showing her willingness to prioritize her bond with her brother over strictly adhering to the evacuation protocol enforced by OchraMags.

Furthermore, the narrative suggests that PurpleFrog perceives her brother's safety as her "home," and her actions—stealing the MODR (a device to fly out)—demonstrate her commitment to reaching him rather than obeying strict authority or protocols. She is aware that following OchraMags's instructions might keep her safe temporarily but is willing to take significant risks (risking not making it to safety or her own injury) in

In [80]:
response = option_b_chain.invoke(
    {"question": "How does PurpleFrog feel about the Underground?"}
)
print(response["response"].content)

Based on the provided context, PurpleFrog's feelings about the Underground are complex. She perceives the Underground as a refuge or hiding place, especially after she observes the light coming from a big hole at the top, indicating a way back to the Overground or perhaps a safer area. She explicitly states her desire to "hide in the Overground especially after she sees light coming out from a big hole at the top. Not back where the worms are!!" This suggests she sees the Underground as a temporary safe space or retreat rather than a permanent home.

Additionally, her interactions with WormWood and her brother imply a certain awareness of the Underground's significance. WormWood built the Underground to feed the Overground, which indicates she recognizes it as a constructed space with strategic importance. She considers reading the server logs WormWood mentions, indicating she is curious about or values the information stored there—possibly seeing the Underground as a place that holds 

In [81]:
response = option_b_chain.invoke(
    {"question": "What is the Ytterbium Entangler and what does it mean for the story?"}
)
print(response["response"].content)

Based on the provided context, the Ytterbium Entangler is a mythic artifact that the rebels have been searching for centuries. It is the key device that enables the time travelers, SnowRaven and PurpleFrog, to jump to a space-time point before WormWood built the Worms. Its significance for the story is that capturing or controlling the Ytterbium Entangler could allow the protagonists to prevent WormWood from creating the deadly "gassing Worms," thereby potentially altering the course of history and achieving their goal of stopping WormWood's destructive plans. It represents a pivotal object of pursuit that embodies hope for change and is central to the worldbuilding and the characters' motivations in the narrative.


---

## Section 5: RAGAS Evaluation

Generate a synthetic test dataset from the raw documents, then evaluate both
pipelines against it using RAGAS metrics.

### Generate synthetic test dataset

In [82]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))  # type: ignore[arg-type]
generator_embeddings = LangchainEmbeddingsWrapper(embeddings)

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(raw_docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [83]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the PFrog and why is it important in t...,[# Scene Planning Attack of the Worms Short St...,"In the story, PFrog refers to PurpleFrog (also...",single_hop_specifc_query_synthesizer
1,Who is the SnowRaven in the context of the sce...,[## Scene 2: Steal The Comms | 1\. What are th...,"SnowRaven, also known as Chion Cas, is a chara...",single_hop_specifc_query_synthesizer
2,What is Purplefrog?,[## Scene 3: Evacuate Or Escape | 1\. What are...,PurpleFrog (aka Mauve Cal) is the Protagonist ...,single_hop_specifc_query_synthesizer
3,In the context of the scene where PurpleFrog i...,[## Scene 4: Chased By A Worm | 1\. What are t...,"PurpleFrog is the protagonist, also known as M...",single_hop_specifc_query_synthesizer
4,Wha crisis involve a choise to loosn grip or s...,[<1-hop>\n\n## Scene 2: Steal The Comms | 1\. ...,In the scene where PurpleFrog faces the crisis...,multi_hop_abstract_query_synthesizer
5,How does the story illustrate that safety can ...,[<1-hop>\n\n# Scene Planning Attack of the Wor...,The story demonstrates that safety is not sole...,multi_hop_abstract_query_synthesizer
6,How do the themes of character struggle with s...,[<1-hop>\n\n## Scene 2: Steal The Comms | 1\. ...,"In the first scene, PurpleFrog faces the chall...",multi_hop_abstract_query_synthesizer
7,How do the choices and consequences faced by P...,[<1-hop>\n\n# Scene Planning Attack of the Wor...,"In the first scene, PurpleFrog faces the crisi...",multi_hop_abstract_query_synthesizer
8,How do the worms in both scenes demonstrate th...,[<1-hop>\n\n## Scene 4: Chased By A Worm | 1\....,"In the first scene, the worms act as environme...",multi_hop_specific_query_synthesizer
9,"How do the themes of resilience and trust, exe...",[<1-hop>\n\n## Scene 3: Evacuate Or Escape | 1...,"In the provided context, Mauve Cal, also known...",multi_hop_specific_query_synthesizer


### Evaluate Option A (Baseline)

In [84]:
from lib.evaluation import evaluate_chain

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))  # type: ignore[arg-type]

option_a_results = evaluate_chain(
    option_a_chain.with_config({"run_name": "option_a_baseline"}),
    dataset,
    evaluator_llm,
)
option_a_results

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[0]: AttributeError('StringIO' object has no attribute 'classifications')
Exception raised in Job[7]: AttributeError('StringIO' object has no attribute 'statements')
Exception raised in Job[21]: AttributeError('StringIO' object has no attribute 'statements')
Exception raised in Job[52]: AttributeError('StringIO' object has no attribute 'statements')
Exception raised in Job[41]: AttributeError('StringIO' object has no attribute 'statements')
Exception raised in Job[47]: AttributeError('StringIO' object has no attribute 'statements')


{'context_recall': 0.3636, 'faithfulness': 0.6647, 'factual_correctness': 0.6744, 'answer_relevancy': 0.8533, 'context_entity_recall': 0.4583}

### Evaluate Option B (Advanced)

In [85]:
option_b_results = evaluate_chain(
    option_b_chain.with_config({"run_name": "option_b_advanced"}),
    dataset,
    evaluator_llm,
)
option_b_results

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[0]: AttributeError('StringIO' object has no attribute 'classifications')
Exception raised in Job[40]: AttributeError('StringIO' object has no attribute 'classifications')
Exception raised in Job[2]: AttributeError('StringIO' object has no attribute 'statements')
Exception raised in Job[7]: AttributeError('StringIO' object has no attribute 'statements')
Exception raised in Job[52]: AttributeError('StringIO' object has no attribute 'statements')


{'context_recall': 0.8833, 'faithfulness': 0.9474, 'factual_correctness': 0.6867, 'answer_relevancy': 0.9277, 'context_entity_recall': 0.5514}

---

## Section 6: Comparison

Side-by-side RAGAS scores and delta analysis.

In [86]:
from lib.evaluation import compare_results

comparison_df = compare_results({
    "Option A (Baseline)": option_a_results,
    "Option B (Advanced)": option_b_results,
})
comparison_df

Unnamed: 0,Option A (Baseline),Option B (Advanced),delta,pct_change
context_recall,0.364,0.883,0.519,142.6
faithfulness,0.665,0.947,0.282,42.4
factual_correctness,0.674,0.687,0.013,1.9
answer_relevancy,0.853,0.928,0.075,8.8
context_entity_recall,0.458,0.551,0.093,20.3


In [88]:
import pandas as pd
import numpy as np

option_a_runs = pd.DataFrame({
    "latency_s": [2.16, 2.59, 5.16, 4.49, 3.66, 3.74, 2.60, 2.96, 2.92, 1.76, 1.50, 1.44],
    "tokens":    [864, 894, 954, 1087, 1073, 1006, 829, 988, 1002, 750, 728, 744],
    "cost":      [0.0001644, 0.0001563, 0.0002055, 0.0002338, 0.0002261, 0.0002155,
                  0.0001447, 0.000193, 0.0001938, 0.0001176, 0.0001085, 0.0001071],
})

option_b_runs = pd.DataFrame({
    "latency_s": [4.20, 3.81, 5.60, 4.47, 3.19, 3.53, 2.75, 3.19, 2.52, 1.70, 2.97, 1.79],
    "tokens":    [3337, 2902, 3198, 2908, 2960, 3007, 2780, 2059, 2628, 1308, 2704, 2530],
    "cost":      [0.0004894, 0.0004039, 0.0004677, 0.0004072, 0.0004181, 0.0003955,
                  0.000353, 0.0002911, 0.0003315, 0.0001548, 0.0003592, 0.000301],
})

avg_a = option_a_runs.mean()
avg_b = option_b_runs.mean()

langsmith_summary = pd.DataFrame({
    "Option A (avg)": avg_a,
    "Option B (avg)": avg_b,
    "delta": avg_b - avg_a,
    "pct_change": ((avg_b - avg_a) / avg_a * 100).round(1),
}).rename(index={"latency_s": "Latency (s)", "tokens": "Tokens", "cost": "Cost ($)"})

langsmith_summary

Unnamed: 0,Option A (avg),Option B (avg),delta,pct_change
Latency (s),2.915,3.31,0.395,13.6
Tokens,909.916667,2693.416667,1783.5,196.0
Cost ($),0.000172,0.000364,0.000192,111.6


### Analysis

- **Context recall (+142.6%, 0.364 → 0.883):** The strongest result. Option
  A's fixed 500-char chunks frequently lack character names because mid-scene
  passages rely on pronouns or implicit references, so character-focused dense
  similarity queries fail to surface the relevant chunks. Option B addresses
  this in two ways: (1) larger semantic units are self-contained narrative
  passages far more likely to contain character names explicitly in
  `page_content`, and (2) the classification pass resolves pronouns and aliases
  into canonical names stored in `characters_present` within each chunk's
  Qdrant payload.

- **Faithfulness (+42.4%, 0.665 → 0.947):** A major improvement. Enriched,
  self-contained semantic chunks give the LLM better grounding, significantly
  reducing hallucination and unsupported claims in generated responses.

- **Factual correctness (+1.9%, 0.674 → 0.687):** A small but positive gain.
  Larger semantic chunks provide enough additional context for the LLM to
  reason more accurately about specific story details without diluting
  precision.

- **Context entity recall (+20.3%, 0.458 → 0.551):** A meaningful improvement.
  The classification pass extracts `characters_present` and other named
  entities into chunk metadata, and that is included explicitly in the chunk so it's expected to have better recall even if their "embedding" contributions may be overshadowed by the rest of the chunk content.

- **Answer relevancy (+8.8%, 0.853 → 0.928):** A solid improvement. Responses
  generated from richer, more coherent context are better aligned to the
  original question.

- **Latency (+13.5%), tokens (+196%), cost (+116%):** Option B is moderately
  slower and more expensive per query, as expected from the larger semantic
  chunks and metadata extraction pipeline. The latency overhead is modest, and
  the token and cost premium is justified by the quality gains across all five
  RAGAS metrics.

**Conclusion:** Option B is the clear winner across every quality dimension.
The metadata extraction and semantic chunking strategy delivers its intended
value: context recall nearly triples, faithfulness approaches 0.95, and entity
recall improves meaningfully even without activated filtered retrieval. The
latency and cost overhead is acceptable given the quality uplift. Potential next steps:
activate metadata-filtered Qdrant retrieval at query time and tune semantic
chunk thresholds to further reduce the token and cost premium?