# Module 7: Synthetic Data Generation and RAG Evaluation with Ragas

In this notebook, we'll go end-to-end from **generating synthetic evaluation data** to **systematically evaluating and improving a RAG pipeline** ‚Äî all using [Ragas](https://github.com/explodinggradients/ragas).

The flow is:
1. **Generate** synthetic test data using Ragas' knowledge graph-based approach
2. **Build** a baseline RAG application with LangChain and LangGraph
3. **Evaluate** the RAG application against our synthetic test set using Ragas metrics
4. **Iterate** on the pipeline and measure the impact

> **NOTE:** Ragas is framework-agnostic ‚Äî while this example uses LangChain/LangGraph, you can use Ragas with any framework (or none at all). Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

## Outline

**Part 1: Synthetic Data Generation**
- Task 1: Dependencies and API Keys
- Task 2: Data Preparation
- Task 3: Knowledge Graph Construction
- Task 4: Generating Synthetic Test Data
- ***‚ùì Question #1 & Question #2***
- ***üèóÔ∏è Activity #1: Custom Query Distribution***

**Part 2: RAG Evaluation with Ragas**
- Task 5: Building a Baseline RAG Application
- Task 6: Evaluating with Ragas
- Task 7: Making Adjustments and Re-Evaluating
- ***‚ùì Question #3, Question #4, Question #5, & Question #6***
- ***üèóÔ∏è Activity #2: Implement a Different Reranking Strategy***

---
# Part 1: Synthetic Data Generation with Ragas

Before we can evaluate a RAG system, we need high-quality test data. Manually creating question-answer pairs is time-consuming and often biased toward simple queries. Ragas solves this by building a **knowledge graph** from your documents and using it to generate diverse, realistic test questions automatically.

We'll use the **Stone Ridge 2025 Investor Letter** and an **Alternative Investments Handbook** as our source documents ‚Äî maintaining continuity with the investment advisory use case from previous sessions.

## Task 1: Dependencies and API Keys

If you have not already done so, install the required libraries using the uv package manager:
```bash
uv sync
```

We'll need API keys for:
- **OpenAI** ‚Äî for LLM and embedding models (used in both SDG and RAG evaluation)
- **Cohere** ‚Äî for reranking in the improved pipeline ([sign up here](https://docs.cohere.com/reference/about))

You have two options for supplying your API keys:
- Use environment variables (copy `.env.sample` to `.env` and fill in your keys)
- Provide them via the prompts below

In [3]:
import nest_asyncio
nest_asyncio.apply()

In [4]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/jaden.lee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jaden.lee/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [6]:
import os
from getpass import getpass
from dotenv import load_dotenv

load_dotenv()

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

if not os.environ.get("COHERE_API_KEY"):
    os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

## Task 2: Data Preparation

We'll prepare our data using two complementary investment-focused sources:
- **Stone Ridge 2025 Investor Letter** ‚Äî covering Stone Ridge's investment philosophy, Bayesian approach to decision-making, energy investments, reinsurance, and risk management
- **Alternative Investments Handbook** ‚Äî covering alternative asset classes including real estate, private equity, hedge funds, reinsurance, commodities, and infrastructure

The topical overlap between these documents (particularly around reinsurance, risk premiums, diversification, and alternative investments) helps Ragas build rich cross-document relationships in the knowledge graph.

In [7]:
from langchain_community.document_loaders import PyMuPDFLoader, TextLoader

# Load the Stone Ridge 2025 Investor Letter (PDF)
pdf_loader = PyMuPDFLoader("data/Stone Ridge 2025 Investor Letter.pdf")
pdf_docs = pdf_loader.load()

# Load the Alternative Investments Handbook (text)
txt_loader = TextLoader("data/AlternativeInvestmentsHandbook.txt")
txt_docs = txt_loader.load()

# Combine into a single list
docs = pdf_docs + txt_docs
print(f"Loaded {len(docs)} documents:")
print(f"  - Stone Ridge 2025 Investor Letter: {len(pdf_docs)} pages")
print(f"  - AlternativeInvestmentsHandbook.txt: {len(txt_docs)} document(s)")

Loaded 15 documents:
  - Stone Ridge 2025 Investor Letter: 14 pages
  - AlternativeInvestmentsHandbook.txt: 1 document(s)


## Task 3: Knowledge Graph Construction

Ragas uses a **knowledge graph-based approach** to create synthetic test data. This is powerful because it allows us to create complex, multi-hop queries ‚Äî not just simple factoid questions. Systems tend to perform well on simple evaluation tasks, so this additional complexity helps us find real weaknesses.

The process works in three stages:
1. **Build the graph** ‚Äî insert documents as nodes
2. **Apply transformations** ‚Äî extract headlines, summaries, themes, entities, and embeddings
3. **Create relationships** ‚Äî use cosine similarity and overlap scores to connect related nodes

Let's start by defining our `generator_llm` and `generator_embeddings`.

In [8]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

### Step 1: Initialize the Knowledge Graph

We create an empty knowledge graph and populate it with our document nodes. Each full document becomes a node of type `DOCUMENT`.

In [10]:
import os
import certifi

# Create a combined cert bundle with Zscaler for corporate network
zscaler_cert = "/Users/jaden.lee/ZscalerRootCertificate-2048-SHA256-Feb2025.crt"
combined_cert = "/Users/jaden.lee/combined_certs.pem"

with open(combined_cert, "w") as outfile:
    with open(certifi.where(), "r") as certifi_file:
        outfile.write(certifi_file.read())
    with open(zscaler_cert, "r") as zscaler_file:
        outfile.write(zscaler_file.read())

os.environ['REQUESTS_CA_BUNDLE'] = combined_cert
os.environ['SSL_CERT_FILE'] = combined_cert
os.environ['CURL_CA_BUNDLE'] = combined_cert

In [11]:
from ragas.testset.graph import KnowledgeGraph, Node, NodeType

kg = KnowledgeGraph()

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 15, relationships: 0)

### Step 2: Apply Transformations

Now we apply the [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms) to enrich our knowledge graph. These transformations:

- **HeadlinesExtractor** ‚Äî finds the overall headlines for each document
- **SummaryExtractor** ‚Äî produces summaries of the documents
- **ThemesExtractor** ‚Äî extracts broad themes
- **EmbeddingExtractor** ‚Äî creates embeddings for similarity computation
- **NERExtractor** ‚Äî extracts named entities

These are then used to build relationships between nodes via cosine similarity and overlap scoring.

In [12]:
from ragas.testset.transforms import default_transforms, apply_transforms

transforms = default_transforms(
    documents=docs,
    llm=generator_llm,
    embedding_model=generator_embeddings
)
apply_transforms(kg, transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/14 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/15 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/25 [00:00<?, ?it/s]

Property 'summary' already exists in node '9d6b55'. Skipping!
Property 'summary' already exists in node '54f924'. Skipping!
Property 'summary' already exists in node 'eec453'. Skipping!
Property 'summary' already exists in node '72a600'. Skipping!
Property 'summary' already exists in node '40931f'. Skipping!
Property 'summary' already exists in node '470328'. Skipping!
Property 'summary' already exists in node '800089'. Skipping!
Property 'summary' already exists in node 'f3c680'. Skipping!
Property 'summary' already exists in node 'd01f69'. Skipping!
Property 'summary' already exists in node '501634'. Skipping!
Property 'summary' already exists in node '26c079'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/10 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/45 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '9d6b55'. Skipping!
Property 'summary_embedding' already exists in node '54f924'. Skipping!
Property 'summary_embedding' already exists in node '72a600'. Skipping!
Property 'summary_embedding' already exists in node '40931f'. Skipping!
Property 'summary_embedding' already exists in node 'eec453'. Skipping!
Property 'summary_embedding' already exists in node '800089'. Skipping!
Property 'summary_embedding' already exists in node 'd01f69'. Skipping!
Property 'summary_embedding' already exists in node '501634'. Skipping!
Property 'summary_embedding' already exists in node '470328'. Skipping!
Property 'summary_embedding' already exists in node '26c079'. Skipping!
Property 'summary_embedding' already exists in node 'f3c680'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 36, relationships: 304)

### Step 3: Save the Knowledge Graph

Knowledge graphs can be saved and loaded, which is useful for iterating on test generation without rebuilding the graph each time.

In [13]:
kg.save("investment_data_kg.json")

# You can reload it later:
# kg = KnowledgeGraph.load("investment_data_kg.json")

## Task 4: Generating Synthetic Test Data

With our knowledge graph built, we can now generate synthetic test data. Ragas provides several **query synthesizers**, each producing a different type of question:

- **`SingleHopSpecificQuerySynthesizer`** ‚Äî creates questions answerable from a single chunk of context (e.g., *"What is Stone Ridge's approach to reinsurance investing?"*)
- **`MultiHopAbstractQuerySynthesizer`** ‚Äî creates questions requiring synthesis across multiple chunks at an abstract level (e.g., *"How do alternative risk premiums relate to portfolio diversification?"*)
- **`MultiHopSpecificQuerySynthesizer`** ‚Äî creates questions requiring specific details from multiple chunks (e.g., *"How does Stone Ridge's Bayesian philosophy connect to their energy investment strategy?"*)

We define a **query distribution** to control the mix of question types.

In [14]:
from ragas.testset.synthesizers import (
    SingleHopSpecificQuerySynthesizer,
    MultiHopAbstractQuerySynthesizer,
    MultiHopSpecificQuerySynthesizer,
)

query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

In [15]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=generator_embeddings,
    knowledge_graph=kg
)

testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How has Chicago influenced your career?,[and my career has essentially been a three-de...,My career has essentially been a three-decade ...,single_hop_specifc_query_synthesizer
1,How does Magnetar Capital's approach relate to...,"[TODAY‚ÄôS THE DAY Fancy math aside, the foundat...","The context mentions Alec Litowitz, founder of...",single_hop_specifc_query_synthesizer
2,Stone Ridge Energy how much return last year a...,[Standardized returns as of most recent quarte...,Standardized returns as of most recent quarter...,single_hop_specifc_query_synthesizer
3,What is Stone Rigde?,[Risk Disclosures This communication has been ...,Stone Ridge is mentioned in the context as par...,single_hop_specifc_query_synthesizer
4,What are commodities in alternative investments?,[The Alternative Investments Handbook A Practi...,Commodities are included as an asset class wit...,single_hop_specifc_query_synthesizer
5,Wht factors affect real estate values like loc...,[<1-hop>\n\nPART 2: REAL ESTATE INVESTMENTS Ch...,Factors affecting real estate values include l...,multi_hop_abstract_query_synthesizer
6,Reinsurance like hurricanes earthquakes storms...,[<1-hop>\n\nPART 5: INSURANCE-LINKED INVESTMEN...,Reinsurance is insurance for insurance compani...,multi_hop_abstract_query_synthesizer
7,Whi is the importnt of private equity concepts...,[<1-hop>\n\nThe Alternative Investments Handbo...,The importance of private equity concepts such...,multi_hop_abstract_query_synthesizer
8,Whay hedge fund strategies are used for divers...,[<1-hop>\n\nThe Alternative Investments Handbo...,Hedge fund strategies such as long/short equit...,multi_hop_specific_query_synthesizer
9,"In Chapter 2 and Chapter 4, how does the real ...",[<1-hop>\n\nPART 2: REAL ESTATE INVESTMENTS Ch...,Chapter 4 explains various types of real estat...,multi_hop_specific_query_synthesizer


### Abstracted SDG (Shortcut)

The above was the **unrolled** process showing each step. Ragas also provides a one-liner that builds the knowledge graph under the hood and generates the test set in a single call. This is convenient for quick iteration:

In [13]:
# Abstracted approach (for reference):
# generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
# dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

### ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### ‚úÖ Answer:

SingleHopSpecificQuerySynthesizer - creates simple, straightforward questions that can be answered using information from a single chunk of the document
MultiHopAbstractQuerySynthesizer - creates questions that require combining information from multiple chunks at a conceptual/abstract level. You need to synthesize ideas across different parts of the documents.
MultiHopSpecificQuerySynthesizer - creates questions that require combining specific details from multiple chunks. Similar to MultiHopAbstract, but focuses on specific facts rather than abstract concepts.

### ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### ‚úÖ Answer:

I believe the main trade offs are specification and control. With abstract, you are giving up control and customization for simplification and ease of use. I would choose unrolled when I need to make a lot of customizations and the tests need to be specific for production deployment.

### üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

**Requirements:**
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [16]:
### YOUR CODE HERE ###

# Define a custom query distribution with different weights
# focused more on complex queries
custom_query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.2),      # 20% - reduced from 50%
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.4),       # 40% - increased from 25%
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.4),       # 40% - increased from 25%
]
# Generate a new test set and compare with the default
print("Custom Distribution:")
print("  - Single-hop (simple): 20% (was 50%)")
print("  - Multi-hop abstract: 40% (was 25%)")
print("  - Multi-hop specific: 40% (was 25%)")
print("\nGenerating test set with custom distribution...\n")

# Generate new test set with custom distribution
custom_generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=generator_embeddings,
    knowledge_graph=kg  # Reuse the same knowledge graph!
)

custom_testset = custom_generator.generate(
    testset_size=10,
    query_distribution=custom_query_distribution
)

# Display the results
custom_df = custom_testset.to_pandas()
print("\n" + "="*80)
print("CUSTOM TEST SET RESULTS")
print("="*80)
display(custom_df)

# Compare distributions
print("\n" + "="*80)
print("DISTRIBUTION COMPARISON")
print("="*80)

original_df = testset.to_pandas()

print("\nOriginal Distribution (from Cell 18):")
original_counts = original_df['synthesizer_name'].value_counts()
for synth, count in original_counts.items():
    print(f"  {synth}: {count} questions ({count/len(original_df)*100:.1f}%)")

print("\nCustom Distribution (Activity #1):")
custom_counts = custom_df['synthesizer_name'].value_counts()
for synth, count in custom_counts.items():
    print(f"  {synth}: {count} questions ({count/len(custom_df)*100:.1f}%)")

# Show example questions from each type
print("\n" + "="*80)
print("SAMPLE QUESTIONS COMPARISON")
print("="*80)

print("\nüìå SINGLE-HOP QUESTIONS (Simple, one chunk):")
single_hop_custom = custom_df[custom_df['synthesizer_name'].str.contains('single_hop')]
if len(single_hop_custom) > 0:
    print(f"Custom example: {single_hop_custom.iloc[0]['user_input']}")
else:
    print("(No single-hop questions generated)")

print("\nüìå MULTI-HOP ABSTRACT QUESTIONS (Conceptual synthesis):")
multi_abstract_custom = custom_df[custom_df['synthesizer_name'].str.contains('abstract')]
if len(multi_abstract_custom) > 0:
    print(f"Custom example: {multi_abstract_custom.iloc[0]['user_input']}")
    if len(multi_abstract_custom) > 1:
        print(f"Another example: {multi_abstract_custom.iloc[1]['user_input']}")

print("\nüìå MULTI-HOP SPECIFIC QUESTIONS (Detailed cross-referencing):")
multi_specific_custom = custom_df[custom_df['synthesizer_name'].str.contains('specific')]
if len(multi_specific_custom) > 0:
    print(f"Custom example: {multi_specific_custom.iloc[0]['user_input']}")
    if len(multi_specific_custom) > 1:
        print(f"Another example: {multi_specific_custom.iloc[1]['user_input']}")

Custom Distribution:
  - Single-hop (simple): 20% (was 50%)
  - Multi-hop abstract: 40% (was 25%)
  - Multi-hop specific: 40% (was 25%)

Generating test set with custom distribution...



Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]


CUSTOM TEST SET RESULTS


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Whaat is the significance of Chicago in the co...,[and my career has essentially been a three-de...,My career has essentially been a three-decade ...,single_hop_specifc_query_synthesizer
1,How do the Florida Keys relate to the Bayesian...,"[TODAY‚ÄôS THE DAY Fancy math aside, the foundat...",The Florida Keys are referenced as the area Fi...,single_hop_specifc_query_synthesizer
2,"How do hedge fund strategies, such as hedge fu...",[<1-hop>\n\nThe Alternative Investments Handbo...,"Hedge fund strategies, including hedge fund st...",multi_hop_abstract_query_synthesizer
3,How reinsurance as an investment like ILS and ...,[<1-hop>\n\nPART 5: INSURANCE-LINKED INVESTMEN...,"Reinsurance as an investment, including Insura...",multi_hop_abstract_query_synthesizer
4,how private equity concepts like J-curve effec...,[<1-hop>\n\nThe Alternative Investments Handbo...,The context explains that alternative investme...,multi_hop_abstract_query_synthesizer
5,Wht real estate as an asset class can help div...,[<1-hop>\n\nThe Alternative Investments Handbo...,Real estate as an asset class provides potenti...,multi_hop_abstract_query_synthesizer
6,"How do hedge fund strategies, particularly hed...",[<1-hop>\n\nThe Alternative Investments Handbo...,Hedge fund strategies such as long/short equit...,multi_hop_specific_query_synthesizer
7,How does PART 2's detailed explantion of real ...,[<1-hop>\n\nPART 2: REAL ESTATE INVESTMENTS Ch...,PART 2 provides an in-depth overview of real e...,multi_hop_specific_query_synthesizer
8,How does the recent performance of Stone Ridge...,[<1-hop>\n\nStandardized returns as of most re...,"The recent performance of Stone Ridge Energy, ...",multi_hop_specific_query_synthesizer
9,How Chapter 2 and Chapter 4 relate to real est...,[<1-hop>\n\nPART 2: REAL ESTATE INVESTMENTS Ch...,Chapter 4 discusses real estate as an asset cl...,multi_hop_specific_query_synthesizer



DISTRIBUTION COMPARISON

Original Distribution (from Cell 18):
  single_hop_specifc_query_synthesizer: 5 questions (45.5%)
  multi_hop_abstract_query_synthesizer: 3 questions (27.3%)
  multi_hop_specific_query_synthesizer: 3 questions (27.3%)

Custom Distribution (Activity #1):
  multi_hop_abstract_query_synthesizer: 4 questions (40.0%)
  multi_hop_specific_query_synthesizer: 4 questions (40.0%)
  single_hop_specifc_query_synthesizer: 2 questions (20.0%)

SAMPLE QUESTIONS COMPARISON

üìå SINGLE-HOP QUESTIONS (Simple, one chunk):
Custom example: Whaat is the significance of Chicago in the context of your career and how does it relate to your current work in Bayesian treasure hunting?

üìå MULTI-HOP ABSTRACT QUESTIONS (Conceptual synthesis):
Custom example: How do hedge fund strategies, such as hedge fund strategies and managed futures, contribute to diversification and risk management in a portfolio, especially considering their unique return streams and performance metrics like alph

---
# Part 2: RAG Evaluation with Ragas

Now that we have our synthetic test data, we can use it to **systematically evaluate** a RAG pipeline. The idea is simple:
1. Build a RAG application
2. Run our synthetic queries through it
3. Score the results using Ragas metrics
4. Make changes and measure the impact

This gives us a **data-driven approach** to improving our RAG system, rather than relying on vibes.

## Task 5: Building a Baseline RAG Application

We'll build a deliberately simple (and somewhat bad) RAG pipeline as our **baseline**, so we can clearly see the impact of improvements later.

Our baseline uses:
- Tiny chunks (50 characters) with no overlap
- A small embedding model (`text-embedding-3-small`)
- Only 3 retrieved documents
- A basic prompt

> **NOTE:** We use the same data that our synthetic test set was generated from ‚Äî this is required because the test questions are specifically designed for this investment data.

### R ‚Äî Retrieval

First, we chunk our documents and build a vector store.

In [17]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=0)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

2045

### ‚ùì Question #3:

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

##### ‚úÖ Answer:
the purpose of the chuck_overlap parameter is to help maintain context between chunks, ensuring that important information is no lost if a split occurs in the middle of a sentence by duplicating characters between adjacent chunks

In [18]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [19]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="baseline_rag",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="baseline_rag",
    embedding=embeddings,
)

_ = vector_store.add_documents(documents=split_documents)

In [20]:
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

In [21]:
def retrieve(state):
    retrieved_docs = retriever.invoke(state["question"])
    return {"context": retrieved_docs}

### A ‚Äî Augmented

A simple RAG prompt:

In [22]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful investment advisory assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### G ‚Äî Generation

We use `gpt-4.1-nano` for generation to avoid using the same model as our judge model.

In [23]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-nano")

In [24]:
def generate(state):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
    response = llm.invoke(messages)
    return {"response": response.content}

### Building the RAG Graph with LangGraph

In [25]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
    question: str
    context: List[Document]
    response: str

In [26]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a quick sanity check:

In [27]:
response = graph.invoke({"question": "What is Stone Ridge's investment philosophy?"})
response["response"]

'Stone Ridge\'s investment philosophy involves a relentless focus on growing, as indicated by their statement, "At Stone Ridge, we relentlessly focus on growing."'

With tiny 50-character chunks and only 3 retrieved documents, the baseline likely struggles to provide good answers about Stone Ridge's investment philosophy. That's intentional ‚Äî it gives us room to improve!

## Task 6: Evaluating with Ragas

Now we can evaluate our baseline RAG against the synthetic test data we generated in Part 1.

First, we run all the synthetic queries through our RAG pipeline to collect responses and retrieved contexts.

In [28]:
for test_row in testset:
    response = graph.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

Convert to an `EvaluationDataset` for smoother evaluation:

In [29]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(testset.to_pandas())

We select a **judge model** ‚Äî a separate, capable model that scores the outputs. Using a different model than the generator avoids self-evaluation bias.

In [30]:
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

### Running the Baseline Evaluation

We evaluate across six metrics:
- **Context Recall** ‚Äî did we retrieve the relevant context?
- **Faithfulness** ‚Äî is the answer grounded in the retrieved context?
- **Factual Correctness** ‚Äî is the answer factually correct vs. the reference?
- **Answer Relevancy** ‚Äî is the answer relevant to the question?
- **Context Entity Recall** ‚Äî did we capture the key entities from the reference context?
- **Noise Sensitivity** ‚Äî is the answer affected by irrelevant retrieved content?

In [31]:
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    FactualCorrectness,
    ResponseRelevancy,
    ContextEntityRecall,
    NoiseSensitivity,
)
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

baseline_result = evaluate(
    dataset=evaluation_dataset,
    metrics=[
        LLMContextRecall(),
        Faithfulness(),
        FactualCorrectness(),
        ResponseRelevancy(),
        ContextEntityRecall(),
        NoiseSensitivity(),
    ],
    llm=evaluator_llm,
    run_config=custom_run_config,
)
baseline_result

Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

{'context_recall': 0.2485, 'faithfulness': 0.4274, 'factual_correctness': 0.4891, 'answer_relevancy': 0.6022, 'context_entity_recall': 0.2465, 'noise_sensitivity_relevant': 0.0000}

## Task 7: Making Adjustments and Re-Evaluating

Now that we have a baseline, let's improve the pipeline and measure the impact. We'll make three changes:

1. **Larger chunks** (500 characters with 30 overlap instead of 50 with 0 overlap)
2. **More documents retrieved** (k=20 instead of k=3)
3. **Reranking with Cohere** ‚Äî retrieves 20 documents, then uses Cohere's reranker to select the top 5

Reranking is a technique that uses a cross-encoder model (slower but more accurate than embedding similarity) on a smaller subset of candidates to improve retrieval precision.

In [32]:
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=30)
split_documents = text_splitter.split_documents(docs)
print(f"Improved chunking: {len(split_documents)} chunks (vs baseline)")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

client = QdrantClient(":memory:")
client.create_collection(
    collection_name="improved_rag",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="improved_rag",
    embedding=embeddings,
)

_ = vector_store.add_documents(documents=split_documents)
adjusted_retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Improved chunking: 202 chunks (vs baseline)


In [33]:
from langchain_classic.retrievers.contextual_compression import (
    ContextualCompressionRetriever,
)
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
    compressor = CohereRerank(model="rerank-v3.5")
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=adjusted_retriever,
        search_kwargs={"k": 5},
    )
    retrieved_docs = compression_retriever.invoke(state["question"])
    return {"context": retrieved_docs}

In [34]:
from typing import TypedDict, List
from langchain_core.documents import Document

class AdjustedState(TypedDict):
    question: str
    context: List[Document]
    response: str

adjusted_graph_builder = StateGraph(AdjustedState).add_sequence([retrieve_adjusted, generate])
adjusted_graph_builder.add_edge(START, "retrieve_adjusted")
adjusted_graph = adjusted_graph_builder.compile()

Let's verify the improved pipeline works:

In [36]:
response = adjusted_graph.invoke({"question": "How does Stone Ridge approach risk management in their energy investments?"})
response["response"]

'Stone Ridge approaches risk management in their energy investments through a combination of proprietary securitizations and an integrated approach that leverages "Flywheel physics" with their financial strategies. They avoid reliance on legacy-only reinsurers, which are considered riskier due to confirmation bias and adverse selection concerns. Instead, they carefully select their energy assets and utilize proprietary securitizations‚Äîwithout engaging bankers or incurring fee leakage‚Äîto manage risk effectively. Their rigorous and precise financial techniques enable them to navigate the highly volatile natural gas price range, ensuring disciplined risk management across their energy portfolio.'

### Running the Improved Evaluation

Now let's run the same synthetic test set through our improved pipeline and compare.

In [38]:
import time
import copy

rerank_testset = copy.deepcopy(testset)

for test_row in rerank_testset:
    response = adjusted_graph.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
    time.sleep(7)  # To avoid rate limiting

In [39]:
rerank_evaluation_dataset = EvaluationDataset.from_pandas(rerank_testset.to_pandas())

rerank_result = evaluate(
    dataset=rerank_evaluation_dataset,
    metrics=[
        LLMContextRecall(),
        Faithfulness(),
        FactualCorrectness(),
        ResponseRelevancy(),
        ContextEntityRecall(),
        NoiseSensitivity(),
    ],
    llm=evaluator_llm,
    run_config=custom_run_config,
)
rerank_result

Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

{'context_recall': 0.5727, 'faithfulness': 0.6853, 'factual_correctness': 0.5327, 'answer_relevancy': 0.9284, 'context_entity_recall': 0.3735, 'noise_sensitivity_relevant': 0.2050}

### ‚ùì Question #4:

Which system performed better, on what metrics, and why?

##### ‚úÖ Answer:
the rerank system improved on all 5 primary metrics because:
1. larger chunks provided complete thoughts instead of sentence fragments
2. reranking quality
3. better 


### ‚ùì Question #5:

What are the benefits and limitations of using synthetic data generation for RAG evaluation? Consider both the practical advantages and potential pitfalls.

##### ‚úÖ Answer:
I think the benefits of using synthetic data generations are:
1. can generate test data fast
2. can generate diverse tests
3. unbiased by human expectations

cons:
1. not real queries
2. quality depends highly on source documents
3. over optimization risk

### ‚ùì Question #6:

If you were building a production investment advisory assistant for Stone Ridge, which Ragas metrics would be most important to optimize for and why? Consider the financial services domain specifically.

##### ‚úÖ Answer:

I think the most important Regas metrics would be faithfulness because trust and liability matters a lot when it comes to the financial services domain. Investment advice based on hallucinated information could lead to poor financial decisions. With this being said, I would say topic adherence is the most important ragas metric.

### üèóÔ∏è Activity #2: Implement a Different Reranking Strategy

Experiment with different reranking parameters or strategies to see how they affect the evaluation metrics.

**Requirements:**
1. Modify the `retrieve_adjusted` function to use different parameters (e.g., change `k` values, try different `top_n` for reranking)
2. Or implement a different retrieval enhancement strategy (e.g., hybrid search, query expansion)
3. Run the evaluation and compare results with the baseline and reranking results above
4. Document your findings in the markdown cell below

In [None]:
### YOUR CODE HERE ###

# STRATEGY 1: Aggressive Reranking
# Retrieve even more candidates (k=30), but rerank down to top 3 (instead of 5)
# Hypothesis: Higher precision with best-of-the-best selection

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_classic.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langgraph.graph import START, StateGraph
from typing import TypedDict, List
from langchain_core.documents import Document
import time
import copy

print("="*80)
print("STRATEGY 1: AGGRESSIVE RERANKING (k=30 ‚Üí rerank to top 3)")
print("="*80)

# Same chunking as improved version
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=30)
split_documents = text_splitter.split_documents(docs)
print(f"Chunks: {len(split_documents)}")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

client = QdrantClient(":memory:")
client.create_collection(
    collection_name="aggressive_rerank",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="aggressive_rerank",
    embedding=embeddings,
)

_ = vector_store.add_documents(documents=split_documents)
aggressive_retriever = vector_store.as_retriever(search_kwargs={"k": 30})  # Increased from 20

def retrieve_aggressive(state):
    compressor = CohereRerank(model="rerank-v3.5")
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=aggressive_retriever,
        search_kwargs={"k": 3},  # Reduced from 5 to 3 - only best of the best
    )
    retrieved_docs = compression_retriever.invoke(state["question"])
    return {"context": retrieved_docs}

class AggressiveState(TypedDict):
    question: str
    context: List[Document]
    response: str

aggressive_graph_builder = StateGraph(AggressiveState).add_sequence([retrieve_aggressive, generate])
aggressive_graph_builder.add_edge(START, "retrieve_aggressive")
aggressive_graph = aggressive_graph_builder.compile()

# Test it
print("\nTesting aggressive reranking...")
test_response = aggressive_graph.invoke({"question": "What is Stone Ridge's investment philosophy?"})
print(f"Sample response: {test_response['response'][:150]}...")

# Run evaluation
print("\nRunning evaluation (this will take ~70 seconds with rate limiting)...")
aggressive_testset = copy.deepcopy(testset)

for i, test_row in enumerate(aggressive_testset):
    response = aggressive_graph.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
    time.sleep(7)  # Rate limiting
    print(f"Processed {i+1}/{len(aggressive_testset)}")

from ragas import EvaluationDataset, evaluate, RunConfig
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    FactualCorrectness,
    ResponseRelevancy,
    ContextEntityRecall,
    NoiseSensitivity,
)

aggressive_evaluation_dataset = EvaluationDataset.from_pandas(aggressive_testset.to_pandas())

custom_run_config = RunConfig(timeout=360)

aggressive_result = evaluate(
    dataset=aggressive_evaluation_dataset,
    metrics=[
        LLMContextRecall(),
        Faithfulness(),
        FactualCorrectness(),
        ResponseRelevancy(),
        ContextEntityRecall(),
        NoiseSensitivity(),
    ],
    llm=evaluator_llm,
    run_config=custom_run_config,
)

print("\n" + "="*80)
print("RESULTS COMPARISON")
print("="*80)
print("\nBaseline (k=3, no rerank):")
print("  context_recall: 0.2485")
print("  faithfulness: 0.4274")
print("  factual_correctness: 0.4891")
print("  answer_relevancy: 0.6022")
print("  context_entity_recall: 0.2465")
print("  noise_sensitivity: 0.0000")

print("\nImproved Rerank (k=20 ‚Üí top 5):")
print("  context_recall: 0.5727")
print("  faithfulness: 0.6853")
print("  factual_correctness: 0.5327")
print("  answer_relevancy: 0.9284")
print("  context_entity_recall: 0.3735")
print("  noise_sensitivity: 0.2050")

print(f"\nAggressive Rerank (k=30 ‚Üí top 3):")
for metric, value in aggressive_result.items():
    print(f"  {metric}: {value:.4f}")

STRATEGY 1: AGGRESSIVE RERANKING (k=30 ‚Üí rerank to top 3)
Chunks: 202

Testing aggressive reranking...
Sample response: Stone Ridge's investment philosophy is centered on relentlessly focusing on growing after-tax cash flow to drive durable equity value in their operati...

Running evaluation (this will take ~70 seconds with rate limiting)...
Processed 1/11
Processed 2/11
Processed 3/11
Processed 4/11
Processed 5/11
Processed 6/11
Processed 7/11
Processed 8/11
Processed 9/11
Processed 10/11
Processed 11/11


Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]


RESULTS COMPARISON

Baseline (k=3, no rerank):
  context_recall: 0.2485
  faithfulness: 0.4274
  factual_correctness: 0.4891
  answer_relevancy: 0.6022
  context_entity_recall: 0.2465
  noise_sensitivity: 0.0000

Improved Rerank (k=20 ‚Üí top 5):
  context_recall: 0.5727
  faithfulness: 0.6853
  factual_correctness: 0.5327
  answer_relevancy: 0.9284
  context_entity_recall: 0.3735
  noise_sensitivity: 0.2050

Aggressive Rerank (k=30 ‚Üí top 3):


AttributeError: 'EvaluationResult' object has no attribute 'items'

In [41]:
print("\n" + "="*80)
print("RESULTS COMPARISON")
print("="*80)
print("\nBaseline (k=3, no rerank):")
print("  context_recall: 0.2485")
print("  faithfulness: 0.4274")
print("  factual_correctness: 0.4891")
print("  answer_relevancy: 0.6022")
print("  context_entity_recall: 0.2465")
print("  noise_sensitivity: 0.0000")

print("\nImproved Rerank (k=20 ‚Üí top 5):")
print("  context_recall: 0.5727")
print("  faithfulness: 0.6853")
print("  factual_correctness: 0.5327")
print("  answer_relevancy: 0.9284")
print("  context_entity_recall: 0.3735")
print("  noise_sensitivity: 0.2050")

print(f"\nAggressive Rerank (k=30 ‚Üí top 3):")
aggressive_result


RESULTS COMPARISON

Baseline (k=3, no rerank):
  context_recall: 0.2485
  faithfulness: 0.4274
  factual_correctness: 0.4891
  answer_relevancy: 0.6022
  context_entity_recall: 0.2465
  noise_sensitivity: 0.0000

Improved Rerank (k=20 ‚Üí top 5):
  context_recall: 0.5727
  faithfulness: 0.6853
  factual_correctness: 0.5327
  answer_relevancy: 0.9284
  context_entity_recall: 0.3735
  noise_sensitivity: 0.2050

Aggressive Rerank (k=30 ‚Üí top 3):


{'context_recall': 0.6879, 'faithfulness': 0.7151, 'factual_correctness': 0.5309, 'answer_relevancy': 0.9344, 'context_entity_recall': 0.3518, 'noise_sensitivity_relevant': 0.1816}

### Activity #2 Findings:

*Document your findings here: What strategy did you try? How did it compare to the baseline and reranking results?*

#### Strategy Tested: Aggressive Reranking (k=30 ‚Üí rerank to top 3)

**Configuration:**
- **Baseline:** chunk_size=50, k=3, no rerank
- **Improved Rerank:** chunk_size=500, overlap=30, k=20 ‚Üí rerank to top 5
- **My Strategy (Aggressive Rerank):** chunk_size=500, overlap=30, k=30 ‚Üí rerank to top 3
- **Rationale:** Cast an even wider net (k=30) but be more selective in final selection (top 3 instead of 5). Hypothesis: Higher initial recall combined with more aggressive filtering will improve precision and reduce noise while maintaining good context coverage.

#### Results Comparison Table

| Metric | Baseline | Improved Rerank (k=20‚Üí5) | Aggressive Rerank (k=30‚Üí3) | Winner |
|--------|----------|--------------------------|----------------------------|--------|
| **Context Recall** | 0.2485 | 0.5727 | **0.6879** | üèÜ Aggressive (+20%) |
| **Faithfulness** | 0.4274 | 0.6853 | **0.7151** | üèÜ Aggressive (+4%) |
| **Factual Correctness** | 0.4891 | 0.5327 | 0.5309 | Improved (tie) |
| **Answer Relevancy** | 0.6022 | 0.9284 | **0.9344** | üèÜ Aggressive (+1%) |
| **Context Entity Recall** | 0.2465 | **0.3735** | 0.3518 | üèÜ Improved (-6%) |
| **Noise Sensitivity** | 0.0000 | 0.2050 | **0.1816** | üèÜ Aggressive (-11%) |

**Overall Winner: Aggressive Rerank Strategy** - Wins 4 out of 6 metrics, ties on 1, loses on 1

---
## Summary

In this notebook, we went end-to-end from data generation to evaluation:

1. **Built a knowledge graph** from our investment documents (Stone Ridge 2025 Investor Letter and Alternative Investments Handbook) and used it to understand the structure of our data
2. **Generated synthetic test data** with diverse query types (single-hop, multi-hop abstract, multi-hop specific)
3. **Built a baseline RAG pipeline** with deliberately simple parameters
4. **Evaluated with Ragas** across six metrics to establish a baseline
5. **Improved the pipeline** with larger chunks and Cohere reranking
6. **Re-evaluated** to measure the impact of our changes

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **Ragas metrics** give you a multi-dimensional view of RAG quality (retrieval vs. generation vs. faithfulness)
- **Small changes matter** ‚Äî chunk size, retrieval strategy, and reranking can dramatically affect evaluation scores
- **Always use a different model for judging** than for generating to avoid self-evaluation bias