# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /home/gpudja/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/gpudja/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

In [3]:
import os

# EU endpoint za LangSmith SDK
os.environ["LANGSMITH_ENDPOINT"] = "https://eu.api.smith.langchain.com"
os.environ["LANGCHAIN_ENDPOINT"] = "https://eu.api.smith.langchain.com"

print("LANGSMITH_ENDPOINT:", os.getenv("LANGSMITH_ENDPOINT"))

LANGSMITH_ENDPOINT: https://eu.api.smith.langchain.com


In [4]:
from langsmith import Client
client = Client()
list(client.list_projects(limit=1))

[TracerSessionResult(id=UUID('090c88e1-6f7a-4885-9051-710ddc03f365'), start_time=datetime.datetime(2026, 2, 14, 6, 23, 44, 684745, tzinfo=datetime.timezone.utc), end_time=None, description=None, name='AIM - SDG - a1df22b3', extra=None, tenant_id=UUID('7d65d547-0418-48a3-8657-0058d3f90d4c'), reference_dataset_id=None, run_count=None, latency_p50=None, latency_p99=None, total_tokens=None, prompt_tokens=None, completion_tokens=None, last_run_start_time=None, feedback_stats=None, session_feedback_stats=None, run_facets=None, total_cost=None, prompt_cost=None, completion_cost=None, first_token_p50=None, first_token_p99=None, error_rate=None)]

We'll also want to set a project name to make things easier for ourselves.

In [5]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [6]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [7]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/MentalHealthGuide.txt', 'data/HealthWellnessGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [8]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [9]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [10]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [11]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 10, relationships: 19)

We can save and load our knowledge graphs as follows.

In [12]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 10, relationships: 19)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [13]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [14]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### ‚úÖ Answer:
The three query synthesizers are responsible for generating different levels of question complexity from the knowledge graph. Each one simulates a different type of user query that our RAG system might encounter.

1. SingleHopSpecificQuerySynthesizer: 
This synthesizer creates straightforward, factual questions that can be answered using information from a single section of the document. These are the simplest queries and help evaluate whether the retriever can correctly fetch relevant context for direct answers.

2. MultiHopAbstractQuerySynthesizer: 
This one generates higher-level conceptual questions that require combining ideas from multiple parts of the documents. The system must understand broader themes rather than just retrieve a single fact. These questions test whether the model can reason across related topics.

3. MultiHopSpecificQuerySynthesizer: 
This synthesizer creates more complex, detailed questions that require connecting specific pieces of information from multiple sources. These are the most challenging queries and are useful for evaluating deeper reasoning and cross-document retrieval capabilities in the RAG pipeline.

Overall, using a mix of these synthesizers helps create a balanced and realistic test set, allowing us to measure how the system performs across simple and complex scenarios.

Finally, we can use our `TestSetGenerator` to generate our testset!

In [15]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What role does the World Health Organization p...,[The Mental Health and Psychology Handbook A P...,"According to the context, the World Health Org...",single_hop_specifc_query_synthesizer
1,What MBSR do for stress help?,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Mindfulness-Based Stress Reduction (MBSR) is a...,single_hop_specifc_query_synthesizer
2,What is mental health?,[Write letters to or from your future self Jou...,Mental health refers to the state of a person'...,single_hop_specifc_query_synthesizer
3,How does cyberbullying impact mental health an...,[social interactions How to set and maintain b...,Cyberbullying affects approximately 37% of you...,single_hop_specifc_query_synthesizer
4,What are some recommended exercises for reliev...,[The Personal Wellness Guide A Comprehensive R...,The exercises recommended for shoulder tension...,single_hop_specifc_query_synthesizer
5,how physical activity help immune system boost...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,The context explains that regular physical act...,multi_hop_abstract_query_synthesizer
6,How can developng emoshunal intellgince and se...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Developing emotional intelligence involves rec...,multi_hop_abstract_query_synthesizer
7,How does the mind-body connection relate to co...,[<1-hop>\n\nThe Mental Health and Psychology H...,"The mind-body connection, as discussed in the ...",multi_hop_abstract_query_synthesizer
8,how CBT and CBT-I help mental health and sleep...,[<1-hop>\n\nWrite letters to or from your futu...,CBT is a therapy that helps change negative th...,multi_hop_specific_query_synthesizer
9,How do chapters 7 and 20 relate to building he...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,Chapter 7 discusses the science of habit forma...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [16]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [17]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the purpose of the Psychology Handbook?,[The Mental Health and Psychology Handbook A P...,The Mental Health and Psychology Handbook is a...,single_hop_specifc_query_synthesizer
1,What is the Cognitive Behavioral Therapy and h...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Cognitive Behavioral Therapy is one of the mos...,single_hop_specifc_query_synthesizer
2,What is the role of CBT-I in managing sleep is...,[Write letters to or from your future self Jou...,"CBT-I, or Cognitive Behavioral Therapy for Ins...",single_hop_specifc_query_synthesizer
3,"How does social media impact mental health, an...",[social interactions How to set and maintain b...,Social media can affect mental health by lower...,single_hop_specifc_query_synthesizer
4,How can establishing a consistent morning rout...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,Building a consistent morning routine that inc...,multi_hop_abstract_query_synthesizer
5,"How can setting boundaries and using ""I"" state...",[<1-hop>\n\nsocial interactions How to set and...,Setting and maintaining boundaries by clearly ...,multi_hop_abstract_query_synthesizer
6,"How can I use ""I"" statements and set boundarie...",[<1-hop>\n\nsocial interactions How to set and...,To improve social interactions and mental heal...,multi_hop_abstract_query_synthesizer
7,how stress reduction tech like deep breathing ...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,The context explains that stress reduction tec...,multi_hop_abstract_query_synthesizer
8,how B vitamins and vitamins are related in the...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,"The context explains that vitamins, including ...",multi_hop_specific_query_synthesizer
9,How can understanding the science of sleep fro...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Understanding the science of sleep detailed in...,multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### ‚úÖ Answer:
The main trade-off between them is control versus simplicity.

The unrolled approach gives full visibility and control over each step of the process. We manually build the knowledge graph, apply transformations, define the query distribution, and then generate the test set. This is useful when we want to fine-tune the structure of the test data, experiment with different query types, or better understand how the system behaves internally. It‚Äôs more flexible, but also more complex and time-consuming.

The abstracted approach simplifies everything into a single high-level function. Ragas automatically builds the knowledge graph, applies default transformations, and generates synthetic queries. This is ideal for quickly creating a baseline test set or iterating rapidly without worrying about the internal mechanics. However, it provides less control over query complexity and distribution.

In practice, I would use the abstracted approach when prototyping or creating a quick evaluation dataset, and the manual approach when I need deeper control over the test design, especially in a production or research setting where query distribution and complexity matter.

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

#### üèóÔ∏è 1. Create a custom query distribution with different weights than the default

For this activity, I decided to create a more challenging test set that better reflects realistic user queries. In the wellness and mental health domain, users often ask questions that require connecting specific recommendations across multiple topics (for example, stress, sleep, and exercise).

Because of that, I reduced the proportion of simple single-hop questions and increased the proportion of multi-hop specific questions to better stress-test retrieval and reasoning performance.

Custom distribution:

30% SingleHopSpecific (basic retrieval validation)

30% MultiHopAbstract (conceptual reasoning across themes)

40% MultiHopSpecific (cross-document, detailed reasoning)

In [19]:
from ragas.testset.synthesizers import (
    SingleHopSpecificQuerySynthesizer,
    MultiHopAbstractQuerySynthesizer,
    MultiHopSpecificQuerySynthesizer,
)

custom_query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.30),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.30),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.40),
]


#### üèóÔ∏è 2. Generating a New Test Set with the Custom Distribution 

Now that I have defined a custom query distribution, I will generate a new synthetic test set using this distribution. The goal is to compare the types of questions produced with the default setup and observe how increasing the proportion of multi-hop queries affects overall complexity.

By shifting more weight toward multi-hop specific questions, I expect to see:

More cross-document reasoning

More ‚Äúwhy‚Äù and ‚Äúhow‚Äù style questions

Fewer straightforward factual lookups

In [20]:
# Generate a new test set using the custom distribution
custom_testset = generator.generate(
    testset_size=10,
    query_distribution=custom_query_distribution
)

# Convert to pandas for inspection
custom_df = custom_testset.to_pandas()

# Display first 10 rows
custom_df.head(10)

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What mental health means and why it important ...,[The Mental Health and Psychology Handbook A P...,"Mental health encompasses our emotional, psych...",single_hop_specifc_query_synthesizer
1,Could you explain what MBCT is and how it inte...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Mindfulness-Based Cognitive Therapy (MBCT) com...,single_hop_specifc_query_synthesizer
2,How does sleep impact mental health according ...,[Write letters to or from your future self Jou...,Sleep and mental health have a bidirectional r...,single_hop_specifc_query_synthesizer
3,How can understanding resilience and growth mi...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Understanding resilience and growth mindset ca...,multi_hop_abstract_query_synthesizer
4,How can I build a workout routine that gradual...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,"According to the Personal Wellness Guide, buil...",multi_hop_abstract_query_synthesizer
5,How support gut-brain axis and probiotics help...,[<1-hop>\n\nWrite letters to or from your futu...,Supporting the gut-brain axis and taking probi...,multi_hop_abstract_query_synthesizer
6,"How does sleep contribute to overall wellness,...",[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,"Sleep is crucial for physical health, mental w...",multi_hop_specific_query_synthesizer
7,Considering the comprehensive insights from Ch...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Implementing specific evening wind-down routin...,multi_hop_specific_query_synthesizer
8,"How can incorporating regular exercise, as det...",[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The wellness guide emphasizes that regular exe...,multi_hop_specific_query_synthesizer
9,How can Cognitive Behavioral Therapy (CBT) and...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Cognitive Behavioral Therapy (CBT) is a widely...,multi_hop_specific_query_synthesizer


#### üèóÔ∏è 3. Compare the types of questions generated with the default distribution
To understand the impact of modifying the query distribution, I will compare the types of questions generated using the default distribution and my custom distribution.

The goal is to observe:

Whether the proportion of multi-hop questions increased

Whether the overall complexity of questions changed

How the distribution shift affects the diversity of query types

In [21]:
# Convert default testset to pandas (if not already done)
default_df = testset.to_pandas()

# Check available columns to identify synthesizer/type column
print("Default columns:", default_df.columns)
print("Custom columns:", custom_df.columns)

# If a synthesizer/type column exists, compare distributions
if "synthesizer_name" in default_df.columns:
    print("Default distribution:")
    print(default_df["synthesizer_name"].value_counts())
    
    print("\nCustom distribution:")
    print(custom_df["synthesizer_name"].value_counts())
else:
    print("Synthesizer column not found. Displaying questions for manual comparison.")
    display(default_df[["question"]].head(10))
    display(custom_df[["question"]].head(10))


Default columns: Index(['user_input', 'reference_contexts', 'reference', 'synthesizer_name'], dtype='str')
Custom columns: Index(['user_input', 'reference_contexts', 'reference', 'synthesizer_name'], dtype='str')
Default distribution:
synthesizer_name
single_hop_specifc_query_synthesizer    5
multi_hop_abstract_query_synthesizer    3
multi_hop_specific_query_synthesizer    3
Name: count, dtype: int64

Custom distribution:
synthesizer_name
multi_hop_specific_query_synthesizer    4
single_hop_specifc_query_synthesizer    3
multi_hop_abstract_query_synthesizer    3
Name: count, dtype: int64


#### üèóÔ∏è 4. Explain why you chose the weights you did
I chose to increase the proportion of multi-hop specific queries because they better reflect realistic user behavior in the wellness and mental health domain, where users often ask questions that require connecting multiple recommendations across topics (for example, how stress management techniques interact with sleep hygiene or exercise routines).

By reducing the weight of single-hop queries, I intentionally made the evaluation more challenging. Single-hop questions primarily test basic retrieval accuracy, but they do not sufficiently stress-test cross-document reasoning or contextual understanding. Increasing the proportion of multi-hop specific queries allows the test set to better evaluate how well the RAG pipeline handles deeper reasoning and information synthesis.

At the same time, I maintained a balanced portion of multi-hop abstract queries to ensure that higher-level conceptual reasoning remains represented.

When comparing the generated distributions, the default setup produced mostly single-hop queries, while the custom distribution increased the number of multi-hop specific questions and reduced simple fact-based queries.

Overall, the custom distribution shifts the test set from primarily simple, single-hop questions toward more complex multi-hop specific queries, increasing the overall reasoning difficulty and making the evaluation more realistic.

We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [22]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [23]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [24]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [26]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [27]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [28]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [29]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [30]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [31]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [32]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [33]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [34]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [37]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`:
> - `labeled_helpfulness_evaluator`:
> - `dopeness_evaluator`:

## LangSmith Evaluation

In [38]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'pertinent-drain-23' at:
https://eu.smith.langchain.com/o/7d65d547-0418-48a3-8657-0058d3f90d4c/datasets/efc93ea7-c97c-488d-808f-b84309c97184/compare?selectedSessions=aa700be1-3430-4c3c-98bd-ae22f3f9e03c




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,how CBT and CBT-I help sleep and stress and ho...,Based on the provided context:\n\n**How CBT an...,,CBT and CBT-I are effective for improving slee...,True,True,True,5.663125,7c849c53-4b92-4c7c-9e84-689a48bbdd67,019c5b2d-7f47-70a3-8fae-c93073e4e2f8
1,How can CBT-I help imrpove sleep quality and m...,CBT-I (Cognitive Behavioral Therapy for Insomn...,,Cognitive Behavioral Therapy for Insomnia (CBT...,True,True,True,5.796482,f0242275-1792-4110-ac7b-8477e64d7f57,019c5b2d-c6cd-76f1-95b8-a3f3bfb517d0
2,How can understanding the science of sleep fro...,Understanding the science of sleep from Chapte...,,Understanding the science of sleep detailed in...,True,False,True,3.345257,24c44959-2bfb-437b-8d85-6df2fc0e84ba,019c5b2e-019f-7460-9b3a-1fda23b2a904
3,how B vitamins and vitamins are related in the...,"Based on the provided context, B vitamins are ...",,"The context explains that vitamins, including ...",True,True,False,2.641379,ca26e844-4e36-4afa-a4d0-7802e0158745,019c5b2e-4533-7911-b466-6cd2092e2c94
4,how stress reduction tech like deep breathing ...,Based on the context provided:\n\nStress reduc...,,The context explains that stress reduction tec...,True,True,False,3.082901,b1df1a29-7657-4353-90b7-2ec90e9ac520,019c5b2e-7817-7f81-8f1b-9f52a0cd9ca2
5,"How can I use ""I"" statements and set boundarie...","Based on the context provided, to use ""I"" stat...",,To improve social interactions and mental heal...,True,True,False,2.678974,75039211-773b-4394-b5c9-ddf241bfe846,019c5b2e-b6e0-7b61-aebf-710a316e39b2
6,"How can setting boundaries and using ""I"" state...","Setting boundaries and using ""I"" statements ca...",,Setting and maintaining boundaries by clearly ...,True,True,True,2.037315,2c5f4f43-f5cd-48ac-8f49-bba20620fbf6,019c5b2e-f041-7fd1-9c6e-d3aee85fde3f
7,How can establishing a consistent morning rout...,Establishing a consistent morning routine that...,,Building a consistent morning routine that inc...,True,True,True,2.932684,aa3fbfcc-6cd5-47f0-910f-3bd0af53a179,019c5b2f-37a6-7932-9cd1-6b374e92f966
8,"How does social media impact mental health, an...",Based on the provided context:\n\n**Impact of ...,,Social media can affect mental health by lower...,True,True,True,3.310128,9c9a5dd3-6d37-4266-8be3-e442df5cfbb0,019c5b2f-772e-7263-be9d-1e9e5456a244
9,What is the role of CBT-I in managing sleep is...,CBT-I (Cognitive Behavioral Therapy for Insomn...,,"CBT-I, or Cognitive Behavioral Therapy for Ins...",True,True,True,1.819198,85a85a74-e067-4d60-b9fd-ab3401a04176,019c5b2f-bb2f-7ed1-9ef3-2c5f825d9a4a


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [39]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [40]:
rag_documents = docs

In [42]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### ‚úÖ Answer:
Modifying the chunk size can significantly impact the performance of a RAG application because chunk size determines how information is segmented and retrieved.

If chunks are too small, relevant information may be split across multiple chunks. In that case, the retriever might return only part of the necessary context, which can reduce answer completeness and negatively affect multi-hop reasoning. This can lead to lower correctness and helpfulness scores during evaluation.

On the other hand, larger chunks contain more surrounding context, increasing the likelihood that all relevant information is retrieved together. This can improve performance on complex or multi-hop questions. However, larger chunks also introduce trade-offs: they may include more irrelevant information, which can dilute semantic similarity during retrieval and potentially confuse the model.

In summary, chunk size directly affects retrieval quality, contextual completeness, and ultimately the overall performance of the RAG system. Finding the right balance is important for optimizing both accuracy and reasoning capability.

In [43]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### ‚úÖ Answer:
Modifying the embedding model can significantly impact the performance of a RAG application because embeddings determine how text is represented in vector space and how semantic similarity is calculated.

The embedding model is responsible for converting document chunks and user queries into numerical vectors. If the embeddings capture richer semantic relationships, the retriever is more likely to return highly relevant chunks. A stronger embedding model, such as text-embedding-3-large, can better understand subtle contextual relationships, conceptual similarities, and cross-topic connections.

This directly affects retrieval quality. If retrieval improves, the LLM receives more accurate and relevant context, which can increase correctness, helpfulness, and overall response quality. Conversely, weaker embeddings may return partially relevant or noisy chunks, leading to incomplete or incorrect answers.

However, larger embedding models also introduce trade-offs such as higher cost and increased computation time. Therefore, selecting an embedding model involves balancing retrieval precision with efficiency.

In summary, since retrieval quality is foundational to RAG performance, improving the embedding model can lead to measurable improvements in overall system accuracy and reasoning capability.

In [44]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [45]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [46]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [47]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

"Alright, let‚Äôs crank your sleep game to legendary status! Based on the juicy deets from the Health & Mental Wellness guides, here‚Äôs how to become the master of your nightly Z‚Äôs:\n\n1. **Lock in a consistent sleep schedule** ‚Äî Even on weekends, keep your bedtime and wake time steady. Your body LOVES routine; it‚Äôs like feeding a beast that powers your day.\n\n2. **Create a chill bedtime ritual** ‚Äî Think reading, gentle stretching, or a warm bath. This primes your mind and muscles to slide into relaxation-mode like a boss.\n\n3. **Craft the ultimate sleep cave** ‚Äî Keep your bedroom cool (65-68¬∞F / 18-20¬∞C), pitch black with blackout curtains or a dope sleep mask, and quiet as a ninja with white noise or earplugs. Don't skimp on quality mattresses and pillows‚Äîthey're your throne!\n\n4. **Screen detox** ‚Äî Kill the screens 1-2 hours before bed. Blue light is the sleep villain stealing your midnight magic.\n\n5. **Be caffeine-savvy** ‚Äî No caffeine after 2 PM. That jitte

Finally, we can evaluate the new chain on the same test set!

In [48]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'flowery-driving-37' at:
https://eu.smith.langchain.com/o/7d65d547-0418-48a3-8657-0058d3f90d4c/datasets/efc93ea7-c97c-488d-808f-b84309c97184/compare?selectedSessions=e0ed0c7f-f730-4c2b-8d9a-535c02799169




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,how CBT and CBT-I help sleep and stress and ho...,"Alright, let‚Äôs dive into the ultra-rad fusion ...",,CBT and CBT-I are effective for improving slee...,True,True,True,6.172044,7c849c53-4b92-4c7c-9e84-689a48bbdd67,019c5b40-5e3e-7183-975f-cbe75fbf0565
1,How can CBT-I help imrpove sleep quality and m...,"Oh, you just unlocked a world of zen-level sle...",,Cognitive Behavioral Therapy for Insomnia (CBT...,True,True,True,7.428651,f0242275-1792-4110-ac7b-8477e64d7f57,019c5b40-9151-7fd0-80be-2adcb03afc86
2,How can understanding the science of sleep fro...,"Yo, here‚Äôs the ultimate wellness remix for Oli...",,Understanding the science of sleep detailed in...,True,True,True,4.47027,24c44959-2bfb-437b-8d85-6df2fc0e84ba,019c5b40-d029-7be1-9a79-e94624c425a0
3,how B vitamins and vitamins are related in the...,"Yo, let‚Äôs break it down like a mental health r...",,"The context explains that vitamins, including ...",True,True,True,4.266794,ca26e844-4e36-4afa-a4d0-7802e0158745,019c5b41-0f3c-73c1-a046-27b114750c55
4,how stress reduction tech like deep breathing ...,"Alright, let‚Äôs break down the rad vibes behind...",,The context explains that stress reduction tec...,True,True,True,3.878862,b1df1a29-7657-4353-90b7-2ec90e9ac520,019c5b41-3aea-7f72-a799-8d4c1f225889
5,"How can I use ""I"" statements and set boundarie...","Alright, you're ready to LEVEL UP your social ...",,To improve social interactions and mental heal...,True,True,True,4.358503,75039211-773b-4394-b5c9-ddf241bfe846,019c5b41-7039-7c40-b3da-d789a2842418
6,"How can setting boundaries and using ""I"" state...","Alright, let‚Äôs crank this up with some boundar...",,Setting and maintaining boundaries by clearly ...,True,True,True,3.773815,2c5f4f43-f5cd-48ac-8f49-bba20620fbf6,019c5b41-b340-7db0-90a9-ccddf5a19960
7,How can establishing a consistent morning rout...,"Oh, heck yes‚Äîlet‚Äôs crank that morning mojo to ...",,Building a consistent morning routine that inc...,True,True,True,6.267061,aa3fbfcc-6cd5-47f0-910f-3bd0af53a179,019c5b41-e599-74e3-a5eb-8303c8da3f66
8,"How does social media impact mental health, an...","Alright, let's dive into this social media men...",,Social media can affect mental health by lower...,True,True,True,5.149035,9c9a5dd3-6d37-4266-8be3-e442df5cfbb0,019c5b42-23af-7fa1-b223-5df388764224
9,What is the role of CBT-I in managing sleep is...,"Alright, strap in for the dopest breakdown on ...",,"CBT-I, or Cognitive Behavioral Therapy for Ins...",True,True,True,4.262034,85a85a74-e067-4d60-b9fd-ab3401a04176,019c5b42-57df-7b01-a153-ea5ccc5205d2


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

### üèóÔ∏è Evaluation Comparison #1 screenshot
![alt text](Capture.JPG)

### üèóÔ∏è Evaluation Comparison #2 screenshot
![alt text](Capture1.JPG)

### üèóÔ∏è Evaluation Comparison ‚Äì Default vs Dope Chain

The screenshot above shows a direct comparison between the initial RAG chain and the improved ‚Äúdopeness‚Äù chain.


##### ‚úÖ Answer: 'explain why you believe certain metrics changed in certain ways?'
After comparing the two chains in LangSmith, several metric changes can be clearly explained by the modifications made to the RAG pipeline.

The dopeness score increased from 0.67 to 1.00, which is a direct result of the prompt augmentation. The updated prompt explicitly instructed the model to produce more engaging, less generic responses. Because the evaluator was designed to measure stylistic quality and expressiveness, the change in prompt strongly influenced this metric.

The helpfulness score improved from 0.92 to 1.00, likely due to a combination of larger chunk sizes and the stronger embedding model (text-embedding-3-large). Larger chunks provided richer contextual information, and improved embeddings enhanced semantic retrieval quality. As a result, the model received more relevant and complete context before generating its answers, which improved perceived usefulness.

The QA correctness score remained at 1.00, indicating that the changes did not negatively impact factual accuracy. This suggests that retrieval precision was preserved despite increasing chunk size and upgrading embeddings. In other words, the improvements enhanced quality without sacrificing correctness.

Latency increased slightly, which is expected when using larger chunks and a more computationally expensive embedding model. This highlights a common trade-off in RAG systems: higher retrieval quality and better outputs often come at the cost of increased computational overhead.

Overall, the experiment demonstrates how coordinated improvements in prompt design, chunking strategy, and embedding strength can positively influence stylistic and qualitative metrics while maintaining factual accuracy.

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores