# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [19]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /home/mbracic/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/mbracic/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [20]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [21]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [22]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/HealthWellnessGuide.txt', 'data/MentalHealthGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [23]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [24]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [25]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [26]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 9, relationships: 17)

We can save and load our knowledge graphs as follows.

In [27]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 9, relationships: 17)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [28]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [29]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
SingleHopSpecificQuerySynthesizer: Generates simple, concrete questions that can be answered from a single part of the context (one document/chunk), asking for specific facts, definitions, or direct information.
MultiHopAbstractQuerySynthesizer: Generates questions that require combining multiple parts of the context to produce an abstract conclusion, involving reasoning, comparisons, or general principles.
MultiHopSpecificQuerySynthesizer: Also requires multi-hop reasoning (combining multiple sources), but aims for a concrete, precise answer‚Äîseeking specific facts from multiple documents.

Finally, we can use our `TestSetGenerator` to generate our testset!

In [30]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What role do minerals play in maintaining a ba...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,"Minerals are inorganic elements like calcium, ...",single_hop_specifc_query_synthesizer
1,How does social connection influence overall h...,[13: The Science of Habit Formation Habits are...,"According to the provided context, strong soci...",single_hop_specifc_query_synthesizer
2,What activities or routines are recommended fo...,[The Personal Wellness Guide A Comprehensive R...,The wellness guide suggests that Wednesdays ar...,single_hop_specifc_query_synthesizer
3,Whaat is the role of the World Health Organiza...,[The Mental Health and Psychology Handbook A P...,"According to the context, the World Health Org...",single_hop_specifc_query_synthesizer
4,What is the significance of Stanford in the co...,[PART 3: BUILDING RESILIENCE Chapter 7: What I...,Stanford is referenced in the context of Chapt...,single_hop_specifc_query_synthesizer
5,How can managing digital mental health challen...,[<1-hop>\n\n- Sleep restriction: Limiting time...,Managing digital mental health challenges rela...,multi_hop_abstract_query_synthesizer
6,How can self-care practices such as mindfulnes...,[<1-hop>\n\n- Sleep restriction: Limiting time...,"Self-care practices like mindfulness, regular ...",multi_hop_abstract_query_synthesizer
7,How does supporting digestive health through d...,[<1-hop>\n\nThe Mental Health and Psychology H...,Supporting digestive health through diet and l...,multi_hop_abstract_query_synthesizer
8,"How do B vitamins, as discussed in the context...",[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,B vitamins are highlighted in the context of n...,multi_hop_specific_query_synthesizer
9,Chapter 12 and 13 how they help mental health ...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,Chapter 12 explains how sleep affects mental h...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [31]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [32]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What role do proteins play in mental health an...,[The Personal Wellness Guide A Comprehensive R...,"The provided context discusses exercise, nutri...",single_hop_specifc_query_synthesizer
1,What is non-REM sleep?,[PART 3: SLEEP AND RECOVERY Chapter 7: The Sci...,"During sleep, your body repairs tissues, conso...",single_hop_specifc_query_synthesizer
2,Can you explain what an APPENDIX is in the con...,[PART 5: BUILDING HEALTHY HABITS Chapter 13: T...,The provided context does not define or descri...,single_hop_specifc_query_synthesizer
3,Wht is the United States?,[The Mental Health and Psychology Handbook A P...,The context provided does not include specific...,single_hop_specifc_query_synthesizer
4,Hwo can nutriens like Omega-3 and B vitams hel...,[<1-hop>\n\nWrite letters to or from your futu...,The context explains that key nutrients such a...,multi_hop_abstract_query_synthesizer
5,How does exercize and physical activty impact ...,[<1-hop>\n\nThe Mental Health and Psychology H...,Exercise and physical activity are highly effe...,multi_hop_abstract_query_synthesizer
6,How does engaging in regular physical activity...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,"Engaging in regular physical activity, as outl...",multi_hop_abstract_query_synthesizer
7,Hwo can I use stress management and relaxation...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,The context provides strategies for managing s...,multi_hop_abstract_query_synthesizer
8,Considering the importance of managing digital...,[<1-hop>\n\nsocial interactions How to set and...,"Establishing a consistent morning routine, as ...",multi_hop_specific_query_synthesizer
9,how mental health resources help with mental h...,[<1-hop>\n\nThe Mental Health and Psychology H...,The context explains that mental health encomp...,multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
Unrolled (manual) approach offers full control over the process, customization of transformations, and the ability to save and reuse the knowledge graph. It also enables debugging and inspection of the knowledge graph structure. Disadvantages include higher complexity, more code, and a deeper understanding of the process. I would use it when I need fine-grained control, want to customize transformations, plan to reuse the knowledge graph, or need to debug the generation process.
Abstracted (automatic) approach offers simplicity, faster implementation, and less code. Disadvantages include less control over intermediate steps, limited customization, and less transparency. I would use it when I want quick results, when prototyping, when I don't need to customize the knowledge graph structure, or when I'm new to Ragas.
I would choose the unrolled approach for production systems with specific requirements, when I need reproducibility, or when I want to optimize or debug specific parts of the process. I would choose the abstracted approach for rapid iteration, experimentation, learning, or when speed of implementation is the priority.

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

I chose the 0.6/0.2/0.2 distribution to prioritize simple single-hop questions for a stable baseline while keeping a meaningful share of complex multi-hop questions. The weights define the approximate percentage of each question type: 60% single-hop, 20% multi-hop abstract, and 20% multi-hop specific.
Compared to the default distribution (0.5/0.25/0.25), this increases single-hop questions from 50% to 60% and reduces multi-hop questions from 50% to 40%. This keeps the test set diverse while making most queries easier to interpret when analyzing performance.
By comparing the synthesizer_name distributions between the default and custom test sets, you can see how changing weights directly changes the structure of generated questions. The custom distribution generated approximately 6 single-hop questions, 2 multi-hop abstract questions, and 2 multi-hop specific questions out of 10 total, which aligns with the intended 60/20/20 split.

In [56]:
### YOUR CODE HERE ###

# Define a custom query distribution with different weights
# Generate a new test set and compare with the default
generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)
custom_query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.6),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.2),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.2),
]

#use our `TestSetGenerator` to generate our testset
custom_testset = generator.generate(testset_size=10, query_distribution=custom_query_distribution)
custom_df=custom_testset.to_pandas()
display(custom_df)

#Count the number of questions generated for each synthesizer
testset_df=testset.to_pandas()
default_count= testset_df['synthesizer_name'].value_counts()
custom_count= custom_df['synthesizer_name'].value_counts()

#Calculate the percentages
default_percentages = default_count / len(testset_df) * 100
custom_percentages = custom_count / len(custom_df) * 100

print(f"Default percentages: {default_percentages}")
print(f"Custom percentages: {custom_percentages}")

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What minerals are important for maintaining ov...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,"Minerals are inorganic elements like calcium, ...",single_hop_specifc_query_synthesizer
1,What does the appendix include as quick stress...,[13: The Science of Habit Formation Habits are...,The appendix includes quick stress relief meth...,single_hop_specifc_query_synthesizer
2,Whaat is the importance of neck in holistc hea...,[The Personal Wellness Guide A Comprehensive R...,The context discusses neck tension caused by d...,single_hop_specifc_query_synthesizer
3,"As a Holistic Wellness Coach, how does the con...",[The Mental Health and Psychology Handbook A P...,"Mental health encompasses our emotional, psych...",single_hop_specifc_query_synthesizer
4,What is Stanford about and how does it relate ...,[PART 3: BUILDING RESILIENCE Chapter 7: What I...,The context discusses Stanford in relation to ...,single_hop_specifc_query_synthesizer
5,What is Mindfulness-Based Stress Reduction and...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Mindfulness-Based Stress Reduction (MBSR) is a...,single_hop_specifc_query_synthesizer
6,how build habits and improve emotional intelli...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,building habits involves understanding the hab...,multi_hop_abstract_query_synthesizer
7,H0w can practises like MBSR and growth mindset...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,Practices like Mindfulness-Based Stress Reduct...,multi_hop_abstract_query_synthesizer
8,How does cognitive therapy relate to cognitive...,[<1-hop>\n\n- Sleep restriction: Limiting time...,Cognitive therapy involves addressing beliefs ...,multi_hop_specific_query_synthesizer
9,"How do B vitamins, as discussed in the context...",[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,B vitamins are essential nutrients found in fo...,multi_hop_specific_query_synthesizer


Default percentages: synthesizer_name
single_hop_specifc_query_synthesizer    45.454545
multi_hop_abstract_query_synthesizer    27.272727
multi_hop_specific_query_synthesizer    27.272727
Name: count, dtype: float64
Custom percentages: synthesizer_name
single_hop_specifc_query_synthesizer    60.0
multi_hop_abstract_query_synthesizer    20.0
multi_hop_specific_query_synthesizer    20.0
Name: count, dtype: float64


We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [57]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [58]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [60]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [61]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [62]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [63]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [64]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [65]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [66]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [67]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [68]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off the floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [69]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [70]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`:
> - `labeled_helpfulness_evaluator`:
> - `dopeness_evaluator`:

## LangSmith Evaluation

In [71]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'complicated-product-92' at:
https://smith.langchain.com/o/5bc0e3fa-3ae1-4c4e-bc9e-5be1ef1d781c/datasets/2b8d3302-b4e3-4f76-85a6-d58dd1e6f832/compare?selectedSessions=7571b6b1-8729-4040-8ef0-97879674b607




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can CBT and CBT-I help improve mental heal...,CBT (Cognitive Behavioral Therapy) helps impro...,,"CBT, or Cognitive Behavioral Therapy, focuses ...",True,True,True,5.111873,a04042d2-4eac-4de3-9abc-1ce11d2e03ba,019c5018-bbd6-7bc0-94a3-5c0041e92c09
1,How can understanding the importance of social...,Understanding the importance of social connect...,,Understanding the importance of social connect...,True,True,True,4.935261,4e9f2387-7bae-4efe-83eb-17a2af0ae801,019c5019-0b3d-7631-840f-99edb638d49c
2,how mental health resources help with mental h...,The provided context does not explicitly expla...,,The context explains that mental health encomp...,False,False,False,2.382326,1f5aa75d-e817-4378-b391-ec2a642daaea,019c5019-51b8-7402-8baf-f508365fd7ba
3,Considering the importance of managing digital...,Establishing a consistent morning routine can ...,,"Establishing a consistent morning routine, as ...",True,True,False,5.814236,87028718-c38d-4504-9e93-dc58828e67d4,019c5019-85c3-7972-aa74-21832c2902c9
4,Hwo can I use stress management and relaxation...,You can use the following immediate stress rel...,,The context provides strategies for managing s...,True,True,False,3.168214,999c351b-61b8-4988-ac3d-74ea6aea0bfb,019c5019-c5f5-70d1-a38e-d46e176ee847
5,How does engaging in regular physical activity...,"Engaging in regular physical activity, as desc...",,"Engaging in regular physical activity, as outl...",True,True,False,4.524002,16681d82-b957-44e0-b757-9e69f91b5773,019c5019-ef6e-7f22-81b6-e57d109a334c
6,How does exercize and physical activty impact ...,Based on the provided context:\n\n**Exercise a...,,Exercise and physical activity are highly effe...,True,True,True,8.632282,1af96b74-17ae-4a2f-b5c3-05a37e87d2c5,019c501a-2296-7c82-b832-e1aa8f462862
7,Hwo can nutriens like Omega-3 and B vitams hel...,Based on the provided context:\n\nOmega-3 fatt...,,The context explains that key nutrients such a...,True,True,False,3.416877,8a3e6f04-e3e5-4779-88eb-6d5779f12fe4,019c501a-6ebd-7f32-92ec-f248c538af86
8,Wht is the United States?,I don't know,,The context provided does not include specific...,True,False,False,0.628536,376e293c-44b7-45f3-8e0c-467d82b19b7a,019c501a-a4f7-7772-9780-fde506387218
9,Can you explain what an APPENDIX is in the con...,I don't know.,,The provided context does not define or descri...,True,False,False,0.800227,395e0e48-2ca3-4529-8b37-d43623ba5ac9,019c501a-dcfa-7b23-aa62-dd6e4535a3dc


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [72]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [73]:
rag_documents = docs

In [74]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
Smaller chunk size (e.g., 500) produces more precise, focused chunks that are often more relevant for simple questions, but may lack context for complex questions that require more information. The retriever may return more chunks, but each contains less context, which can lead to fragmented answers if information is spread across multiple chunks. Larger chunk size (e.g., 1000) contains more context per chunk, which helps with complex and multi-hop questions because it can include more relevant information in a single chunk. On the other hand, larger chunks may include less relevant parts, reducing precision, and increase the number of tokens in the prompt, which can affect costs and speed. Chunk size affects the balance between precision and context: smaller chunks are better for simple, specific questions, while larger chunks help with more complex questions that require more information. The optimal chunk size depends on the type of questions and the complexity of the data.

In [75]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
The embedding model converts text into vectors that represent semantic meaning. Different models capture similarities and relationships between words and concepts differently. A better embedding model (e.g., text-embedding-3-large instead of text-embedding-3-small) typically captures complex semantic relationships, synonyms, contextual differences, and ambiguities better. The quality of embeddings directly affects retrieval quality: poorer embeddings mean the retriever may return less relevant documents, leading to worse answers. Larger models are usually more accurate but can be slower and more expensive. The embedding model determines how well the retriever finds relevant chunks for a question, which is key to RAG system quality. Switching to a better embedding model can improve performance, especially for complex questions that require deeper understanding of context and relationships between concepts.

In [76]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [77]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [78]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [79]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

"Yo, want to level up your sleep game like a sleep ninja? Here's the ultimate cheat code straight from the wellness vault:\n\n1. Lock in a consistent sleep schedule‚Äîyeah, even on weekends. Your body loves that rhythm like a beat drop.\n2. Dial into a chill bedtime routine‚Äîthink reading, gentle stretching, or a warm bath to melt stress away.\n3. Transform your bedroom into a sleep sanctuary: keep it cool (65-68¬∞F / 18-20¬∞C), pitch-black with blackout curtains or a sleep mask, and ninja-quiet with white noise machines or earplugs.\n4. Screen zombies beware‚Äîcut screen time 1 to 2 hours before bed to dodge that blue light buzzkill.\n5. Caffeine? Cut it off after 2 PM unless you wanna sabotage your sleep mission.\n6. Get your sweat on regularly but keep it away from bedtime to avoid overheating your system.\n7. Skip heavy meals and booze before bed; they wreck your sleep cycle vibes.\n8. Invest in a mattress and pillows that cradle you like royalty‚Äîcomfort is king.\nMaster these a

Finally, we can evaluate the new chain on the same test set!

In [80]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'loyal-toe-17' at:
https://smith.langchain.com/o/5bc0e3fa-3ae1-4c4e-bc9e-5be1ef1d781c/datasets/2b8d3302-b4e3-4f76-85a6-d58dd1e6f832/compare?selectedSessions=1ccbe18d-52d7-40c8-9089-b8739c3f98cf




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can CBT and CBT-I help improve mental heal...,"Hell yeah, let‚Äôs break down how CBT and CBT-I ...",,"CBT, or Cognitive Behavioral Therapy, focuses ...",True,True,True,6.107222,a04042d2-4eac-4de3-9abc-1ce11d2e03ba,019c5030-c4f0-7012-ac4d-9b8fc193a39d
1,How can understanding the importance of social...,"Alright, buckle up for some next-level mental ...",,Understanding the importance of social connect...,True,True,True,4.084045,4e9f2387-7bae-4efe-83eb-17a2af0ae801,019c5031-3b38-7921-a5cb-1ea686b3d9cd
2,how mental health resources help with mental h...,"Yo, let‚Äôs break it down with some seriously ra...",,The context explains that mental health encomp...,True,True,True,6.937903,1f5aa75d-e817-4378-b391-ec2a642daaea,019c5031-7eac-7930-b51d-d6350d24b8d2
3,Considering the importance of managing digital...,"Oh, buckle up, because this combo of a solid m...",,"Establishing a consistent morning routine, as ...",True,True,True,5.532583,87028718-c38d-4504-9e93-dc58828e67d4,019c5031-e6a4-7351-9a1b-316ebf9dc8ec
4,Hwo can I use stress management and relaxation...,"Yo, when stress hits you like a lightning bolt...",,The context provides strategies for managing s...,True,True,True,5.227073,999c351b-61b8-4988-ac3d-74ea6aea0bfb,019c5032-4b9e-7df2-b544-f83cee8ce966
5,How does engaging in regular physical activity...,"Alright, let‚Äôs break down the epic vibes of re...",,"Engaging in regular physical activity, as outl...",True,True,True,6.195168,16681d82-b957-44e0-b757-9e69f91b5773,019c5032-8c32-7c43-9c97-9d2774f97f72
6,How does exercize and physical activty impact ...,"Alright, strap in for a mind-body synergy brea...",,Exercise and physical activity are highly effe...,True,True,True,7.591024,1af96b74-17ae-4a2f-b5c3-05a37e87d2c5,019c5032-cf65-7e83-8f0d-a57468fbcd34
7,Hwo can nutriens like Omega-3 and B vitams hel...,"Alright, let‚Äôs crank this mental health game u...",,The context explains that key nutrients such a...,True,True,True,4.259014,8a3e6f04-e3e5-4779-88eb-6d5779f12fe4,019c5033-28d9-74f2-a28a-af07a38f1a61
8,Wht is the United States?,I don‚Äôt know. The context didn‚Äôt drop any know...,,The context provided does not include specific...,True,True,True,1.153421,376e293c-44b7-45f3-8e0c-467d82b19b7a,019c5033-7fa2-78c3-842d-e310e13a8da8
9,Can you explain what an APPENDIX is in the con...,"Yo, in the context of building healthy habits,...",,The provided context does not define or descri...,False,True,True,1.928761,395e0e48-2ca3-4529-8b37-d43623ba5ac9,019c5033-ba7e-7e61-a6c6-d91bfc32b1c8


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:
![LangSmith Comparison Screenshot](LangSmithComparingDatasetScreenshoot.png)


As shown in the screenshot, in the notebook we first built a baseline RAG chain using the standard rag_prompt, a retriever with parameter k=10, a smaller chunk_size during document segmentation, and a smaller embedding model for building the vector store. This configuration meant that we retrieved a larger number of smaller text segments, whose semantic representation was generated with a simpler embedding model.
In the experimental version, we modified several components: we used the dopeness_rag_prompt, increased the chunk_size, applied a larger embedding model, and left the retriever at its default value (fewer retrieved documents compared to k=10). These changes affected both the generation style and the quality of retrieval, as well as the structure of the context provided to the model.

Dopeness metric:
The evaluation results (visible in the bar chart section and table in the screenshot) show that the experiment significantly improved the dopeness metric. In the baseline version, the average score was around 0.25, while in the experimental version it increased to 1.00. This improvement can be directly attributed to the prompt change, as the dopeness_rag_prompt was explicitly designed to produce an energetic, expressive, and "cool" style of responses. Since the dopeness evaluator measures tone and stylistic qualities, the prompt modification had a direct and strong impact on this metric.

Helpfulness metric:
The helpfulness metric also improved, increasing from approximately 0.67 in the baseline to 1.00 in the experimental version (visible in the table in the screenshot). This improvement is not solely due to the change in tone, but also to technical adjustments in the pipeline. A larger chunk_size allows each retrieved segment to contain more contextual information, reducing fragmentation. Additionally, the larger embedding model captures semantic relationships between queries and documents more accurately, improving retrieval quality. The combination of richer context and a more engaging prompt resulted in responses that appear more informative and useful.

QA Correctness metric:
The QA (factual correctness) metric remained approximately the same in both versions, with an average score of around 0.91 (visible in the bar chart section in the screenshot). This indicates that the baseline system already achieved a high level of accuracy, and the changes in chunk_size and embedding model did not lead to a significant improvement in factual correctness. In other words, the experiment enhanced style and perceived quality but did not substantially change the level of answer accuracy.

Latency and token usage:
Latency analysis (shown in the latency section in the screenshot) shows that the experimental version has a higher median latency (P50). This can be explained by several factors: the larger embedding model requires more computational resources, larger chunks increase the amount of context processed, and the new prompt generates longer and more detailed responses. All of these factors contribute to increased processing and generation time. Regarding token usage (shown in the token count section), the baseline version has more input tokens because it uses k=10 and retrieves more documents. The experimental version retrieves fewer documents but uses a larger chunk_size, meaning each chunk may contain richer information. At the same time, the experiment generates more output tokens due to longer and more stylized responses. The increase in output tokens contributes to the slightly higher overall cost of the experimental configuration (visible in the cost section).

Conclusion:
The experimental setup represents a technically more advanced RAG system in which we modified the prompt to optimize style, increased chunk_size to provide richer context, and used a larger embedding model to improve retrieval quality. As a result, the system achieved better perceived usefulness and significantly higher stylistic quality, while maintaining a similar level of factual accuracy. However, these improvements come at the cost of higher latency and slightly increased overall cost.

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores