# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

> NOTE: DO NOT RUN THESE CELLS IF YOU ARE RUNNING THIS NOTEBOOK LOCALLY

In [None]:
#!pip install -qU ragas==0.2.10

In [None]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/julieberlin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/julieberlin/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Loan Data use-case!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

### NOTICE: We're using a subset of the data for this example - this is to keep costs/time down.
for doc in docs[:20]:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 20, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node 'f64240'. Skipping!
Property 'summary' already exists in node 'aa11e2'. Skipping!
Property 'summary' already exists in node '1e161f'. Skipping!
Property 'summary' already exists in node 'a4003f'. Skipping!
Property 'summary' already exists in node '39161a'. Skipping!
Property 'summary' already exists in node '8a6ad7'. Skipping!
Property 'summary' already exists in node '5d8a9f'. Skipping!
Property 'summary' already exists in node '592ff0'. Skipping!
Property 'summary' already exists in node '3dda32'. Skipping!
Property 'summary' already exists in node 'd96e00'. Skipping!
Property 'summary' already exists in node '1fb89d'. Skipping!
Property 'summary' already exists in node '6d3af8'. Skipping!
Property 'summary' already exists in node '373dee'. Skipping!
Property 'summary' already exists in node 'ce7a45'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/43 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'f64240'. Skipping!
Property 'summary_embedding' already exists in node '8a6ad7'. Skipping!
Property 'summary_embedding' already exists in node 'a4003f'. Skipping!
Property 'summary_embedding' already exists in node 'aa11e2'. Skipping!
Property 'summary_embedding' already exists in node '373dee'. Skipping!
Property 'summary_embedding' already exists in node '3dda32'. Skipping!
Property 'summary_embedding' already exists in node '1e161f'. Skipping!
Property 'summary_embedding' already exists in node 'ce7a45'. Skipping!
Property 'summary_embedding' already exists in node '1fb89d'. Skipping!
Property 'summary_embedding' already exists in node '5d8a9f'. Skipping!
Property 'summary_embedding' already exists in node '6d3af8'. Skipping!
Property 'summary_embedding' already exists in node '39161a'. Skipping!
Property 'summary_embedding' already exists in node 'd96e00'. Skipping!
Property 'summary_embedding' already exists in node '592ff0'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 40, relationships: 478)

We can save and load our knowledge graphs as follows.

In [10]:
kg.save("loan_data_kg.json")
loan_data_kg = KnowledgeGraph.load("loan_data_kg.json")
loan_data_kg

KnowledgeGraph(nodes: 40, relationships: 478)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [None]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=loan_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [12]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### ✅ **Answer:**

1. `SingleHopSpecificQuerySynthesizer` - Generates questions that can be answered by referencing a single, specific piece of information (node) in the knowledge graph. These questions will be direct, fact-based, and focused on a single document or fact.

2. `MultiHopAbstractQuerySynthesizer` - Generates questions that require combining information from multiple nodes, but in an abstract or generalized way. These questions require reasoning across several pieces of information, but the answer is more about general concepts or relationships, not specific details.

3. `MultiHopSpecificQuerySynthesizer` - Generates questions that require combining information from multiple nodes, with the answer being a specific, concrete detail. These questions require the model to connect multiple facts/documents to arrive at a specific answer.

Finally, we can use our `TestSetGenerator` to generate our testset!

In [13]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is Title IV in federal student aid?,"[Chapter 1 Academic Years, Academic Calendars,...",The context does not provide a specific defini...,single_hop_specifc_query_synthesizer
1,Could you please explain the significance of 3...,[Regulatory Citations Academic year minimums: ...,Regulatory citations indicate that 34 CFR 668....,single_hop_specifc_query_synthesizer
2,Wha is Volume 8?,[Inclusion of Clinical Work in a Standard Term...,Inclusion of Clinical Work in a Standard Term ...,single_hop_specifc_query_synthesizer
3,What is the significance of Title IV in relati...,[Non-Term Characteristics A program that measu...,The payment period is applicable to all Title ...,single_hop_specifc_query_synthesizer
4,How does a mispelled Pell Grant affect student...,[both the credit or clock hours and the weeks ...,The context does not provide information about...,single_hop_specifc_query_synthesizer
5,How do the disbursement timing requirements di...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,multi_hop_abstract_query_synthesizer
6,How do the minimum instructional weeks for cre...,"[<1-hop>\n\nChapter 1 Academic Years, Academic...",The minimum instructional weeks for credit-hou...,multi_hop_abstract_query_synthesizer
7,If student accelerate in clock hour or non-ter...,[<1-hop>\n\nboth the credit or clock hours and...,The student who accelerates in a clock-hour or...,multi_hop_abstract_query_synthesizer
8,which volume 2 or volume 8 is more relevant fo...,[<1-hop>\n\nboth the credit or clock hours and...,Volume 2 discusses the effect of accelerated p...,multi_hop_specific_query_synthesizer
9,How do Volume 8 and Volume 7 relate to disburs...,[<1-hop>\n\nboth the credit or clock hours and...,Volume 8 discusses the inclusion of clinical w...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [14]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node 'f97854'. Skipping!
Property 'summary' already exists in node 'b22719'. Skipping!
Property 'summary' already exists in node 'ae68a3'. Skipping!
Property 'summary' already exists in node '48744b'. Skipping!
Property 'summary' already exists in node 'b9772f'. Skipping!
Property 'summary' already exists in node 'd4696b'. Skipping!
Property 'summary' already exists in node '839a02'. Skipping!
Property 'summary' already exists in node '9a9cd5'. Skipping!
Property 'summary' already exists in node '2a7de9'. Skipping!
Property 'summary' already exists in node '2a2d1b'. Skipping!
Property 'summary' already exists in node '844519'. Skipping!
Property 'summary' already exists in node '1f5520'. Skipping!
Property 'summary' already exists in node '07899f'. Skipping!
Property 'summary' already exists in node '1494be'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/43 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'f97854'. Skipping!
Property 'summary_embedding' already exists in node 'b22719'. Skipping!
Property 'summary_embedding' already exists in node 'b9772f'. Skipping!
Property 'summary_embedding' already exists in node '2a7de9'. Skipping!
Property 'summary_embedding' already exists in node '1f5520'. Skipping!
Property 'summary_embedding' already exists in node '839a02'. Skipping!
Property 'summary_embedding' already exists in node 'ae68a3'. Skipping!
Property 'summary_embedding' already exists in node '48744b'. Skipping!
Property 'summary_embedding' already exists in node 'd4696b'. Skipping!
Property 'summary_embedding' already exists in node '844519'. Skipping!
Property 'summary_embedding' already exists in node '07899f'. Skipping!
Property 'summary_embedding' already exists in node '9a9cd5'. Skipping!
Property 'summary_embedding' already exists in node '2a2d1b'. Skipping!
Property 'summary_embedding' already exists in node '1494be'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [15]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the definition of a 'Department' infl...,"[Chapter 1 Academic Years, Academic Calendars,...",The context does not explicitly define 'Depart...,single_hop_specifc_query_synthesizer
1,Could you please explain the significance of 3...,[Regulatory Citations Academic year minimums: ...,Weeks of instructional time are specified unde...,single_hop_specifc_query_synthesizer
2,Wha is a Standard Term?,[Inclusion of Clinical Work in a Standard Term...,Inclusion of Clinical Work in a Standard Term ...,single_hop_specifc_query_synthesizer
3,Could you explain what constitutes Non-Term Ch...,[Non-Term Characteristics A program that measu...,A program that measures progress in clock hour...,single_hop_specifc_query_synthesizer
4,How do the academic year minimums and instruct...,"[<1-hop>\n\nChapter 1 Academic Years, Academic...","The academic year minimums, specified in 34 CF...",multi_hop_abstract_query_synthesizer
5,How do the payment periods and scheduled payme...,[<1-hop>\n\nboth the credit or clock hours and...,The initial disbursement requirements depend o...,multi_hop_abstract_query_synthesizer
6,Wht are the disbursement timing rules for fina...,[<1-hop>\n\nboth the credit or clock hours and...,The disbursement timing rules for financial ai...,multi_hop_abstract_query_synthesizer
7,clinical work in nonstandard terms how does it...,[<1-hop>\n\nInclusion of Clinical Work in a St...,The context explains that clinical work outsid...,multi_hop_abstract_query_synthesizer
8,"According to Volume 2 and Volume 8, how does t...",[<1-hop>\n\nInclusion of Clinical Work in a St...,Volume 2 explains that clinical work conducted...,multi_hop_specific_query_synthesizer
9,where appendix A and B tell about disbursement...,[<1-hop>\n\nDisbursement Timing in Subscriptio...,Appendix B provides detailed guidance and exam...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [16]:
from langsmith import Client

client = Client()

dataset_name = "Loan Synthetic Data 2"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Loan Synthetic Data"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [17]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [19]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [21]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [22]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Loan RAG"
)

In [23]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [24]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!


^NOTE: The above appears to be incorrect. We're using OpenAI 4.1-mini.

In [25]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [26]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [27]:
rag_chain.invoke({"question" : "What kinds of loans are available?"})

'The kinds of loans available include:\n\n- Direct Subsidized Loans  \n- Direct Unsubsidized Loans  \n- Direct PLUS Loans (student Federal PLUS Loans and parent loans on behalf of dependent students)  \n- Subsidized and Unsubsidized Federal Stafford Loans (made under the FFEL Program before July 1, 2010)  \n- Federal SLS Loans (also made under the FFEL Program before July 1, 2010)  \n- Federal PLUS Loans (also under the FFEL Program before July 1, 2010)  \n\nDirect Subsidized Loans are available only to undergraduate students. Graduate or professional students are eligible for Direct Unsubsidized Loans but not for Direct Subsidized Loans. Direct Unsubsidized Loans and Direct PLUS Loans can be used to cover unmet need and replace the SAI.  \n\nNo new loans are made under the FFEL Program since June 30, 2010; current loans are Direct Loans.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [28]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [29]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

empathy_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "empathy": "Is this response empathetic? Does it make the user feel like they are being heard?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

##### ✅ **Answer:**

- `qa_evaluator`: Built-in (Question Answering) evaluator uses the LLM referenced by `eval_llm` (OpenAI's GPT-4.1, here) to judge the quality of the answers produced by our RAG chain. This evaluator compares the output (answer) generated by our RAG system to a reference answer (the ground truth from synthetic dataset). The LLM is prompted to assess whether the generated answer is correct, complete, and relevant to the question, based on the reference answer. The result is a score or label indicating how well the system's answer matches the expected answer.

- `labeled_helpfulness_evaluator`: Uses the built-in LangChainStringEvaluator class with a specific configuration for the "labeled_criteria" evaluation type and a custom criterion for "helpfulness". The "labeled_criteria" type tells the evaluator to use the reference answer as a label to judge if the answer is helpful, considering the reference answer. The `prepare_data` function ensures the LLM receives the system output, reference answer, and question in the expected format.

- `empathy_evaluator`: Also uses the built-in LangChainStringEvaluator class with a custom criterion "empathy" and unlike "helpfulness" above, here there are no labels provided. The LLM is prompted to judge the system’s answer for empathy, based on your description. For "criteria" evaluators, the output is typically a label such as "Y" (yes), "N" (no), or sometimes "Y?" (uncertain), depending on the LLM’s response and the evaluator’s configuration.

NOTE: 

I did some investigation into whether we can influence the output label for custom evaluators.

- For label-based outputs: You can post-process the labels to map them to numeric scores (e.g., "Y" → 1, "N" → 0).

- For numeric scoring: You can modify the criterion prompt to explicitly ask for a score (e.g., “Rate the empathy of this response on a scale from 1 to 5”), and then parse the LLM’s output accordingly.

- Custom evaluators: You can write a fully custom evaluator function to enforce any scoring scheme you want.

Example:

```python
empathy_evaluator = LangChainStringEvaluator(
    ...
        "criteria": {
            "empathy": "On a scale from 1 (not empathetic) to 5 (very empathetic), how empathetic is this response? Respond with a single number."
        },
    ...
)
```

## LangSmith Evaluation

In [30]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        empathy_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'scholarly-plant-11' at:
https://smith.langchain.com/o/b6da38b2-2b3b-4201-8414-bf6f7feb1065/datasets/de49220c-a9a5-4b4a-8d2b-abd14801ed11/compare?selectedSessions=f01fc563-8b89-4486-8c3e-98e51d323901




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.empathy,execution_time,example_id,id
0,Volume 2 and Volume 8 both talk about academic...,"Based on the provided context, Volume 2 and Vo...",,Volume 2 explains the academic year requiremen...,1,1,0,5.140653,47e642f6-71f6-46b3-b25b-61a11dc0846e,56fd67d1-5cd5-46fb-af34-6d691bdaf0a2
1,How does the inclusion of clinical work in a s...,The inclusion of clinical work in a standard t...,,"According to Volume 8, inclusion of clinical w...",1,1,0,5.029287,55c9abf5-bf18-4efc-80e9-28de5a0a5e76,c3e293ff-2073-45d5-aadb-dcfc66629eb9
2,where appendix A and B tell about disbursement...,Based on the provided context:\n\n- **Appendix...,,Appendix B provides detailed guidance and exam...,1,1,0,4.505716,031996f2-7484-4177-9a3e-f508b7809556,42d85ca9-9df5-4414-ad58-dbb1967c4dce
3,"According to Volume 2 and Volume 8, how does t...",According to the provided context from Volume ...,,Volume 2 explains that clinical work conducted...,1,1,0,2.19059,2ec192de-3d23-4dc8-9900-210b260eacfd,8794042f-7140-4bb5-83ce-cf7ce433d432
4,clinical work in nonstandard terms how does it...,"Based on the provided context, clinical work t...",,The context explains that clinical work outsid...,1,0,0,1.774134,342c2e6b-17c5-4931-9741-1fe8f8b7c33a,143fc60a-e78d-4720-a35f-6fc00fd0fa59
5,Wht are the disbursement timing rules for fina...,The disbursement timing rules for financial ai...,,The disbursement timing rules for financial ai...,1,1,0,5.922941,ee4b5eab-933f-41fc-b06b-7356be354056,a5d7dfd4-3154-4f6d-8ac4-c1bf4fcf0b49
6,How do the payment periods and scheduled payme...,"Based on the provided context, here is how pay...",,The initial disbursement requirements depend o...,1,1,0,28.214481,f0394fca-8049-4d58-8362-ee282acc2404,6ed95a09-46e9-4858-a070-748463df40e4
7,How do the academic year minimums and instruct...,The academic year minimums and instructional t...,,"The academic year minimums, specified in 34 CF...",1,1,0,4.172672,0ee21bb7-e7ae-4721-adc2-2e493cc0f89a,b268890f-2bc3-4aeb-8f8b-d9c22bd674e4
8,Could you explain what constitutes Non-Term Ch...,Non-Term Characteristics in the context of aca...,,A program that measures progress in clock hour...,1,1,0,2.215194,0265f06e-b4eb-463b-b082-a5d6adc78320,4ef9316b-cc42-4975-b58a-c8aa610aa549
9,Wha is a Standard Term?,A standard term is generally a period in which...,,Inclusion of Clinical Work in a Standard Term ...,0,1,0,2.253271,99504c50-8660-40b5-bc9a-235c96435281,4687a8a6-4eff-4942-9e7e-0614be3fd6f0


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [31]:
EMPATHY_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the question using empathy and kindness, and make sure the user feels heard.

Context: {context}
Question: {question}
"""

empathy_rag_prompt = ChatPromptTemplate.from_template(EMPATHY_RAG_PROMPT)

In [32]:
rag_documents = docs

In [33]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

##### ✅ **Answer:**

Chunk size influences the amount of context available inside each chunk retrieved from the vector database. Smaller chunk sizes are suitable for single, isolated pieces of information, such as "What is the capitol of France?" Larger chunks can capture more nuanced meanings. The rule of thumb is to chunk text to roughly the size of an expected answer for the use case.

Here, we are using recursive character splitting and doubled the chunk size from 500 to 1000 characters. That is roughly the equivalent of changing from a couple of sentences to an average paragraph size. We would expect that the responses from this new application would be able to provide answers that are slightly better quality. It's unlikely this change will dramatically alter the quality of responses.

I would expect we'd need to increase the chunk size to 3000 if the intention was to truly capture meaning derived from approximately a page-worth of text, or alter chunking to use semantic structure.

In [34]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

##### ✅ **Answer:**

Here we have switched from `text-embedding-3-small` to `text-embedding-3-large`. 

Larger models like `text-embedding-3-large` are trained with more parameters and data, enabling them to capture deeper and more nuanced semantic relationships between pieces of text. This means that when you embed your documents and queries, the vectors produced by the large model will more accurately reflect the true meaning and context of the text.

In RAG, the retriever’s job is to find the most relevant documents or passages for a given query. Higher-quality embeddings lead to better similarity matching, so the retriever is more likely to surface the most contextually relevant information. This directly improves the quality of the context provided to the LLM, resulting in more accurate and relevant answers.

A more powerful embedding model reduces the chance of retrieving irrelevant documents (false positives) or missing relevant ones (false negatives). This is especially important for complex or nuanced queries, where subtle differences in meaning matter.

Since the LLM’s answer depends on the retrieved context, better retrieval means the LLM has better material to work with, leading to higher-quality, more factual, and more helpful answers. This, in turn, improves evaluation metrics (correctness, helpfulness, empathy, etc.).

Finally

To sum up, here are the main ways choice of embedding model can impact the application:

1. Embedding quality and semantic understanding
2. Improved retrieval accuracy
3. Reduced false positives/negatives
4. Downstream impact on generation and evaluation
5. Cost per token

In [35]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Loan Data for RAG"
)

In [36]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [37]:
empathy_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | empathy_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [38]:
empathy_rag_chain.invoke({"question" : "What kinds of loans are available?"})

"Thank you for your question! Based on the information from the documents you provided, there are several types of loans available to students and their parents:\n\n1. **Direct Subsidized Loans** – These loans are need-based and can only be granted up to the amount of the student's financial need (calculated as Cost of Attendance minus Student Aid Index and other financial aid). They have maximum annual limits depending on the student's year and status.\n\n2. **Direct Unsubsidized Loans** – These loans are not based on financial need, and students, whether dependent or independent, may qualify for them to cover unmet need or to replace other aid.\n\n3. **Direct PLUS Loans** – These are loans that parents of dependent students can take out to help pay for the student’s Cost of Attendance, assuming they qualify. There’s no fixed limit, but the loan amount cannot exceed the student's Cost of Attendance minus other aid received. If a parent is ineligible, the dependent student may receive 

Finally, we can evaluate the new chain on the same test set!

In [39]:
evaluate(
    empathy_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        empathy_evaluator
    ],
    metadata={"revision_id": "empathy_rag_chain"},
)

View the evaluation results for experiment: 'puzzled-glove-89' at:
https://smith.langchain.com/o/b6da38b2-2b3b-4201-8414-bf6f7feb1065/datasets/de49220c-a9a5-4b4a-8d2b-abd14801ed11/compare?selectedSessions=ca377f63-5cfe-4773-abe0-1943248e7a83




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.empathy,execution_time,example_id,id
0,Volume 2 and Volume 8 both talk about academic...,Thank you for your thoughtful question. Based ...,,Volume 2 explains the academic year requiremen...,1,1,1,5.27885,47e642f6-71f6-46b3-b25b-61a11dc0846e,9cd10e00-6cfa-474b-afee-412858663f3a
1,How does the inclusion of clinical work in a s...,Thank you for your thoughtful question. Based ...,,"According to Volume 8, inclusion of clinical w...",1,1,1,6.047591,55c9abf5-bf18-4efc-80e9-28de5a0a5e76,5639c8a4-6b04-4f11-aeba-a3528a75ed92
2,where appendix A and B tell about disbursement...,Thank you for your thoughtful question. Based ...,,Appendix B provides detailed guidance and exam...,1,1,1,5.195282,031996f2-7484-4177-9a3e-f508b7809556,0ac963d7-0dc7-4dab-89ae-0a343590b483
3,"According to Volume 2 and Volume 8, how does t...",Thank you for your thoughtful question. Based ...,,Volume 2 explains that clinical work conducted...,1,1,1,4.586608,2ec192de-3d23-4dc8-9900-210b260eacfd,50b4ea15-5911-4d24-92d9-bf633ebb1066
4,clinical work in nonstandard terms how does it...,Thank you for your thoughtful question. Based ...,,The context explains that clinical work outsid...,1,1,1,5.480681,342c2e6b-17c5-4931-9741-1fe8f8b7c33a,d9f5f720-a832-422a-a4f9-b414987c49e3
5,Wht are the disbursement timing rules for fina...,Thank you for your thoughtful question about d...,,The disbursement timing rules for financial ai...,1,1,1,8.395553,ee4b5eab-933f-41fc-b06b-7356be354056,b14f17f6-3878-4cf6-a77e-a82614f68312
6,How do the payment periods and scheduled payme...,Thank you for your thoughtful question. Naviga...,,The initial disbursement requirements depend o...,1,1,1,11.28493,f0394fca-8049-4d58-8362-ee282acc2404,807553f3-c163-486b-874d-3721d6796ab8
7,How do the academic year minimums and instruct...,Thank you for your thoughtful question about t...,,"The academic year minimums, specified in 34 CF...",1,1,1,8.46801,0ee21bb7-e7ae-4721-adc2-2e493cc0f89a,3f0bd838-4578-4765-8248-244f756770b6
8,Could you explain what constitutes Non-Term Ch...,Thank you for your thoughtful question. Unders...,,A program that measures progress in clock hour...,1,1,1,7.772443,0265f06e-b4eb-463b-b082-a5d6adc78320,3d936536-b9d1-491b-96f5-667bf5bf7e1a
9,Wha is a Standard Term?,Thank you for your thoughtful question about w...,,Inclusion of Clinical Work in a Standard Term ...,0,0,1,2.341338,99504c50-8660-40b5-bc9a-235c96435281,a3470c1e-97b6-4149-acf1-f0aeb61037ec


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

##### ✅ **Answer:**

Overall, the *Empathy RAG Chain* is an improvement according to our metrics.

Here's a summary of the differences as a table:

| Metric      | RAG Chain (scholarly-plant-11) | Empathy RAG Chain (puzzled-glove-89) |
| :---------- | -----------------: | ----------------: |
| correctness (avg) | 0.9167 | 0.9167 |
| helpfulness (avg) | 0.8333 | 0.8333 |
| empathy (avg)     | 0      | 1      |
| latency (P50)     | 3.21   | 5.54   |
| total tokens      | 41,106 | 24,753 |

Screen captures of the differences from LangSmith:

<img src="./images/RAG_eval_comparison.png" width="99%" />

*^Fig 1: Graphs comparing performance of RAG chains*

<img src="./images/scholarly-plant-11.png" width="99%" />

*^Fig 2: RAG Chain Evaluation*

<img src="./images/puzzled-glove-89.png" width="99%" />

*^Fig 3: Empathy RAG Chain Evaluation*


**Explanation**

A few key differences jump out when comparing the results:

- Empathy RAG Chain performed very well on `empathy` metric, scoring `1` on all responses.
    - The dramatic change in empathy scores (from all zeroes to all ones) can be attributed solely to the change to the prompt for "Empathy RAG Chain" which added the sentence: "You must answer the question using empathy and kindness, and make sure the user feels heard." 
    - Upon closer examination, the output is flat and repetitive. A user asking more than one question will see the "empathy" as formulaic when every response starts with "Thank you for your thoughtful question..." and this will probably make humans appreciate the responses less over time.

- Total tokens used decreased for "Empathy RAG Chain" chain and costs were slightly less. Some can probably be attributed to larger chunk sizes and fewer overall chunks to retrieve each time.

- Latency: P50 latency increased for the "Empathy RAG Chain" to levels that are not good for user experience. It's likely the larger embedding model was slightly slower. (See note for latency explanations.)

- Helpfulness: "RAG Chain" had a positive helpfulness score for an incorrect answer and two non-passing helpfulness scores.
    - Both chains failed to return a good response for the question "How does the definition of a 'Department' influence the management of academic calendars and compliance with federal regulations within a college or university?"
    - RAG Chain answered "I don't know" - probably because the chunks retrieved were not finding the more nuanced meaning required to answer this.

- Correctness: Both models score incorrect for the question "What is a Standard Term?"
    - The responses from both models are quite similar, verbose and utterly confusing. It is likely that this is a nuanced term that needs more sophisticated embeddings to be answered properly. The reference answer is not good so we can't expect that a response will get better.



**Further Steps**

The process of increasing the quality of our application response has just begun. A number of changes need to be made:

Process:

- Document overall quality goals for the application, both functional and non-functional
- Find metrics that most closely align with the goals and tie each metric definition to one or more goals
- Get a baseline for all metrics
- Make one change at a time, analyze results
- Repeat
- Periodically reassess that all metrics are still valid for the application's goals.

Proposed changes:

- Adjust **embedding model**:
    - Why: P50 latency is too high at `5.54`.
    - How: Switch back to `text-embedding-3-small` which likely is capable of extracting the meaning we need but should be faster and slightly less expensive.

- Adjust **chunking strategy**:
    - Why: Nuanced meaning is lacking as demonstrated by some incorrect answers. The subject matter is not simplistic and intuitively we can see that character encoding equivalent of a paragraph of text is not sensitive enough for this material.
    - How: Use a semantic text chunking strategy to better extract meaning from the documents. We could use `RecursiveCharacterTextSplitter` and `tiktoken` to split by token or we could also try `NLTKTextSplitter`. Either will probably provide better retrieved context. Another option is LangChain's [SemanticChunker](https://python.langchain.com/docs/how_to/semantic-chunker/).

- Adjust **empathy** evaluation:
    - Why: The maximum score across all examples means we should increase the difficulty and/or sensitivity of this metric. The basic prompt is resulting in a nice value but it's not useful for the outcome of a good user experience.
    - How: The prompt should be rewritten to be more detailed, perhaps giving an example of the tone to strike. Also, the phrasing needs to be varied. Finally, the score needs to be more sensitive by perhaps adjusting the prompt to supply a 1-5 score. In addition, consider adding a "sycophancy" metric to counteract this one to prevent the results from becoming overly sensitive. We should ensure the responses match the desired tone for a professional interaction in this case.

- Adjust **helpfulness** evaluation:
    - Why: Most scores are `Y` which indicates the metric is not sensitive enough. We have overridden the custom prompt for this metric but our definition is so generic that it's likely not at all useful.
    - How: Assess exactly what the intention of this metric is and either use the built-in prompt or take time to carefully craft what "helpful" means in the context of our app. We could update the prompt with a requirement to cite the main document source and define any terms for example.


---
NOTE: 

Latency Percentiles:
- P50 (Median Latency): 50% of requests are faster than this value. This represents the typical user experience.
- P90 (90th Percentile Latency): 90% of requests are faster than this value. The slowest 10% of users experience worse performance.
- P99 (99th Percentile Latency): Only 1% of requests take longer than this value. This represents the worst-case performance for the majority of users.

How to read:
- If your P50 is great but P90 is high, many users experience occasional slowness.
- If P99 is too high, a small percentage of users are facing serious performance issues.

[CITE: Medium](https://medium.com/@jfindikli/the-ultimate-guide-to-faster-api-response-times-p50-p90-p99-latencies-0fb60f0a0198)
