# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

> NOTE: DO NOT RUN THESE CELLS IF YOU ARE RUNNING THIS NOTEBOOK LOCALLY

In [None]:
#!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mayankshah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/mayankshah/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
# import os
# import getpass

# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGCHAIN_API_KEY")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [8]:
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Loan Data use-case!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [4]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()
print(docs)



### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [5]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [6]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [7]:
from ragas.testset.graph import Node, NodeType

### NOTICE: We're using a subset of the data for this example - this is to keep costs/time down.
for doc in docs[:20]:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 20, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [8]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node '4b0e80'. Skipping!
Property 'summary' already exists in node '7b9675'. Skipping!
Property 'summary' already exists in node '67216e'. Skipping!
Property 'summary' already exists in node 'd58eb5'. Skipping!
Property 'summary' already exists in node 'e3955b'. Skipping!
Property 'summary' already exists in node '149f57'. Skipping!
Property 'summary' already exists in node '08a2ac'. Skipping!
Property 'summary' already exists in node '24a240'. Skipping!
Property 'summary' already exists in node 'd8b30b'. Skipping!
Property 'summary' already exists in node '25a76d'. Skipping!
Property 'summary' already exists in node 'b99e9d'. Skipping!
Property 'summary' already exists in node '81ea8c'. Skipping!
Property 'summary' already exists in node '2887e1'. Skipping!
Property 'summary' already exists in node '13a9c7'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/43 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '4b0e80'. Skipping!
Property 'summary_embedding' already exists in node '24a240'. Skipping!
Property 'summary_embedding' already exists in node 'e3955b'. Skipping!
Property 'summary_embedding' already exists in node '08a2ac'. Skipping!
Property 'summary_embedding' already exists in node '7b9675'. Skipping!
Property 'summary_embedding' already exists in node '67216e'. Skipping!
Property 'summary_embedding' already exists in node '149f57'. Skipping!
Property 'summary_embedding' already exists in node 'd8b30b'. Skipping!
Property 'summary_embedding' already exists in node '81ea8c'. Skipping!
Property 'summary_embedding' already exists in node 'b99e9d'. Skipping!
Property 'summary_embedding' already exists in node '13a9c7'. Skipping!
Property 'summary_embedding' already exists in node 'd58eb5'. Skipping!
Property 'summary_embedding' already exists in node '25a76d'. Skipping!
Property 'summary_embedding' already exists in node '2887e1'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 40, relationships: 478)

We can save and load our knowledge graphs as follows.

In [9]:
kg.save("loan_data_kg.json")
loan_data_kg = KnowledgeGraph.load("loan_data_kg.json")
loan_data_kg

KnowledgeGraph(nodes: 40, relationships: 478)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [10]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=loan_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [11]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### ✅ Answer:

The three query synthesizers each generate a distinct style of question to test different capabilities of a Retrieval-Augmented Generation (RAG) system. Here’s a summary of each, with example questions and answers that clearly demonstrate the type of reasoning required:

---

**1. SingleHopSpecificQuerySynthesizer**  
- **Purpose:** Generates straightforward, fact-based questions that can be answered by retrieving a single piece of information from the document.  
- **Nature:** Direct and simple; requires only one “hop” to find the answer.  
- **Example:**  
  - **Question:** *Who is eligible for federal student loans?*  
  - **Answer:** *U.S. citizens or eligible non-citizens enrolled at least half-time in an eligible program.*

---

**2. MultiHopAbstractQuerySynthesizer**  
- **Purpose:** Produces conceptual or high-level reasoning questions that require synthesizing information from multiple parts of the document.  
- **Nature:** Abstract and broad; focuses on understanding relationships, causes, or the “big picture.”  
- **Example:**  
  - **Question:** *How do changes in federal policy influence both the eligibility criteria and the disbursement process for student loans?*  
  - **Answer:**  
    *Changes in federal policy can simultaneously alter who qualifies for student loans (as described in the “Eligibility Criteria” section) and modify how and when funds are distributed to students (see “Disbursement Process” section). For example, a policy might expand eligibility to more students while also introducing new disbursement schedules, requiring schools to update both their admissions and financial aid procedures. This demonstrates how policy changes can have broad, interconnected effects across multiple aspects of student aid.*

---

**3. MultiHopSpecificQuerySynthesizer**  
- **Purpose:** Creates detailed, fact-oriented questions that span multiple sections of the document, requiring the combination of several specific data points.  
- **Nature:** Precise and multi-faceted; tests the ability to accurately retrieve and connect multiple concrete facts.  
- **Example:**  
  - **Question:** *List all the forms and deadlines required for a dependent undergraduate student to complete FAFSA verification for the 2024-25 academic year.*  
  - **Answer:**  
    *A dependent undergraduate student must submit the following: (1) the FAFSA form, (2) a signed verification worksheet, and (3) copies of parent tax returns (see “Required Documents” section). The deadline for submission is 30 days after notification (see “Verification Deadlines” section).*

---

**Summary Table:**

| Synthesizer                        | Question Type         | Information Required      | Example Question                                                                 | Example Answer                                                                                                            |
|-------------------------------------|----------------------|--------------------------|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| SingleHopSpecificQuerySynthesizer   | Simple, direct       | Single fact (“one hop”)  | Who is eligible for federal student loans?                                       | U.S. citizens or eligible non-citizens enrolled at least half-time in an eligible program.                               |
| MultiHopAbstractQuerySynthesizer    | Conceptual, broad    | Multiple, synthesized    | How do changes in federal policy influence both eligibility and disbursement?    | Explains how policy changes can affect both who qualifies and how/when funds are distributed, requiring synthesis.        |
| MultiHopSpecificQuerySynthesizer    | Detailed, multi-fact | Multiple, specific facts | List all the forms and deadlines for FAFSA verification for a dependent student. | Lists each required form and the deadline, combining details from different sections of the document.                     |

---

**Key Point:**  
- *Multi-hop* questions require the answerer to find and connect information from more than one place in the document.
- The **abstract** type focuses on explanation and synthesis (the “why” or “how” across concepts).
- The **specific** type focuses on gathering and combining concrete details (the “what,” “which,” or “when” across facts).

---


Finally, we can use our `TestSetGenerator` to generate our testset!

In [12]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How are academic years defined for different p...,"[Chapter 1 Academic Years, Academic Calendars,...","According to the context, each eligible progra...",single_hop_specifc_query_synthesizer
1,What does 34 CFR 668.3(b) specify regarding we...,[Regulatory Citations Academic year minimums: ...,34 CFR 668.3(b) pertains to weeks of instructi...,single_hop_specifc_query_synthesizer
2,What is Volume 8 in relation to clinical work ...,[Inclusion of Clinical Work in a Standard Term...,Inclusion of clinical work in a standard term ...,single_hop_specifc_query_synthesizer
3,Is the Federal Work-Study (FWS) program subjec...,[Non-Term Characteristics A program that measu...,"No, the payment period is applicable to all Ti...",single_hop_specifc_query_synthesizer
4,How does the mispelled term 'Direct Loan' impa...,[both the credit or clock hours and the weeks ...,The context explains that for the Direct Loan ...,single_hop_specifc_query_synthesizer
5,"According to the regulatory citations, what ar...","[<1-hop>\n\nChapter 1 Academic Years, Academic...",The regulatory citations specify that for cred...,multi_hop_abstract_query_synthesizer
6,How does the definition of an academic year ba...,"[<1-hop>\n\nChapter 1 Academic Years, Academic...",The definition of an academic year based on in...,multi_hop_abstract_query_synthesizer
7,How do policy requirements for term lengths an...,[<1-hop>\n\nInclusion of Clinical Work in a St...,Policy requirements for standard and nonstanda...,multi_hop_abstract_query_synthesizer
8,where appendix A and B tell about disbursement...,[<1-hop>\n\nDisbursement Timing in Subscriptio...,Disbursement timing in subscription-based prog...,multi_hop_specific_query_synthesizer
9,How do Volume 2 and Volume 8 relate to the tim...,[<1-hop>\n\nboth the credit or clock hours and...,Volume 2 discusses the requirements for the nu...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [13]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node 'cfcf4e'. Skipping!
Property 'summary' already exists in node 'f74e92'. Skipping!
Property 'summary' already exists in node '6fcb8d'. Skipping!
Property 'summary' already exists in node '452793'. Skipping!
Property 'summary' already exists in node '00a7c7'. Skipping!
Property 'summary' already exists in node 'c29d82'. Skipping!
Property 'summary' already exists in node 'c3ac27'. Skipping!
Property 'summary' already exists in node 'd5f440'. Skipping!
Property 'summary' already exists in node '612196'. Skipping!
Property 'summary' already exists in node '5148c3'. Skipping!
Property 'summary' already exists in node '8b01e9'. Skipping!
Property 'summary' already exists in node '4ce453'. Skipping!
Property 'summary' already exists in node 'dfbbc7'. Skipping!
Property 'summary' already exists in node '09eb80'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/43 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'cfcf4e'. Skipping!
Property 'summary_embedding' already exists in node '452793'. Skipping!
Property 'summary_embedding' already exists in node '6fcb8d'. Skipping!
Property 'summary_embedding' already exists in node 'f74e92'. Skipping!
Property 'summary_embedding' already exists in node '8b01e9'. Skipping!
Property 'summary_embedding' already exists in node '00a7c7'. Skipping!
Property 'summary_embedding' already exists in node 'dfbbc7'. Skipping!
Property 'summary_embedding' already exists in node '612196'. Skipping!
Property 'summary_embedding' already exists in node '5148c3'. Skipping!
Property 'summary_embedding' already exists in node '4ce453'. Skipping!
Property 'summary_embedding' already exists in node 'c29d82'. Skipping!
Property 'summary_embedding' already exists in node 'd5f440'. Skipping!
Property 'summary_embedding' already exists in node 'c3ac27'. Skipping!
Property 'summary_embedding' already exists in node '09eb80'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [14]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is Volume 2 about in the context of acade...,"[Chapter 1 Academic Years, Academic Calendars,...",The provided context does not include specific...,single_hop_specifc_query_synthesizer
1,What does 34 CFR 668.3(a) specify regarding th...,[Regulatory Citations Academic year minimums: ...,34 CFR 668.3(a) specifies the academic year mi...,single_hop_specifc_query_synthesizer
2,"What information does Volume 8, Chapter 3 prov...",[Inclusion of Clinical Work in a Standard Term...,"Volume 8, Chapter 3 offers guidance on includi...",single_hop_specifc_query_synthesizer
3,What is the role of the Federal Work-Study pro...,[Non-Term Characteristics A program that measu...,The payment period is applicable to all Title ...,single_hop_specifc_query_synthesizer
4,How does the measurement of progress in credit...,[<1-hop>\n\nInclusion of Clinical Work in a St...,The inclusion of clinical work in standard ter...,multi_hop_abstract_query_synthesizer
5,"So, like, if a program has courses that do not...",[<1-hop>\n\nInclusion of Clinical Work in a St...,Courses that do not begin and end within a ter...,multi_hop_abstract_query_synthesizer
6,so tell me how program requirements like 34 CF...,"[<1-hop>\n\nChapter 1 Academic Years, Academic...",The context explains that academic year requir...,multi_hop_abstract_query_synthesizer
7,min weeks for credit clock how many weeks for ...,"[<1-hop>\n\nChapter 1 Academic Years, Academic...",The context states that for credit hour progra...,multi_hop_abstract_query_synthesizer
8,How do Volume 2 and Volume 8 relate to the inc...,[<1-hop>\n\nInclusion of Clinical Work in a St...,Volume 2 discusses the criteria for including ...,multi_hop_specific_query_synthesizer
9,How do Appendix A and Appendix B provide guida...,[<1-hop>\n\nDisbursement Timing in Subscriptio...,Appendix A offers detailed guidance on disburs...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [16]:
from langsmith import Client

client = Client()

dataset_name = "Loan Synthetic Data v5"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Loan Synthetic Data"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [17]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [18]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [20]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [21]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Loan RAG"
)

In [22]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [23]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [24]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

# from langchain_together import ChatTogether

# llm = ChatTogether(model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo")

Finally, we can set-up our RAG LCEL chain!

In [25]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [26]:
rag_chain.invoke({"question" : "What kinds of loans are available?"})

'The kinds of loans available mentioned in the context are:\n\n- Direct PLUS Loan (or student Federal PLUS Loan)  \n- Direct Subsidized Loan  \n- Direct Unsubsidized Loan  \n- Federal Stafford Loans (Subsidized and Unsubsidized)  \n- Federal SLS Loans  \n- Federal PLUS Loans  \n\nNote: Federal Stafford Loans, Federal SLS Loans, and Federal PLUS Loans were made under the Federal Family Education Loan (FFEL) Program before new loans under that program ended effective July 1, 2010. New loans currently come under the Direct Loan Program, which includes Direct Subsidized Loans, Direct Unsubsidized Loans, and Direct PLUS Loans.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [27]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [28]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

empathy_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "empathy": "Is this response empathetic? Does it make the user feel like they are being heard?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`:
- `labeled_helpfulness_evaluator`:
- `empathy_evaluator`:

##### ✅ Answer:

1. qa_evaluator:
Evaluates factual correctness. It compares the generated answer with a reference (ground truth) answer to determine if the model's response is accurate and aligns with the expected content.

2. labeled_helpfulness_evaluator:
Assesses helpfulness and relevance. It checks whether the response actually helps the user by considering both the question and the correct answer. The focus is on usefulness, clarity, and whether the information satisfies the user's need.

3. empathy_evaluator:
Measures emotional intelligence. It evaluates whether the response shows empathy — i.e., does the answer acknowledge the user’s emotions, concerns, or context in a way that feels human and supportive?


## LangSmith Evaluation

In [29]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        empathy_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'crazy-eye-34' at:
https://smith.langchain.com/o/3c2c7006-57b9-4cbe-911e-6f73b4734883/datasets/1d1745e4-cabf-4b70-9eab-a082bf5e8a08/compare?selectedSessions=4de66612-127f-4885-b3e9-a4658a33c267




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.empathy,execution_time,example_id,id
0,Volume 2 and Volume 7 what about academic year...,Based on the provided context:\n\n- Volume 2 (...,,"Based on Volume 2, academic years must have at...",1,1,0,8.77266,c4738a5e-56b1-4851-86a3-5d8285a1dfb2,ee92b038-e6d9-4014-a800-1aa8b93dc965
1,Volume 8 include clinical work in standard ter...,"Based on the provided context, Volume 8 discus...",,"According to Volume 8, if clinical work is inc...",1,1,0,6.268852,7471df04-d19e-4c77-905f-4822b7faefdf,bbcfaee3-1fe3-4af3-89ff-921af5961876
2,How do Appendix A and Appendix B provide guida...,Appendix B provides guidance on coursework com...,,Appendix A offers detailed guidance on disburs...,1,1,0,8.064445,66d546ce-a88a-4ded-8fce-f8bd9ce94e1f,83c5c9e7-d5de-493f-9cad-84adaa56e10e
3,How do Volume 2 and Volume 8 relate to the inc...,"Based on the provided context:\n\nVolume 2, Ch...",,Volume 2 discusses the criteria for including ...,1,0,0,8.252661,d46e2c68-806e-46c6-805b-81c6a4d51748,021a8ad9-44fb-455b-8a4f-902bc9fe3f14
4,min weeks for credit clock how many weeks for ...,Based on the provided context:\n\n- The minimu...,,The context states that for credit hour progra...,1,1,0,7.984969,e42bb014-9941-4629-995f-8fb4c40311d9,e2b11970-07d0-4e1a-9442-f6b1c651590a
5,so tell me how program requirements like 34 CF...,Based on the provided context:\n\nProgram requ...,,The context explains that academic year requir...,1,1,0,25.157297,3ca07855-f687-41dd-b042-beedc65a4808,f52d8beb-997c-4597-beea-f917d0ff4c2d
6,"So, like, if a program has courses that do not...",Based on the context provided:\n\nCourses that...,,Courses that do not begin and end within a ter...,0,0,0,6.187766,129b015d-afde-4324-8591-3474aec96d4c,455380f1-990c-4b06-a30b-d3dfe9b3ee60
7,How does the measurement of progress in credit...,Based on the provided context:\n\n- Progress i...,,The inclusion of clinical work in standard ter...,1,1,0,6.348761,4cc0d706-6afb-40cd-b8e1-33ada2fd0856,d529055a-65ad-4ec0-b586-f655be970ed3
8,What is the role of the Federal Work-Study pro...,The Federal Work-Study (FWS) Program is an exc...,,The payment period is applicable to all Title ...,1,1,0,2.570936,e243c630-c320-4449-94b1-fe766953bfbf,343fd2f1-98f8-485a-b445-039631dbec56
9,"What information does Volume 8, Chapter 3 prov...","Volume 8, Chapter 3 provides additional guidan...",,"Volume 8, Chapter 3 offers guidance on includi...",0,0,0,2.627586,e810858a-a33e-484e-a25e-67bf46607848,ce732bc8-2af0-4c12-a0ca-72fb5bcd7fc2


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [30]:
EMPATHY_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the question using empathy and kindness, and make sure the user feels heard.

Context: {context}
Question: {question}
"""

empathy_rag_prompt = ChatPromptTemplate.from_template(EMPATHY_RAG_PROMPT)

In [31]:
rag_documents = docs

In [32]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

##### ✅ Answer:

Modifying chunk size affects RAG performance in several key ways:

  1. Information Completeness
  - Larger chunks (like 1000 vs 500) capture more complete context and relationships between concepts, reducing information fragmentation
  - Smaller chunks may split related information across multiple pieces, making it harder to retrieve comprehensive answers

  2. Retrieval Quality
  - Larger chunks provide more semantic context for embedding models to understand, potentially improving relevance matching
  - Smaller chunks may be too granular, losing important contextual relationships needed for accurate retrieval

  3. Response Coherence
  - Larger chunks allow the LLM to see more complete information in a single context window, leading to more coherent and comprehensive
  responses
  - Smaller chunks may require the model to piece together fragmented information, potentially missing connections

  4. Retrieval Efficiency
  - Larger chunks mean fewer total chunks in the vector database, potentially reducing noise in retrieval results
  - Smaller chunks create more granular options but may introduce more irrelevant results

  5. Context Window Utilization
  - Larger chunks make better use of the LLM's context window by providing more substantial information per retrieved piece
  - Smaller chunks may waste context space with redundant or incomplete information

  In this specific case, increasing from 500 to 1000 characters likely improved performance by ensuring that related concepts about loan types,
  eligibility criteria, and regulations stay together in single retrievable units, allowing the model to provide more complete and accurate answers.

In [33]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

##### ✅ Answer:

Modifying the embedding model from text-embedding-3-small to text-embedding-3-large improves performance because of the following reasons:

  1. Increased dimensional capacity (1,536 → 3,072 dimensions) allows the model to capture more nuanced relationships between complex financial concepts in federal loan documentation.

  2. Enhanced semantic understanding of regulatory terminology enables better matching between user queries and relevant document sections containing loan eligibility criteria, disbursement rules, and academic requirements.
  
  3. Improved benchmark performance (61% → 64.6% on MTEB) translates to more accurate retrieval of contextually relevant information from the specialized financial regulation content.

  4. Better handling of multi-hop queries that require understanding connections between different loan types, eligibility requirements, and regulatory conditions - critical for providing accurate financial guidance.

  5. Richer semantic representations preserve the precise relationships between interconnected financial concepts, reducing retrieval errors that could lead to incorrect loan advice.

  The larger embedding model's enhanced capacity to understand complex financial relationships directly improves the RAG system's ability to retrieve accurate, relevant information from federal student aid documentation.


In [34]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Loan Data for RAG"
)

In [35]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [36]:
empathy_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | empathy_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [50]:
empathy_rag_chain.invoke({"question" : "What kinds of loans are available?"})

"Thank you for your question! Based on the information provided, it looks like there are several types of loans available to help students and their families with college costs:\n\n1. **Direct Subsidized Loans** – These loans are available based on the student's financial need and have a maximum annual limit depending on the student's year and dependency status.\n\n2. **Direct Unsubsidized Loans** – These loans are available to students regardless of financial need and can supplement the subsidized loan amounts. They can also be used if the parent of a dependent student is ineligible for a Direct PLUS Loan.\n\n3. **Direct PLUS Loans** – These loans are available to parents of dependent students to help cover the student's cost of attendance, assuming the parent meets eligibility requirements. There is no fixed loan limit for PLUS loans, but they cannot exceed the student's cost of attendance minus other financial aid.\n\nIt's clear that the loan program offers flexibility to accommodat

Finally, we can evaluate the new chain on the same test set!

In [51]:
evaluate(
    empathy_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        empathy_evaluator
    ],
    metadata={"revision_id": "empathy_rag_chain"},
)

View the evaluation results for experiment: 'flowery-smile-47' at:
https://smith.langchain.com/o/3c2c7006-57b9-4cbe-911e-6f73b4734883/datasets/1d1745e4-cabf-4b70-9eab-a082bf5e8a08/compare?selectedSessions=65ac8386-34ce-47eb-8122-55a8f6b67036




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.empathy,execution_time,example_id,id
0,Volume 2 and Volume 7 what about academic year...,Thank you for your thoughtful question. Based ...,,"Based on Volume 2, academic years must have at...",0,0,1,5.376172,c4738a5e-56b1-4851-86a3-5d8285a1dfb2,b81b1dbd-66ed-4315-883b-2a140740ca60
1,Volume 8 include clinical work in standard ter...,Thank you for your thoughtful question. Based ...,,"According to Volume 8, if clinical work is inc...",1,1,1,5.525478,7471df04-d19e-4c77-905f-4822b7faefdf,763aa655-7f35-4846-ab5a-9842fcdad4a7
2,How do Appendix A and Appendix B provide guida...,Thank you for your thoughtful question. It’s c...,,Appendix A offers detailed guidance on disburs...,0,0,1,6.693762,66d546ce-a88a-4ded-8fce-f8bd9ce94e1f,9999a21c-3e96-49d5-8e8f-33422d13cae2
3,How do Volume 2 and Volume 8 relate to the inc...,I truly appreciate you reaching out with such ...,,Volume 2 discusses the criteria for including ...,0,0,1,5.105922,d46e2c68-806e-46c6-805b-81c6a4d51748,844587b8-f86f-4545-93c8-42b4c9fca27b
4,min weeks for credit clock how many weeks for ...,Thank you for your thoughtful question. Based ...,,The context states that for credit hour progra...,0,0,1,7.327528,e42bb014-9941-4629-995f-8fb4c40311d9,13cc8f11-3256-4647-8406-73aa474d26f6
5,so tell me how program requirements like 34 CF...,Thank you for your thoughtful question—it’s gr...,,The context explains that academic year requir...,1,1,1,6.721323,3ca07855-f687-41dd-b042-beedc65a4808,1af4e973-8e1c-4d96-b254-bedb629531e2
6,"So, like, if a program has courses that do not...",Thank you for your thoughtful question—it's cl...,,Courses that do not begin and end within a ter...,1,1,1,7.22244,129b015d-afde-4324-8591-3474aec96d4c,7892ce75-c7d0-49b2-b80c-839ec2b807ec
7,How does the measurement of progress in credit...,Thank you for your thoughtful question. Based ...,,The inclusion of clinical work in standard ter...,1,1,1,6.87469,4cc0d706-6afb-40cd-b8e1-33ada2fd0856,0496f579-e614-44ed-993b-4ce06d10eae2
8,What is the role of the Federal Work-Study pro...,Thank you for your question—it's important to ...,,The payment period is applicable to all Title ...,1,1,1,7.154166,e243c630-c320-4449-94b1-fe766953bfbf,a8e7dd54-9b2a-48e0-b12b-33a924a751dd
9,"What information does Volume 8, Chapter 3 prov...",Thank you for your thoughtful question. Based ...,,"Volume 8, Chapter 3 offers guidance on includi...",1,0,1,4.513546,e810858a-a33e-484e-a25e-67bf46607848,a925ec50-3873-45e1-8376-5e3c8ac0d123


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

##### ✅ Answer:

<table style="width:100%;">
  <tr>
    <td style="width:50%; vertical-align:top;">
      <img src="assets/Eval_before_param_change.png" style="width:100%;">
    </td>
    <td style="width:50%; vertical-align:top;">
      <img src="assets/Eval_after_param_change.png" style="width:100%;">
    </td>
  </tr>
</table>

<img src="assets/Langsmith_eval_comparision.png" style="width:100%; margin-top: 20px;">


The screenshots I’ve included show a comparison between two chains—**before and after a parameter change**. The first two images display the raw outputs side by side (`Eval_before_param_change.png` and `Eval_after_param_change.png`), and the third image (`Langsmith_eval_comparision.png`) highlights the differences in key evaluation metrics such as **correctness**, **empathy**, and **helpfulness**.

**Correctness dropped** from **0.83 to 0.58** after the parameter change. While the newer chain sounded more natural and friendly, it often failed to provide **factually complete or accurate answers**. Many responses included empathetic phrases but didn’t fully address the question or align with the reference outputs. This suggests that the model’s focus shifted away from precision.

On the other hand, **empathy improved drastically**—from **0.0 to a full 1.0**. Every response in the updated chain began with some form of acknowledgment or supportive language, such as *“Thank you for your thoughtful question.”* This clearly shows that the chain was **optimized to sound more human and emotionally attuned** to the user.

**Helpfulness saw a slight decline**, going from **0.58 to 0.50**. While the tone became more pleasant, a few answers lacked **actionable or relevant information**. In some cases, the model was polite but didn’t actually provide a useful response. This reinforces the idea that **being friendly doesn’t always mean being helpful**.

**Latency improved slightly**, with the P50 dropping from **6.31s to 6.11s**. The difference is small and likely due to **less complex reasoning or shorter responses**. It didn’t have a significant impact on the overall evaluation, but it’s still worth noting.

Overall, the updated chain became much more **empathetic and user-friendly**, but it lost some ground in **factual accuracy and usefulness**. This trade-off may work well in **casual or conversational settings**, but for domains like **compliance or education**, **correctness should remain a top priority**.



