# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Inés\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Inés\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Use-Case Data!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

### NOTICE: We're using a subset of the data for this example - this is to keep costs/time down.
for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 64, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/37 [00:00<?, ?it/s]

Property 'summary' already exists in node '5e1a3d'. Skipping!
Property 'summary' already exists in node '85bb23'. Skipping!
Property 'summary' already exists in node 'fd40db'. Skipping!
Property 'summary' already exists in node '29a2a7'. Skipping!
Property 'summary' already exists in node '719a63'. Skipping!
Property 'summary' already exists in node '57c446'. Skipping!
Property 'summary' already exists in node '739bb9'. Skipping!
Property 'summary' already exists in node '4609db'. Skipping!
Property 'summary' already exists in node '469ec8'. Skipping!
Property 'summary' already exists in node '510688'. Skipping!
Property 'summary' already exists in node 'e334e0'. Skipping!
Property 'summary' already exists in node '88faa2'. Skipping!
Property 'summary' already exists in node 'dd4af3'. Skipping!
Property 'summary' already exists in node '3c6da4'. Skipping!
Property 'summary' already exists in node '05b4cd'. Skipping!
Property 'summary' already exists in node 'ae6d00'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/10 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/51 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'e334e0'. Skipping!
Property 'summary_embedding' already exists in node '85bb23'. Skipping!
Property 'summary_embedding' already exists in node '57c446'. Skipping!
Property 'summary_embedding' already exists in node 'fd40db'. Skipping!
Property 'summary_embedding' already exists in node '5e1a3d'. Skipping!
Property 'summary_embedding' already exists in node '510688'. Skipping!
Property 'summary_embedding' already exists in node '29a2a7'. Skipping!
Property 'summary_embedding' already exists in node '719a63'. Skipping!
Property 'summary_embedding' already exists in node '4609db'. Skipping!
Property 'summary_embedding' already exists in node '739bb9'. Skipping!
Property 'summary_embedding' already exists in node '469ec8'. Skipping!
Property 'summary_embedding' already exists in node '88faa2'. Skipping!
Property 'summary_embedding' already exists in node '3c6da4'. Skipping!
Property 'summary_embedding' already exists in node 'dd4af3'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 87, relationships: 678)

We can save and load our knowledge graphs as follows.

In [11]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 87, relationships: 678)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [14]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [15]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

#### ✅ Answer

From : https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/

- SingleHopSpecific: makes a straight, factual question you can answer from one doc/chunk.
E.g., “What’s the capital of **Spain**?”

- MultiHopAbstract: makes a big-picture question that needs at least 2 sources and some synthesis, without naming specific people/things.
E.g., “What common pitfalls appear when deploying multi-agent systems?”

- MultiHopSpecific: makes a concrete, multi-source question that links specific entities/details across docs—mostly lookup + stitching.
E.g., “Which **2023** papers mention **LangGraph** and evaluate on HotpotQA?”


In summary:

**Single** = one doc. **Multi-Abstract** = themes across docs. **Multi-Specific** = named facts across docs. **QuerySynthesizer** = question generator -->


Finally, we can use our `TestSetGenerator` to generate our testset!

In [16]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How do Wiggers relate to the societal impacts ...,[Introduction ChatGPT launched in November 202...,The provided context does not mention or defin...,single_hop_specifc_query_synthesizer
1,What are the key insights about ChatGPT usage ...,[Table 1: ChatGPT daily message counts (millio...,"Table 1 reports that in June 2024, ChatGPT dai...",single_hop_specifc_query_synthesizer
2,How does Deming's work relate to the economic ...,"[Doing, and that Asking messages are consisten...",The context suggests that ChatGPT likely impro...,single_hop_specifc_query_synthesizer
3,How ChatGPT help workers and who get most benefit,"[How does ChatGPT provide economic value, and ...",ChatGPT likely improves worker output by provi...,single_hop_specifc_query_synthesizer
4,What is Standard Occupation Classification and...,[Variation by Occupation Figure 23 presents va...,Variation by Occupation Figure 23 presents var...,single_hop_specifc_query_synthesizer
5,How does ChatGPT's role in supporting professi...,"[<1-hop>\n\nDoing, and that Asking messages ar...",ChatGPT likely enhances worker output in profe...,multi_hop_abstract_query_synthesizer
6,how chatgpt help worker productivity and impac...,"[<1-hop>\n\nDoing, and that Asking messages ar...",chatgpt likely improves worker output by provi...,multi_hop_abstract_query_synthesizer
7,How do the use of ChatGPT and its LLM technolo...,"[<1-hop>\n\nDoing, and that Asking messages ar...","ChatGPT, based on Large Language Models (LLMs)...",multi_hop_abstract_query_synthesizer
8,Based on the analysis of ChatGPT's role in pro...,[<1-hop>\n\nHow does ChatGPT provide economic ...,The context indicates that ChatGPT enhances wo...,multi_hop_specific_query_synthesizer
9,How do Talamas' findings relate to ChatGPT's r...,[<1-hop>\n\nHow does ChatGPT provide economic ...,Talamas' findings are consistent with the idea...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [13]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/37 [00:00<?, ?it/s]

Property 'summary' already exists in node '718f36'. Skipping!
Property 'summary' already exists in node '2cf0e3'. Skipping!
Property 'summary' already exists in node '6ee68c'. Skipping!
Property 'summary' already exists in node 'cc895b'. Skipping!
Property 'summary' already exists in node 'aec680'. Skipping!
Property 'summary' already exists in node '0510e5'. Skipping!
Property 'summary' already exists in node 'e7f35b'. Skipping!
Property 'summary' already exists in node 'd4d9dc'. Skipping!
Property 'summary' already exists in node 'f63036'. Skipping!
Property 'summary' already exists in node '6f01a7'. Skipping!
Property 'summary' already exists in node 'ab85cc'. Skipping!
Property 'summary' already exists in node 'fdb03f'. Skipping!
Property 'summary' already exists in node '359244'. Skipping!
Property 'summary' already exists in node '0ef17c'. Skipping!
Property 'summary' already exists in node '828b32'. Skipping!
Property 'summary' already exists in node 'd7765c'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/10 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/51 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '6ee68c'. Skipping!
Property 'summary_embedding' already exists in node '2cf0e3'. Skipping!
Property 'summary_embedding' already exists in node 'e7f35b'. Skipping!
Property 'summary_embedding' already exists in node 'aec680'. Skipping!
Property 'summary_embedding' already exists in node '718f36'. Skipping!
Property 'summary_embedding' already exists in node '0510e5'. Skipping!
Property 'summary_embedding' already exists in node 'd4d9dc'. Skipping!
Property 'summary_embedding' already exists in node 'ab85cc'. Skipping!
Property 'summary_embedding' already exists in node 'f63036'. Skipping!
Property 'summary_embedding' already exists in node 'cc895b'. Skipping!
Property 'summary_embedding' already exists in node '6f01a7'. Skipping!
Property 'summary_embedding' already exists in node '0ef17c'. Skipping!
Property 'summary_embedding' already exists in node '359244'. Skipping!
Property 'summary_embedding' already exists in node 'fdb03f'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [11]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"Wha is the Pew Research Center, 2025 report ab...",[Introduction ChatGPT launched in November 202...,"The context mentions Pew Research Center, 2025...",single_hop_specifc_query_synthesizer
1,How does OpenAI's development and deployment o...,[Table 1: ChatGPT daily message counts (millio...,The context describes that ChatGPT's daily mes...,single_hop_specifc_query_synthesizer
2,What are SOC2 codes 11 used for?,[Variation by Occupation Figure 23 presents va...,Variation by Occupation Figure 23 presents var...,single_hop_specifc_query_synthesizer
3,ChatGPT is like what kinda AI thing and how do...,[Conclusion This paper studies the rapid growt...,The context states that ChatGPT launched in No...,single_hop_specifc_query_synthesizer
4,how message volume compare Jun 2024 and Jun 20...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,"In June 2024, total messages were 451 million ...",multi_hop_abstract_query_synthesizer
5,Based on the comparison of message volume grow...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,The data shows that non-work messages grew sig...,multi_hop_abstract_query_synthesizer
6,Based on the data showing that non-work messag...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,"The increasing volume of non-work messages, wh...",multi_hop_abstract_query_synthesizer
7,Based on the data showing that non-work messag...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,"The increase in non-work messages, which now m...",multi_hop_abstract_query_synthesizer
8,wHAT US rELATED tHEMES aRE cONNECTED tO tHE uS...,[<1-hop>\n\nConclusion This paper studies the ...,The context indicates that ChatGPT usage has g...,multi_hop_specific_query_synthesizer
9,Based on the rapid growth of ChatGPT usage in ...,[<1-hop>\n\nConclusion This paper studies the ...,"The first segment indicates that by July 2025,...",multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [21]:
from langsmith import Client

client = Client()

dataset_name = "Use Case Synthetic Data - AIE8 - Inés"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [22]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [23]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [25]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [26]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG"
)

In [27]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [28]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [29]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [30]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [31]:
rag_chain.invoke({"question" : "What are people doing with AI these days?"})

'Based on the provided context, people are using AI, particularly generative AI like ChatGPT, in a variety of ways including:\n\n- Performing workplace tasks either by augmenting or automating human labor.\n- Producing writing, software code, spreadsheets, and other digital products.\n- Seeking information and advice, similar to traditional web search engines but with added flexibility.\n- Using AI as co-workers that produce output or as co-pilots that provide advice and improve human problem-solving productivity.\n- Engaging in activities related to self-expression, relationships, personal reflection, games, and role play, though these are smaller portions of overall usage.\n- Therapy and companionship use cases have also been identified as prevalent.\n\nThus, people are leveraging AI for both professional tasks that enhance productivity and for personal or expressive purposes.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [32]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [33]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dopeness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this response dope, lit, cool, or is it just a generic response?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

#### ✅ Answer

As per LangSmith docs: https://docs.smith.langchain.com/reference/sdk_reference/langchain_evaluators

All evaluators are evaluating **string** as we are using **LangChainStringEvaluator**

- `qa_evaluator`: **correctness** whether the response matches the reference answer.
- `labeled_helpfulness_evaluator`: We use labeled criteria option, in the config we name the criteria as **helpfulness** and we give a prompt to the model for judging this criteria:  **"Is this submission helpful to the user,"" taking into account the correct reference answer?"**
- `dopeness_evaluator`: We use criteria option, in the config we name the criteria as **dopeness** and we pass a prompt to judge this criteria: **"Is this response dope, lit, cool, or is it just a generic response?"**. This criteria is not labeled, we do not need to pass a ground truth to compare.

## LangSmith Evaluation

In [34]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'upbeat-meal-82' at:
https://smith.langchain.com/o/67d3d2c8-bb8c-4749-aa07-825356b10ae6/datasets/c502cf66-4a4d-4a16-b905-79491b63a179/compare?selectedSessions=7e855a03-4715-4e67-b115-ba6016f88d59




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,"How does Caplin's model of decision support, a...","Caplin's model, as discussed in the context, i...",,The context explains that ChatGPT provides eco...,1,1,0,2.142028,90feb03d-7229-45da-84fb-bc947f888950,fbd37d53-1bf7-4a1c-87ce-81992c6d1a8b
1,How does the rapid growth of ChatGPT to over 7...,The rapid growth of ChatGPT to over 700 millio...,,"By July 2025, ChatGPT had been used weekly by ...",1,1,0,4.11403,0e41efba-c16b-468f-94fc-638daee96b53,0e499e8a-b06f-4a8c-a706-38b18d57811f
2,How does the study by Handa et al. (2025) rela...,The study by Handa et al. (2025) is mentioned ...,,The study by Handa et al. (2025) provides deta...,1,0,0,3.046627,f1665591-fe2e-4440-ada7-f49221db474a,74dd55df-b35b-49da-8b72-eeb212e834e7
3,"How does OpenAI's development of ChatGPT, as d...","Based on the provided context, OpenAI’s develo...",,OpenAI has developed ChatGPT as a large langua...,1,1,0,6.819505,7ae6ad55-2b21-4192-822a-da2549cd7ba8,af0fc99d-abca-4d2c-9d8a-0d1750fad5bb
4,Considering the significant growth in total me...,"Based on the provided context, the significant...",,The data shows that total messages increased f...,1,1,0,8.4113,82d0e987-55da-42f5-89b3-2e5005d965b0,847b83ae-f8ec-409a-a12a-07497b741d63
5,who use chatgpt more in age and gender and how...,Based on the provided context:\n\n- **Gender u...,,The context indicates that nearly half of all ...,1,1,0,5.715379,c7df132f-174d-494e-a7a5-47fb5667d174,a6aa2500-16e4-4140-bc50-7bf363997c3b
6,so like how does ChatGPT usage vary by occupat...,Based on the provided context:\n\nChatGPT usag...,,The context explains that ChatGPT usage varies...,1,0,0,9.69734,7d062bd3-2010-480a-b059-3557891b3a2d,37268713-2998-4259-a049-971d863b5e82
7,H0w does ChatGPT impact productivity outside o...,"Based on the context provided, ChatGPT impacts...",,"Based on the context, ChatGPT's daily message ...",1,1,0,3.325161,6a5af14d-1431-4bde-ba53-1f9b6479953e,c212adcd-fc34-49e7-9336-0a5d2169e83a
8,ChatGPT how does it give value and who gets mo...,"Based on the provided context, ChatGPT provide...",,ChatGPT provides economic value by likely impr...,1,1,0,4.73812,0b03c9aa-1479-4411-a245-ac0054f5d8bd,e5a2c7d2-3770-426b-a438-77c1d80b3489
9,How does Talamas relate to ChatGPT and its eco...,Talamas (along with Ide) develops a model wher...,,The context mentions that ChatGPT provides eco...,1,1,0,2.658415,9469673f-789d-4067-bf37-3bf3c17fa3cc,854c68d3-8d9d-4e2d-825d-5ca10df1ce91


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [35]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [36]:
rag_documents = docs

In [37]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

#### ✅ Answer

Changing chunk size affects evaluation by shifting retrieval precision/recall and downstream faithfulness/relevance. Smaller chunks can enable precise retrieval of the exact sentences, but they may miss surrounding context (key details can be split across chunks). Larger chunks capture more complete context and maintain narrative flow, but they’re less precise and can include irrelevant text. (Note: faithfulness and relevance move with context quality—these are tendencies, not guarantees.)

In general:

- Smaller chunks → higher precision, often higher faithfulness when needed context fits (but if needed context is split away, faithfulness can drop.); lower recall; risk of missing context.
- Larger chunks → higher recall, often higher perceived relevance from added context (but they can also dilute relevance with extra noise); lower precision; risk of noise/distractors.

There is a trade-off between the two.

In [38]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large") #3072 dimensions

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

#### ✅ Answer

Different models tokenize differently, capture semantics at different granularity, support different input lengths/languages, and produce vectors with different geometry. These differences shift retrieval precision/recall, which then affects answer faithfulness/relevance. Nonetheless,  “bigger” (more dimensions) isn’t automatically better, performance improves only if the new model retrieves more useful context for your data (not noise). We should verify with retrieval metrics (Recall@k, nDCG@k, MRR) and end-to-end metrics (accuracy/faithfulness), considering cost/latency.


In [39]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [40]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [41]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [42]:
dopeness_rag_chain.invoke({"question" : "How are people using AI to make money?"})

"Alright, strap in because this AI usage story is straight fire! People aren't just grinding away getting AI to *do* their jobs — nah, they’re flexing ChatGPT as their ultimate sidekick: an advisor and research assistant that supercharges their decision-making game. Especially in those brain-heavy, knowledge-intensive gigs where every choice can make or break the income flow, AI is the secret sauce elevating productivity by leveling up *how* and *what* decisions get made.\n\nThat’s why experts like Collis and Brynjolfsson drop this bomb: US users would need a fat $98 payout just to skip AI for a month, underscoring an insane consumer surplus pushing at least $97 billion a year. Boom! This means folks are cashing in by using generative AI to sharpen their work output, not just automating chores but amplifying insights, advice, and research speed.\n\nIn other words, AI isn’t just a tool shaking up workflows — it’s the dopest co-pilot in the money-making hustle, helping knowledge workers 

Finally, we can evaluate the new chain on the same test set!

In [43]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'best-turn-20' at:
https://smith.langchain.com/o/67d3d2c8-bb8c-4749-aa07-825356b10ae6/datasets/c502cf66-4a4d-4a16-b905-79491b63a179/compare?selectedSessions=ed99408c-8acd-4de9-8d38-9c3b3765f471




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,"How does Caplin's model of decision support, a...","Oh snap, let’s dive into the turbocharged vibe...",,The context explains that ChatGPT provides eco...,1,1,1,4.740307,90feb03d-7229-45da-84fb-bc947f888950,0bf5fd8f-6348-452c-aff3-53f7288d5277
1,How does the rapid growth of ChatGPT to over 7...,"Alright, buckle up—the explosive rocket ride o...",,"By July 2025, ChatGPT had been used weekly by ...",1,1,1,5.494103,0e41efba-c16b-468f-94fc-638daee96b53,c813643e-7ba4-4e0f-bb99-d8f97c2c0e01
2,How does the study by Handa et al. (2025) rela...,"Yo, here’s the lowdown straight from the data ...",,The study by Handa et al. (2025) provides deta...,1,1,1,5.975538,f1665591-fe2e-4440-ada7-f49221db474a,ffd7fac4-ac8f-4026-bf60-6238a970f9a8
3,"How does OpenAI's development of ChatGPT, as d...","Alright, buckle up for this AI hype ride becau...",,OpenAI has developed ChatGPT as a large langua...,1,1,1,8.728148,7ae6ad55-2b21-4192-822a-da2549cd7ba8,1d7bee8b-afaa-44a1-9eb0-c3ee193e13ec
4,Considering the significant growth in total me...,"Yo, buckle up—here’s the lowdown straight from...",,The data shows that total messages increased f...,1,1,1,9.051213,82d0e987-55da-42f5-89b3-2e5005d965b0,c0d7f302-739f-4a58-a159-1d36731c334b
5,who use chatgpt more in age and gender and how...,"Yo, here’s the lowdown straight from the dopes...",,The context indicates that nearly half of all ...,1,1,1,6.356518,c7df132f-174d-494e-a7a5-47fb5667d174,6b4b1eb4-ca79-4f58-b159-568b21cea505
6,so like how does ChatGPT usage vary by occupat...,"Alright, strap in—let’s decode the mad science...",,The context explains that ChatGPT usage varies...,1,1,1,14.608596,7d062bd3-2010-480a-b059-3557891b3a2d,785dd072-7c2f-4e3b-b144-5a085e55816e
7,H0w does ChatGPT impact productivity outside o...,"Yo, here’s the dopest breakdown fresh from the...",,"Based on the context, ChatGPT's daily message ...",1,1,1,4.084192,6a5af14d-1431-4bde-ba53-1f9b6479953e,d0296015-f8f6-48c3-8fcd-97d1301356aa
8,ChatGPT how does it give value and who gets mo...,"Alright, here’s the nitty-gritty on the radnes...",,ChatGPT provides economic value by likely impr...,1,1,1,3.117598,0b03c9aa-1479-4411-a245-ac0054f5d8bd,03d7f244-ef0e-4bdd-acf2-3b7e784aa8f5
9,How does Talamas relate to ChatGPT and its eco...,"Oh heck yeah, let’s drop some knowledge bombs ...",,The context mentions that ChatGPT provides eco...,1,1,1,3.465061,9469673f-789d-4067-bf37-3bf3c17fa3cc,a95df00c-36d1-499d-a9fe-cc1de92503b0


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

![Experiments Comparison](experiments/experiments_new.png)







#### ✅ Answer

- **Correctness**: increasing the chunk_size and changing the embedding model had some impact, a delta of 0.17, this means that we were able to retrieve more meaningful, relevant context.
- **Dopeness**: changing the dope system prompt achieved us a delta of 1.
- **Helpfulness**: changing the chunk_size and the embedding model achieved a delta improvement of 0.35, as above, we were able to retrieve more meaningful, relevant context.
