# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mspla\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mspla\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

LangChain API Key: ········


We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key: ········


## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Use-Case Data!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

### NOTICE: We're using a subset of the data for this example - this is to keep costs/time down.
for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 64, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/38 [00:00<?, ?it/s]

Property 'summary' already exists in node '3856dd'. Skipping!
Property 'summary' already exists in node 'c0e8b5'. Skipping!
Property 'summary' already exists in node '089ba8'. Skipping!
Property 'summary' already exists in node '35ba75'. Skipping!
Property 'summary' already exists in node '92f2ed'. Skipping!
Property 'summary' already exists in node '684804'. Skipping!
Property 'summary' already exists in node '829049'. Skipping!
Property 'summary' already exists in node '28936b'. Skipping!
Property 'summary' already exists in node '4d3ecd'. Skipping!
Property 'summary' already exists in node '163ed5'. Skipping!
Property 'summary' already exists in node '97997b'. Skipping!
Property 'summary' already exists in node '20f3f2'. Skipping!
Property 'summary' already exists in node '4cc321'. Skipping!
Property 'summary' already exists in node '6e25cf'. Skipping!
Property 'summary' already exists in node '6a35a4'. Skipping!
Property 'summary' already exists in node 'aebcf8'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/48 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '684804'. Skipping!
Property 'summary_embedding' already exists in node '089ba8'. Skipping!
Property 'summary_embedding' already exists in node '35ba75'. Skipping!
Property 'summary_embedding' already exists in node '3856dd'. Skipping!
Property 'summary_embedding' already exists in node '92f2ed'. Skipping!
Property 'summary_embedding' already exists in node 'c0e8b5'. Skipping!
Property 'summary_embedding' already exists in node '97997b'. Skipping!
Property 'summary_embedding' already exists in node '829049'. Skipping!
Property 'summary_embedding' already exists in node '163ed5'. Skipping!
Property 'summary_embedding' already exists in node '4d3ecd'. Skipping!
Property 'summary_embedding' already exists in node '28936b'. Skipping!
Property 'summary_embedding' already exists in node '20f3f2'. Skipping!
Property 'summary_embedding' already exists in node '4cc321'. Skipping!
Property 'summary_embedding' already exists in node '6e25cf'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 86, relationships: 711)

We can save and load our knowledge graphs as follows.

In [10]:
import json
from ragas.testset.graph import UUIDEncoder

"""
Explicit UTF-8: Forces the file to be saved/loaded with UTF-8 encoding
"""

# Save with explicit UTF-8 encoding
data = {
    "nodes": [node.model_dump() for node in kg.nodes],
    "relationships": [rel.model_dump() for rel in kg.relationships],
}

with open("usecase_data_kg.json", "w", encoding="utf-8") as f:
    json.dump(data, f, cls=UUIDEncoder, indent=2, ensure_ascii=False)

# Load with explicit UTF-8 encoding
with open("usecase_data_kg.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Reconstruct the knowledge graph
from ragas.testset.graph import KnowledgeGraph, Node, Relationship

usecase_data_kg = KnowledgeGraph()
usecase_data_kg.nodes = [Node(**node_data) for node_data in data["nodes"]]
usecase_data_kg.relationships = [Relationship(**rel_data) for rel_data in data["relationships"]]

usecase_data_kg

KnowledgeGraph(nodes: 86, relationships: 711)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [11]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [12]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

The "hop" refers to logical hops between different representations of the same content

1. SingleHopSpecificQuerySyntehsizer:   Creates questions that can be answered with information from one single document or chunk. It does not need to connect different sources to find the answer. 

2. MultiHopAbstractQuerySynthesizer (25% of queries): Creates questions that require connecting multiple documents. Focus on asking general concepts or themes  "Concept questioning"

3. MultiHopSpecificQuerySynthesizer (25% of queries)
Creates questions that require connecting multiple documents but asks for specific details or facts
Simple example: "What are the specific AI applications in healthcare and how do they differ from those in finance?" (needs specific details from multiple sources) "Specific factual enquiry"



Finally, we can use our `TestSetGenerator` to generate our testset!

In [13]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"What is the significance of Handa et al., 2025...",[Introduction ChatGPT launched in November 202...,"The paper by Handa et al., 2025 reports statis...",single_hop_specifc_query_synthesizer
1,What is OpenAI known for?,[Table 1: ChatGPT daily message counts (millio...,"OpenAI is associated with ChatGPT, which is us...",single_hop_specifc_query_synthesizer
2,What is the significance of Appendix D in unde...,[Variation by Occupation Figure 23 presents va...,Appendix D contains a full report of GWA count...,single_hop_specifc_query_synthesizer
3,What is the Practical Guidance in the context ...,[Conclusion This paper studies the rapid growt...,Practical Guidance is one of the three most co...,single_hop_specifc_query_synthesizer
4,how much ChatGPT messages are non work and how...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,"In June 2024, non-work messages made up 53% of...",multi_hop_abstract_query_synthesizer
5,Based on the data showing the growth of non-wo...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,The data indicates that non-work messages have...,multi_hop_abstract_query_synthesizer
6,how chatgpt growth and user questions relate t...,[<1-hop>\n\nConclusion This paper studies the ...,the paper shows chatgpt grew fast since nov 20...,multi_hop_abstract_query_synthesizer
7,Based on the rapid growth of ChatGPT usage in ...,[<1-hop>\n\nConclusion This paper studies the ...,"The context indicates that by July 2025, ChatG...",multi_hop_specific_query_synthesizer
8,Wht US is the? how does it relate to ChatGPT u...,[<1-hop>\n\nConclusion This paper studies the ...,The context indicates that ChatGPT usage in th...,multi_hop_specific_query_synthesizer
9,"Based on the rapid growth of ChatGPT, which ha...",[<1-hop>\n\nConclusion This paper studies the ...,"By July 2025, ChatGPT users were collectively ...",multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [14]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/38 [00:00<?, ?it/s]

Property 'summary' already exists in node 'a76c46'. Skipping!
Property 'summary' already exists in node '51afe8'. Skipping!
Property 'summary' already exists in node 'cb6a16'. Skipping!
Property 'summary' already exists in node 'da926b'. Skipping!
Property 'summary' already exists in node '321ff5'. Skipping!
Property 'summary' already exists in node '5171f1'. Skipping!
Property 'summary' already exists in node '5ff6f3'. Skipping!
Property 'summary' already exists in node '4fca9e'. Skipping!
Property 'summary' already exists in node 'db9384'. Skipping!
Property 'summary' already exists in node '7282c2'. Skipping!
Property 'summary' already exists in node 'b4f6e0'. Skipping!
Property 'summary' already exists in node 'ef8cf3'. Skipping!
Property 'summary' already exists in node '934e3b'. Skipping!
Property 'summary' already exists in node '5a1652'. Skipping!
Property 'summary' already exists in node 'ccd60a'. Skipping!
Property 'summary' already exists in node '971494'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/48 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'da926b'. Skipping!
Property 'summary_embedding' already exists in node '51afe8'. Skipping!
Property 'summary_embedding' already exists in node 'a76c46'. Skipping!
Property 'summary_embedding' already exists in node 'db9384'. Skipping!
Property 'summary_embedding' already exists in node '5171f1'. Skipping!
Property 'summary_embedding' already exists in node 'cb6a16'. Skipping!
Property 'summary_embedding' already exists in node 'b4f6e0'. Skipping!
Property 'summary_embedding' already exists in node '4fca9e'. Skipping!
Property 'summary_embedding' already exists in node '321ff5'. Skipping!
Property 'summary_embedding' already exists in node '5ff6f3'. Skipping!
Property 'summary_embedding' already exists in node '7282c2'. Skipping!
Property 'summary_embedding' already exists in node 'ef8cf3'. Skipping!
Property 'summary_embedding' already exists in node '934e3b'. Skipping!
Property 'summary_embedding' already exists in node '5a1652'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [18]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is a Roth?,[Introduction ChatGPT launched in November 202...,The provided context does not include a defini...,single_hop_specifc_query_synthesizer
1,What is the significance of June 2024 in the c...,[Table 1: ChatGPT daily message counts (millio...,The context reports on ChatGPT message counts ...,single_hop_specifc_query_synthesizer
2,ChatGPT use work?,[Variation by Occupation Figure 23 presents va...,Variation by Occupation Figure 23 shows how Ch...,single_hop_specifc_query_synthesizer
3,Whn was November 2022 in relation to ChatGPT's...,[Conclusion This paper studies the rapid growt...,Conclusion This paper studies the rapid growth...,single_hop_specifc_query_synthesizer
4,How does variation in ChatGPT usage by occupat...,[<1-hop>\n\nVariation by Occupation Figure 23 ...,The context indicates that ChatGPT usage varie...,multi_hop_abstract_query_synthesizer
5,"How does the rapid adoption of ChatGPT, as evi...",[<1-hop>\n\nIntroduction ChatGPT launched in N...,"The rapid adoption of ChatGPT, with its widesp...",multi_hop_abstract_query_synthesizer
6,"How do the differences in message types, such ...",[<1-hop>\n\nVariation by Occupation Figure 23 ...,The data indicates that users in highly paid p...,multi_hop_abstract_query_synthesizer
7,Hw does ChatGPT usage grow for work and non-wo...,[<1-hop>\n\nIntroduction ChatGPT launched in N...,The context indicates that since its launch in...,multi_hop_abstract_query_synthesizer
8,Considering the rapid growth of ChatGPT usage ...,[<1-hop>\n\nConclusion This paper studies the ...,"The context indicates that in the US, ChatGPT ...",multi_hop_specific_query_synthesizer
9,Wht US is the main focus of the ChatGPT growth...,[<1-hop>\n\nConclusion This paper studies the ...,The study highlights that ChatGPT's rapid grow...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [15]:
# check if dataset has been created on langsmith account, create only if not.
from langsmith import Client

client = Client()

dataset_name = "Use Case Synthetic Data - AIE8"

# Check if dataset already exists
try:
    # Try to get the existing dataset
    langsmith_dataset = client.read_dataset(dataset_name=dataset_name)
    print(f"Using existing dataset: {dataset_name}")
except:
    # If it doesn't exist, create it
    langsmith_dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="Synthetic Data for Use Cases"
    )
    print(f"Created new dataset: {dataset_name}")

Using existing dataset: Use Case Synthetic Data - AIE8


We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [16]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [17]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [31]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [18]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [19]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG"
)

In [20]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [21]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [22]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [23]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [24]:
rag_chain.invoke({"question" : "What are people doing with AI these days?"})

'Based on the provided context from "How People Use ChatGPT," people today use AI, particularly ChatGPT, for a variety of purposes that can be broadly classified into work-related and non-work-related uses. As of 2025, about 70% of ChatGPT queries are non-work-related, but usage for work tasks is also growing steadily.\n\nThe three most common conversation topics with ChatGPT are:\n\n1. **Practical Guidance** – General how-to advice and step-by-step instructions on various topics.\n2. **Writing** – This is the dominant work-related use, accounting for 42% of work messages and includes modifying existing text and producing new written content.\n3. **Seeking Information** – Users frequently ask for information or clarification to inform decisions.\n\nAdditional uses include:\n\n- **Tutoring or Teaching** (education-related queries make up 10.2% of messages),\n- **Technical Help** (computing programming, mathematical calculations, data analysis),\n- **Creative ideation** and some self-exp

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [25]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [26]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dopeness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this response dope, lit, cool, or is it just a generic response?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`:
- `labeled_helpfulness_evaluator`:
- `dopeness_evaluator`:

qa evaluator -- is used to judge llm predicted answer against reference answer from document texts. Measures accuracy of reponse.

labeled_healthfulness_evaluator -- while an answer may be accurate, but is it useful? Measures usefulness of response.

dopeness_evaluator -- is used to assess ability to engage maintain user interest, instead of plain bland texts. Measures engagement factor, and to some extent 'quality' response.


## LangSmith Evaluation

In [27]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'elderly-grip-80' at:
https://smith.langchain.com/o/3936adcd-6eec-4723-b49d-fee2168a2d46/datasets/8349afec-538d-4daa-bd7f-041c0c3c23f1/compare?selectedSessions=14cc06d7-bff2-4026-8e71-79b6582a0f37




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,how many messages like 18 billion and 2.5 bill...,Based on the provided context:\n\n- By July 20...,,"According to the context, by July 2025, ChatGP...",1,1,0,6.58319,1b70ae01-fc3a-4d47-a8f4-6f6b251b1fb8,6e45d7ed-58a7-4ce2-b06f-f6748cefb5f1
1,what happen in july 2025 with chatgpt and how ...,"In July 2025, ChatGPT had more than 700 millio...",,"In July 2025, ChatGPT had been used weekly by ...",1,1,0,3.579342,be6eb518-9138-463c-8cc0-2b97c678439c,b8618610-d994-4485-be6b-93139fa241bb
2,How does the growth of ChatGPT usage in the US...,"Based on the context provided, the growth of C...",,"The context indicates that in the US, ChatGPT ...",1,1,0,7.273568,8cb54345-757e-454e-b2a8-180bc55bb14b,1f296f3f-81a4-4bcf-a990-db4192d12029
3,Wht US us ChatGPT for work and non-work?,"Based on the provided context, users in the US...",,"The context indicates that in the US, ChatGPT ...",1,1,0,6.138035,7a8d1800-b7ae-45cd-8900-a442c8227988,db87c9e5-d7f8-4217-956b-7b405fceb550
4,Based on the data showing the growth in non-wo...,"Based on the context, a professional tech mana...",,The data indicates that non-work messages have...,1,1,0,7.218759,cb8c9822-fdad-40ba-baab-e25845d30e7f,fec4613d-1e2a-49bf-bd59-7424e0be58af
5,How does the increase in total ChatGPT message...,"Between June 2024 and June 2025, the total dai...",,"Between June 2024 and June 2025, total ChatGPT...",1,1,0,5.115514,b05f1401-16f3-45fd-9742-32eed880fe1d,68edf107-3a02-4f26-bf9a-80a02972d7bc
6,How does the growth and adoption of ChatGPT re...,"Based on the provided context, the growth and ...",,"The rapid growth of ChatGPT, launched in Novem...",1,1,0,12.61971,9c41c3aa-120d-4846-b125-dd7b62a204d2,9334b907-b64c-4f3f-bc82-bfabe160a90f
7,ChatGPT launch impact on economy and jobs how ...,"Based on the provided context, the launch and ...",,"The context explains that ChatGPT, launched in...",1,1,0,10.624191,c51dde34-ffa9-44b7-9214-2a052274ba0a,79106af0-b53d-4466-b00b-f2c585d93e90
8,How does ChatGPT facilitate writing tasks for ...,"Based on the provided context, ChatGPT facilit...",,Writing is by far the most common work use of ...,1,1,0,4.958922,0be31397-83da-41f7-b4be-0a43d29ca3fe,c2239f1f-b580-4573-9f97-f4377d4026c6
9,How do different occupation categories utilize...,Based on the provided context from the documen...,,Variation by occupation shows that users in hi...,1,1,0,14.819876,73d107e3-390b-44f5-81f9-ea0bc887ab41,22716e57-af5d-4d50-a368-b655d1bb5e0c


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [28]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [29]:
rag_documents = docs

In [30]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

Chunk size is raised from 500 to 1000.
This provides better context . Similar context texts less likely to be truncated.  This leads to more complex idea structure, and better semantic coherence. On the downside, there are more tokesn, and processing is more costly and takes longer. 
 1. The questions generated by RAGAS tend to be more complex (multi-hop, abstract queries)
 2. Larger chunks provide better context for answering these sophisticated questions - more helpful and nuanced answer.
 3. The "dopeness" evaluator benefits from having more context to generate engaging, comprehensive responses.
 4. All this at a cost of increased computation time and resource (cost)

In [31]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

 1. Better embedding model have higher dimesion which better captures semantic understanding.
 2. This allow captures of more nuanced semantic relationship.
 3. This leads to better query understanding and retrival informations. As in question 2 the downside is increased computation resource - cost and time.

In [32]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [33]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [34]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [35]:
dopeness_rag_chain.invoke({"question" : "How are people using AI to make money?"})

"Alright, buckle up! Based on the context from this fresh 2025 brain dump, people are straight-up flexing AI like ChatGPT to turbocharge their money-making game in a few slick ways.\n\nFirstly, AI isn’t just some magic button to do your job for you—it’s more like the ultimate sidekick, giving killer advice and research support. This means folks are leveraging AI as a savvy advisor, cranking up their decision-making chops, especially in brainy, knowledge-heavy gigs. Smarter decisions = better output = more $$$. \n\nCollis and Brynjolfsson’s 2025 study drops the mic here: US users would need a whopping $98 to skip out on generative AI for a month — implying the economic boost from AI is over $97 billion annually. That’s wild surplus value flowing because workers are not just using AI as an assistant but as a productivity amplifier. Whether it's automating tasks or guiding choices (hello, decision support system vibes), AI is remixing how people do work, making the grind both smoother and

Finally, we can evaluate the new chain on the same test set!

In [36]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'definite-coast-37' at:
https://smith.langchain.com/o/3936adcd-6eec-4723-b49d-fee2168a2d46/datasets/8349afec-538d-4daa-bd7f-041c0c3c23f1/compare?selectedSessions=b2d01397-839d-49c6-a12e-5b176d6a9269




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,how many messages like 18 billion and 2.5 bill...,"Alright, strap in for some next-level AI juice...",,"According to the context, by July 2025, ChatGP...",1,1,1,7.234436,1b70ae01-fc3a-4d47-a8f4-6f6b251b1fb8,953fa4d0-031a-45cc-aa8f-a062edf98787
1,what happen in july 2025 with chatgpt and how ...,"Yo, let's drop some sick knowledge from that f...",,"In July 2025, ChatGPT had been used weekly by ...",1,0,1,7.144565,be6eb518-9138-463c-8cc0-2b97c678439c,bdd891c5-d78f-48bd-819c-8f9fdfcadf12
2,How does the growth of ChatGPT usage in the US...,"Alright, buckle up — this is some seriously sl...",,"The context indicates that in the US, ChatGPT ...",1,1,1,6.152414,8cb54345-757e-454e-b2a8-180bc55bb14b,c77b2ba5-f4df-47b4-a84e-4ddb3879dd3e
3,Wht US us ChatGPT for work and non-work?,"Alright, let's crank this up to eleven and dro...",,"The context indicates that in the US, ChatGPT ...",1,0,1,5.678231,7a8d1800-b7ae-45cd-8900-a442c8227988,5c50a528-9728-44aa-8975-274ee642fb6c
4,Based on the data showing the growth in non-wo...,"Alright, buckle up — here’s how a sharp tech m...",,The data indicates that non-work messages have...,1,1,1,11.162671,cb8c9822-fdad-40ba-baab-e25845d30e7f,2a7f6ea1-d78e-4cf1-985d-eb5f5cce7908
5,How does the increase in total ChatGPT message...,"Alright, buckle up for some sizzling AI usage ...",,"Between June 2024 and June 2025, total ChatGPT...",1,1,1,7.982502,b05f1401-16f3-45fd-9742-32eed880fe1d,3fde20c8-7f16-4434-8098-e005d8164356
6,How does the growth and adoption of ChatGPT re...,"Yo, here’s the 411 straight from the dopest AI...",,"The rapid growth of ChatGPT, launched in Novem...",1,1,1,7.67561,9c41c3aa-120d-4846-b125-dd7b62a204d2,aa686942-edbc-46a7-aca4-14cd6d2be2b8
7,ChatGPT launch impact on economy and jobs how ...,"Yo, strap in for this AI-powered economic and ...",,"The context explains that ChatGPT, launched in...",1,1,1,7.67601,c51dde34-ffa9-44b7-9214-2a052274ba0a,b7e513d9-1de7-4ab3-97b6-980d760dea29
8,How does ChatGPT facilitate writing tasks for ...,"Oh, buckle up, because ChatGPT is the ultimate...",,Writing is by far the most common work use of ...,1,1,1,4.911931,0be31397-83da-41f7-b4be-0a43d29ca3fe,35e7b076-2098-4b0b-add8-d1a27a44e00f
9,How do different occupation categories utilize...,"Alright, buckle up for the ultimate dive into ...",,Variation by occupation shows that users in hi...,1,1,1,13.380189,73d107e3-390b-44f5-81f9-ea0bc887ab41,84d29ef8-d5bb-4a2d-a907-a3fac9c7a9ab


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

Dopeness improves as prompt augementation makes response more targeted.

Correctness dropped as large chunk may have introduced noise in this case. Better embedding model may retrivve contextually similar but factually incorrect information. Helpfulness may drop for similar reasons.

#### This suggests 

Engagement vs. Utility: The "dope" responses might be more entertaining but affects accuracy. Facts may be boring.
Information Overload: Larger chunks might be providing too much context, making responses less focused
Style vs. Substance: The evaluator might prefer straightforward, factual responses over engaging ones

#### What this neans is that 

For Different Use Cases: You might want different chains for different scenarios
Recommendations:
Hybrid Approach: Use different chains for different types of questions
No one size fits all!  **Always evaluate**



#### Screenshot: Chunk size 500,    embeddings "text-embedding-3-small"

![500_chunk_embeddings_small](data/chain1.png)

#### Screenshot : Chunk size 1000,  embeddings "text-embedding-3-large"

![1000_chunks_embeddings_large](data/chain2.png)