# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

> NOTE: DO NOT RUN THESE CELLS IF YOU ARE RUNNING THIS NOTEBOOK LOCALLY

In [None]:
#!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
"""
Was getting error with nltk, so, aided by Wiz, we don't use this code and instead
use the nltk code below (a few cells down)



import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

"""

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1018)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data]     failed: unable to get local issuer certificate
[nltk_data]     (_ssl.c:1018)>


False

In [1]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [2]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [4]:
# This is just to make the output look better

from IPython.display import HTML, display
def set_output_wrapping():
   display(HTML('''
   <style>
   pre {
       white-space: pre-wrap;
       word-wrap: break-word;
       max-width: 100%;
       overflow-x: hidden;
   }
   .output_area {
       white-space: pre-wrap;
       word-wrap: break-word;
       max-width: 100%;
       overflow-x: hidden;
   }
   .output_text {
       white-space: pre-wrap;
       word-wrap: break-word;
   }
   div.output {
       white-space: pre-wrap;
       word-wrap: break-word;
       max-width: 100%;
   }
   span {
       white-space: pre-wrap;
       word-wrap: break-word;
   }
   </style>
   '''))
set_output_wrapping()


## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [4]:
!mkdir data

In [5]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31287    0 31287    0     0  92995      0 --:--:-- --:--:-- --:--:-- 92839


In [6]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70146    0 70146    0     0   182k      0 --:--:-- --:--:-- --:--:--  183k


In [5]:
# Was getting an error with nltk, so I added this code (aided by Wiz)

import ssl
import nltk

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kamerankolahi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/kamerankolahi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [6]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

In [7]:
# Confirm that we have exactly 2 documents

len(docs)

2

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [8]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [9]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [10]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [11]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/20 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 11, relationships: 39)

We can save and load our knowledge graphs as follows.

In [12]:
kg.save("ai_across_years_kg.json")
ai_across_years_kg = KnowledgeGraph.load("ai_across_years_kg.json")
ai_across_years_kg

KnowledgeGraph(nodes: 11, relationships: 39)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [13]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=ai_across_years_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [14]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

#### Answer:

SingleHopSpecificQuerySynthesizer: generates a query based off of a single node in the knowledge graph (SingleHop) that is a factual/direct query (SpecificQuery). Essentially, you only need to retrieve the context from one node to be able to answer it, and it's not a conceptual query (it's a factual/direct query). An example is "what's the capital of France?"

MultiHopAbstractQuerySynthesizer: generates a query based off of multiple nodes in the knowledge graph (MultiHop) that is a conceptual query (AbstractQuery). Essentially, you need to retrieve the contexts from multiple nodes to be able to answer it, and it's a conceptual/broad query (abstract), not a simple fact. An example is "How did Einstein's Theory of Relativity inspire later physicists work on quantum mechanics?"

MultiHopSpecificQuerySynthesizer: generates a query based off of multiple nodes in the knowledge graph (MultiHop) that is a factual/direct query (SpecificQuery). Essentially, you need to retrieve the contexts from multiple nodes to be able to answer it, and it's not a conceptual query (it's a factual/direct query). An example is "What is the capital city of the country with the largest human popuplation?"



Finally, we can use our `TestSetGenerator` to generate our testset!

In [15]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Considering the significant role of Microsoft ...,[The ethics of this space remain diabolically ...,Microsoft Research is mentioned as one of the ...,single_hop_specifc_query_synthesizer
1,AI what is it,[Simon Willison’s Weblog Subscribe Stuff we fi...,Artificial Intelligence refers to the latest a...,single_hop_specifc_query_synthesizer
2,what is stanford alpaca,[the document includes some of the clearest ex...,The context does not provide a specific explan...,single_hop_specifc_query_synthesizer
3,What role does Nvidia play in the development ...,[The year of slop Synthetic training data work...,The context mentions Nvidia as one of the orga...,single_hop_specifc_query_synthesizer
4,What is Anthropic in the context of AI develop...,[also pre-announced voice mode for Amazon Nova...,Anthropic is mentioned as the creator of Claud...,single_hop_specifc_query_synthesizer
5,How does the recent advancement in platform-sp...,[<1-hop>\n\nThe year of slop Synthetic trainin...,"The context highlights that many models, inclu...",multi_hop_abstract_query_synthesizer
6,Hw can we use LLMs and AI to improve our under...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Based on simon wilison's statement that LLMs a...,multi_hop_abstract_query_synthesizer
7,Considering the recent advancements in large l...,[<1-hop>\n\nI would be very surprised if they ...,The development of DeepSeek v3 demonstrates th...,multi_hop_abstract_query_synthesizer
8,Considering the advancements in Large Language...,[<1-hop>\n\nI would be very surprised if they ...,"The development of DeepSeek v3 in China, which...",multi_hop_specific_query_synthesizer
9,how google is involved in the development of l...,[<1-hop>\n\nthe document includes some of the ...,the context mentions google as one of the orga...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [16]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'StringIO' object has no attribute 'output'


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [17]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Meta is like a big company that makes social s...,[The ethics of this space remain diabolically ...,Meta is mentioned as one of the organizations ...,single_hop_specifc_query_synthesizer
1,What are LLMs?,"[and software engineer, LLMs are infuriating. ...",LLMs are large language models that are often ...,single_hop_specifc_query_synthesizer
2,What significant developments in AI occurred i...,[Simon Willison’s Weblog Subscribe Stuff we fi...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
3,is it ok to use homebrew to run llms without a...,[the document includes some of the clearest ex...,The context discusses the development and impa...,single_hop_specifc_query_synthesizer
4,How do the impact of AI models on industry and...,[<1-hop>\n\nThe year of slop Synthetic trainin...,"The context highlights that AI models, particu...",multi_hop_abstract_query_synthesizer
5,How do the themes of LLMs' smartness versus du...,"[<1-hop>\n\nand software engineer, LLMs are in...",The first segment highlights that LLMs are bot...,multi_hop_abstract_query_synthesizer
6,How do openly licensed models like DeepSeek v3...,[<1-hop>\n\nI would be very surprised if they ...,"DeepSeek v3, as one of the largest openly lice...",multi_hop_abstract_query_synthesizer
7,How do the costs and efficiency of trainng lar...,[<1-hop>\n\nI would be very surprised if they ...,Recent developments show that training large l...,multi_hop_abstract_query_synthesizer
8,how LLMs are really smart but also dumb and ho...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,The context explains that LLMs are really smar...,multi_hop_specific_query_synthesizer
9,How does googles research and development in A...,[<1-hop>\n\nThe year of slop Synthetic trainin...,The context indicates that Google has made sig...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [18]:
from langsmith import Client

client = Client()

dataset_name = "State of AI Across the Years!"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="State of AI Across the Years!"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [19]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [20]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [21]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [22]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [23]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="State of AI"
)

  description="Check that the field is empty, alternative syntax for `is_empty: \&quot;field_name\&quot;`",
  description="Check that the field is null, alternative syntax for `is_null: \&quot;field_name\&quot;`",


In [24]:
# Retrieving the top 10 chunks from the vector store

retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [25]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [26]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [27]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [28]:
rag_chain.invoke({"question" : "What are Agents?"})

'Based on the provided context, "agents" is an extremely vague and ill-defined term in the AI community. Generally, it refers to AI systems that can go away and act on your behalf, similar to a travel agent model, or large language models (LLMs) equipped with tools that they can use iteratively to solve problems. However, there is no single clear or widely accepted meaning of the term, and many who use it don’t acknowledge this lack of clarity. Despite much discussion and excitement about AI agents, few examples of them actually running in production exist, and the concept still feels like it is perpetually "coming soon." One major challenge is the problem of "gullibility"—current AI systems tend to believe anything they are told, which limits their ability to act reliably and autonomously. Some think that fully solving this gullibility problem, and thereby realizing true AI agents, may require achieving Artificial General Intelligence (AGI).  \n\nIn short, agents are AI systems intend

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [29]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [30]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`:
- `labeled_helpfulness_evaluator`:
- `dope_or_nope_evaluator`:


#### Answer:

(1) `qa_evaluator`: qa_evaluator is evaluating our RAG strictly on whether or not our RAG's response to the question is factually correct or not (based off of the question, our RAG's response to the question, and the provided true answer)

I know this because: looking at the trace in LangSmith for qa_evaluator (which is really cool to see!), the qa_evaluator is given a prompt that says it is a teacher grading a quiz, you are given the question, student answer, and true answer, and you have to grade it as either CORRECT or INCORRECT (grade only on factual accuracy, not on puncutation, etc. Additional info is ok as long as it doesn't contain any conflicting statements)

(2) `labeled_helpfulness_evaluator`: labeled_helpfulness_evaluator is evaluating whether or not our RAG's response to the user's query is helpful to the user, taking into account the correct reference answer (and of course the user question, which is also provided to labeled_helpfulness_evaluator). labeled_helpfulness_evaluator is told to first write out in a step by step manner its reasoning and then on the last line say 'Y' for yes or 'N' for no

I know this because I looked at a trace in LangSmith. In particular, I know labeled_helpfulness_evaluator considers the user's question because it mentions it in its step by step reasoning before giving its final answer (atleast the trace that I looked at)

(3) `dope_or_nope_evaluator`: dope_or_nope_evaluator is evaluating whether or not our RAG's response to the user's query is dope, lit, or cool. dope_or_nope_evaluator is told to first write out in a step by step manner its reasoning and then on the last line say 'Y' for yes or 'N' for no

I know this because I looked at a trace in LangSmith. In particular, dope_or_nope_evaluator is not exposed to the correct reference answer, but is exposed to the user's question (although it didn't consider the user's question in its step by step reasoning, atleast the trace that I looked at)

## LangSmith Evaluation

In [31]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'best-smile-68' at:
https://smith.langchain.com/o/ded253b7-e6f0-49c4-9c1c-737484f7f65a/datasets/33602d1c-64ec-4d09-9226-27d3cbcab194/compare?selectedSessions=12135951-c654-499c-9887-d292262e005f




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,"Based on the insights from 2023 and 2024, how ...","Based on the provided context, recent advancem...",,"Recent advancements in LLMs, as highlighted in...",1,0,0,4.668474,832dd8db-7ff8-48d2-bb76-92f0e0ef6c3c,8ac56942-44be-4d15-abda-9c5f7045eec7
1,How does ChatGPT's abilty to generate code and...,"Based on the provided context, ChatGPT's abili...",,"ChatGPT's strong capability to generate code, ...",1,1,0,4.096225,8c1110b0-57bd-40f9-a5c4-fe8fb631c4ca,c064f83d-16c7-49d2-b752-92b613afc8ec
2,How does googles research and development in A...,"Google’s research and development in AI, parti...",,The context indicates that Google has made sig...,1,1,0,3.410534,fa8b213b-29a1-4f51-ab19-8e7ebefad19c,5dc0505d-ac16-4af2-9843-b4ef35185e96
3,how LLMs are really smart but also dumb and ho...,"LLMs are described as being ""really smart, and...",,The context explains that LLMs are really smar...,1,1,0,6.396271,53bec1a8-545f-41a3-a356-0ed5b2c1ef02,833e1da2-1076-44b4-985f-4c5b6181bc99
4,How do the costs and efficiency of trainng lar...,"Based on the provided context, the costs and e...",,Recent developments show that training large l...,1,1,0,3.48479,c10f1324-08d8-455d-9e3a-258f559f22c5,2ce528e8-6ac3-4fe9-a0dd-fe53ba5cac60
5,How do openly licensed models like DeepSeek v3...,Openly licensed models like DeepSeek v3 and Ll...,,"DeepSeek v3, as one of the largest openly lice...",1,1,0,5.512262,9735285e-414c-4beb-9d2f-51d3a7018dae,d1b9a88f-6e3c-487c-a295-ce705604239f
6,How do the themes of LLMs' smartness versus du...,The themes of LLMs being both very smart and v...,,The first segment highlights that LLMs are bot...,1,1,0,5.76729,37e8f585-502f-49d3-868c-b0796f3101bc,12a0bbf7-521c-4878-958e-678f197b6f22
7,How do the impact of AI models on industry and...,"Based on the provided context, the impact of A...",,"The context highlights that AI models, particu...",1,1,0,4.244793,2c916b07-5c70-4b9d-8eba-e757b91958e7,9d421386-3369-424d-af3e-a8337d093585
8,is it ok to use homebrew to run llms without a...,I don't know.,,The context discusses the development and impa...,1,0,0,1.534997,85b922a9-923b-43cb-a871-8e0a70050400,07a95d44-39c9-412b-a628-177b858174ba
9,What significant developments in AI occurred i...,"According to Simon Willison’s weblog, 2023 was...",,2023 was the breakthrough year for Large Langu...,1,0,0,4.112528,3e091db8-830a-4eed-abe7-0d8b105cc83e,81b6beb8-6e11-42f9-8840-0b9d78110a6d


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [32]:
DOPE_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the questions in a dope way, be cool!

Context: {context}
Question: {question}
"""

dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)

In [33]:
rag_documents = docs

In [34]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

#### Answer:

A few reasons why modifying our chunk size would modify the performance of our application:

(1) smaller chunk sizing might split up answers across multiple chunks vs larger chunk sizing might contain the answer in one chunk but also contain irrelevant information that will confuse the chat model. It can even be too big to fit in its context window (although chat models do have decent size context windows)

(2) If the chunk size is too big for the embedding model, then what the embedding model does is consider the first part of the chunk up to its token limit, then after that it essentially ignores the rest of the chunk. This will clearly affect the performance of our application

(3) larger chunk sizing will require more tokens and thus cause more latency to the user while they are using it (more context = more tokens = more processing time for the chat model)

The point is is that there is no general "golden" chunk size. It really depends on the specifics (e.g. content, models used, etc)


In [35]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

#### Answer:

A few reasons why modifying our embedding model would modify the performance of our application:

(1) different embedding models have different embedding vector dimensions (in other words, the number of numerical dimensions a chunk gets mapped to). The more embedding vector dimensions, the more it can capture nuanded meanings and relationships, which can lead to more relevant retrievals, which would improve the performance of our application

(2) different embedding models have different context window lengths, which affects the size of chunks it can properly handle. This is especially important if based on your data, it happens to be the case that the most logical chunks happen to be pretty big, then you really need to be careful about which embedding model you pick. If the chunk size is too big for the embedding model, then what the embedding model does is consider the first part of the chunk up to its token limit, then after that it essentially ignores the rest of the chunk. This will clearly affect the performance of our application

(3) larger embedding models require more compute time to process and thus lead to more latency to the user while they are using it. First of all, when a user enters their prompt, we need to feed it to the embedding model first before comparing it to what's in our vector store. More embedding vector dimensions means more computation which means more latency. Similarly, when comparing to our vector store (e.g. cosine simularity), the more embedding vector dimensions means more computation (e.g. more dimensions for cosine simularity calculations) which means more latency

(4) larger embedding models require more computing and thus more money (cost). I guess it depends on your definition of "performance", but if you consider cost to be part of "performance", then yeah, it'll cost more generally


In [36]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Across Years (Augmented)"
)

In [37]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [38]:
dope_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dope_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [39]:
dope_rag_chain.invoke({"question" : "what are Agents?"})

'Yo, so here’s the deal with “agents” — the term is hella confusing and people throw it around without ever locking down what it really means. At its core, folks think of agents in two main vibes: one, like a travel agent who goes out and acts on your behalf, doing your bidding; and two, LLMs (large language models) hooked up with tools they can use repeatedly to solve problems.  \n\nBut here’s the kicker — these so-called agents still feel like they’re “coming soon” forever. Why? Because they struggle with gullibility. They just believe anything you feed ’em, which messes up their ability to make solid decisions for you. It’s like trusting a buddy who falls for any prank. Without nailing down truth versus fiction, these agents can’t be fully legit.  \n\nIn other words, “agents” are hyped AI systems that can act for you, but the tech and trust game ain’t quite there yet. So yeah, cool concept but still work in progress, fam.'

Finally, we can evaluate the new chain on the same test set!

In [40]:
evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)

View the evaluation results for experiment: 'giving-circle-20' at:
https://smith.langchain.com/o/ded253b7-e6f0-49c4-9c1c-737484f7f65a/datasets/33602d1c-64ec-4d09-9226-27d3cbcab194/compare?selectedSessions=ae0f4681-0eb8-4ade-b37d-2846b292b0d5




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,"Based on the insights from 2023 and 2024, how ...","Yo, here’s the lowdown straight from the 2023-...",,"Recent advancements in LLMs, as highlighted in...",1,1,1,3.811722,832dd8db-7ff8-48d2-bb76-92f0e0ef6c3c,17ad4b80-aa2c-4513-9e71-ceb1dc1add9e
1,How does ChatGPT's abilty to generate code and...,"Alright, here’s the lowdown, straight from the...",,"ChatGPT's strong capability to generate code, ...",1,1,1,5.225174,8c1110b0-57bd-40f9-a5c4-fe8fb631c4ca,74f51dec-7444-4c51-adf7-c9daff005a96
2,How does googles research and development in A...,"Yo, here’s the lowdown straight from the conte...",,The context indicates that Google has made sig...,1,1,1,4.500851,fa8b213b-29a1-4f51-ab19-8e7ebefad19c,20def45a-e921-45ed-9283-56d3f24c836c
3,how LLMs are really smart but also dumb and ho...,"Yo, here’s the lowdown straight from the vibe ...",,The context explains that LLMs are really smar...,1,0,1,3.001078,53bec1a8-545f-41a3-a356-0ed5b2c1ef02,f077e4aa-4a31-4db0-a7ec-0c8fe2698cf1
4,How do the costs and efficiency of trainng lar...,"Yo, here’s the lowdown on training big languag...",,Recent developments show that training large l...,1,1,1,6.470085,c10f1324-08d8-455d-9e3a-258f559f22c5,36e4d3db-f53f-4481-95bb-cfbdfa3f2ad9
5,How do openly licensed models like DeepSeek v3...,"Alright, here's the lowdown, fresh and fly:\n\...",,"DeepSeek v3, as one of the largest openly lice...",1,1,1,7.173961,9735285e-414c-4beb-9d2f-51d3a7018dae,36e01403-0a9e-4687-a3a6-aa08e1df7206
6,How do the themes of LLMs' smartness versus du...,"Alright, listen up — here’s the lowdown on the...",,The first segment highlights that LLMs are bot...,1,1,1,4.962666,37e8f585-502f-49d3-868c-b0796f3101bc,99a1f738-5ff6-40a8-84db-9519709ee85a
7,How do the impact of AI models on industry and...,"Alright, here’s the lowdown straight from the ...",,"The context highlights that AI models, particu...",1,1,1,9.906588,2c916b07-5c70-4b9d-8eba-e757b91958e7,dae1b506-7f66-4164-801b-648f96596517
8,is it ok to use homebrew to run llms without a...,"Yo, based on the context you dropped, running ...",,The context discusses the development and impa...,1,1,1,3.654021,85b922a9-923b-43cb-a871-8e0a70050400,c1565228-b41f-429c-a66b-f53f202df646
9,What significant developments in AI occurred i...,"Ah, 2023 was the year the AI game leveled up b...",,2023 was the breakthrough year for Large Langu...,1,1,1,5.059165,3e091db8-830a-4eed-abe7-0d8b105cc83e,93641013-61ce-4710-90a0-ee8eb3cc8333


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

#### Answer:

![Activity 3 Screenshot](screenshots/Activity_3_screenshot.png)

Additional screenshot to show latency:

![Activity 3 Screenshot](screenshots/Activity_3_screenshot_part2.png)

Let's start with discussing the dopeness metric: clearly this changed dramatically positively (from an average score of 0 (0 out of 12 correct) with the default_chain_init to an average score of 1 (12 out of 12 correct) with the dope_chain) because we added to the DOPE_RAG_PROMPT "You must answer the questions in a dope way, be cool!". Since the prompt is feed into our chat model llm, it knew to make its response dope. Since this line wasn't in the default_chain_init's prompt, then the chat model llm didn't know (wasn't told) to make its response dope. 

Regarding the correctness metric: it didn't change (both times had an average score of 1). Since the default_chain_init's average score for correctness metric was 1, that means that even though it had a smaller chunk size, it was still big enough to provide good enough context to our chat model llm (as described in my answer to Question 2 above) for it to get the right answer. Similarly, even though default_chain_init used a smaller embedding model, it was still big enough to extract enough nuanced meanings and relationships to make our context provided to our chat model llm good enough for it to get the correct answer. 

Regarding the helpfulness metric: it didn't change (both times had an average score of 0.75 i.e. 9 out 12 correct). Thus, the larger chunk size didn't result in more unneccesary information in each chunk in such a way that the context confused our chat model llm to be less helpful. That didn't happen. Nor did the potential benefits of having larger chunks (e.g. keep answers intact in each chunk) happen enough to make the context better enough to make the chat model llm more helpful. Similarly, the larger embedding model of dope_chain wasn't able to extract more nuanced meanings and relationships to make our context provided to our chat model llm good enough for it to give more helpful answers

However, as we see in the 2nd screenshot above, the dope_chain did have more latency. So if these were the only non-latency/cost metrics we cared about, then we should revert to the smaller embedding model (and possibly the smaller chunks although it may not make a huge difference in latency (good to experiment)) so that we can have less latency for the same performance in these metrics. The point here is that the updated RAG prompt is the real reason for the improvement in dopeness score, not the larger chunk size or the larger embedding model (but we can test this to be sure), so we can keep that but revert the embedding model and chunk size

That being said, it should be mentioned that 12 is a very small number statistically speaking, so some of this could just be noise. We could rerun this a few times (or increase the number of questions) to be more statistically confident that both models actually perform the same for the correctness and helpfulness metrics