# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

> NOTE: DO NOT RUN THESE CELLS IF YOU ARE RUNNING THIS NOTEBOOK LOCALLY

In [16]:
#!pip install -qU ragas==0.2.10

In [17]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

In [1]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [2]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [20]:
!mkdir data

mkdir: data: File exists


In [21]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31287    0 31287    0     0   415k      0 --:--:-- --:--:-- --:--:--  462k


In [22]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70146    0 70146    0     0  1411k      0 --:--:-- --:--:-- --:--:--     0--:--:-- --:--:-- 1522k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [10]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
document = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


In [12]:
docs = document

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [5]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [13]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [14]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [15]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 14, relationships: 71)

We can save and load our knowledge graphs as follows.

In [16]:
kg.save("ai_across_years_kg.json")
ai_across_years_kg = KnowledgeGraph.load("ai_across_years_kg.json")
ai_across_years_kg

KnowledgeGraph(nodes: 14, relationships: 71)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [17]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=ai_across_years_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [18]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.


#### Answer #1:

SingleHopSpecificQuerySynthesizer:
50% of questions.
This is a specific, direct question that can be answered by retreiving relevant information form 1 document that will directly answer the specific questions.

MultiHopAbstractQuerySynthesizer:
25% of questions.
Abstract questions that need reasoning based on context from multiple documentations

MultiHopSpecificQuerySynthesizer:
25% of questions.
Direct questions on single document but need reasoning with input/information from multiple documentations

Finally, we can use our `TestSetGenerator` to generate our testset!

In [19]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What TII in Abu Dhabi do?,[Code may be the best application The ethics o...,TII in Abu Dhabi has produced a better-than-GP...,single_hop_specifc_query_synthesizer
1,How do Large Language Models handle the comple...,[Based Development As a computer scientist and...,The grammar rules of programming languages lik...,single_hop_specifc_query_synthesizer
2,What LLMs do in 2023?,[Simon Willison’s Weblog Subscribe Stuff we fi...,"In 2023, Large Language Models (LLMs) were con...",single_hop_specifc_query_synthesizer
3,What are the ethical implications of using AI ...,[easy to follow. The rest of the document incl...,The ethical implications of using AI models tr...,single_hop_specifc_query_synthesizer
4,What advancements in AI model training have be...,[Prompt driven app generation is a commodity a...,"In 2024, China has been part of the global adv...",single_hop_specifc_query_synthesizer
5,How does the black box nature of Large Languag...,[<1-hop>\n\nCode may be the best application T...,The black box nature of Large Language Models ...,multi_hop_abstract_query_synthesizer
6,How has OpenAI influenced the development of L...,[<1-hop>\n\nCode may be the best application T...,OpenAI has played a significant role in the de...,multi_hop_abstract_query_synthesizer
7,"How has the efficiency of AI models, particula...",[<1-hop>\n\nCode may be the best application T...,"The efficiency of AI models, including those c...",multi_hop_abstract_query_synthesizer
8,How has Meta contributed to the advancements i...,[<1-hop>\n\nAnother common technique is to use...,Meta has significantly contributed to the adva...,multi_hop_specific_query_synthesizer
9,How do the capabilities of GPT-4o differ from ...,[<1-hop>\n\nAnother common technique is to use...,GPT-4o has the ability to run web searches and...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [20]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [21]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wht are the ethical implcations of Anthropic's...,[Code may be the best application The ethics o...,Anthropic is one of the organizations that hav...,single_hop_specifc_query_synthesizer
1,What are the ethical implications of using the...,[Based Development As a computer scientist and...,The ChatGPT Code Interpreter's ability to exec...,single_hop_specifc_query_synthesizer
2,Wht is Artificil Intelligenc?,[Simon Willison’s Weblog Subscribe Stuff we fi...,Artificial Intelligence refers to the latest a...,single_hop_specifc_query_synthesizer
3,Is it ethical to use AI models trained on peop...,[easy to follow. The rest of the document incl...,The ethical question of whether it is acceptab...,single_hop_specifc_query_synthesizer
4,How has increased competition influenced the u...,[<1-hop>\n\nPrompt driven app generation is a ...,Increased competition has significantly influe...,multi_hop_abstract_query_synthesizer
5,How has increased competition influenced the u...,[<1-hop>\n\nPrompt driven app generation is a ...,Increased competition in 2024 has significantl...,multi_hop_abstract_query_synthesizer
6,How do evaluations in AI and test-driven devel...,[<1-hop>\n\nPrompt driven app generation is a ...,Evaluations in AI and test-driven development ...,multi_hop_abstract_query_synthesizer
7,How has the development of prompt-driven app g...,[<1-hop>\n\nPrompt driven app generation is a ...,The development of prompt-driven app generatio...,multi_hop_abstract_query_synthesizer
8,How do the challenges of evaluating LLMs and t...,[<1-hop>\n\ndependent on AGI itself. A model t...,The challenges of evaluating LLMs and the conc...,multi_hop_specific_query_synthesizer
9,What are the challenges and ethical concerns a...,[<1-hop>\n\neasy to follow. The rest of the do...,The development and deployment of Large Langua...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [22]:
from langsmith import Client

client = Client()

dataset_name = "State of AI Across the Years!"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="State of AI Across the Years!"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [23]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [24]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [25]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [26]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [27]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="State of AI"
)

In [28]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [29]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [30]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Finally, we can set-up our RAG LCEL chain!

In [31]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [57]:
rag_chain.invoke({"question" : "What are Agents?"})

'Agents are infuriatingly vague AI systems that are often thought of as capable of acting on your behalf, similar to a travel agent. There are differing interpretations of what "agents" are, with some viewing them as AI that can run tools in a loop to solve problems. However, the term lacks a clear and widely understood definition, and there are few examples of such systems running in production despite prototypes. The concept of agents is still seen as perpetually "coming soon."'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4o as our evaluation LLM for our base Evaluators.

In [32]:
eval_llm = ChatOpenAI(model="gpt-4o")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [33]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`:
- `labeled_helpfulness_evaluator`:
- `dope_or_nope_evaluator`:

#### Activity #2 Answer:

- `qa_evaluator`:
QA Evaluator.
Evaluating the answer generated by the RAG system is correct or not. The correctness of the generated answer by the RAG system. Most simple QA evaluator.

- `labeled_helpfulness_evaluator`:
Criteria Evaluator with Label.
Evaluate whether the RAG generated prediction is helful to the user or not, when comppared the prediction to the reference(answer) from the synthetic generated data.

- `dope_or_nope_evaluator`:
Criteria Evaluator with NO Label.
Evaluate whether the output generated by RAG is dope/lit/cool OR not

## LangSmith Evaluation

In [34]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'aching-toad-81' at:
https://smith.langchain.com/o/d0d88943-eeab-4eae-90cc-8ae8385f454e/datasets/c2412a63-33d3-4a33-bcf3-3b844222942a/compare?selectedSessions=a26256ef-0bec-40e9-9cf0-b9c37589792e




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How has increased competition influenced the u...,Increased competition has led to a crash in LL...,,Increased competition has significantly influe...,1,1,0,2.299439,9530be8d-3347-4f92-b3d4-0f40e65366f5,413ef1f9-3be8-41b8-9327-913880981ef3
1,Wht are the ethical implcations of Anthropic's...,I don't know.,,Anthropic is one of the organizations that hav...,0,0,0,1.220032,9b6ce400-30aa-45a6-b91c-d790d0b29cf3,1536fe3b-ab02-4124-ac61-5329c3316ed7
2,Wht is Artificil Intelligenc?,I don't know.,,Artificial Intelligence refers to the latest a...,0,0,0,1.043828,59dfc211-a889-4c0c-9b05-5d09bc206c22,32b11119-1937-4266-880c-9f457f4746dd
3,Is it ethical to use AI models trained on peop...,I don't know.,,The ethical question of whether it is acceptab...,0,0,0,1.322364,dc1a518f-12e3-4d0b-8938-8ce9f4038894,59ad6b50-d007-4e70-9169-bee067e4e2b1
4,How does Claude contribute to the development ...,I don't know.,,Claude plays a significant role in the develop...,0,0,0,1.572596,a6d5d904-8aba-4cfc-83c6-b4e340615fda,9ab53ae1-38fa-46c8-a0ea-563e7fff0ec1
5,How do evaluations in AI and test-driven devel...,Evaluations in AI and test-driven development ...,,Evaluations in AI and test-driven development ...,1,1,0,3.058161,0de28f1b-b1f4-43c7-80f9-e4a835bd7e04,0c839243-d654-41a8-aa5e-3e8183a44783
6,How has increased competition influenced the u...,Increased competition has led to a crash in LL...,,Increased competition in 2024 has significantl...,1,0,0,2.21012,86567a9c-9a0c-4bd2-a773-0e13c826b9ea,ef982cba-1a6f-4552-931a-643d6c9cdb10
7,What are the cost and efficiency benefits of u...,I don't know.,,GPT-4o is significantly more cost-effective th...,0,0,0,1.269939,1ee9ebe2-e0eb-4a5e-bc93-e8928006aaef,934a921b-db11-45b4-b7fc-a44a6958532b
8,What are the challenges and ethical concerns a...,The challenges and ethical concerns associated...,,The development and deployment of Large Langua...,1,0,0,4.983141,1db446cd-2189-4f7f-82c9-d599796d1f29,4c07946e-1995-498c-8ea9-7e0b9b578cb9
9,How has the development of prompt-driven app g...,The development of prompt-driven app generatio...,,The development of prompt-driven app generatio...,1,0,0,4.634722,011427a2-26b5-4fb5-b5e2-3987d3ee1a51,2f0d00c3-d711-490d-a2f1-729d163bb254


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [35]:
DOPE_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the questions in a dope way, be cool!

Context: {context}
Question: {question}
"""

dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)

In [36]:
rag_documents = docs

In [37]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

#### Answer #2:
Because bigger chunk size means each chunk/vector will have more broader context so each retrieved vector will have more context provided to answer user's question

In [38]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")


#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

#### Answer #3:

Granularity of representing the tokens will be larger (more dimensions captured on chucnks), so cosine similiarity will be better. Help improve retriever.

In [39]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Across Years (Augmented)"
)

In [40]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [41]:
dope_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dope_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [68]:
dope_rag_chain.invoke({"question" : "what are Agents?"})

'Agents? Man, they’re this super vague concept in the AI world. Some folks think of them as digital assistants that act on your behalf—like a travel agent. Others see them as AI models with tools, looping through tasks to solve problems. But here’s the kicker: the term is so fuzzy that it leaves you scratching your head, since everyone seems to have their own take on what it really means. Plus, there’s this whole issue of gullibility—how can these agents make smart choices if they can’t tell reality from fiction? So yeah, they’re like this elusive dream still waiting for a real breakthrough. 💫'

Finally, we can evaluate the new chain on the same test set!

In [42]:
evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)

View the evaluation results for experiment: 'long-note-58' at:
https://smith.langchain.com/o/d0d88943-eeab-4eae-90cc-8ae8385f454e/datasets/c2412a63-33d3-4a33-bcf3-3b844222942a/compare?selectedSessions=ad61dafa-9576-4854-99d6-931e5dd4f201




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How has increased competition influenced the u...,"Yo, check it out! Increased competition in the...",,Increased competition has significantly influe...,1,0,1,3.240619,9530be8d-3347-4f92-b3d4-0f40e65366f5,ccc30ac1-e924-4431-914e-cfb3a18222f6
1,Wht are the ethical implcations of Anthropic's...,I don't know.,,Anthropic is one of the organizations that hav...,0,0,0,1.39076,9b6ce400-30aa-45a6-b91c-d790d0b29cf3,d0dd48ea-249e-4f31-809a-b0514162dde8
2,How has the development of prompt-driven app g...,"Yo, check it! In 2024, we saw a major glow-up ...",,The development of prompt-driven app generatio...,1,0,1,3.756153,011427a2-26b5-4fb5-b5e2-3987d3ee1a51,19647c14-3d57-41af-8914-c9acfa4399ce
3,Wht is Artificil Intelligenc?,I don't know.,,Artificial Intelligence refers to the latest a...,0,0,0,2.213942,59dfc211-a889-4c0c-9b05-5d09bc206c22,5591eca8-3e34-4b58-9da8-486cd922ef5b
4,How does Claude contribute to the development ...,"Yo, Claude is making waves in the LLM scene! T...",,Claude plays a significant role in the develop...,0,0,1,3.943505,a6d5d904-8aba-4cfc-83c6-b4e340615fda,323817c9-4da5-4013-a2dd-db1003116e68
5,Is it ethical to use AI models trained on peop...,"Yo, that's a tricky one! The context hints at ...",,The ethical question of whether it is acceptab...,0,0,1,2.636524,dc1a518f-12e3-4d0b-8938-8ce9f4038894,3e8b3940-f848-4ec0-8cf0-2782adb36574
6,How do evaluations in AI and test-driven devel...,"Yo, evaluations in AI and test-driven developm...",,Evaluations in AI and test-driven development ...,1,1,1,3.697532,0de28f1b-b1f4-43c7-80f9-e4a835bd7e04,a4c17876-298a-4216-9c03-371cb82d2318
7,What are the ethical implications of using the...,I don't know.,,The ChatGPT Code Interpreter's ability to exec...,0,0,0,1.852164,802b6c4e-9a70-4ea7-91ba-b6be3ca1e882,0d685162-124d-46b4-804b-e2cde8d6c800
8,What are the challenges and ethical concerns a...,"Yo, let’s break it down! The challenges with L...",,The development and deployment of Large Langua...,1,0,1,5.228367,1db446cd-2189-4f7f-82c9-d599796d1f29,e5dc477c-e836-4708-b0f4-85ccf230393a
9,How do the challenges of evaluating LLMs and t...,"Yo, check it out! The challenges of evaluating...",,The challenges of evaluating LLMs and the conc...,1,0,1,4.395812,1cf73fe0-b362-4df4-abe4-3999551b7c56,a6b100b8-b9ee-48e9-89dd-0deb6bb87892


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.


#### Activity #3 Answer:

The correctness improved slightly because the returned context had more information due to increased context through bigger chunking size and more dimensions in embedded vectors with larger embedding model. It helped slightly improve the retriever to help slightly improve the performance on the correctness metric.

The Dopeness metric imporved significantly (0% to 75%) by prompt engineering. Adding "You must answer the questions in a dope way, be cool!" to the prompt helped improve.

Helpfulness metric went down. The evaluator found the prediction generated by the improved RAG chain to be less helpful compared to the reference from systhethic data generated. The common reasoning provided by evaluator why it found majority of predictions to be not helpful is because it just answered the main ideas asked by the input but lacked specific examples and details found in the reference answer.
