# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, inspired by the [Evol Instruct](https://arxiv.org/abs/2304.12244) paper.



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

In [2]:
!pip install -qU langsmith langchain-core langchain-community langchain-openai langchain-qdrant

In [3]:
!pip install -qU pymupdf ragas

We'll need to provide our LangSmith API key, and set tracing to "true".

In [4]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [5]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG MIKE DEAN - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [6]:
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Task 2: Loading Source Documents

In order to create a synthetic dataset, we must first load our source documents!

In [7]:
from langchain_community.document_loaders import PyMuPDFLoader

documents = PyMuPDFLoader(file_path="https://s2.q4cdn.com/470004039/files/doc_earnings/2024/q3/filing/_10-Q-Q3-2024-As-Filed.pdf").load()

## Task 3: Generate Synthetic Data

Let's first take a peek under the RAGAS hood to see what's happening when we generate a single example.

For simplicities sake - we'll look at a flow that results in a reasoning question.

### Two LLMs To Rule Them All

- `generator_llm` - will generate our seed questions and evolutions
- `critic_llm` - will act as a critic to verify if the evolutions are as we expect them to be

### Entering the Generation

We'll enter the generation process with our `generate_with_langchain_docs()` method - let's look at how that is implemented:

```python
def generate_with_langchain_docs(
    self,
    documents: t.Sequence[LCDocument],
    test_size: int,
    distributions: t.Optional[Distributions] = None,
    with_debugging_logs=False,
    is_async: bool = True,
    raise_exceptions: bool = True,
    run_config: t.Optional[RunConfig] = None,
):
    distributions = distributions or {}
    # chunk documents and add to docstore
    self.docstore.add_documents(
        [Document.from_langchain_document(doc) for doc in documents]
    )

    return self.generate(
        test_size=test_size,
        distributions=distributions,
        with_debugging_logs=with_debugging_logs,
        is_async=is_async,
        raise_exceptions=raise_exceptions,
        run_config=run_config,
    )
```

As you can see - before we do anything, our `doctore` is created using the provided `documents`.

Then, we move on to `generate()`, let's see how that works next!

### Generating Examples!

> NOTE: You can see the full implementation [here](https://github.com/explodinggradients/ragas/blob/fe379a1c97d18ce2c203d80432a3da6622337968/src/ragas/testset/generator.py#L234), we'll work through the pseudo-code.

```python
function generate(test_size, distributions, other_params...):
    # Validate and set default values
    if distributions not provided:
        distributions = DEFAULT_DISTRIBUTION
    
    validate_distributions_sum_to_one(distributions)
    
    set_up_run_config()
    initialize_docstore()
    
    # Initialize evolutions
    for each evolution in distributions:
        initialize_evolution(evolution)
    
    set_up_debugging_logs_if_needed()
    
    # Set up execution environment
    executor = create_executor()
    
    # Get initial nodes
    current_nodes = get_random_nodes_from_docstore(test_size)
    
    total_evolutions = 0
    
    # Distribute evolutions based on probabilities
    for each evolution, probability in distributions:
        num_samples = round(probability * test_size)
        for i in random_sample(range(test_size), num_samples):
            submit_task_to_executor(evolution.evolve, current_nodes[i])
            total_evolutions += 1
    
    # Add filler evolutions if needed
    while total_evolutions < test_size:
        random_evolution = choose_random_evolution(distributions)
        submit_task_to_executor(random_evolution.evolve, current_nodes[total_evolutions])
        total_evolutions += 1
    
    # Get results
    try:
        test_data_rows = executor.get_results()
        if test_data_rows is empty:
            raise Exception("No results generated")

    return test_data_rows
```

In essence, we:

1. Do some validation of inputs, and initialize our evolutions.
2. Get some random nodes from our docstore.
3. Evolve the current nodes based on the desired distribution.
4. Fill with sampled evolutions if we're not at the desired number of rows.

### Peeking into the Complex Evolution Implementation for Reasoning Questions.

> NOTE: You can see the full implementation [here](https://github.com/explodinggradients/ragas/blob/fe379a1c97d18ce2c203d80432a3da6622337968/src/ragas/testset/evolutions.py#L375). We'll work through the high-level implementation below.

Let's look into how the "Complex Evolution" is implemented:

1. First, we use [`_aevolve()`](https://github.com/explodinggradients/ragas/blob/fe379a1c97d18ce2c203d80432a3da6622337968/src/ragas/testset/evolutions.py#L289) to generate a "Seed Question".

```python
simple_question, current_nodes, _ = await self.se._aevolve(
            current_tries, current_nodes
        )
```

2. We use our provided `question_prompt` to generate a reasoning question.

```python
result = await self.generator_llm.generate(
            prompt=question_prompt.format(
                question=simple_question, context=merged_node.page_content
            )
        )
```

> PROMPT (implementation [here](https://github.com/explodinggradients/ragas/blob/fe379a1c97d18ce2c203d80432a3da6622337968/src/ragas/testset/prompts.py#L15)):

```python
instruction="""Complicate the given question by rewriting question into a multi-hop reasoning question based on the provided context.
    Answering the question should require the reader to make multiple logical connections or inferences using the information available in given context.
    Rules to follow when rewriting question:
    1. Ensure that the rewritten question can be answered entirely from the information present in the contexts.
    2. Do not frame questions that contains more than 15 words. Use abbreviation wherever possible.
    3. Make sure the question is clear and unambiguous.
    4. phrases like 'based on the provided context','according to the context',etc are not allowed to appear in the question."""
```

3. We verify the question is valid.

```python
is_valid_question, feedback = await self.question_filter.filter(
            reasoning_question
        )
```

> PROMPT (implementation [here](https://github.com/explodinggradients/ragas/blob/fe379a1c97d18ce2c203d80432a3da6622337968/src/ragas/testset/prompts.py#L390))

```python
instruction="""
Asses the given question for clarity and answerability given enough domain knowledge, consider the following criteria:
1.Independence: Can the question be understood and answered without needing additional context or access to external references not provided within the question itself? Questions should be self-contained, meaning they do not rely on specific documents, tables, or prior knowledge not shared within the question.
2.Clear Intent: Is it clear what type of answer or information the question seeks? The question should convey its purpose without ambiguity, allowing for a direct and relevant response.
Based on these criteria, assign a verdict of "1" if a question is specific, independent, and has a clear intent, making it understandable and answerable based on the details provided. Assign "0" if it fails to meet one or more of these criteria due to vagueness, reliance on external references, or ambiguity in intent.
Provide feedback and a verdict in JSON format, including suggestions for improvement if the question is deemed unclear. Highlight aspects of the question that contribute to its clarity or lack thereof, and offer advice on how it could be reframed or detailed for better understanding and answerability.
"""
```

4. We [handle the question](https://github.com/explodinggradients/ragas/blob/fe379a1c97d18ce2c203d80432a3da6622337968/src/ragas/testset/evolutions.py#L401) if it's not valid, otherwise we compress the question:

```python
compressed_question = await self._transform_question(
            prompt=self.compress_question_prompt, question=reasoning_question
        )
```

> PROMPT (implementation [here](https://github.com/explodinggradients/ragas/blob/fe379a1c97d18ce2c203d80432a3da6622337968/src/ragas/testset/prompts.py#L100))

```python
instruction="""Rewrite the following question to make it more indirect and shorter while retaining the essence of the original question.
    The goal is to create a question that conveys the same meaning but in a less direct manner. The rewritten question should shorter so use abbreviation wherever possible."""
```

5. Filter the newly compressed question based on a comparison to the original simple question.

```python
if await self.evolution_filter.filter(simple_question, compressed_question):
            # retry
            current_nodes = self.se._get_new_random_node()
            logger.debug(
                "evolution_filter failed, retrying with %s", len(current_nodes.nodes)
            )
            return await self.aretry_evolve(current_tries, current_nodes)
```

`filter` is implemented as follows, with our Critic LLM:

```python
    async def filter(self, simple_question: str, compressed_question: str) -> bool:
        prompt = self.evolution_elimination_prompt.format(
            question1=simple_question, question2=compressed_question
        )
        results = await self.llm.generate(prompt=prompt)
        results = results.generations[0][0].text.strip()
        results = await evolution_elimination_parser.aparse(results, prompt, self.llm)
        results = results.dict() if results is not None else {}
        logger.debug("evolution filter: %s", results)
        return results.get("verdict") == 1
```

Let's zoom back out now!



### Generating Answers:

For answer generation, we simply ask the LLM to answer the question we evolved using the context associated with our evolution - that's it!

We will this:

```python
class AnswerFormat(BaseModel):
    answer: str
    verdict: int
```

Using [this prompt](https://github.com/explodinggradients/ragas/blob/fe379a1c97d18ce2c203d80432a3da6622337968/src/ragas/testset/prompts.py#L143):

```python
instruction="""Answer the question using the information from the given context. Output verdict as '1' if answer is present '-1' if answer is not present in the context."""
```

This uses our Generator LLM.

Actually creating our Synthetic Dataset is as simple as running the following cell!

In [8]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-4o-mini")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

#### ❓ Question #1:

What do the distributions do *specifically*?

> NOTE: More information is available [here](https://docs.ragas.io/en/latest/concepts/testset_generation.html#in-depth-evolution) on the evolution distributions.

---

### ANSWER #1:
This is the relevant code snippet from the pseudocode earlier in the notebook:

```python
    # Distribute evolutions based on probabilities
    for each evolution, probability in distributions:
        num_samples = round(probability * test_size)
        for i in random_sample(range(test_size), num_samples):
            submit_task_to_executor(evolution.evolve, current_nodes[i])
            total_evolutions += 1
```
Distributions is a dictionary with keys (simple, multi-context, reasoning) are associated with a target probability of being the type of question, in this instance the values are 50%, 40%, and 10%, respectively.  In the code, the number of samples is calculated as the fraction of the total test size (our desired number of questions), and then in the inner for loop, random_sample samples and then submits the evolution task to the executor (the tasks will be one of simple.evolve, multi-context.evolve, or reasoning.evolve).  The frequency of these submissions will roughly be the same as the values in the distribution object, 50% simple.evolve, 40% multi-context.evolve, and 10% reasoning.evolve.

---

Let's generate!

> NOTE: This cell will take some time, and also make a lot of calls to OpenAI's endpoints! You may run into rate-limits during this cell!

---



In [9]:
testset = generator.generate_with_langchain_docs(documents, 20, distributions, with_debugging_logs=True)

embedding nodes:   0%|          | 0/64 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/20 [00:00<?, ?it/s]

[ragas.testset.filters.DEBUG] context scoring: {'clarity': 2, 'depth': 3, 'structure': 2, 'relevance': 3, 'score': 2.5}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['European Commission State Aid Decision', 'Ireland tax aid', 'Commercial paper program', 'Share repurchase program', 'General Court of the Court of Justice']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 2, 'depth': 3, 'structure': 3, 'relevance': 3, 'score': 2.75}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Apple Inc.', 'Condensed consolidated statements of cash flows', 'Operating activities', 'Investing activities', 'Financing activities']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 2, 'depth': 3, 'structure': 2, 'relevance': 3, 'score': 2.5}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Digital Markets Act Investigations', 'U.S. Department of Justice lawsuit', 'Epic Games lawsuit', 'Antitrust laws', 'Compliance plan']
[ragas.testset.filters.DEB

In [13]:
for data_row in testset.test_data:
    question = data_row.question
    contexts = data_row.contexts
    ground_truth = data_row.ground_truth
    evolution_type = data_row.evolution_type
    metadata = data_row.metadata
    
    # Process each element as needed
    print(f"Question: {question}")
    print(f"Contexts: {contexts}")
    print(f"Ground Truth: {ground_truth}")
    print(f"Evolution Type: {evolution_type}")
    print(f"Metadata: {metadata}")
    print("\n")  # For better readability

Question: What was the net income for Apple Inc. for the three months ended June 29, 2024?
Contexts: ['Apple Inc.\nCONDENSED CONSOLIDATED STATEMENTS OF COMPREHENSIVE INCOME (Unaudited)\n(In millions)\nThree Months Ended\nNine Months Ended\nJune 29,\n2024\nJuly 1,\n2023\nJune 29,\n2024\nJuly 1,\n2023\nNet income\n$ \n21,448 \n$ \n19,881 \n$ \n79,000 \n$ \n74,039 \nOther comprehensive income/(loss):\nChange in foreign currency translation, net of tax\n \n(73)  \n(385)  \n(87)  \n(494) \nChange in unrealized gains/losses on derivative \ninstruments, net of tax:\nChange in fair value of derivative instruments\n \n406 \n \n509 \n \n331 \n \n(492) \nAdjustment for net (gains)/losses realized and included \nin net income\n \n(87)  \n103 \n \n(678)  \n(1,854) \nTotal change in unrealized gains/losses on \nderivative instruments\n \n319 \n \n612 \n \n(347)  \n(2,346) \nChange in unrealized gains/losses on marketable debt \nsecurities, net of tax:\nChange in fair value of marketable debt securit

#### 🏗️ Activity #1:

Using the dubgging logs above - trace through a single example of an evolution.

Mark which LLM (Generator, or Critic) was responsible for each step.
---
#### ANSWER ACTIVITY #1:
Here is the debug log for one question:
```
[ragas.testset.evolutions.INFO] seed question generated: "What was the increase in Selling, General and Administrative expense during the third quarter of 2024 compared to the same period in 2023?"
[ragas.testset.filters.DEBUG] filtered question: {'feedback': 'The question is specific and clear, asking for the increase in Selling, General and Administrative expense during the third quarter of 2024 compared to the same period in 2023. It does not rely on external references or unspecified contexts, making it understandable and answerable based on the details provided.', 'verdict': 1}
[ragas.testset.evolutions.DEBUG] [MultiContextEvolution] simple question generated: "What was the increase in Selling, General and Administrative expense during the third quarter of 2024 compared to the same period in 2023?"
[ragas.testset.evolutions.DEBUG] [MultiContextEvolution] multicontext question generated: "What was the dollar increase in Selling, General and Administrative expenses in Q3 2024 relative to Q3 2023, and how does this change relate to the overall operating expenses as a percentage of total net sales during the same periods?"
[ragas.testset.filters.DEBUG] filtered question: {'feedback': 'The question asks for the dollar increase in Selling, General and Administrative (SG&A) expenses from Q3 2023 to Q3 2024 and how this change relates to the overall operating expenses as a percentage of total net sales during the same periods. It is clear in specifying the time periods (Q3 2023 and Q3 2024) and the financial metrics of interest (SG&A expenses, operating expenses, and total net sales). The intent is also clear, seeking both a numerical change and a contextual analysis. However, the question assumes access to specific financial data for Q3 2023 and Q3 2024, which may not be provided within the question itself. To improve independence, the question could include the relevant financial figures or specify where this data can be found.', 'verdict': 0}
[ragas.testset.evolutions.INFO] rewritten question: "What was the dollar increase in Selling, General and Administrative expenses in Q3 2024 relative to Q3 2023, and how does this change relate to the overall operating expenses as a percentage of total net sales during the same periods?"
[ragas.testset.filters.DEBUG] filtered question: {'feedback': 'The question asks for the dollar increase in Selling, General and Administrative expenses from Q3 2023 to Q3 2024 and how this change relates to the overall operating expenses as a percentage of total net sales during the same periods. It is clear in specifying the time periods (Q3 2023 and Q3 2024) and the financial metrics of interest (Selling, General and Administrative expenses, overall operating expenses, and total net sales). However, it assumes access to specific financial data for these periods, which is not provided within the question. To improve clarity and answerability, the question could include the relevant financial figures or be framed in a way that does not require access to external documents.', 'verdict': 0}
[ragas.testset.evolutions.INFO] retrying evolution: 3 times
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 2, 'depth': 3, 'structure': 2, 'relevance': 3, 'score': 2.5}
```

Seed question generated by our generator LLM and then given feedback from the critic with verdict = 1, so now the seed question becomes the simple question.  This is now fed into the routine to create a multi-context question, which is done by the generator LLM.  The simple question was augmented by adding "...and how this change relates to the overall operating expenses as a percentage of total net sales during the same periods".  The critic points out that this new question may not be answerable with the given context, and then the generator rewrote the question but didn't fix the issue.  THe critic restated that the question could include the relevant financial information, etc. and returned a verdict of 0, and by this time the evolution had gone through 3 iterations.  It was then context scored and received a score of 2.5, and in fact this question was not used in the final set.

---

In [14]:
testset.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What was the net income for Apple Inc. for the...,[Apple Inc.\nCONDENSED CONSOLIDATED STATEMENTS...,"$21,448 million",simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
1,What is the purpose of the commercial paper pr...,[Note 6 – Income Taxes\nEuropean Commission St...,The purpose of the commercial paper program fo...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
2,What allegations are being made against the Co...,[PART II — OTHER INFORMATION\nItem 1. \nLega...,The allegations being made against the Company...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
3,What were the total comprehensive income figur...,[Apple Inc.\nCONDENSED CONSOLIDATED STATEMENTS...,The total comprehensive income figures for App...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
4,What is Timothy D. Cook's role in the certific...,"[Exhibit 31.1\nCERTIFICATION\nI, Timothy D. Co...",Timothy D. Cook's role in the certification of...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
5,What were the changes in net sales by reportab...,[Products and Services Performance\nThe follow...,The changes in net sales by reportable segment...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
6,What were the total assets of Apple Inc. as of...,[Apple Inc.\nCONDENSED CONSOLIDATED BALANCE SH...,"The total assets of Apple Inc. as of June 29, ...",simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
7,What requirements of the Securities Exchange A...,[Exhibit 32.1\nCERTIFICATIONS OF CHIEF EXECUTI...,The Quarterly Report of Apple Inc. complies wi...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
8,What factors can materially and adversely affe...,"[ ended March 30, 2024 (the “second quarter 20...",The context mentions that the Company's busine...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
9,What certification is associated with the Chie...,[Item 6. \nExhibits\nIncorporated by Reference...,The certification associated with the Chief Fi...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True


# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [16]:
from langsmith import Client

client = Client()

dataset_name = "Apple 10-Q Filing Questions - v4"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about Apple's 10-Q Filing"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [17]:
for test in testset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": test[1]["question"]
      },
      outputs={
          "answer": test[1]["ground_truth"]
      },
      metadata={
          "context": test[0]
      },
      dataset_id=dataset.id
  )

## Basic RAG Chain

Time for some RAG!

We'll use the Apple 10-Q filing as our data source today!


In [18]:
rag_documents = PyMuPDFLoader(file_path="https://s2.q4cdn.com/470004039/files/doc_earnings/2024/q3/filing/_10-Q-Q3-2024-As-Filed.pdf").load()

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [20]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [21]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Apple 10-Q"
)

In [22]:
retriever = vectorstore.as_retriever()

To get the "A" in RAG, we'll provide a prompt.

In [23]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [24]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Finally, we can set-up our RAG LCEL chain!

In [25]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [26]:
rag_chain.invoke({"question" : "Does Apple seem to be in good financial health?"})

"I don't know."

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4o as our evaluation LLM for our base Evaluators.

In [27]:
eval_llm = ChatOpenAI(model="gpt-4o")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [28]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "llm": eval_llm,
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        }
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "llm": eval_llm,
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        }
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.
---
#### SHOWN BELOW

- `qa_evaluator`: Measuring whether the answer to the question is correct.  Correctness in the LangSmith environment.
- `labeled_helpfulness_evaluator`: Assessment of whether the answer is helpful to the user given the correct reference answer.
- `dope_or_nope_evaluator`: Is the response dope?

---

## LangSmith Evaluation

In [29]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain"},
)

View the evaluation results for experiment: 'memorable-push-54' at:
https://smith.langchain.com/o/b47f3abe-d937-5f35-8caa-ad9d628ed67f/datasets/37f74f54-b12a-4f20-9020-75c834880a36/compare?selectedSessions=2a9b9eea-3e57-4471-b028-cd8f25b6fbf8




0it [00:00, ?it/s]

<ExperimentResults memorable-push-54>

## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [30]:
DOPE_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the questions in a dope way, be cool!

Context: {context}
Question: {question}
"""

dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)

In [31]:
rag_documents = PyMuPDFLoader(file_path="https://s2.q4cdn.com/470004039/files/doc_earnings/2024/q3/filing/_10-Q-Q3-2024-As-Filed.pdf").load()

In [32]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

---

#### ANSWER #2:
In this situation we doubled the chunk size from 500 to 1000, leaving overlap unchanged.  This **might** improve the overall content of retrieved chunks by increasing the semantic content.  Interestingly, we don't actually know what number of chunks are returned by the default retriever, but if the number of chunks is small, then each chunk needs to be large enough to contain sufficient information to form an adequate context.  

---

In [33]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

---

#### ANSWER #3:
The larger embedding model has more dimensions (3072) compared with the small model (1536) and this **might** improve the "semantic focus" on chunks, improving the retrieval for the specific question.  For English retrieval, which is our situation, the difference between the two models on the MTEB metric is negligible.  So I do not expect an effect when we redo the evaluation.

---

In [34]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Apple 10-Q (Augmented)"
)

In [64]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [65]:
dope_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dope_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [66]:
dope_rag_chain.invoke({"question" : "Does Apple seem to be in good financial health?"})

"Yo, based on the context, Apple is looking pretty solid! They've got a total of $331.6 billion in assets, and their net income for the nine months ended June 29, 2024, was $79 billion. Plus, their total net sales are up, with a gross margin of $136.8 billion. All signs point to a healthy financial vibe! 💰📈"

Finally, we can evaluate the new chain on the same test set!

In [67]:
evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)

View the evaluation results for experiment: 'brief-plot-77' at:
https://smith.langchain.com/o/b47f3abe-d937-5f35-8caa-ad9d628ed67f/datasets/37f74f54-b12a-4f20-9020-75c834880a36/compare?selectedSessions=28097d35-2595-4f1f-be73-0f5e2b4eef20




0it [00:00, ?it/s]

<ExperimentResults brief-plot-77>

#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

![Comparison of experiments](ImprovedRAGExperiment.png)

Correctness increased and this is most likely because the chunk size was increased, which would provide more context for answering questions.  However, whether this is a significant increase (from 11/20 to 13/20 correct) is not clear, and probably the experiment should have repetitions to look at variability. Changing the embedding model MIGHT have impacted this metric.  THe Dopeness clearly exploded from only 1/20 initially to 18/20, due to the specific prompt change.  Helpfulness did not change, and given that the correctness only changed slightly, this is probably expected.  Also it is not clear how adding dopeness might affect the helpfulness metric.

One important point, I think, is that we should really only change one parameter at a time because it is impossible to sort out specific actions when three different things are changed at once.

Finally, we can provide an argument to the retriever 
```
retriever = vectorstore.as_retriever(kwargs={"k":10})
```
and this is important because it determines how many chunks are retrieved.  The following shows some runs where I varied this;  in run 7 is set K to 10;  run 8 shows if I don't set it explicitly, and I think it is defaulting to 5.  Just another example of how many knobs there are on the RAG machine.

![More experiments](moreExperiments.png)
---