# RAGAS Evaluation for LangChain Agents

In [None]:
!python --version

**R**etrieval **A**ugmented **G**eneration **As**sessment (RAGAS) is an evaluation framework for quantifying the performances of our RAG pipelines. In this example we will see how to use it with a RAG-enabled conversational agent in LangChain.

Because we need an agent and RAG pipeline to evaluate RAGAS the first part of this notebook covers setting up an XML Agent with RAG. Jump ahead to **Integrating RAGAS** for the RAGAS section.

To begin, let's install the prerequisites:

In [None]:
!pip install -qU \
    langchain==0.1.1 \
    langchain-community==0.0.13 \
    langchainhub==0.1.14 \
    anthropic==0.14.0 \
    cohere==4.45 \
    pinecone-client==3.1.0 \
    datasets==2.16.1 \
    ragas==0.1.0

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/commands/install.py", line 362, in run
    resolver = self.make_resolver(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/req_command.py", line 177, in make_resolver
    return pip._internal.resolution.resolvelib.resolver.Resolver(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 58, in __init__
    self.factory = Factory(
                   ^^^^^^^^
  File "/usr/local/lib/py

In [None]:
import os
from getpass import getpass

# dashboard.cohere.com
os.environ["COHERE_API_KEY"] = "<<YOUR_KEY>>" or getpass("Cohere API key: ")
# app.pinecone.io
os.environ["PINECONE_API_KEY"] = "<<YOUR_KEY>>" or getpass("Pinecone API key: ")
# console.anthropic.com
os.environ["ANTHROPIC_API_KEY"] = "<<YOUR_KEY>>" or getpass("Anthropic API key: ")
# platform.openai.com
os.environ["OPENAI_API_KEY"] = "<<YOUR_KEY>>" or getpass("OpenAI API key: ")

## Finding Knowledge

The first thing we need for an agent using RAG is somewhere we want to pull knowledge from. We will use v2 of the AI ArXiv dataset, available on Hugging Face Datasets at [`jamescalam/ai-arxiv2-chunks`](https://huggingface.co/datasets/jamescalam/ai-arxiv2-chunks).

_Note: we're using the prechunked dataset. For the raw version see [`jamescalam/ai-arxiv2`](https://huggingface.co/datasets/jamescalam/ai-arxiv2)._

In [None]:
from datasets import load_dataset

dataset = load_dataset("jamescalam/ai-arxiv2-chunks", split="train[:20000]")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


train.jsonl:   0%|          | 0.00/766M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/240927 [00:00<?, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 20000
})

In [None]:
dataset[1]

{'doi': '2401.09350',
 'chunk-id': 1,
 'chunk': 'These neural networks and their training algorithms may be complex, and the scope of their impact broad and wide, but nonetheless they are simply functions in a high-dimensional space. A trained neural network takes a vector as input, crunches and transforms it in various ways, and produces another vector, often in some other space. An image may thereby be turned into a vector, a song into a sequence of vectors, and a social network as a structured collection of vectors. It seems as though much of human knowledge, or at least what is expressed as text, audio, image, and video, has a vector representation in one form or another.\nIt should be noted that representing data as vectors is not unique to neural networks and deep learning. In fact, long before learnt vector representations of pieces of dataâ\x80\x94what is commonly known as â\x80\x9cembeddingsâ\x80\x9dâ\x80\x94came along, data was often encoded as hand-crafted feature vectors. E

## Building the Knowledge Base

To build our knowledge base we need _two things_:

1. Embeddings, for this we will use `CohereEmbeddings` using Cohere's embedding models, which do need an [API key](https://dashboard.cohere.com/api-keys).
2. A vector database, where we store our embeddings and query them. We use Pinecone which again requires a [free API key](https://app.pinecone.io).

First we initialize our connection to Cohere and define an `embed` helper function:

In [None]:
from langchain_community.embeddings import CohereEmbeddings

embed = CohereEmbeddings(model="embed-english-v3.0")

Then we initialize our connection to Pinecone:

In [None]:
from pinecone import Pinecone

# configure client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [None]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"
)

Before creating an index, we need the dimensionality of our Cohere embedding model, which we can find easily by creating an embedding and checking the length:

In [None]:
vec = embed.embed_documents(["ello"])
len(vec[0])

Now we create the index using our embedding dimensionality, and a metric also compatible with the model (this can be either cosine or dotproduct). We also pass our spec to index initialization.

In [None]:
import time

index_name = "ragas-evaluation"

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=len(vec[0]),  # dimensionality of cohere v3
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

### Populating our Index

Now our knowledge base is ready to be populated with our data. We will use the `embed` helper function to embed our documents and then add them to our index.

We will also include metadata from each record.

In [None]:
from tqdm.auto import tqdm

# easier to work with dataset as pandas dataframe
data = dataset.to_pandas()

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [x["id"] for i, x in batch.iterrows()]
    # get text to embed
    texts = [x['chunk'] for _, x in batch.iterrows()]
    # embed text
    embeds = embed.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

Create a tool for our agent to use when searching for ArXiv papers:

In [None]:
from langchain.agents import tool

@tool
def arxiv_search(query: str) -> str:
    """Use this tool when answering questions about AI, machine learning, data
    science, or other technical questions that may be answered using arXiv
    papers.
    """
    # create query vector
    xq = embed.embed_query(query)
    # perform search
    out = index.query(vector=xq, top_k=5, include_metadata=True)
    # reformat results into string
    results_str = "\n---\n".join(
        [x["metadata"]["text"] for x in out["matches"]]
    )
    return results_str

tools = [arxiv_search]

When this tool is used by our agent it will execute it like so:

In [None]:
print(
    arxiv_search.run(tool_input={"query": "can you tell me about llama 2?"})
)

## Defining XML Agent

The XML agent is built primarily to support Anthropic models. Anthropic models have been trained to use XML tags like `<input>{some input}</input` or when using a tool they use:

```
<tool>{tool name}</tool>
<tool_input>{tool input}</tool_input>
```

This is much different to the format produced by typical ReAct agents, which is not as well supported by Anthropic models.

To create an XML agent we need a `prompt`, `llm`, and list of `tools`. We can download a prebuilt prompt for conversational XML agents from LangChain hub.

In [None]:
from langchain import hub

prompt = hub.pull("hwchase17/xml-agent-convo")
prompt

We can see the XML format being used throughout the prompt when explaining to the LLM how it should use tools.

In [None]:
from langchain_community.chat_models import ChatAnthropic

# chat completion llm
llm = ChatAnthropic(
    anthropic_api_key=os.environ["ANTHROPIC_API_KEY"],
    model_name='claude-2.1',
    temperature=0.0
)

When the agent is run we will provide it with a single `input` — this is the input text from a user. However, within the agent logic an *agent_scratchpad* object will be passed too, which will include tool information. To feed this information into our LLM we will need to transform it into the XML format described above, we define the `convert_intermediate_steps` function to handle that.

In [None]:
def convert_intermediate_steps(intermediate_steps):
    log = ""
    for action, observation in intermediate_steps:
        log += (
            f"<tool>{action.tool}</tool><tool_input>{action.tool_input}"
            f"</tool_input><observation>{observation}</observation>"
        )
    return log

We must also parse the tools into a string containing `tool_name: tool_description` — we handle that with the `convert_tools` function.

In [None]:
def convert_tools(tools):
    return "\n".join([f"{tool.name}: {tool.description}" for tool in tools])

With everything ready we can go ahead and initialize our agent object using [**L**ang**C**hain **E**xpression **L**anguage (LCEL)](https://www.pinecone.io/learn/series/langchain/langchain-expression-language/). We add instructions for when the LLM should _stop_ generating with `llm.bind(stop=[...])` and finally we parse the output from the agent using an `XMLAgentOutputParser` object.

In [None]:
from langchain.agents.output_parsers import XMLAgentOutputParser

agent = (
    {
        "input": lambda x: x["input"],
        # without "chat_history", tool usage has no context of prev interactions
        "chat_history": lambda x: x["chat_history"],
        "agent_scratchpad": lambda x: convert_intermediate_steps(
            x["intermediate_steps"]
        ),
    }
    | prompt.partial(tools=convert_tools(tools))
    | llm.bind(stop=["</tool_input>", "</final_answer>"])
    | XMLAgentOutputParser()
)

With our `agent` object initialized we pass it to an `AgentExecutor` object alongside our original `tools` list:

In [None]:
from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(
    agent=agent, tools=tools, return_intermediate_steps=True
)

Now we can use the agent via the `invoke` method:

In [None]:
agent_executor.invoke({
    "input": "can you tell me about llama 2?",
    "chat_history": ""
})

We have no `"chat_history"` so we will pass an empty string to our `invoke` method:

In [None]:
user_msg = "hello mate"

out = agent_executor.invoke({
    "input": "hello mate",
    "chat_history": ""
})

Now let's put together another helper function called `chat` to help us handle the _state_ part of our agent.

In [None]:
def chat(text: str):
    out = agent_executor.invoke({
        "input": text,
        "chat_history": ""
    })
    return out

Now we simply chat with our agent and it will remember the context of previous interactions.

In [None]:
print(chat("can you tell me about llama 2?")["output"])

We can ask follow up questions that miss key information but thanks to the conversational history the LLM understands the context and uses that to adjust the search query.

_Note: if missing `"chat_history"` parameter from the `agent` definition you will likely notice a lack of context in the search term, and in some cases this lack of good information can trigger a `ValueError` during output parsing._

In [None]:
out = chat("was any red teaming done?")
print(out["output"])

We get a reasonable answer here. It's worth noting that with previous iterations of this test, ie "llama 2 red teaming" using the original `ai-arxiv` dataset rarely (if ever) returned directly relevant results.

---

## Integrating RAGAS

To integrate RAGAS evaluation into this pipeline we need a few things, from our pipeline we need the retrieved contexts, and the generated output.

We already have the generated output, it is what we're printing above. However, the retrieved contexts are being logged but we haven't seen how to programatically extract them yet. Let's take a look at what we are returned in `out`:

In [None]:
out

When initializing our `AgentExecutor` object we included `return_intermediate_steps=True` — this (unsuprisingly) returns the intermediate steps that the agent tool to generate the final answer. Those steps include the response from our `arxiv_search` tool — which we can use the evaluate the retrieval portion of our pipeline with RAGAS.

We extract the contexts themselves like so:

In [None]:
print(out["intermediate_steps"][0][1])

## Evaluation

To evaluate with RAG we need a dataset containing question, ideal contexts, and the _ground truth_ answers to those questions.

In [None]:
ragas_data = load_dataset("aurelio-ai/ai-arxiv2-ragas-mixtral", split="train")
ragas_data

In [None]:
ragas_data[0]

We first iterate through the questions in this evaluation dataset and ask these questions to our agent.

In [None]:
import pandas as pd
from tqdm.auto import tqdm

df = pd.DataFrame({
    "question": [],
    "contexts": [],
    "answer": [],
    "ground_truth": []
})

limit = 5

for i, row in tqdm(enumerate(ragas_data), total=limit):
    if i >= limit:
        break
    question = row["question"]
    ground_truths = row["ground_truth"]
    try:
        out = chat(question)
        answer = out["output"]
        if len(out["intermediate_steps"]) != 0:
            contexts = out["intermediate_steps"][0][1].split("\n---\n")
        else:
            # this is where no intermediate steps are used
            contexts = []
    except ValueError:
        answer = "ERROR"
        contexts = []
    df = pd.concat([df, pd.DataFrame({
        "question": question,
        "answer": answer,
        "contexts": [contexts],
        "ground_truth": ground_truths
    })], ignore_index=True)

In [None]:
df

In [None]:
from datasets import Dataset
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_relevancy,
    context_recall,
    answer_similarity,
    answer_correctness,
)

eval_data = Dataset.from_dict(df)
eval_data

In [None]:
from ragas import evaluate

result = evaluate(
    dataset=eval_data,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_relevancy,
        context_recall,
        answer_similarity,
        answer_correctness,
    ],
)
result = result.to_pandas()

### Retrieval Metrics

Retrieval is the first step in a RAG pipeline, so we will focus on metrics that assess retrieval first. For that we primarily want to focus on `context_recall` and `context_precision` but before diving into these metrics we must understand what it is that they will be measuring.

### Actual vs. Predicted

When evaluating the performance of retrieval systems we tend to compare the _actual_ (ground truth) to _predicted_ results. We define these as:

* **Actual condition** is the true label of every context in the dataset. These are _positive_ ($p$) if the context is relevant to our query or _negative_ ($n$) if the context is _ir_relevant to our query.

* **Predicted condition** is the _predicted_ label determined by our retrieval system. If a context is returned it is a predicted _positive_, ie $\hat{p}$. If a context is not returned it is a predicted _negative_, ie $\hat{n}$.

Given these conditions, we can say the following:

* $p\hat{p}$ is a **true positive**, meaning a relevant result has been returned.
* $n\hat{n}$ is a **true negative**, meaning an irrelevant result was not returned
* $n\hat{p}$ is a **false positive**, meaning an irrelevant result has been returned.
* $p\hat{n}$ is a **false negative**, meaning an relevant result has _not_ been returned.

Let's see how these apply to our metrics in RAGAS.

#### Context Recall

Context recall (or just _recall_) is a measure of how many of the relevant records in a dataset have been retrieved. It is calculated as:

$$
Recall@K = \frac{p\hat{p}}{p\hat{p} + n\hat{n}} = \frac{Relevant \: contexts \: retrieved}{Total \: number \: of \: relevant \: contexts}
$$

RAGAS calculates _Recall@K_ for recall, where the _@K_ represents the number of contexts returned. As the @K value is increased the recall scores will improve (as the capture size of the retrieval step increases). At it's extreme we could set @K equal to the size of the dataset to guarantee perfect recall — although this negates the point of RAG in the first place.

By default, RAGAS uses a _@K_ value of `5`.

In [None]:
pd.set_option("display.max_colwidth", 700)
result[["question", "contexts", "answer", "context_recall"]]

Here we can see all but the second set of results returned all relevant contexts. The score here is `0.6` meaning that 3/5 (60%) of the relevant contexts were returned.

All other results returned `1.0` (100%), meaning all contexts were retrieved.

Recall is a useful metric but easily fooled by simply returning more records, ie increasing the _@K_ value. Because of that it is typically paired with _precision_.

### Context Precision

Context precision (or just _precision_) is another popular retrieval metric. We typically see both recall and precision paired together when evaluating retrieval systems.

As with recall, the actual metric here is called _Precision@K_ where @K represents the number of contexts returned. However, unlike recall, precision is focusing on the number of relevant results returned compared to the total results returned, whether they are relevant or not — this is equal to our chosen _@K_ value.

$$
Precision@K = \frac{p\hat{p}}{p\hat{p} + p\hat{n}} = \frac{Relevant \: contexts \: retrieved}{Total \: number \: of \: relevant \: contexts}
$$

In [None]:
pd.set_option("display.max_colwidth", 700)
result[["question", "contexts", "answer", "context_precision"]]

Our precision@K scores are equal to our recall scores (this can happen when there are _5_ relevant contexts for each query at we set _@K = 5_). This result means every query produced 100% precision with the exception of our 60% precision result where only 3/5 returned contexts were relevant.

## Generation Metrics

### Faithfullness

The _faithfullness_ metric measures (from _0_ to _1_) the factual consistency of an answer when compared to the retrieved context. A score of _1_ means all claims in the answer can be found in the context. A score of _0_ would indicate _no_ claims in the answer are found in the context.

We calculate the faithfullness like so:

$$
Faithfulness = \frac{Number \: of \: claims \: in \: answer \: also \: found \: in \: context}{Number \: of \: claims \: in \: answer}
$$

In [None]:
pd.set_option("display.max_colwidth", 1000)
result[["question", "contexts", "answer", "faithfulness"]]

When calculating faithfullness RAGAS is using OpenAI LLMs to decide which claims are in the answer and whether they also exist in the context. Because of the "generative" nature of this approach we won't always get accurate scores.

We can see that we get perfect scores for all but our fourth result, which scores `0.0`. However, when looking at this we can see some claims that seem related. Nonetheless the fourth answer does seem to be less grounded in the truth of our context than other responses, indicated that there is justification behind this low score.

### Answer Relevancy

Answer relevancy is our final metric. It focuses on the generation component and is similar to our "context precision" metric in that it measures how much of the returned information is relevant to our original question.

We return a low answer relevancy score when:

* Answers are incomplete.

* Answers contain redundant information.

A high answer relevancy score indicates that an answer is concise and does not contain "fluff" (ie irrelevant information).

The score is calculated by asking an LLM to generate multiple questions for a generated answer and then calculating the cosine similarity between the original question and the generated questions. Naturally, if we have a concise answer that answers a very specific question, we should find that the generated question will have a high cosine similarity to the original question.

In [None]:
pd.set_option("display.max_colwidth", 700)
result[["question", "answer", "answer_relevancy"]]

Again we can see poorer performance from our fourth answer but the remainder (particularly answer with similarity greater than `0.9`) perform well.

---