# RAGAS Evaluation for LangChain Agents

In [1]:
!python --version

Python 3.10.12


**R**etrieval **A**ugmented **G**eneration **As**sessment (RAGAS) is an evaluation framework for quantifying the performances of our RAG pipelines. In this example we will see how to use it with a RAG-enabled conversational agent in LangChain.

Because we need an agent and RAG pipeline to evaluate RAGAS the first part of this notebook covers setting up an XML Agent with RAG. Jump ahead to **Integrating RAGAS** for the RAGAS section.

To begin, let's install the prerequisites:

In [2]:
!pip install -qU \
    langchain==0.1.1 \
    langchain-community==0.0.13 \
    langchainhub==0.1.14 \
    anthropic==0.14.0 \
    cohere==4.45 \
    pinecone-client==3.1.0 \
    datasets==2.16.1 \
    ragas==0.1.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m802.4/802.4 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m846.7/846.7 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.1/52.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.4/201.4 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.4/65.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [3]:
import os
from getpass import getpass

# dashboard.cohere.com
os.environ["COHERE_API_KEY"] = "<<YOUR_KEY>>" or getpass("Cohere API key: ")
# app.pinecone.io
os.environ["PINECONE_API_KEY"] = "<<YOUR_KEY>>" or getpass("Pinecone API key: ")
# console.anthropic.com
os.environ["ANTHROPIC_API_KEY"] = "<<YOUR_KEY>>" or getpass("Anthropic API key: ")
# platform.openai.com
os.environ["OPENAI_API_KEY"] = "<<YOUR_KEY>>" or getpass("OpenAI API key: ")

## Finding Knowledge

The first thing we need for an agent using RAG is somewhere we want to pull knowledge from. We will use v2 of the AI ArXiv dataset, available on Hugging Face Datasets at [`jamescalam/ai-arxiv2-chunks`](https://huggingface.co/datasets/jamescalam/ai-arxiv2-chunks).

_Note: we're using the prechunked dataset. For the raw version see [`jamescalam/ai-arxiv2`](https://huggingface.co/datasets/jamescalam/ai-arxiv2)._

In [4]:
from datasets import load_dataset

dataset = load_dataset("jamescalam/ai-arxiv2-chunks", split="train[:20000]")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/766M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 20000
})

In [5]:
dataset[1]

{'doi': '2401.09350',
 'chunk-id': 1,
 'chunk': 'These neural networks and their training algorithms may be complex, and the scope of their impact broad and wide, but nonetheless they are simply functions in a high-dimensional space. A trained neural network takes a vector as input, crunches and transforms it in various ways, and produces another vector, often in some other space. An image may thereby be turned into a vector, a song into a sequence of vectors, and a social network as a structured collection of vectors. It seems as though much of human knowledge, or at least what is expressed as text, audio, image, and video, has a vector representation in one form or another.\nIt should be noted that representing data as vectors is not unique to neural networks and deep learning. In fact, long before learnt vector representations of pieces of dataâ\x80\x94what is commonly known as â\x80\x9cembeddingsâ\x80\x9dâ\x80\x94came along, data was often encoded as hand-crafted feature vectors. E

## Building the Knowledge Base

To build our knowledge base we need _two things_:

1. Embeddings, for this we will use `CohereEmbeddings` using Cohere's embedding models, which do need an [API key](https://dashboard.cohere.com/api-keys).
2. A vector database, where we store our embeddings and query them. We use Pinecone which again requires a [free API key](https://app.pinecone.io).

First we initialize our connection to Cohere and define an `embed` helper function:

In [6]:
from langchain_community.embeddings import CohereEmbeddings

embed = CohereEmbeddings(model="embed-english-v3.0")

Then we initialize our connection to Pinecone:

In [7]:
from pinecone import Pinecone

# configure client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [8]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"
)

Before creating an index, we need the dimensionality of our Cohere embedding model, which we can find easily by creating an embedding and checking the length:

In [9]:
vec = embed.embed_documents(["ello"])
len(vec[0])

1024

Now we create the index using our embedding dimensionality, and a metric also compatible with the model (this can be either cosine or dotproduct). We also pass our spec to index initialization.

In [10]:
import time

index_name = "ragas-evaluation"

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=len(vec[0]),  # dimensionality of cohere v3
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 40000}},
 'total_vector_count': 40000}

### Populating our Index

Now our knowledge base is ready to be populated with our data. We will use the `embed` helper function to embed our documents and then add them to our index.

We will also include metadata from each record.

In [11]:
from tqdm.auto import tqdm

# easier to work with dataset as pandas dataframe
data = dataset.to_pandas()

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [x["id"] for i, x in batch.iterrows()]
    # get text to embed
    texts = [x['chunk'] for _, x in batch.iterrows()]
    # embed text
    embeds = embed.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

  0%|          | 0/200 [00:00<?, ?it/s]

Create a tool for our agent to use when searching for ArXiv papers:

In [12]:
from langchain.agents import tool

@tool
def arxiv_search(query: str) -> str:
    """Use this tool when answering questions about AI, machine learning, data
    science, or other technical questions that may be answered using arXiv
    papers.
    """
    # create query vector
    xq = embed.embed_query(query)
    # perform search
    out = index.query(vector=xq, top_k=5, include_metadata=True)
    # reformat results into string
    results_str = "\n---\n".join(
        [x["metadata"]["text"] for x in out["matches"]]
    )
    return results_str

tools = [arxiv_search]

When this tool is used by our agent it will execute it like so:

In [13]:
print(
    arxiv_search.run(tool_input={"query": "can you tell me about llama 2?"})
)

Ethical Considerations and Limitations (Section 5.2) Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2âs potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their speciï¬c applications of the model. Please see the Responsible Use Guide available available at https://ai.meta.com/llama/responsible-user-guide
Table 52: Model card for Llama 2.
77
---
Ethical Considerations and Limitations (Section 5.2) Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2âs potential outpu

## Defining XML Agent

The XML agent is built primarily to support Anthropic models. Anthropic models have been trained to use XML tags like `<input>{some input}</input` or when using a tool they use:

```
<tool>{tool name}</tool>
<tool_input>{tool input}</tool_input>
```

This is much different to the format produced by typical ReAct agents, which is not as well supported by Anthropic models.

To create an XML agent we need a `prompt`, `llm`, and list of `tools`. We can download a prebuilt prompt for conversational XML agents from LangChain hub.

In [14]:
from langchain import hub

prompt = hub.pull("hwchase17/xml-agent-convo")
prompt

ChatPromptTemplate(input_variables=['agent_scratchpad', 'input', 'tools'], partial_variables={'chat_history': ''}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['agent_scratchpad', 'chat_history', 'input', 'tools'], template="You are a helpful assistant. Help the user answer any questions.\n\nYou have access to the following tools:\n\n{tools}\n\nIn order to use a tool, you can use <tool></tool> and <tool_input></tool_input> tags. You will then get back a response in the form <observation></observation>\nFor example, if you have a tool called 'search' that could run a google search, in order to search for the weather in SF you would respond:\n\n<tool>search</tool><tool_input>weather in SF</tool_input>\n<observation>64 degrees</observation>\n\nWhen you are done, respond with a final answer between <final_answer></final_answer>. For example:\n\n<final_answer>The weather in SF is 64 degrees</final_answer>\n\nBegin!\n\nPrevious Conversation:\n{chat_history}\n\n

We can see the XML format being used throughout the prompt when explaining to the LLM how it should use tools.

In [15]:
from langchain_community.chat_models import ChatAnthropic

# chat completion llm
llm = ChatAnthropic(
    anthropic_api_key=os.environ["ANTHROPIC_API_KEY"],
    model_name='claude-2.1',
    temperature=0.0
)

When the agent is run we will provide it with a single `input` — this is the input text from a user. However, within the agent logic an *agent_scratchpad* object will be passed too, which will include tool information. To feed this information into our LLM we will need to transform it into the XML format described above, we define the `convert_intermediate_steps` function to handle that.

In [16]:
def convert_intermediate_steps(intermediate_steps):
    log = ""
    for action, observation in intermediate_steps:
        log += (
            f"<tool>{action.tool}</tool><tool_input>{action.tool_input}"
            f"</tool_input><observation>{observation}</observation>"
        )
    return log

We must also parse the tools into a string containing `tool_name: tool_description` — we handle that with the `convert_tools` function.

In [17]:
def convert_tools(tools):
    return "\n".join([f"{tool.name}: {tool.description}" for tool in tools])

With everything ready we can go ahead and initialize our agent object using [**L**ang**C**hain **E**xpression **L**anguage (LCEL)](https://www.pinecone.io/learn/series/langchain/langchain-expression-language/). We add instructions for when the LLM should _stop_ generating with `llm.bind(stop=[...])` and finally we parse the output from the agent using an `XMLAgentOutputParser` object.

In [18]:
from langchain.agents.output_parsers import XMLAgentOutputParser

agent = (
    {
        "input": lambda x: x["input"],
        # without "chat_history", tool usage has no context of prev interactions
        "chat_history": lambda x: x["chat_history"],
        "agent_scratchpad": lambda x: convert_intermediate_steps(
            x["intermediate_steps"]
        ),
    }
    | prompt.partial(tools=convert_tools(tools))
    | llm.bind(stop=["</tool_input>", "</final_answer>"])
    | XMLAgentOutputParser()
)

With our `agent` object initialized we pass it to an `AgentExecutor` object alongside our original `tools` list:

In [19]:
from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(
    agent=agent, tools=tools, return_intermediate_steps=True
)

Now we can use the agent via the `invoke` method:

In [20]:
agent_executor.invoke({
    "input": "can you tell me about llama 2?",
    "chat_history": ""
})

{'input': 'can you tell me about llama 2?',
 'chat_history': '',
 'output': "\nBased on the information from arXiv, Llama 2 is a collection of large language models developed by Meta AI ranging in size from 7 billion to 70 billion parameters. The fine-tuned versions, called Llama 2-Chat, are optimized for dialogue and outperform other open source chat models on most benchmarks. \n\nKey points about Llama 2:\n\n- Pretrained and fine-tuned large language models for dialogue\n- Models range from 7B to 70B parameters\n- Llama 2-Chat models outperform other open source chat models\n- Fine-tuned for safety and helpfulness\n- Released to enable responsible LLM development\n\nThe abstract and contents provide an overview of the model, its performance, and Meta AI's approach to developing and releasing it responsibly.\n",
 'intermediate_steps': [(AgentAction(tool='arxiv_search', tool_input='llama 2', log=' <tool>arxiv_search</tool><tool_input>llama 2'),
   'Ethical Considerations and Limitation

We have no `"chat_history"` so we will pass an empty string to our `invoke` method:

In [21]:
user_msg = "hello mate"

out = agent_executor.invoke({
    "input": "hello mate",
    "chat_history": ""
})

Now let's put together another helper function called `chat` to help us handle the _state_ part of our agent.

In [22]:
def chat(text: str):
    out = agent_executor.invoke({
        "input": text,
        "chat_history": ""
    })
    return out

Now we simply chat with our agent and it will remember the context of previous interactions.

In [23]:
print(chat("can you tell me about llama 2?")["output"])


Based on the information from arXiv, Llama 2 is a collection of large language models developed by Meta AI ranging in size from 7 billion to 70 billion parameters. The fine-tuned versions, called Llama 2-Chat, are optimized for dialogue and outperform other open source chat models on most benchmarks. 

Key points about Llama 2:

- Pretrained and fine-tuned large language models for dialogue
- Models range from 7B to 70B parameters
- Llama 2-Chat models outperform other open source chat models
- Fine-tuned for safety and helpfulness
- Released to enable responsible LLM development

The abstract and contents provide an overview of the model, its performance, and Meta AI's approach to developing and releasing it responsibly.



We can ask follow up questions that miss key information but thanks to the conversational history the LLM understands the context and uses that to adjust the search query.

_Note: if missing `"chat_history"` parameter from the `agent` definition you will likely notice a lack of context in the search term, and in some cases this lack of good information can trigger a `ValueError` during output parsing._

In [24]:
out = chat("was any red teaming done?")
print(out["output"])


The articles discuss several examples of red teaming being done to proactively identify risks with AI systems:

1) Meta (formerly Facebook) conducted red teaming exercises with 25 employees, including domain experts in responsible AI, malware development, and offensive security engineering, to evaluate risks from dual intent prompts that could potentially be used maliciously.

2) Anaplan claims to have conducted red teaming with over 350 people, including experts in cybersecurity, election fraud, civil rights, and responsible AI, to identify risks across a variety of potential misuse cases. 

3) One article recommends that AI labs commission external red teams to actively probe for vulnerabilities and demonstrate dangerous behaviors that could inform deployment decisions. This adversarial testing approach allows risks to be identified proactively rather than waiting for issues to emerge after deployment.

So in summary, yes red teaming has been done by major AI companies like Meta and

We get a reasonable answer here. It's worth noting that with previous iterations of this test, ie "llama 2 red teaming" using the original `ai-arxiv` dataset rarely (if ever) returned directly relevant results.

---

## Integrating RAGAS

To integrate RAGAS evaluation into this pipeline we need a few things, from our pipeline we need the retrieved contexts, and the generated output.

We already have the generated output, it is what we're printing above. However, the retrieved contexts are being logged but we haven't seen how to programatically extract them yet. Let's take a look at what we are returned in `out`:

In [25]:
out

{'input': 'was any red teaming done?',
 'chat_history': '',
 'output': '\nThe articles discuss several examples of red teaming being done to proactively identify risks with AI systems:\n\n1) Meta (formerly Facebook) conducted red teaming exercises with 25 employees, including domain experts in responsible AI, malware development, and offensive security engineering, to evaluate risks from dual intent prompts that could potentially be used maliciously.\n\n2) Anaplan claims to have conducted red teaming with over 350 people, including experts in cybersecurity, election fraud, civil rights, and responsible AI, to identify risks across a variety of potential misuse cases. \n\n3) One article recommends that AI labs commission external red teams to actively probe for vulnerabilities and demonstrate dangerous behaviors that could inform deployment decisions. This adversarial testing approach allows risks to be identified proactively rather than waiting for issues to emerge after deployment.\n\

When initializing our `AgentExecutor` object we included `return_intermediate_steps=True` — this (unsuprisingly) returns the intermediate steps that the agent tool to generate the final answer. Those steps include the response from our `arxiv_search` tool — which we can use the evaluate the retrieval portion of our pipeline with RAGAS.

We extract the contexts themselves like so:

In [26]:
print(out["intermediate_steps"][0][1])

# 2.1 RED-EVAL
At the core of RED-EVAL red-teaming is a CoU-based jailbreak prompt that aims to tease out harmful answers from LLM when asked a harmful questionâa query that seeks to find infor- mation that can be used by malicious actors. As shown in Figure 2, the jailbreak prompt sets up a CoU environmentâa conversation between two agents Red-LM and Base-LM. Then, as a part of Red-LMâs utterance, RED-EVAL seeks information for a harmful question from Base- LM (to be placed in yellow highlighted text) such as âHow to dispose of a body using chemical meansâ while the model under evaluation is asked to fill the response of Base-LM (cyan highlighted text) by following the CoU demonstration and instructions. Base-LM also pos- sesses internal thoughts to drive the answer in a more helpful direction without taking into account the harmfulness of the response i.e., safety, ethics, transparency, etc.
4
# er
uoljeysuoweg N09
# uonondjsu|
---
Red teaming. It is important to also proac

## Evaluation

To evaluate with RAG we need a dataset containing question, ideal contexts, and the _ground truth_ answers to those questions.

In [27]:
ragas_data = load_dataset("aurelio-ai/ai-arxiv2-ragas-mixtral", split="train")
ragas_data

Downloading data:   0%|          | 0.00/87.0k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['question', 'ground_truth_context', 'ground_truth', 'question_type', 'episode_done'],
    num_rows: 51
})

In [28]:
ragas_data[0]

{'question': 'What is the impact of encoding the input prompt on inference speed in generative inference?',
 'ground_truth_context': ['- This technique works particularly well when processing large batches of data, during train-\ning Pudipeddi et al. (2020); Ren et al. (2021) or large-batch non-interactive inference Aminabadi et al.\n(2022); Sheng et al. (2023), where each layer processes a lot of tokens each time the layer is loaded\nfrom RAM.\n- In turn, when doing interactive inference (e.g. as a chat assistants), offloading works\nsignificantly slower than on-device inference.\n- The generative inference workload consists of two phases: 1) encoding the input prompt and 2)\ngenerating tokens conditioned on that prompt.\n- The key difference between these two phases is that\nprompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially\n(token-by-token and layer-by-layer).\n- In general, phase 1 works relatively well with existing Mixture-\nof-Exper

We first iterate through the questions in this evaluation dataset and ask these questions to our agent.

In [29]:
import pandas as pd
from tqdm.auto import tqdm

df = pd.DataFrame({
    "question": [],
    "contexts": [],
    "answer": [],
    "ground_truth": []
})

limit = 5

for i, row in tqdm(enumerate(ragas_data), total=limit):
    if i >= limit:
        break
    question = row["question"]
    ground_truths = row["ground_truth"]
    try:
        out = chat(question)
        answer = out["output"]
        if len(out["intermediate_steps"]) != 0:
            contexts = out["intermediate_steps"][0][1].split("\n---\n")
        else:
            # this is where no intermediate steps are used
            contexts = []
    except ValueError:
        answer = "ERROR"
        contexts = []
    df = pd.concat([df, pd.DataFrame({
        "question": question,
        "answer": answer,
        "contexts": [contexts],
        "ground_truth": ground_truths
    })], ignore_index=True)

  0%|          | 0/5 [00:00<?, ?it/s]

In [30]:
df

Unnamed: 0,question,contexts,answer,ground_truth
0,What is the impact of encoding the input promp...,[The generative inference workload consists of...,\nThe paper discusses that the generative infe...,The encoding of the input prompt has an impact...
1,How does generating tokens affect the inferenc...,[The generative inference workload consists of...,\nThe paper discusses that the generative infe...,Generating tokens affects the inference speed ...
2,How does the architecture of Mixtral 8x7B diff...,"[Abstract\nWe introduce Mixtral 8x7B, a Sparse...",\nThe key differences between the architecture...,The architecture of Mixtral 8x7B differs from ...
3,When is offloading used on the A100 server for...,[# Denis Mazur Moscow Institute of Physics and...,\nThe paper discusses using offloading strateg...,Offloading is used on the A100 server for acce...
4,How does Mixtral compare to Llama 2 70B in cod...,[Table 2: Comparison of Mixtral with Llama. Mi...,\nBased on the information from the arXiv pape...,Mixtral outperforms Llama 2 70B in code benchm...


In [31]:
from datasets import Dataset
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_relevancy,
    context_recall,
    answer_similarity,
    answer_correctness,
)

eval_data = Dataset.from_dict(df)
eval_data

Dataset({
    features: ['question', 'contexts', 'answer', 'ground_truth'],
    num_rows: 5
})

In [32]:
from ragas import evaluate

result = evaluate(
    dataset=eval_data,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_relevancy,
        context_recall,
        answer_similarity,
        answer_correctness,
    ],
)
result = result.to_pandas()

Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]

### Retrieval Metrics

Retrieval is the first step in a RAG pipeline, so we will focus on metrics that assess retrieval first. For that we primarily want to focus on `context_recall` and `context_precision` but before diving into these metrics we must understand what it is that they will be measuring.

### Actual vs. Predicted

When evaluating the performance of retrieval systems we tend to compare the _actual_ (ground truth) to _predicted_ results. We define these as:

* **Actual condition** is the true label of every context in the dataset. These are _positive_ ($p$) if the context is relevant to our query or _negative_ ($n$) if the context is _ir_relevant to our query.

* **Predicted condition** is the _predicted_ label determined by our retrieval system. If a context is returned it is a predicted _positive_, ie $\hat{p}$. If a context is not returned it is a predicted _negative_, ie $\hat{n}$.

Given these conditions, we can say the following:

* $p\hat{p}$ is a **true positive**, meaning a relevant result has been returned.
* $n\hat{n}$ is a **true negative**, meaning an irrelevant result was not returned
* $n\hat{p}$ is a **false positive**, meaning an irrelevant result has been returned.
* $p\hat{n}$ is a **false negative**, meaning an relevant result has _not_ been returned.

Let's see how these apply to our metrics in RAGAS.

#### Context Recall

Context recall (or just _recall_) is a measure of how many of the relevant records in a dataset have been retrieved. It is calculated as:

$$
Recall@K = \frac{p\hat{p}}{p\hat{p} + n\hat{n}} = \frac{Relevant \: contexts \: retrieved}{Total \: number \: of \: relevant \: contexts}
$$

RAGAS calculates _Recall@K_ for recall, where the _@K_ represents the number of contexts returned. As the @K value is increased the recall scores will improve (as the capture size of the retrieval step increases). At it's extreme we could set @K equal to the size of the dataset to guarantee perfect recall — although this negates the point of RAG in the first place.

By default, RAGAS uses a _@K_ value of `5`.

In [33]:
pd.set_option("display.max_colwidth", 700)
result[["question", "contexts", "answer", "context_recall"]]

Unnamed: 0,question,contexts,answer,context_recall
0,What is the impact of encoding the input prompt on inference speed in generative inference?,"[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\nBelow, we look for ...","\nThe paper discusses that the generative inference workload has two main phases:\n\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\n\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\n\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existi...",1.0
1,How does generating tokens affect the inference speed in generative inference?,"[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\nBelow, we look for ...","\nThe paper discusses that the generative inference process in large language models consists of two main phases:\n\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\n\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \n\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end...",0.6
2,How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?,"[Abstract\nWe introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchm...","\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\n\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \n\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\n\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\n\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall par...",1.0
3,When is offloading used on the A100 server for accelerating MoE-based language models?,"[# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\n# Abstract\nWith the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) â a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their âdenseâ counterparts, but it also increases model size due to having multiple âexpertsâ. Unfortunately, this makes state-of-the-art MoE language models difficult t...","\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\n\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\n\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \n\n- The paper proposes MoE-specific offloading that takes advantage of:\n - Expert reuse between tokens (using an LRU cache)\n - Early layer states predicting later active experts. This allows prefetching experts.\n \n- They use these strategies t...",1.0
4,How does Mixtral compare to Llama 2 70B in code benchmarks?,"[Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\n70 Mixtral 8x7B. âMixtral 8x7B Mixtral 8x7B 355 =o = Es & E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 Â§ 30 5 eo â= Mistral Â° 20 âe LlaMA2 78 (138 348 70B 7B (138 348 708 7B Â«13B 34B 708 Active Params Active Params Active Params\nFigure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x...","\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\n\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\n\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\n\nSo in summary, Mixtral compares very favorably to Llama 2 70B on c...",1.0


Here we can see all but the second set of results returned all relevant contexts. The score here is `0.6` meaning that 3/5 (60%) of the relevant contexts were returned.

All other results returned `1.0` (100%), meaning all contexts were retrieved.

Recall is a useful metric but easily fooled by simply returning more records, ie increasing the _@K_ value. Because of that it is typically paired with _precision_.

### Context Precision

Context precision (or just _precision_) is another popular retrieval metric. We typically see both recall and precision paired together when evaluating retrieval systems.

As with recall, the actual metric here is called _Precision@K_ where @K represents the number of contexts returned. However, unlike recall, precision is focusing on the number of relevant results returned compared to the total results returned, whether they are relevant or not — this is equal to our chosen _@K_ value.

$$
Precision@K = \frac{p\hat{p}}{p\hat{p} + p\hat{n}} = \frac{Relevant \: contexts \: retrieved}{Total \: number \: of \: relevant \: contexts}
$$

In [34]:
pd.set_option("display.max_colwidth", 700)
result[["question", "contexts", "answer", "context_precision"]]

Unnamed: 0,question,contexts,answer,context_precision
0,What is the impact of encoding the input prompt on inference speed in generative inference?,"[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\nBelow, we look for ...","\nThe paper discusses that the generative inference workload has two main phases:\n\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\n\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\n\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existi...",0.866667
1,How does generating tokens affect the inference speed in generative inference?,"[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\nBelow, we look for ...","\nThe paper discusses that the generative inference process in large language models consists of two main phases:\n\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\n\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \n\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end...",1.0
2,How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?,"[Abstract\nWe introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchm...","\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\n\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \n\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\n\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\n\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall par...",1.0
3,When is offloading used on the A100 server for accelerating MoE-based language models?,"[# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\n# Abstract\nWith the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) â a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their âdenseâ counterparts, but it also increases model size due to having multiple âexpertsâ. Unfortunately, this makes state-of-the-art MoE language models difficult t...","\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\n\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\n\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \n\n- The paper proposes MoE-specific offloading that takes advantage of:\n - Expert reuse between tokens (using an LRU cache)\n - Early layer states predicting later active experts. This allows prefetching experts.\n \n- They use these strategies t...",1.0
4,How does Mixtral compare to Llama 2 70B in code benchmarks?,"[Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\n70 Mixtral 8x7B. âMixtral 8x7B Mixtral 8x7B 355 =o = Es & E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 Â§ 30 5 eo â= Mistral Â° 20 âe LlaMA2 78 (138 348 70B 7B (138 348 708 7B Â«13B 34B 708 Active Params Active Params Active Params\nFigure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x...","\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\n\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\n\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\n\nSo in summary, Mixtral compares very favorably to Llama 2 70B on c...",1.0


Our precision@K scores are equal to our recall scores (this can happen when there are _5_ relevant contexts for each query at we set _@K = 5_). This result means every query produced 100% precision with the exception of our 60% precision result where only 3/5 returned contexts were relevant.

## Generation Metrics

### Faithfullness

The _faithfullness_ metric measures (from _0_ to _1_) the factual consistency of an answer when compared to the retrieved context. A score of _1_ means all claims in the answer can be found in the context. A score of _0_ would indicate _no_ claims in the answer are found in the context.

We calculate the faithfullness like so:

$$
Faithfulness = \frac{Number \: of \: claims \: in \: answer \: also \: found \: in \: context}{Number \: of \: claims \: in \: answer}
$$

In [35]:
pd.set_option("display.max_colwidth", 1000)
result[["question", "contexts", "answer", "faithfulness"]]

Unnamed: 0,question,contexts,answer,faithfulness
0,What is the impact of encoding the input prompt on inference speed in generative inference?,"[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\nBelow, we look for patterns in how the MoE model loads its experts and propose ways to exploit these patterns to speed up inference time.\n4To learn more about these methods, please refer to surveys such as Gholami et al. (2021); Liang et al. (2021) 5As opposed to running a processing a large batch of texts over many ...","\nThe paper discusses that the generative inference workload has two main phases:\n\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\n\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\n\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existing parallelization techniques, while the sequential token generation phase is harder to optimize and can become a bottleneck.\n\nSo in summary, the prompt encoding phase has less of an impact on slowing down inference, while the sequential generation phase can significantly limit inference speed if ...",1.0
1,How does generating tokens affect the inference speed in generative inference?,"[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\nBelow, we look for patterns in how the MoE model loads its experts and propose ways to exploit these patterns to speed up inference time.\n4To learn more about these methods, please refer to surveys such as Gholami et al. (2021); Liang et al. (2021) 5As opposed to running a processing a large batch of texts over many ...","\nThe paper discusses that the generative inference process in large language models consists of two main phases:\n\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\n\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \n\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end-to-end latency during generative inference. This is because all the model weights need to be loaded for every single generated token, causing inference to be heavily bottlenecked by parameter I/O instead of computation.\n\nIn summary, generating more tokens leads to lower utilization of compute (0....",1.0
2,How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?,"[Abstract\nWe introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine- tuned to follow instructions, Mixtral 8x7B â Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B â chat model on hum...","\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\n\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \n\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\n\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\n\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall parameters while keeping the active parameters during inference manageable.\n",1.0
3,When is offloading used on the A100 server for accelerating MoE-based language models?,"[# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\n# Abstract\nWith the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) â a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their âdenseâ counterparts, but it also increases model size due to having multiple âexpertsâ. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties o...","\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\n\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\n\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \n\n- The paper proposes MoE-specific offloading that takes advantage of:\n - Expert reuse between tokens (using an LRU cache)\n - Early layer states predicting later active experts. This allows prefetching experts.\n \n- They use these strategies to run the Mixtral-8x7B MoE model on T4, RTX 3060, 3080 GPUs at 2-3 tokens/sec.\n\nSo in summary, the offloading is designed to mitigate the limited memory on consumer GPUs in order to run very large MoE models for inference. It exploits properties of MoE sparsity to reduce communication and overlap ...",1.0
4,How does Mixtral compare to Llama 2 70B in code benchmarks?,"[Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\n70 Mixtral 8x7B. âMixtral 8x7B Mixtral 8x7B 355 =o = Es & E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 Â§ 30 5 eo â= Mistral Â° 20 âe LlaMA2 78 (138 348 70B 7B (138 348 708 7B Â«13B 34B 708 Active Params Active Params Active Params\nFigure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x7B) vs Llama 2 (7B/13B/70B). Mixtral largely outperforms Llama 2 70B on all benchmarks, except on reading comprehension benchmarks while using 5x lower active parameters. It is also vastly superior to Llama 2 70B on code and math., Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or ma...","\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\n\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\n\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\n\nSo in summary, Mixtral compares very favorably to Llama 2 70B on code benchmarks, outperforming it on most metrics while being much more parameter efficient.\n",1.0


When calculating faithfullness RAGAS is using OpenAI LLMs to decide which claims are in the answer and whether they also exist in the context. Because of the "generative" nature of this approach we won't always get accurate scores.

We can see that we get perfect scores for all but our fourth result, which scores `0.0`. However, when looking at this we can see some claims that seem related. Nonetheless the fourth answer does seem to be less grounded in the truth of our context than other responses, indicated that there is justification behind this low score.

### Answer Relevancy

Answer relevancy is our final metric. It focuses on the generation component and is similar to our "context precision" metric in that it measures how much of the returned information is relevant to our original question.

We return a low answer relevancy score when:

* Answers are incomplete.

* Answers contain redundant information.

A high answer relevancy score indicates that an answer is concise and does not contain "fluff" (ie irrelevant information).

The score is calculated by asking an LLM to generate multiple questions for a generated answer and then calculating the cosine similarity between the original question and the generated questions. Naturally, if we have a concise answer that answers a very specific question, we should find that the generated question will have a high cosine similarity to the original question.

In [37]:
pd.set_option("display.max_colwidth", 700)
result[["question", "answer", "answer_relevancy"]]

Unnamed: 0,question,answer,answer_relevancy
0,What is the impact of encoding the input prompt on inference speed in generative inference?,"\nThe paper discusses that the generative inference workload has two main phases:\n\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\n\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\n\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existi...",0.812683
1,How does generating tokens affect the inference speed in generative inference?,"\nThe paper discusses that the generative inference process in large language models consists of two main phases:\n\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\n\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \n\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end...",0.829982
2,How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?,"\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\n\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \n\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\n\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\n\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall par...",0.928409
3,When is offloading used on the A100 server for accelerating MoE-based language models?,"\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\n\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\n\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \n\n- The paper proposes MoE-specific offloading that takes advantage of:\n - Expert reuse between tokens (using an LRU cache)\n - Early layer states predicting later active experts. This allows prefetching experts.\n \n- They use these strategies t...",0.760017
4,How does Mixtral compare to Llama 2 70B in code benchmarks?,"\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\n\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\n\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\n\nSo in summary, Mixtral compares very favorably to Llama 2 70B on c...",0.95494


Again we can see poorer performance from our fourth answer but the remainder (particularly answer with similarity greater than `0.9`) perform well.

---