# RAG Evaluation
First, we install the required model dependancies.

In [None]:
!pip install -q torch transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets langchain-community langchain-openai langchain-core
!pip install ragatouille==0.0.9



In [None]:
# %reload_ext autoreload
# %autoreload 2

In [None]:
from tqdm.auto import tqdm
import pandas as pd
from typing import Optional, List, Tuple, Any
import json
import datasets

pd.set_option("display.max_colwidth", None)

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Load your knowledge base

In [None]:
ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

huggingface_doc.csv:   0%|          | 0.00/22.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2647 [00:00<?, ? examples/s]

# 1. Build a synthetic dataset for evaluation
We first build a synthetic dataset of questions and associated contexts. The method is to get elements from our knowledge base, and ask an LLM to generate questions based on these documents.

Then we setup other LLM agents to act as quality filters for the generated QA couples: each of them will act as the filter for a specific flaw.

### 1.1. Prepare source documents

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document as LangchainDocument

langchain_docs = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in tqdm(ds)
]


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs_processed = []
for doc in langchain_docs:
    docs_processed += text_splitter.split_documents([doc])

  0%|          | 0/2647 [00:00<?, ?it/s]

### 1.2. Setup agents for question generation

We use [Mixtral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) ("mistralai/Mistral-7B-Instruct-v0.2") for QA couple generation because it it has excellent performance in leaderboards such as [Chatbot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard).

In [None]:
from huggingface_hub import InferenceClient
from google.colab import userdata

hf_token = userdata.get("key_hf")

repo_model = "mistralai/Mistral-7B-Instruct-v0.2"


llm_client = InferenceClient(
    model = repo_model,
    token = hf_token,
    timeout = 120
)

def call_llm(inference_client: InferenceClient, prompt: str):
    response = inference_client.chat.completions.create(
        messages=
          [{
              "role": "user",
              "content": prompt
          },],
        max_tokens=1000,
    )
    return response.choices[0].message.content

call_llm(llm_client, "This is a test context")

" I see. In programming, a test context is an instance of a test runner or testing framework that is used to execute tests. It provides access to various resources and services that are needed for the tests to run and report their results.\n\nFor example, a test context in a unit testing framework might give you access to a mocking library for creating test doubles, a database connection for testing database interactions, or a logging service for reporting test results.\n\nIf you have a specific testing framework or testing scenario in mind, I'd be happy to help you with any questions you have about creating and using a test context. Just let me know what you need!"

In [None]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

Now let's generate our QA couples.
For this example, we generate only 10 QA couples and will load the rest from the Hub.

But for your specific knowledge base, given that you want to get at least ~100 test samples, and accounting for the fact that we will filter out around half of these with our critique agents later on, you should generate much more, in the >200 samples.

In [None]:
import random

N_GENERATIONS = 10  # We intentionally generate only 10 QA couples here for cost and time considerations

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(random.sample(docs_processed, N_GENERATIONS)):
    # Generate QA couple
    output_QA_couple = call_llm(
        llm_client, QA_generation_prompt.format(context=sampled_context.page_content)
    )
    try:
        question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0]
        answer = output_QA_couple.split("Answer: ")[-1]
        assert len(answer) < 300, "Answer is too long"
        outputs.append(
            {
                "context": sampled_context.page_content,
                "question": question,
                "answer": answer,
                "source_doc": sampled_context.metadata["source"],
            }
        )
    except:
        continue

Generating 10 QA couples...


  0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
print(len(outputs))
display(pd.DataFrame(outputs))

10


Unnamed: 0,context,question,answer,source_doc
0,"- Endianness: Little-endian. This can be modified later, but it feels really unnecessary at the\nmoment.\n- Order: 'C' or row-major. This seems to have won. We can add that information later if needed.\n- Stride: No striding, all tensors need to be packed before being serialized. I have yet to see a case where it seems useful to have a strided tensor stored in serialized format.\n\n### Benefits\n\nSince we can invent a new format we can propose additional benefits:\n\n- Prevent DOS attacks: We can craft the format in such a way that it's almost\nimpossible to use malicious files to DOS attack a user. Currently, there's a limit\non the size of the header of 100MB to prevent parsing extremely large JSON.\n Also when reading the file, there's a guarantee that addresses in the file\n do not overlap in any way, meaning when you're loading a file you should never\n exceed the size of the file in memory\n\n- Faster load: PyTorch seems to be the fastest file to load out in the major\nML formats. However, it does seem to have an extra copy on CPU, which we\ncan bypass in this lib by using `torch.UntypedStorage.from_file`.\nCurrently, CPU loading times are extremely fast with this lib compared to pickle.\nGPU loading times are as fast or faster than PyTorch equivalent.\nLoading first on CPU with memmapping with torch, and then moving all tensors to GPU seems\nto be faster too somehow (similar behavior in torch pickle)\n\n- Lazy loading: in distributed (multi-node or multi-gpu) settings, it's nice to be able to\nload only part of the tensors on the various models. For\n[BLOOM](https://huggingface.co/bigscience/bloom) using this format enabled\nto load the model on 8 GPUs from 10mn with regular PyTorch weights down to 45s.\nThis really speeds up feedbacks loops when developing on the model. For instance\nyou don't have to have separate copies of the weights when changing the distribution\nstrategy (for instance Pipeline Parallelism vs Tensor Parallelism).\n\nLicense: Apache-2.0",What is the average loading time for CPU with this library compared to pickle?\n,The CPU loading times are faster with this library compared to pickle.,huggingface/safetensors/blob/main/README.md
1,| | |[camembert/camembert-large](https://huggingface.co/camembert/camembert-large) |3660 |6 | | |[LICENSE](https://huggingface.co/camembert/camembert-large/blob/main/LICENSE) | | |\n| | |[stabilityai/japanese-stablelm-instruct-alpha-7b](https://huggingface.co/stabilityai/japanese-stablelm-instruct-alpha-7b) |3553 |80 | | |[LICENSE](https://huggingface.co/stabilityai/japanese-stablelm-instruct-alpha-7b/blob/main/LICENSE) | | |\n| | |[TheBloke/llama-2-70b-Guanaco-QLoRA-fp16](https://huggingface.co/TheBloke/llama-2-70b-Guanaco-QLoRA-fp16) |3537 |52 | llama2 | |[LICENSE.txt](https://huggingface.co/TheBloke/llama-2-70b-Guanaco-QLoRA-fp16/blob/main/LICENSE.txt) | | |,"What is the size (number of parameters) of ""llama-2-70b-Guanaco-QLoRA-fp16"" model?\n",3537,huggingface/hub-docs/blob/main/hacktoberfest_challenges/model_no_license.md
2,"Deep Layer Aggregation\n\nExtending “shallow” skip connections, **Dense Layer Aggregation (DLA)** incorporates more depth and sharing. The authors introduce two structures for deep layer aggregation (DLA): iterative deep aggregation (IDA) and hierarchical deep aggregation (HDA). These structures are expressed through an architectural framework, independent of the choice of backbone, for compatibility with current and future networks. \n\nIDA focuses on fusing resolutions and scales while HDA focuses on merging features from all modules and channels. IDA follows the base hierarchy to refine resolution and aggregate scale stage-bystage. HDA assembles its own hierarchy of tree-structured connections that cross and merge stages to aggregate different levels of representation. \n\n## How do I use this model on an image?\nTo load a pretrained model:\n\n```python\nimport timm\nmodel = timm.create_model('dla102', pretrained=True)\nmodel.eval()\n```\n\nTo load and preprocess the image:\n```python \nimport urllib\nfrom PIL import Image\nfrom timm.data import resolve_data_config\nfrom timm.data.transforms_factory import create_transform\n\nconfig = resolve_data_config({}, model=model)\ntransform = create_transform(**config)\n\nurl, filename = (""https://github.com/pytorch/hub/raw/master/images/dog.jpg"", ""dog.jpg"")\nurllib.request.urlretrieve(url, filename)\nimg = Image.open(filename).convert('RGB')\ntensor = transform(img).unsqueeze(0) # transform and add batch dimension\n```\n\nTo get the model predictions:\n```python\nimport torch\nwith torch.no_grad():\n out = model(tensor)\nprobabilities = torch.nn.functional.softmax(out[0], dim=0)\nprint(probabilities.shape)\n# prints: torch.Size([1000])\n```",What are the two structures introduced in Dense Layer Aggregation (DLA) for deep layer aggregation?\n,Iterative deep aggregation (IDA) and hierarchical deep aggregation (HDA),huggingface/pytorch-image-models/blob/main/docs/models/dla.md
3,"Objective: Move the agent so the box is within the agents its field of view\n\nActors: An EgoCentric Camera Actor (LINK) equipped with a monocular camera\n\nObservation space: \n- An RGB camera of shape (3, 40, 40) (C, H, W) in uint8 format.\n \nAction space:\n- A discrete action space with 3 possible actions\n- Turn left 10 degrees\n- Turn right 10 degrees\n- Move forward\n\nReward function:\n- A sparse reward for moving the box within a 60 degree fov cone in front of the agent.\n- A timeout penaly of -1 if the agent does not reach the object in 100 time-steps\n\nParallel: 4 independent instances of the same environment configuration.",What is the shape of the RGB camera in the observation space?\n,"The RGB camera in the observation space has a shape of (3, 40, 40).",huggingface/simulate/blob/main/docs/source/howto/rl.mdx
4,| [How to run inference with OpenVINO](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb) | Explains how to export your model to OpenVINO and run inference with OpenVINO Runtime on various tasks| [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb)|\n| [How to quantize a question answering model with NNCF](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb) | Show how to apply post-training quantization on a question answering model using [NNCF](https://github.com/openvinotoolkit/nncf) and to accelerate inference with OpenVINO| [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb)|,Which OpenVINO notebook from Hugging Face provides instructions on exporting a model and running inference?\n,"The notebook named ""optimum_openvino_inference.ipynb"" does this.",huggingface/optimum/blob/main/notebooks/README.md
5,"Simple call on one item:\n\n```python\n>>> pipe = pipeline(""text-classification"")\n>>> pipe(""This restaurant is awesome"")\n[{'label': 'POSITIVE', 'score': 0.9998743534088135}]\n```\n\nIf you want to use a specific model from the [hub](https://huggingface.co) you can ignore the task if the model on\nthe hub already defines it:\n\n```python\n>>> pipe = pipeline(model=""roberta-large-mnli"")\n>>> pipe(""This restaurant is awesome"")\n[{'label': 'NEUTRAL', 'score': 0.7313136458396912}]\n```\n\nTo call a pipeline on many items, you can call it with a *list*.\n\n```python\n>>> pipe = pipeline(""text-classification"")\n>>> pipe([""This restaurant is awesome"", ""This restaurant is awful""])\n[{'label': 'POSITIVE', 'score': 0.9998743534088135},\n {'label': 'NEGATIVE', 'score': 0.9996669292449951}]\n```\n\nTo iterate over full datasets it is recommended to use a `dataset` directly. This means you don't need to allocate\nthe whole dataset at once, nor do you need to do batching yourself. This should work just as fast as custom loops on\nGPU. If it doesn't don't hesitate to create an issue.\n\n```python\nimport datasets\nfrom transformers import pipeline\nfrom transformers.pipelines.pt_utils import KeyDataset\nfrom tqdm.auto import tqdm\n\npipe = pipeline(""automatic-speech-recognition"", model=""facebook/wav2vec2-base-960h"", device=0)\ndataset = datasets.load_dataset(""superb"", name=""asr"", split=""test"")\n\n# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item\n# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset\nfor out in tqdm(pipe(KeyDataset(dataset, ""file""))):\n print(out)\n # {""text"": ""NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND""}\n # {""text"": ....}\n # ....\n```\n\nFor ease of use, a generator is also possible:\n\n\n```python\nfrom transformers import pipeline\n\npipe = pipeline(""text-classification"")","What is the label of the first text in the pipe(""This restaurant is awesome"") output?\n",The label of the first text is 'POSITIVE'.,huggingface/transformers/blob/main/docs/source/en/main_classes/pipelines.md
6,- [#5840](https://github.com/gradio-app/gradio/pull/5840) [`4e62b8493`](https://github.com/gradio-app/gradio/commit/4e62b8493dfce50bafafe49f1a5deb929d822103) - Ensure websocket polyfill doesn't load if there is already a `global.Webocket` property set. Thanks [@Jay2theWhy](https://github.com/Jay2theWhy)!\n- [#5839](https://github.com/gradio-app/gradio/pull/5839) [`b83064da0`](https://github.com/gradio-app/gradio/commit/b83064da0005ca055fc15ee478cf064bf91702a4) - Fix error when scrolling dropdown with scrollbar. Thanks [@Kit-p](https://github.com/Kit-p)!\n- [#5822](https://github.com/gradio-app/gradio/pull/5822) [`7b63db271`](https://github.com/gradio-app/gradio/commit/7b63db27161ab538f20cf8523fc04c9c3b604a98) - Convert async methods in the Examples class into normal sync methods. Thanks [@whitphx](https://github.com/whitphx)!\n- [#5904](https://github.com/gradio-app/gradio/pull/5904) [`891d42e9b`](https://github.com/gradio-app/gradio/commit/891d42e9baa7ab85ede2a5eadb56c274b0ed2785) - Define Font.__repr__() to be printed in the doc in a readable format. Thanks [@whitphx](https://github.com/whitphx)!\n- [#5811](https://github.com/gradio-app/gradio/pull/5811) [`1d5b15a2d`](https://github.com/gradio-app/gradio/commit/1d5b15a2d24387154f2cfb40a36de25b331471d3) - Assert refactor in external.py. Thanks [@harry-urek](https://github.com/harry-urek)!\n- [#5827](https://github.com/gradio-app/gradio/pull/5827) [`48e09ee88`](https://github.com/gradio-app/gradio/commit/48e09ee88799efa38a5cc9b1b61e462f72ec6093) - Quick fix: Chatbot change event. Thanks [@dawoodkhan82](https://github.com/dawoodkhan82)!\n- [#5890](https://github.com/gradio-app/gradio/pull/5890) [`c4ba832b3`](https://github.com/gradio-app/gradio/commit/c4ba832b318dad5e8bf565cfa0daf93ca188498f) - Remove deprecation warning from `gr.update` and clean up associated code. Thanks [@abidlabs](https://github.com/abidlabs)!,"In how many pull requests were font-related changes made between commit hashes ""4e62b8493dfce50bafafe49f1a5deb929d822103"" and ""baa7ab85ede2a5eadb56c274b0ed2785""?\n",One pull request (#5904) contained font-related changes.,gradio-app/gradio/blob/main/CHANGELOG.md
7,"!--Copyright 2023 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the ""License""); you may not use this file except in compliance with\nthe License. You may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on\nan ""AS IS"" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\nspecific language governing permissions and limitations under the License.\n\n⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be\nrendered properly in your Markdown viewer.\n\n-->\n\n# Multitask Prompt Tuning\n\n[Multitask Prompt Tuning](https://huggingface.co/papers/2303.02861) decomposes the soft prompts of each task into a single learned transferable prompt instead of a separate prompt for each task. The single learned prompt can be adapted for each task by multiplicative low rank updates.\n\nThe abstract from the paper is:",In which license is the HuggingFace Multitask Prompt Tuning paper distributed?\n,"The HuggingFace Multitask Prompt Tuning paper is distributed under the Apache License, Version 2.0.",huggingface/peft/blob/main/docs/source/package_reference/multitask_prompt_tuning.md
8,"<p align=""center"">\n <img src=""https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/distill_sd/Picture6.png"" width=500>\n</p>\n\n## Conclusion\n\nWe invite the open-source community to help us improve and achieve wider adoption of these distilled SD models. Users can join our [Discord](https://discord.gg/s6E6eHJk) server, where we will be announcing the latest updates to these models, releasing more checkpoints and some exciting new LoRAs. And if you like our work, please give us a star on our [Github](https://github.com/segmind/distill-sd).",In which Discord server can users find the latest updates about the distilled SD models?\n,Users can find the latest updates about the distilled SD models on the Discord server with the invitation link <https://discord.gg/s6E6eHJk>.,huggingface/blog/blob/main/sd_distillation.md
9,| Task | Example datasets | Trainer support | 🤗 Accelerate | 🤗 Datasets | Colab\n|---|---|:---:|:---:|:---:|:---:|\n| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling) | [WikiText-2](https://huggingface.co/datasets/wikitext) | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)\n| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) | [SWAG](https://huggingface.co/datasets/swag) | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)\n| [**`question-answering`**](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) | [SQuAD](https://huggingface.co/datasets/squad) | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)\n| [**`summarization`**](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) | [XSum](https://huggingface.co/datasets/xsum) | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)\n| [**`text-classification`**](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) | [GLUE](https://huggingface.co/datasets/glue) | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb),In which GitHub repository are the examples for question-answering tasks located?\n,"The examples for question-answering tasks are located in the GitHub repository: `huggingface/transformers`. Specifically, they can be found in the subdirectory: `examples/pytorch/question-answering`.",huggingface/transformers/blob/main/examples/pytorch/README.md


### 1.3. Setup critique agents

The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.

We thus build critique agents that will rate each question on several criteria, given in [this paper](https://huggingface.co/papers/2312.10003):
- **Groundedness:** can the question be answered from the given context?
- **Relevance:** is the question relevant to users? For instance, `"What is the date when transformers 4.29.1 was released?"` is not relevant for ML practitioners.

One last failure case we've noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like `"What is the name of the function used in this guide?"`.
We also build a critique agent for this criteria:
- **Stand-alone**: is the question understandable free of any context, for someone with domain knowledge/Internet access? The opposite of this would be `What is the function used in this article?` for a question generated from a specific blog article.

We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.

💡 ___When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.___

We now build and run these critique agents.

In [None]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [None]:
import re

print("Generating critique for each QA couple...")

for output in tqdm(outputs):
    critique_inputs = {
        "groundedness": question_groundedness_critique_prompt.format(
            context=output["context"], question=output["question"]
        ),
        "relevance": question_relevance_critique_prompt.format(question=output["question"]),
        "standalone": question_standalone_critique_prompt.format(question=output["question"]),
    }

    for criterion, prompt in critique_inputs.items():
        try:
            evaluation = call_llm(llm_client, prompt)

            score_match = re.search(r"Total rating:\s*(\d+)", evaluation, re.IGNORECASE)

            eval_match = re.search(r"Evaluation:\s*(.+?)(?=\nTotal rating:|$)", evaluation, re.IGNORECASE | re.DOTALL)

            if score_match:
                score = int(score_match.group(1))
            else:
                score = None

            if eval_match:
                explanation = eval_match.group(1).strip()
            else:
                explanation = evaluation

            output.update({
                f"{criterion}_score": score,
                f"{criterion}_eval": explanation,
            })

        except Exception as e:
            print(f"Error critiquing {criterion}: {e}")
            output.update({
                f"{criterion}_score": None,
                f"{criterion}_eval": "Error",
            })
if outputs:
    print("Keys in first item:", outputs[0].keys())

Generating critique for each QA couple...


  0%|          | 0/10 [00:00<?, ?it/s]

Keys in first item: dict_keys(['context', 'question', 'answer', 'source_doc', 'groundedness_score', 'groundedness_eval', 'relevance_score', 'relevance_eval', 'standalone_score', 'standalone_eval'])


In [None]:
display(pd.DataFrame(outputs))

Unnamed: 0,context,question,answer,source_doc,groundedness_score,groundedness_eval,relevance_score,relevance_eval,standalone_score,standalone_eval
0,"- Endianness: Little-endian. This can be modified later, but it feels really unnecessary at the\nmoment.\n- Order: 'C' or row-major. This seems to have won. We can add that information later if needed.\n- Stride: No striding, all tensors need to be packed before being serialized. I have yet to see a case where it seems useful to have a strided tensor stored in serialized format.\n\n### Benefits\n\nSince we can invent a new format we can propose additional benefits:\n\n- Prevent DOS attacks: We can craft the format in such a way that it's almost\nimpossible to use malicious files to DOS attack a user. Currently, there's a limit\non the size of the header of 100MB to prevent parsing extremely large JSON.\n Also when reading the file, there's a guarantee that addresses in the file\n do not overlap in any way, meaning when you're loading a file you should never\n exceed the size of the file in memory\n\n- Faster load: PyTorch seems to be the fastest file to load out in the major\nML formats. However, it does seem to have an extra copy on CPU, which we\ncan bypass in this lib by using `torch.UntypedStorage.from_file`.\nCurrently, CPU loading times are extremely fast with this lib compared to pickle.\nGPU loading times are as fast or faster than PyTorch equivalent.\nLoading first on CPU with memmapping with torch, and then moving all tensors to GPU seems\nto be faster too somehow (similar behavior in torch pickle)\n\n- Lazy loading: in distributed (multi-node or multi-gpu) settings, it's nice to be able to\nload only part of the tensors on the various models. For\n[BLOOM](https://huggingface.co/bigscience/bloom) using this format enabled\nto load the model on 8 GPUs from 10mn with regular PyTorch weights down to 45s.\nThis really speeds up feedbacks loops when developing on the model. For instance\nyou don't have to have separate copies of the weights when changing the distribution\nstrategy (for instance Pipeline Parallelism vs Tensor Parallelism).\n\nLicense: Apache-2.0",What is the average loading time for CPU with this library compared to pickle?\n,The CPU loading times are faster with this library compared to pickle.,huggingface/safetensors/blob/main/README.md,1,"The context mentions the benefits of the new library in terms of preventing DOS attacks, faster load times on CPU compared to pickle, and lazy loading. However, it does not provide any specific numerical data or detailed comparison between the average loading times of the library and pickle for CPU.",1,"This question is not directly related to NLP applications or the Hugging Face ecosystem as it compares loading times of different serialization libraries (Transformers library vs pickle) for general Python data, rather than specifically focusing on NLP tasks or Hugging Face components. Furthermore, loading times for CPU and other factors like model size can highly influence the results, making it difficult to provide a definitive answer without knowing these specifics.",5,"This question can be understood without additional context, but it relies on the assumption that the reader is familiar with the concepts of ""loading time"" and ""comparison between library X and pickle"". However, these concepts are commonly used in data processing and machine learning fields, and they do not depend on a specific context or document."
1,| | |[camembert/camembert-large](https://huggingface.co/camembert/camembert-large) |3660 |6 | | |[LICENSE](https://huggingface.co/camembert/camembert-large/blob/main/LICENSE) | | |\n| | |[stabilityai/japanese-stablelm-instruct-alpha-7b](https://huggingface.co/stabilityai/japanese-stablelm-instruct-alpha-7b) |3553 |80 | | |[LICENSE](https://huggingface.co/stabilityai/japanese-stablelm-instruct-alpha-7b/blob/main/LICENSE) | | |\n| | |[TheBloke/llama-2-70b-Guanaco-QLoRA-fp16](https://huggingface.co/TheBloke/llama-2-70b-Guanaco-QLoRA-fp16) |3537 |52 | llama2 | |[LICENSE.txt](https://huggingface.co/TheBloke/llama-2-70b-Guanaco-QLoRA-fp16/blob/main/LICENSE.txt) | | |,"What is the size (number of parameters) of ""llama-2-70b-Guanaco-QLoRA-fp16"" model?\n",3537,huggingface/hub-docs/blob/main/hacktoberfest_challenges/model_no_license.md,5,"The context provides the number of parameters for the given model under the ""Size (MB)"" column, but no clear indication of the total number of parameters. However, we can find the number of parameters by multiplying the size (in MB) by 1024 and then dividing by 1,000,000. This calculation can be performed as follows: (3537 / 1024) * 1024 * 1024 = 3,587,251,840 parameters.",5,This question is useful for machine learning developers building NLP applications with the Hugging Face ecosystem as they often need to know the size of their chosen models to make informed decisions regarding system requirements and resource allocation.,5,"The question refers to a specific model named ""llama-2-70b-Guanaco-QLoRA-fp16"". The model name itself provides sufficient context to understand what is being asked, as long as the reader is familiar with the naming conventions used in the model community. The size or number of parameters of a model is a common piece of information that is often provided in the model documentation, making this a self-contained question."
2,"Deep Layer Aggregation\n\nExtending “shallow” skip connections, **Dense Layer Aggregation (DLA)** incorporates more depth and sharing. The authors introduce two structures for deep layer aggregation (DLA): iterative deep aggregation (IDA) and hierarchical deep aggregation (HDA). These structures are expressed through an architectural framework, independent of the choice of backbone, for compatibility with current and future networks. \n\nIDA focuses on fusing resolutions and scales while HDA focuses on merging features from all modules and channels. IDA follows the base hierarchy to refine resolution and aggregate scale stage-bystage. HDA assembles its own hierarchy of tree-structured connections that cross and merge stages to aggregate different levels of representation. \n\n## How do I use this model on an image?\nTo load a pretrained model:\n\n```python\nimport timm\nmodel = timm.create_model('dla102', pretrained=True)\nmodel.eval()\n```\n\nTo load and preprocess the image:\n```python \nimport urllib\nfrom PIL import Image\nfrom timm.data import resolve_data_config\nfrom timm.data.transforms_factory import create_transform\n\nconfig = resolve_data_config({}, model=model)\ntransform = create_transform(**config)\n\nurl, filename = (""https://github.com/pytorch/hub/raw/master/images/dog.jpg"", ""dog.jpg"")\nurllib.request.urlretrieve(url, filename)\nimg = Image.open(filename).convert('RGB')\ntensor = transform(img).unsqueeze(0) # transform and add batch dimension\n```\n\nTo get the model predictions:\n```python\nimport torch\nwith torch.no_grad():\n out = model(tensor)\nprobabilities = torch.nn.functional.softmax(out[0], dim=0)\nprint(probabilities.shape)\n# prints: torch.Size([1000])\n```",What are the two structures introduced in Dense Layer Aggregation (DLA) for deep layer aggregation?\n,Iterative deep aggregation (IDA) and hierarchical deep aggregation (HDA),huggingface/pytorch-image-models/blob/main/docs/models/dla.md,5,The context clearly states that the two structures introduced in DLA for deep layer aggregation are iterative deep aggregation (IDA) and hierarchical deep aggregation (HDA. The context also explains the functions of each structure.,4,"DLA is a deep neural architecture proposed for object detection, which aggregates deep feature representations in a hierarchical and densely connected manner. Understanding the structures employed in DLA for deep layer aggregation can be crucial for machine learning developers working on NLP applications using Hugging Face's ecosystem, as NLP models may benefit from similar hierarchical and dense feature representation methods.",3,"The question refers to Dense Layer Aggregation (DLA), which is a specific deep learning architecture used for feature aggregation in deep neural networks. The question asks about two specific structures introduced in DLA for deep layer aggregation. To understand the question, one needs to have a basic understanding of DLA, but the explicit context of the question is sufficient to comprehend it. Therefore, the question is more context-dependent than independent, but still clear enough to be rated as a 3."
3,"Objective: Move the agent so the box is within the agents its field of view\n\nActors: An EgoCentric Camera Actor (LINK) equipped with a monocular camera\n\nObservation space: \n- An RGB camera of shape (3, 40, 40) (C, H, W) in uint8 format.\n \nAction space:\n- A discrete action space with 3 possible actions\n- Turn left 10 degrees\n- Turn right 10 degrees\n- Move forward\n\nReward function:\n- A sparse reward for moving the box within a 60 degree fov cone in front of the agent.\n- A timeout penaly of -1 if the agent does not reach the object in 100 time-steps\n\nParallel: 4 independent instances of the same environment configuration.",What is the shape of the RGB camera in the observation space?\n,"The RGB camera in the observation space has a shape of (3, 40, 40).",huggingface/simulate/blob/main/docs/source/howto/rl.mdx,1,"The context does not provide any information about the shape of the RGB camera in the observation space from the perspective of the observation space itself. It only describes the shape of the observation space as an RGB image of size (3, 40, 40), but it doesn't provide any information about the physical shape of the camera in the observation space.",1,"This question is not useful for machine learning developers building NLP applications with the Hugging Face ecosystem as it is not related to natural language processing or the Hugging Face library specifically. It appears to be asking about the shape of an RGB camera in a computer vision context, rather than anything to do with machine learning or NLP.",5,"This question refers to the shape of an RGB camera in the observation space, which is a common concept in computer vision and robotics. However, the question does not include any context or specific reference to a particular RGB camera or system."
4,| [How to run inference with OpenVINO](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb) | Explains how to export your model to OpenVINO and run inference with OpenVINO Runtime on various tasks| [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb)|\n| [How to quantize a question answering model with NNCF](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb) | Show how to apply post-training quantization on a question answering model using [NNCF](https://github.com/openvinotoolkit/nncf) and to accelerate inference with OpenVINO| [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb)|,Which OpenVINO notebook from Hugging Face provides instructions on exporting a model and running inference?\n,"The notebook named ""optimum_openvino_inference.ipynb"" does this.",huggingface/optimum/blob/main/notebooks/README.md,5,"The context provides two OpenVINO notebooks from Hugging Face. The first notebook, ""optimum\_openvino\_inference.ipynb,"" explains how to export a model to OpenVINO and run inference. Therefore, the question can be answered by referring to this notebook.",5,"The question is relevant to machine learning developers building NLP applications with the Hugging Face ecosystem as it asks about a specific task: exporting a model and running inference using Hugging Face's OpenVINO toolkit. OpenVINO is a popular framework for running inference on mobile and embedded devices, and exporting models is an essential step in the ML development workflow.",5,"This question refers to specific OpenVINO notebooks available on Hugging Face, asking for one that provides instructions on exporting a model and running inference. The question assumes that the reader has some familiarity with OpenVINO and Hugging Face, but it does not require any additional context beyond that."
5,"Simple call on one item:\n\n```python\n>>> pipe = pipeline(""text-classification"")\n>>> pipe(""This restaurant is awesome"")\n[{'label': 'POSITIVE', 'score': 0.9998743534088135}]\n```\n\nIf you want to use a specific model from the [hub](https://huggingface.co) you can ignore the task if the model on\nthe hub already defines it:\n\n```python\n>>> pipe = pipeline(model=""roberta-large-mnli"")\n>>> pipe(""This restaurant is awesome"")\n[{'label': 'NEUTRAL', 'score': 0.7313136458396912}]\n```\n\nTo call a pipeline on many items, you can call it with a *list*.\n\n```python\n>>> pipe = pipeline(""text-classification"")\n>>> pipe([""This restaurant is awesome"", ""This restaurant is awful""])\n[{'label': 'POSITIVE', 'score': 0.9998743534088135},\n {'label': 'NEGATIVE', 'score': 0.9996669292449951}]\n```\n\nTo iterate over full datasets it is recommended to use a `dataset` directly. This means you don't need to allocate\nthe whole dataset at once, nor do you need to do batching yourself. This should work just as fast as custom loops on\nGPU. If it doesn't don't hesitate to create an issue.\n\n```python\nimport datasets\nfrom transformers import pipeline\nfrom transformers.pipelines.pt_utils import KeyDataset\nfrom tqdm.auto import tqdm\n\npipe = pipeline(""automatic-speech-recognition"", model=""facebook/wav2vec2-base-960h"", device=0)\ndataset = datasets.load_dataset(""superb"", name=""asr"", split=""test"")\n\n# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item\n# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset\nfor out in tqdm(pipe(KeyDataset(dataset, ""file""))):\n print(out)\n # {""text"": ""NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND""}\n # {""text"": ....}\n # ....\n```\n\nFor ease of use, a generator is also possible:\n\n\n```python\nfrom transformers import pipeline\n\npipe = pipeline(""text-classification"")","What is the label of the first text in the pipe(""This restaurant is awesome"") output?\n",The label of the first text is 'POSITIVE'.,huggingface/transformers/blob/main/docs/source/en/main_classes/pipelines.md,5,"The context provides the output of a text classification pipeline, which includes the label 'POSITIVE' and a high score for the given input ""This restaurant is awesome"". There is no ambiguity in the context that the label for the first text in the pipeline output is 'POSITIVE'.",1,"This question assumes knowledge of a specific output format from Hugging Face models, which is not provided in the context. It also does not specify which model or dataset is being used. These details are important for determining the label of a text output.",1,"The question refers to the label of the first text in the output of the pipe function, but it does not specify which pipe function or model is being used. In order to fully understand the question and provide an accurate answer, one would need to have additional context or information about which specific pipe function or model is being referred to."
6,- [#5840](https://github.com/gradio-app/gradio/pull/5840) [`4e62b8493`](https://github.com/gradio-app/gradio/commit/4e62b8493dfce50bafafe49f1a5deb929d822103) - Ensure websocket polyfill doesn't load if there is already a `global.Webocket` property set. Thanks [@Jay2theWhy](https://github.com/Jay2theWhy)!\n- [#5839](https://github.com/gradio-app/gradio/pull/5839) [`b83064da0`](https://github.com/gradio-app/gradio/commit/b83064da0005ca055fc15ee478cf064bf91702a4) - Fix error when scrolling dropdown with scrollbar. Thanks [@Kit-p](https://github.com/Kit-p)!\n- [#5822](https://github.com/gradio-app/gradio/pull/5822) [`7b63db271`](https://github.com/gradio-app/gradio/commit/7b63db27161ab538f20cf8523fc04c9c3b604a98) - Convert async methods in the Examples class into normal sync methods. Thanks [@whitphx](https://github.com/whitphx)!\n- [#5904](https://github.com/gradio-app/gradio/pull/5904) [`891d42e9b`](https://github.com/gradio-app/gradio/commit/891d42e9baa7ab85ede2a5eadb56c274b0ed2785) - Define Font.__repr__() to be printed in the doc in a readable format. Thanks [@whitphx](https://github.com/whitphx)!\n- [#5811](https://github.com/gradio-app/gradio/pull/5811) [`1d5b15a2d`](https://github.com/gradio-app/gradio/commit/1d5b15a2d24387154f2cfb40a36de25b331471d3) - Assert refactor in external.py. Thanks [@harry-urek](https://github.com/harry-urek)!\n- [#5827](https://github.com/gradio-app/gradio/pull/5827) [`48e09ee88`](https://github.com/gradio-app/gradio/commit/48e09ee88799efa38a5cc9b1b61e462f72ec6093) - Quick fix: Chatbot change event. Thanks [@dawoodkhan82](https://github.com/dawoodkhan82)!\n- [#5890](https://github.com/gradio-app/gradio/pull/5890) [`c4ba832b3`](https://github.com/gradio-app/gradio/commit/c4ba832b318dad5e8bf565cfa0daf93ca188498f) - Remove deprecation warning from `gr.update` and clean up associated code. Thanks [@abidlabs](https://github.com/abidlabs)!,"In how many pull requests were font-related changes made between commit hashes ""4e62b8493dfce50bafafe49f1a5deb929d822103"" and ""baa7ab85ede2a5eadb56c274b0ed2785""?\n",One pull request (#5904) contained font-related changes.,gradio-app/gradio/blob/main/CHANGELOG.md,1,The context does not provide sufficient information to answer the question as it does not mention any specific changes related to fonts between the given commit hashes.,1,"This question is not directly related to machine learning or NLP applications using the Hugging Face ecosystem. It pertains to Git history and font changes, which is an unrelated concern for developers building NLP applications.",5,"This question refers to specific commit hashes and asks for information that can be obtained by examining the git history. However, it does not depend on any context outside of that, assuming that the operator has access to the repository in question."
7,"!--Copyright 2023 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the ""License""); you may not use this file except in compliance with\nthe License. You may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on\nan ""AS IS"" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\nspecific language governing permissions and limitations under the License.\n\n⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be\nrendered properly in your Markdown viewer.\n\n-->\n\n# Multitask Prompt Tuning\n\n[Multitask Prompt Tuning](https://huggingface.co/papers/2303.02861) decomposes the soft prompts of each task into a single learned transferable prompt instead of a separate prompt for each task. The single learned prompt can be adapted for each task by multiplicative low rank updates.\n\nThe abstract from the paper is:",In which license is the HuggingFace Multitask Prompt Tuning paper distributed?\n,"The HuggingFace Multitask Prompt Tuning paper is distributed under the Apache License, Version 2.0.",huggingface/peft/blob/main/docs/source/package_reference/multitask_prompt_tuning.md,1,"The context does not explicitly mention the license of the paper ""Multitask Prompt Tuning"" itself, only the license of the context file where the paper is hosted. Therefore, it is not possible to unambiguously determine the answer from the context alone.",5,"This question is useful for machine learning developers building NLP applications with the Hugging Face ecosystem as it provides important information about the licensing of the research paper that discusses a key feature of the Hugging Face library. Understanding the licensing terms can help developers determine how they can use, modify, and distribute the information and code discussed in the paper.",5,"This question refers to a specific document, namely the HuggingFace Multitask Prompt Tuning paper. However, it does not require any context-specific information beyond the title of the paper and the organization, Hugging Face, which is well-known in the machine learning community."
8,"<p align=""center"">\n <img src=""https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/distill_sd/Picture6.png"" width=500>\n</p>\n\n## Conclusion\n\nWe invite the open-source community to help us improve and achieve wider adoption of these distilled SD models. Users can join our [Discord](https://discord.gg/s6E6eHJk) server, where we will be announcing the latest updates to these models, releasing more checkpoints and some exciting new LoRAs. And if you like our work, please give us a star on our [Github](https://github.com/segmind/distill-sd).",In which Discord server can users find the latest updates about the distilled SD models?\n,Users can find the latest updates about the distilled SD models on the Discord server with the invitation link <https://discord.gg/s6E6eHJk>.,huggingface/blog/blob/main/sd_distillation.md,5,The context explicitly states that users can find the latest updates about the distilled SD models on the Discord server mentioned in the link.,1,"This question is not directly related to building NLP applications using the Hugging Face ecosystem. It asks about a specific Discord server, which does not provide any relevant information for developers working on machine learning projects.",1,"This question assumes the existence of Discord servers related to the distilled SD models. Therefore, to fully understand the question, some context, such as the name or access to these servers, is required."
9,| Task | Example datasets | Trainer support | 🤗 Accelerate | 🤗 Datasets | Colab\n|---|---|:---:|:---:|:---:|:---:|\n| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling) | [WikiText-2](https://huggingface.co/datasets/wikitext) | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)\n| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) | [SWAG](https://huggingface.co/datasets/swag) | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)\n| [**`question-answering`**](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) | [SQuAD](https://huggingface.co/datasets/squad) | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)\n| [**`summarization`**](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) | [XSum](https://huggingface.co/datasets/xsum) | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)\n| [**`text-classification`**](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) | [GLUE](https://huggingface.co/datasets/glue) | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb),In which GitHub repository are the examples for question-answering tasks located?\n,"The examples for question-answering tasks are located in the GitHub repository: `huggingface/transformers`. Specifically, they can be found in the subdirectory: `examples/pytorch/question-answering`.",huggingface/transformers/blob/main/examples/pytorch/README.md,5,"Given the context, the question ""In which GitHub repository are the examples for question-answering tasks located?"" can be answered unambiguously as the answer is provided in the context as ""https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering"".",4,This question is useful for machine learning developers building NLP applications with the Hugging Face ecosystem as it provides information about the location of specific examples related to question-answering tasks. Having this information can help developers quickly access and learn from the available resources.,5,"This question is not dependent on any specific context and can be understood independently. The question refers to a GitHub repository, which is a common term used in software development and version control projects."


Now let us filter out bad questions based on our critique agent scores:

In [None]:
import pandas as pd

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(outputs)

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["relevance_score"] >= 4)
    & (generated_questions["standalone_score"] >= 4)
]
print("============================================")
print("Final evaluation dataset:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)

eval_dataset = datasets.Dataset.from_pandas(
    generated_questions, split="train", preserve_index=False
)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What is the average loading time for CPU with this library compared to pickle?\n,The CPU loading times are faster with this library compared to pickle.,1,1,5
1,"What is the size (number of parameters) of ""llama-2-70b-Guanaco-QLoRA-fp16"" model?\n",3537,5,5,5
2,What are the two structures introduced in Dense Layer Aggregation (DLA) for deep layer aggregation?\n,Iterative deep aggregation (IDA) and hierarchical deep aggregation (HDA),5,4,3
3,What is the shape of the RGB camera in the observation space?\n,"The RGB camera in the observation space has a shape of (3, 40, 40).",1,1,5
4,Which OpenVINO notebook from Hugging Face provides instructions on exporting a model and running inference?\n,"The notebook named ""optimum_openvino_inference.ipynb"" does this.",5,5,5
5,"What is the label of the first text in the pipe(""This restaurant is awesome"") output?\n",The label of the first text is 'POSITIVE'.,5,1,1
6,"In how many pull requests were font-related changes made between commit hashes ""4e62b8493dfce50bafafe49f1a5deb929d822103"" and ""baa7ab85ede2a5eadb56c274b0ed2785""?\n",One pull request (#5904) contained font-related changes.,1,1,5
7,In which license is the HuggingFace Multitask Prompt Tuning paper distributed?\n,"The HuggingFace Multitask Prompt Tuning paper is distributed under the Apache License, Version 2.0.",1,5,5
8,In which Discord server can users find the latest updates about the distilled SD models?\n,Users can find the latest updates about the distilled SD models on the Discord server with the invitation link <https://discord.gg/s6E6eHJk>.,5,1,1
9,In which GitHub repository are the examples for question-answering tasks located?\n,"The examples for question-answering tasks are located in the GitHub repository: `huggingface/transformers`. Specifically, they can be found in the subdirectory: `examples/pytorch/question-answering`.",5,4,5


Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
1,"What is the size (number of parameters) of ""llama-2-70b-Guanaco-QLoRA-fp16"" model?\n",3537,5,5,5
4,Which OpenVINO notebook from Hugging Face provides instructions on exporting a model and running inference?\n,"The notebook named ""optimum_openvino_inference.ipynb"" does this.",5,5,5
9,In which GitHub repository are the examples for question-answering tasks located?\n,"The examples for question-answering tasks are located in the GitHub repository: `huggingface/transformers`. Specifically, they can be found in the subdirectory: `examples/pytorch/question-answering`.",5,4,5


Now our synthetic evaluation dataset is complete! We can evaluate different RAG systems on this evaluation dataset.

We have generated only a few QA couples here to reduce time and cost. But let's kickstart the next part by loading a pre-generated dataset:

In [None]:
eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

README.md:   0%|          | 0.00/893 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/289k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/65 [00:00<?, ? examples/s]

# Build RAG SYSTEM

### 2.1. Preprocessing documents to build our vector database


In [None]:
from langchain_core.documents import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in tqdm(ds)
]

  0%|          | 0/2647 [00:00<?, ?it/s]

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: str,
) -> List[LangchainDocument]:
    """
    Split documents into chunks of size `chunk_size` characters and return a list of documents.
    """
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        tokenizer,
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

### 2.2. Retriever - embeddings 🗂️

In [None]:
!pip install -U langchain-huggingface

Collecting langchain-huggingface
  Downloading langchain_huggingface-1.2.0-py3-none-any.whl.metadata (2.8 kB)
Downloading langchain_huggingface-1.2.0-py3-none-any.whl (30 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-1.2.0


In [None]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
import os

def load_embeddings(
    langchain_docs: List[LangchainDocument],
    chunk_size: int,
    embedding_model_name: Optional[str] = "thenlper/gte-small",
) -> FAISS:
    # load embedding_model
    embedding_model = HuggingFaceEmbeddings(
        model_name=embedding_model_name,
        multi_process=True,
        model_kwargs={"device": "cuda"},
        encode_kwargs={
            "normalize_embeddings": True
        },
    )

    # Check if embeddings already exist on disk
    index_name = (
        f"index_chunk:{chunk_size}_embeddings:{embedding_model_name.replace('/', '~')}"
    )
    index_folder_path = f"./data/indexes/{index_name}/"
    if os.path.isdir(index_folder_path):
        return FAISS.load_local(
            index_folder_path,
            embedding_model,
            distance_strategy=DistanceStrategy.COSINE,
            allow_dangerous_deserialization=True
        )

    else:
        print("Index not found, generating it...")
        docs_processed = split_documents(
            chunk_size,
            langchain_docs,
            embedding_model_name,
        )
        knowledge_index = FAISS.from_documents(
            docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
        )
        knowledge_index.save_local(index_folder_path)
        return knowledge_index

### 2.3. Reader - LLM 💬

In this part, the __LLM Reader reads the retrieved documents to formulate its answer.__

In [None]:
RAG_PROMPT_TEMPLATE = """
<|system|>
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {question}
</s>
<|assistant|>
"""

In [None]:
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace

repo_id = "HuggingFaceH4/zephyr-7b-beta"

base_llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    task="text-generation",
    huggingfacehub_api_token=hf_token,
    max_new_tokens=512,
    temperature=0.1,
    repetition_penalty=1.03,
)


READER_LLM = ChatHuggingFace(llm=base_llm)

In [None]:
# from langchain_huggingface import HuggingFaceEndpoint

# repo_id = "HuggingFaceH4/zephyr-7b-beta"
# READER_MODEL_NAME = "zephyr-7b-beta"

# READER_LLM = HuggingFaceEndpoint(
#     repo_id=repo_id,
#     task="conversational",
#     max_new_tokens=512,
#     top_k=30,
#     temperature=0.1,
#     repetition_penalty=1.03,
# )

Reimport 2 error files


**from langchain_core.documents.compressor import BaseDocumentCompressor**


In [None]:
from ragatouille import RAGPretrainedModel
from langchain_core.vectorstores import VectorStore
from langchain_core.language_models.llms import LLM

def answer_with_rag(
    question: str,
    llm: LLM,
    knowledge_index: VectorStore,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 7,
) -> Tuple[str, List[LangchainDocument]]:
    """Answer a question using RAG with the given knowledge index."""
    # Gather documents with retriever
    relevant_docs = knowledge_index.similarity_search(
        query=question, k=num_retrieved_docs
    )
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

    # Optionally rerank results
    if reranker:
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join(
        [f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)]
    )

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    answer = llm.invoke(final_prompt)

    return answer, relevant_docs

# 3. Benchmarking the RAG system

The RAG system and the evaluation datasets are now ready. The last step is to judge the RAG system's output on this evaluation dataset.

In [None]:
from langchain_core.language_models import BaseChatModel

def run_rag_tests(
    eval_dataset: datasets.Dataset,
    llm,
    knowledge_index: VectorStore,
    output_file: str,
    reranker: Optional[RAGPretrainedModel] = None,
    verbose: Optional[bool] = True,
    test_settings: Optional[str] = None,  # To document the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in tqdm(eval_dataset):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(
            question, llm, knowledge_index, reranker=reranker
        )
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer.content, # Fixed: Accessing the content attribute
            "retrieved_docs": [doc for doc in relevant_docs],
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with open(output_file, "w") as f:
            json.dump(outputs, f)

In [None]:
EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_core.messages import SystemMessage


evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)

In [None]:
from langchain_openai import ChatOpenAI
from google.colab import userdata

# eval_chat_model = ChatOpenAI(
#     model="gpt-4o-mini",
#     temperature=0,
#     openai_api_key=userdata.get('key_openai'))
# evaluator_name = "GPT4"

eval_chat_model = ChatOpenAI(
    model="gpt-4.1", # Model's name
    temperature=0,
    openai_api_key=userdata.get('key_ptn'), # PTN's key
    base_url="https://llm.ptnglobalcorp.com"
)
evaluator_name = "GPT4.1"

def evaluate_answers(
    answer_path: str,
    eval_chat_model,
    evaluator_name: str,
    evaluation_prompt_template: ChatPromptTemplate,
) -> None:
    """Evaluates generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if os.path.isfile(answer_path):  # load previous generations if they exist
        with open(answer_path, "r") as f:
            answers = json.load(f)

    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_name}" in experiment:
            continue

        eval_prompt = evaluation_prompt_template.format_messages(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_result = eval_chat_model.invoke(eval_prompt)

        feedback, score = [
            item.strip() for item in eval_result.content.split("[RESULT]")
        ]
        experiment[f"eval_score_{evaluator_name}"] = score
        experiment[f"eval_feedback_{evaluator_name}"] = feedback

        with open(answer_path, "w") as f:
            json.dump(answers, f)

🚀 Let's run the tests and evaluate answers!👇

In [None]:
if not os.path.exists("./output"):
    os.mkdir("./output")

READER_MODEL_NAME = "zephyr-7b-beta"

for chunk_size in [200]:  # Add other chunk sizes (in tokens) as needed
    for embeddings in ["thenlper/gte-small"]:  # Add other embeddings as needed
        for rerank in [True, False]:
            settings_name = f"chunk:{chunk_size}_embeddings:{embeddings.replace('/', '~')}_rerank:{rerank}_reader-model:{READER_MODEL_NAME}"
            output_file_name = f"./output/rag_{settings_name}.json"

            print(f"Running evaluation for {settings_name}:")

            print("Loading knowledge base embeddings...")
            knowledge_index = load_embeddings(
                RAW_KNOWLEDGE_BASE,
                chunk_size=chunk_size,
                embedding_model_name=embeddings,
            )

            print("Running RAG...")
            reranker = (
                RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
                if rerank
                else None
            )

            run_rag_tests(
                eval_dataset=eval_dataset,
                llm=READER_LLM,
                knowledge_index=knowledge_index,
                output_file=output_file_name,
                reranker=reranker,
                verbose=False,
                test_settings=settings_name,
            )

Running evaluation for chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:zephyr-7b-beta:
Loading knowledge base embeddings...
Running RAG...


  0%|          | 0/65 [00:00<?, ?it/s]


  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  9.48it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  9.56it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  9.75it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  9.34it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  9.51it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  9.43it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  9.50it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  9.15it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  9.47it/s]

100%|██████████| 1/1 [00:00<00:00, 10.03it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  9.65it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:

Running evaluation for chunk:200_embeddings:thenlper~gte-small_rerank:False_reader-model:zephyr-7b-beta:
Loading knowledge base embeddings...
Running RAG...


  0%|          | 0/65 [00:00<?, ?it/s]

In [None]:
# Split into 2 sessions: run_rag_tests and evaluate_anwsers in order to fix bug

for chunk_size in [200]:  # Add other chunk sizes (in tokens) as needed
    for embeddings in ["thenlper/gte-small"]:  # Add other embeddings as needed
        for rerank in [True, False]:
            settings_name = f"chunk:{chunk_size}_embeddings:{embeddings.replace('/', '~')}_rerank:{rerank}_reader-model:{READER_MODEL_NAME}"
            output_file_name = f"./output/rag_{settings_name}.json"

            print("Running evaluation (GPT4.1 Judge)...")
            evaluate_answers(
                output_file_name,
                eval_chat_model,
                evaluator_name,
                evaluation_prompt_template,
            )

Running evaluation (GPT4.1 Judge)...


100%|██████████| 65/65 [05:30<00:00,  5.08s/it]


Running evaluation (GPT4.1 Judge)...


100%|██████████| 65/65 [06:52<00:00,  6.35s/it]


In [None]:
import glob

outputs = []
for file in glob.glob("./output/*.json"):
    output = pd.DataFrame(json.load(open(file, "r")))
    output["settings"] = file
    outputs.append(output)
result = pd.concat(outputs)

In [None]:
result["eval_score_GPT4.1"] = result["eval_score_GPT4.1"].apply(
    lambda x: int(x) if isinstance(x, str) else 1
)
result["eval_score_GPT4.1"] = (result["eval_score_GPT4.1"] - 1) / 4

In [None]:
average_scores = result.groupby("settings")["eval_score_GPT4.1"].mean()
average_scores.sort_values()

Unnamed: 0_level_0,eval_score_GPT4.1
settings,Unnamed: 1_level_1
./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:False_reader-model:zephyr-7b-beta.json,0.715385
./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:zephyr-7b-beta.json,0.757692


## Example results

In [None]:
import plotly.express as px

scores = datasets.load_dataset("m-ric/rag_scores_cookbook", split="train")
scores = pd.Series(scores["score"], index=scores["settings"])

In [None]:
fig = px.bar(
    scores,
    color=scores,
    labels={
        "value": "Accuracy",
        "settings": "Configuration",
    },
    color_continuous_scale="bluered",
)
fig.update_layout(
    width=1000,
    height=600,
    barmode="group",
    yaxis_range=[0, 100],
    title="<b>Accuracy of different RAG configurations</b>",
    xaxis_title="RAG settings",
    font=dict(size=15),
)
fig.layout.yaxis.ticksuffix = "%"
fig.update_coloraxes(showscale=False)
fig.update_traces(texttemplate="%{y:.1f}", textposition="outside")
fig.show()