# RAG Evaluation
_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_

This notebook demonstrates how you can evaluate your RAG (Retrieval Augmented Generation), by building a synthetic evaluation dataset and using LLM-as-a-judge to compute the accuracy of your system.

For an introduction to RAG, you can check [this other cookbook](rag_zephyr_langchain)!

RAG systems are complex: here a RAG diagram, where we noted in blue all possibilities for system enhancement:

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_workflow.png" height="700">

Implementing any of these improvements can bring a huge performance boost; but changing anything is useless if you cannot monitor the impact of your changes on the system's performance!
So let's see how to evaluate our RAG system.

### Evaluating RAG performance

Since there are so many moving parts to tune with a big impact on performance, benchmarking the RAG system is crucial.

For our evaluation pipeline, we will need:
1. An evaluation dataset with question - answer couples (QA couples)
2. An evaluator to compute the accuracy of our system on the above evaluation dataset.

➡️ It turns out, we can use LLMs to help us all along the way!
1. The evaluation dataset will be synthetically generated by an LLM 🤖, and questions will be filtered out by other LLMs 🤖
2. An [LLM-as-a-judge](https://huggingface.co/papers/2306.05685) agent 🤖 will then perform the evaluation on this synthetic dataset.

__Let's dig into it and start building our evaluation pipeline!__ First, we install the required model dependancies.

In [1]:
# !pip install -q torch transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets langchain-community ragatouille

In [None]:
import pandas as pd
import datasets
import pickle
import os
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
import pickle

# Configure API key
dashscope_api_key = "<paste_your_api_key_here>"
dashscope_base_url = "<paste_your_base_url_here>"


pd.set_option("display.max_colwidth", None)

### Load your knowledge base

In [2]:
with open("./Data/all_splits.pkl", "rb") as f:
    docs_processed = pickle.load(f)

# 1. Build a synthetic dataset for evaluation
We first build a synthetic dataset of questions and associated contexts. The method is to get elements from our knowledge base, and ask an LLM to generate questions based on these documents.

Then we setup other LLM agents to act as quality filters for the generated QA couples: each of them will act as the filter for a specific flaw.

### 1.1. Prepare source documents

In [4]:
prompt = ChatPromptTemplate.from_messages([
    ("system", 
      """ Your task is to write a factoid question in Arabic only and an answer given a context in Arabic only.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question in Arabic)
Answer: (your answer to the factoid question in Arabic)
"""),
    ("human", """Now here is the context.

Context: {context}\n
Output:::""")
])



### 1.2. Setup agents for question generation

We use [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) for QA couple generation because it it has excellent performance in leaderboards such as [Chatbot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard).

In [5]:
llm = ChatOpenAI(
    model_name="qwen-plus",  # or qwen-Turpo, qwen-plus, qwen-max,deepseek-r1, qwen3-8b
    openai_api_key=dashscope_api_key,
    openai_api_base=dashscope_base_url,
    temperature=0.4,
    # extra_body={"enable_thinking": False} 
)




Now let's generate our QA couples.
For this example, we generate only 10 QA couples and will load the rest from the Hub.

But for your specific knowledge base, given that you want to get at least ~100 test samples, and accounting for the fact that we will filter out around half of these with our critique agents later on, you should generate much more, in the >200 samples.

In [40]:
import random
from tqdm import tqdm

N_GENERATIONS = 469

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(random.sample(docs_processed, N_GENERATIONS)):
    try:
        # Create and run the chain
        chain = prompt | llm | StrOutputParser()
        output_QA_couple = chain.invoke({"context": sampled_context.page_content})
        
        question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0]
        answer = output_QA_couple.split("Answer: ")[-1]
        
        assert len(answer) < 300, "Answer is too long"
        
        outputs.append({
            "context": sampled_context.page_content,
            "question": question,
            "answer": answer,
            "source_doc": sampled_context.metadata["source"],
        })
    except Exception as e:
        print(f"Error: {e}")
        continue

Generating 469 QA couples...


  1%|          | 5/469 [00:17<26:31,  3.43s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-1cae3525-9489-9656-ae5f-f080a6d2b294', 'request_id': '1cae3525-9489-9656-ae5f-f080a6d2b294'}


  1%|▏         | 6/469 [00:20<25:40,  3.33s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-410cc838-6ca4-9e91-a777-505d36ae6b08', 'request_id': '410cc838-6ca4-9e91-a777-505d36ae6b08'}


  2%|▏         | 8/469 [00:27<28:32,  3.72s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-e76e83d5-cbb4-9a6f-9e9e-455c56f6e57e', 'request_id': 'e76e83d5-cbb4-9a6f-9e9e-455c56f6e57e'}


  3%|▎         | 12/469 [00:37<21:29,  2.82s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-f5aa90ea-9416-94a9-a4e8-778aaececc3b', 'request_id': 'f5aa90ea-9416-94a9-a4e8-778aaececc3b'}


  3%|▎         | 13/469 [00:40<21:51,  2.88s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-f1a765f8-14a0-93ed-8c94-1310b783d1b0', 'request_id': 'f1a765f8-14a0-93ed-8c94-1310b783d1b0'}


  3%|▎         | 15/469 [00:47<23:41,  3.13s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-e2057dec-e7af-9309-9ead-c72203828057', 'request_id': 'e2057dec-e7af-9309-9ead-c72203828057'}


  6%|▌         | 29/469 [01:42<40:27,  5.52s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-150231de-6c0f-98cd-be40-eda62da8e281', 'request_id': '150231de-6c0f-98cd-be40-eda62da8e281'}


  8%|▊         | 37/469 [02:06<20:58,  2.91s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-e1b18ec4-acef-925f-89a4-b9a1ae4ad9f3', 'request_id': 'e1b18ec4-acef-925f-89a4-b9a1ae4ad9f3'}


 10%|▉         | 45/469 [02:31<23:11,  3.28s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-34dea3f2-5824-9c26-9d08-b4496cd0f6f7', 'request_id': '34dea3f2-5824-9c26-9d08-b4496cd0f6f7'}


 10%|▉         | 46/469 [02:35<25:21,  3.60s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-105ec9be-6564-9fd1-94e9-060240a4e477', 'request_id': '105ec9be-6564-9fd1-94e9-060240a4e477'}


 12%|█▏        | 56/469 [03:14<24:48,  3.60s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-1bc37132-f5d1-97a6-8ec2-82d51938ba98', 'request_id': '1bc37132-f5d1-97a6-8ec2-82d51938ba98'}


 12%|█▏        | 58/469 [03:31<39:15,  5.73s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-ee5e0f7b-e2b6-9ec5-8000-eecbad417318', 'request_id': 'ee5e0f7b-e2b6-9ec5-8000-eecbad417318'}


 14%|█▍        | 65/469 [03:52<22:45,  3.38s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-c62d5d09-dfb6-9e30-b193-148fd3a00942', 'request_id': 'c62d5d09-dfb6-9e30-b193-148fd3a00942'}


 17%|█▋        | 79/469 [04:41<29:14,  4.50s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-9fc75afc-d121-944f-873d-d473f380d4dd', 'request_id': '9fc75afc-d121-944f-873d-d473f380d4dd'}


 18%|█▊        | 86/469 [05:00<17:46,  2.78s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-d03f169d-d09f-92a5-bf44-d1e180ae67f0', 'request_id': 'd03f169d-d09f-92a5-bf44-d1e180ae67f0'}


 19%|█▉        | 89/469 [05:11<22:16,  3.52s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-e5c7ca63-90a8-957b-a01b-55135e09305e', 'request_id': 'e5c7ca63-90a8-957b-a01b-55135e09305e'}


 20%|█▉        | 92/469 [05:24<24:01,  3.82s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-5faf6483-1a2c-9d02-8632-33c36ac91c1e', 'request_id': '5faf6483-1a2c-9d02-8632-33c36ac91c1e'}


 21%|██        | 97/469 [05:42<23:38,  3.81s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-15254462-d251-9618-93bf-1c98c9ac8824', 'request_id': '15254462-d251-9618-93bf-1c98c9ac8824'}


 22%|██▏       | 101/469 [05:56<22:26,  3.66s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-a238a3fc-163a-917f-b3a2-1290b6ba456f', 'request_id': 'a238a3fc-163a-917f-b3a2-1290b6ba456f'}


 23%|██▎       | 109/469 [06:17<15:29,  2.58s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-f8a3081a-4cae-957d-b5d3-cf8d2b31fad3', 'request_id': 'f8a3081a-4cae-957d-b5d3-cf8d2b31fad3'}


 24%|██▎       | 111/469 [06:23<17:29,  2.93s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-13ed772c-00cf-9433-8084-fa6d2dcd7644', 'request_id': '13ed772c-00cf-9433-8084-fa6d2dcd7644'}


 24%|██▍       | 112/469 [06:26<16:35,  2.79s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-aa5887f6-5104-90a0-ae24-569dda1edf02', 'request_id': 'aa5887f6-5104-90a0-ae24-569dda1edf02'}


 24%|██▍       | 114/469 [06:37<24:24,  4.13s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-a24c5811-fe0a-9c58-96a7-aca6b211cca2', 'request_id': 'a24c5811-fe0a-9c58-96a7-aca6b211cca2'}


 25%|██▌       | 118/469 [06:49<20:08,  3.44s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-238fc68f-918b-916f-9b22-55d885f64565', 'request_id': '238fc68f-918b-916f-9b22-55d885f64565'}


 26%|██▌       | 123/469 [07:03<16:02,  2.78s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-34c180c2-7ba9-9b7f-b082-c9af9d440555', 'request_id': '34c180c2-7ba9-9b7f-b082-c9af9d440555'}


 27%|██▋       | 125/469 [07:10<18:13,  3.18s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-a50f04c9-5fe0-98e3-8056-2bb313e304a9', 'request_id': 'a50f04c9-5fe0-98e3-8056-2bb313e304a9'}


 27%|██▋       | 127/469 [07:17<18:06,  3.18s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-69d061f7-7e98-9052-8a83-39f2da29f22f', 'request_id': '69d061f7-7e98-9052-8a83-39f2da29f22f'}


 31%|███       | 146/469 [08:16<17:30,  3.25s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-5976dfa0-e13b-9f96-9f2e-0388552617ae', 'request_id': '5976dfa0-e13b-9f96-9f2e-0388552617ae'}


 32%|███▏      | 148/469 [08:21<14:36,  2.73s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-351e50e9-f4dd-9cd9-91dd-186e6eefeea8', 'request_id': '351e50e9-f4dd-9cd9-91dd-186e6eefeea8'}


 32%|███▏      | 150/469 [08:49<51:22,  9.66s/it]

Error: Answer is too long


 33%|███▎      | 156/469 [09:07<20:50,  4.00s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-87ae2fe2-f54f-9d27-b6e2-3d14500c5943', 'request_id': '87ae2fe2-f54f-9d27-b6e2-3d14500c5943'}


 34%|███▍      | 160/469 [09:23<20:21,  3.95s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-fbd1fe0b-d6f8-98f2-b6ab-586d6d90d642', 'request_id': 'fbd1fe0b-d6f8-98f2-b6ab-586d6d90d642'}


 36%|███▌      | 167/469 [09:51<18:58,  3.77s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-863d2efd-c988-99a4-a97a-f9b8114bd551', 'request_id': '863d2efd-c988-99a4-a97a-f9b8114bd551'}


 36%|███▌      | 169/469 [09:57<16:04,  3.21s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-3dd280a8-eb35-9940-b833-b7e1b5f670d5', 'request_id': '3dd280a8-eb35-9940-b833-b7e1b5f670d5'}


 37%|███▋      | 172/469 [10:11<20:08,  4.07s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-d9254c85-994a-9395-a6f2-7db859140e9a', 'request_id': 'd9254c85-994a-9395-a6f2-7db859140e9a'}


 37%|███▋      | 175/469 [10:17<13:56,  2.85s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-14990d31-e024-9a28-b1bc-d5a2db7a6c55', 'request_id': '14990d31-e024-9a28-b1bc-d5a2db7a6c55'}


 38%|███▊      | 178/469 [10:24<12:25,  2.56s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-1c94204b-36f0-9811-81c6-5f22519966b5', 'request_id': '1c94204b-36f0-9811-81c6-5f22519966b5'}


 38%|███▊      | 180/469 [10:31<15:29,  3.22s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-3a2d2427-57e0-9636-9287-30d7e1cc6a4e', 'request_id': '3a2d2427-57e0-9636-9287-30d7e1cc6a4e'}


 39%|███▉      | 184/469 [10:45<15:17,  3.22s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-d3a34702-6114-9cbb-8762-0761963b624c', 'request_id': 'd3a34702-6114-9cbb-8762-0761963b624c'}


 39%|███▉      | 185/469 [10:48<14:02,  2.97s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-b1162d41-b166-998a-abc9-cb0879615edf', 'request_id': 'b1162d41-b166-998a-abc9-cb0879615edf'}


 40%|███▉      | 186/469 [10:50<13:28,  2.86s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-9337f24a-1e6e-9688-a76f-ed8dad02b650', 'request_id': '9337f24a-1e6e-9688-a76f-ed8dad02b650'}


 41%|████▏     | 194/469 [11:22<15:16,  3.33s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-300db04d-2f3c-90f1-abd7-b66da3756f7a', 'request_id': '300db04d-2f3c-90f1-abd7-b66da3756f7a'}


 43%|████▎     | 201/469 [11:50<14:15,  3.19s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-dc80d759-b42a-9061-bf6d-67f3c87282eb', 'request_id': 'dc80d759-b42a-9061-bf6d-67f3c87282eb'}


 45%|████▌     | 212/469 [12:23<12:24,  2.90s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-3657a96a-3ccf-9161-8545-74c5103860c5', 'request_id': '3657a96a-3ccf-9161-8545-74c5103860c5'}


 45%|████▌     | 213/469 [12:26<12:23,  2.90s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-985b8dc7-24c9-911a-951a-95ef3d499b89', 'request_id': '985b8dc7-24c9-911a-951a-95ef3d499b89'}


 48%|████▊     | 225/469 [13:03<14:31,  3.57s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-d018bc6b-d4ac-9b85-b087-de5eecc8afa7', 'request_id': 'd018bc6b-d4ac-9b85-b087-de5eecc8afa7'}


 48%|████▊     | 227/469 [13:08<12:05,  3.00s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-85428847-0850-925d-8c39-758eaa2499a5', 'request_id': '85428847-0850-925d-8c39-758eaa2499a5'}


 50%|████▉     | 234/469 [13:24<09:13,  2.36s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-488c60a8-2c92-976f-b352-28e4b9d56667', 'request_id': '488c60a8-2c92-976f-b352-28e4b9d56667'}


 54%|█████▎    | 252/469 [14:18<08:38,  2.39s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-ac003be4-c00e-91d7-abf3-8a6370e4bc79', 'request_id': 'ac003be4-c00e-91d7-abf3-8a6370e4bc79'}


 54%|█████▍    | 255/469 [14:24<08:31,  2.39s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-922fbb24-3c8f-9c88-bb59-2b3413704187', 'request_id': '922fbb24-3c8f-9c88-bb59-2b3413704187'}


 56%|█████▌    | 261/469 [14:44<12:09,  3.51s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-cbbcbb05-5ea6-919b-969d-7bfd7a108541', 'request_id': 'cbbcbb05-5ea6-919b-969d-7bfd7a108541'}


 56%|█████▌    | 263/469 [14:48<09:07,  2.66s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-a9718c13-0598-9587-9c70-4d822e73f03e', 'request_id': 'a9718c13-0598-9587-9c70-4d822e73f03e'}


 57%|█████▋    | 267/469 [15:01<10:23,  3.08s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-4cc43077-8502-906f-82b4-063ea442e22d', 'request_id': '4cc43077-8502-906f-82b4-063ea442e22d'}


 58%|█████▊    | 271/469 [15:12<08:39,  2.62s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-fec28b26-d27b-94e7-b505-3fbe93943e14', 'request_id': 'fec28b26-d27b-94e7-b505-3fbe93943e14'}


 58%|█████▊    | 274/469 [15:19<07:52,  2.42s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-fed79ea6-2cfe-96ac-940b-8132e5a44ffc', 'request_id': 'fed79ea6-2cfe-96ac-940b-8132e5a44ffc'}


 59%|█████▉    | 279/469 [15:35<10:22,  3.27s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-bdbfdaeb-a42c-9bb4-a450-dfb6248a35f2', 'request_id': 'bdbfdaeb-a42c-9bb4-a450-dfb6248a35f2'}


 61%|██████    | 284/469 [15:54<11:42,  3.80s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-f20870cc-98aa-9298-93d5-8187cb1960af', 'request_id': 'f20870cc-98aa-9298-93d5-8187cb1960af'}


 62%|██████▏   | 291/469 [16:17<10:23,  3.50s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-7bedec5a-1370-9455-b694-9d87d2b92262', 'request_id': '7bedec5a-1370-9455-b694-9d87d2b92262'}


 64%|██████▍   | 301/469 [16:51<09:24,  3.36s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-32f29c89-8115-9f28-970d-106729ef3f4e', 'request_id': '32f29c89-8115-9f28-970d-106729ef3f4e'}


 67%|██████▋   | 312/469 [17:26<09:09,  3.50s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-10219d13-63f5-9ea2-8885-bafb3a4c652d', 'request_id': '10219d13-63f5-9ea2-8885-bafb3a4c652d'}


 68%|██████▊   | 318/469 [17:57<10:55,  4.34s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-88d62ea5-118f-9ceb-b165-b2b15636f886', 'request_id': '88d62ea5-118f-9ceb-b165-b2b15636f886'}


 72%|███████▏  | 338/469 [18:59<08:43,  3.99s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-ab3df51f-0b27-9711-ace1-7a3a0c8bba92', 'request_id': 'ab3df51f-0b27-9711-ace1-7a3a0c8bba92'}


 72%|███████▏  | 339/469 [19:02<07:58,  3.68s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-5b2eac52-ab77-97ad-87c5-bee83b19bf28', 'request_id': '5b2eac52-ab77-97ad-87c5-bee83b19bf28'}


 73%|███████▎  | 343/469 [19:15<06:42,  3.20s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-de270efd-675a-90af-b005-9eedbd5fc1c5', 'request_id': 'de270efd-675a-90af-b005-9eedbd5fc1c5'}


 74%|███████▎  | 345/469 [19:23<07:14,  3.50s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-f9aad5bf-2f4a-9f21-8fe2-9ce4625b1d3c', 'request_id': 'f9aad5bf-2f4a-9f21-8fe2-9ce4625b1d3c'}


 74%|███████▍  | 349/469 [19:32<05:19,  2.66s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-854d39cb-47e1-95de-9aa3-7a52e9cbf4a0', 'request_id': '854d39cb-47e1-95de-9aa3-7a52e9cbf4a0'}


 75%|███████▌  | 352/469 [19:46<07:33,  3.87s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-519e2026-3c34-93be-9782-e2139d1dd3a4', 'request_id': '519e2026-3c34-93be-9782-e2139d1dd3a4'}


 76%|███████▌  | 355/469 [19:57<06:51,  3.61s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-92cfcaa6-bdef-96e3-811a-fdd4dbfc3333', 'request_id': '92cfcaa6-bdef-96e3-811a-fdd4dbfc3333'}


 77%|███████▋  | 359/469 [20:08<06:26,  3.51s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-a856511d-9849-9e93-a7b0-f744b6b877a1', 'request_id': 'a856511d-9849-9e93-a7b0-f744b6b877a1'}


 81%|████████  | 380/469 [21:18<04:22,  2.95s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-b4ea0c1b-505f-901f-867e-f9d0864d60ab', 'request_id': 'b4ea0c1b-505f-901f-867e-f9d0864d60ab'}


 82%|████████▏ | 384/469 [21:30<04:02,  2.85s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-0db5c6a4-71a6-9554-8ea2-8f90a38a7e75', 'request_id': '0db5c6a4-71a6-9554-8ea2-8f90a38a7e75'}


 82%|████████▏ | 386/469 [21:36<04:06,  2.97s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-adcce773-7b88-9bf0-8b14-821c61fe1ad8', 'request_id': 'adcce773-7b88-9bf0-8b14-821c61fe1ad8'}


 84%|████████▍ | 395/469 [22:15<04:56,  4.01s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-90a54988-ff81-97e6-ad63-b60a542a2131', 'request_id': '90a54988-ff81-97e6-ad63-b60a542a2131'}


 85%|████████▍ | 397/469 [22:21<04:26,  3.70s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-aa254f0a-4cdb-9f60-b422-763159921511', 'request_id': 'aa254f0a-4cdb-9f60-b422-763159921511'}


 85%|████████▌ | 399/469 [22:30<04:43,  4.05s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-dd619e6b-905c-90c4-b976-a08cd6456169', 'request_id': 'dd619e6b-905c-90c4-b976-a08cd6456169'}


 86%|████████▋ | 405/469 [22:55<04:13,  3.95s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-34fdd426-ae4d-9bf0-88fb-8ba3b1669fac', 'request_id': '34fdd426-ae4d-9bf0-88fb-8ba3b1669fac'}


 87%|████████▋ | 409/469 [23:12<04:17,  4.28s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-1b4270fa-3c41-93ba-9653-e142ec3a71cc', 'request_id': '1b4270fa-3c41-93ba-9653-e142ec3a71cc'}


 88%|████████▊ | 412/469 [23:22<03:23,  3.56s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-eff74e3e-0c81-972d-9246-ef2f96dcec7e', 'request_id': 'eff74e3e-0c81-972d-9246-ef2f96dcec7e'}


 88%|████████▊ | 413/469 [23:25<03:07,  3.36s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-832b06d5-dd8a-9a48-b6ec-118dbc483630', 'request_id': '832b06d5-dd8a-9a48-b6ec-118dbc483630'}


 89%|████████▉ | 419/469 [23:43<02:48,  3.38s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-47726e38-b19e-9761-a579-a86f1d461e1a', 'request_id': '47726e38-b19e-9761-a579-a86f1d461e1a'}


 91%|█████████▏| 428/469 [24:10<02:04,  3.04s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-f23ab05c-3d80-9ade-a1f9-24e24713941a', 'request_id': 'f23ab05c-3d80-9ade-a1f9-24e24713941a'}


 93%|█████████▎| 435/469 [24:30<01:32,  2.72s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-1ff9c67f-f6bb-93ce-80c5-daa3ecb389bb', 'request_id': '1ff9c67f-f6bb-93ce-80c5-daa3ecb389bb'}


 94%|█████████▍| 440/469 [24:51<01:54,  3.94s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-6167ddd4-ad4e-99be-bb28-6d92febe566e', 'request_id': '6167ddd4-ad4e-99be-bb28-6d92febe566e'}


 99%|█████████▉| 464/469 [25:54<00:12,  2.41s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-46a283cb-513b-95dd-a3a8-2a0fc272dacb', 'request_id': '46a283cb-513b-95dd-a3a8-2a0fc272dacb'}


 99%|█████████▉| 466/469 [26:01<00:08,  2.99s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-a4dafe72-2a3b-9edf-a13c-c7e8c172ba4f', 'request_id': 'a4dafe72-2a3b-9edf-a13c-c7e8c172ba4f'}


100%|█████████▉| 468/469 [26:11<00:04,  4.25s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-8666d1b8-65e0-9546-9115-89124b48f6fb', 'request_id': '8666d1b8-65e0-9546-9115-89124b48f6fb'}


100%|██████████| 469/469 [26:14<00:00,  3.36s/it]

Error: Error code: 400 - {'error': {'code': 'data_inspection_failed', 'param': None, 'message': 'Input data may contain inappropriate content.', 'type': 'data_inspection_failed'}, 'id': 'chatcmpl-b077c03a-20ed-9128-9fd2-874e67d813b2', 'request_id': 'b077c03a-20ed-9128-9fd2-874e67d813b2'}





### 1.3. Setup critique agents

The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.

We thus build critique agents that will rate each question on several criteria, given in [this paper](https://huggingface.co/papers/2312.10003):
- **Groundedness:** can the question be answered from the given context?
- **Relevance:** is the question relevant to users? For instance, `"What is the date when transformers 4.29.1 was released?"` is not relevant for ML practitioners.

One last failure case we've noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like `"What is the name of the function used in this guide?"`.
We also build a critique agent for this criteria:
- **Stand-alone**: is the question understandable free of any context, for someone with domain knowledge/Internet access? The opposite of this would be `What is the function used in this article?` for a question generated from a specific blog article.

We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.

💡 ___When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.___

We now build and run these critique agents.

In [32]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be useful to Muslim person .
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [41]:
from langchain_core.prompts import ChatPromptTemplate

# Create prompt templates
groundedness_prompt = ChatPromptTemplate.from_template(question_groundedness_critique_prompt)
relevance_prompt = ChatPromptTemplate.from_template(question_relevance_critique_prompt)
standalone_prompt = ChatPromptTemplate.from_template(question_standalone_critique_prompt)

print("Generating critique for each QA couple...")

# Create the chains properly
chain1 = groundedness_prompt | llm | StrOutputParser()
chain2 = relevance_prompt | llm | StrOutputParser()
chain3 = standalone_prompt | llm | StrOutputParser()

for output in tqdm(outputs):      
    try:
        evaluations = {
        "groundedness": chain1.invoke({"context": output["context"], "question": output["question"]}),
        "relevance": chain2.invoke({"question": output["question"]}),
        "standalone": chain3.invoke({"question": output["question"]}),
    }
        for criterion, evaluation in evaluations.items():
            score = int(evaluation.split("Total rating: ")[-1].strip())
            eval = evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1]
            output.update({
                f"{criterion}_score": score,
                f"{criterion}_eval": eval,
            })
    except Exception as e:
        continue

Generating critique for each QA couple...


  0%|          | 0/382 [00:00<?, ?it/s]

100%|██████████| 382/382 [1:50:22<00:00, 17.34s/it]   


Now let us filter out bad questions based on our critique agent scores:

In [44]:
import pandas as pd

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(outputs)

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 3)
    & (generated_questions["relevance_score"] >= 3)
    & (generated_questions["standalone_score"] >= 2)
]
print("============================================")
print("Final evaluation dataset:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)

eval_dataset = datasets.Dataset.from_pandas(
    generated_questions, split="train", preserve_index=False
)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,من خلق الملائكة حسب ما ورد في النص؟\n,الله عز وجل,,,
1,ما هو المثل الذي استخدمه الرسول لوصف كيف سيهاجم الأعداء المسلمين؟\n,مثل الأكلة إلى قصعتها,5.0,3.0,4.0
2,ما الذي لا يعقل نسبته إلى الحجارة الصماء وفقًا للنص؟\n,القدرة على الخلق والرزق والإحياء والإماتة وإيصال النفع والضر إلى من تشاء.,5.0,5.0,1.0
3,ما هي الكلمات التي يجب أن يقولها المصلي عند القعدة حسب حديث أبوموسى الأشعري؟\n,التحيات الطيبات الصلوات لله؛ السلام عليك أيها النبي ورحمة الله وبركاته؛ السلام علينا وعلى عباد الله الصالحين؛ أشهد أن لا إله إلا الله وأشهد أن محمداً عبده ورسوله,5.0,5.0,2.0
4,ما هو الحكم الشرعي في ذبائح أهل الكتاب وفقاً للنص؟\n,الحل,5.0,5.0,1.0
...,...,...,...,...,...
377,من الذي جاء إلى النبي محمد مع عظم رميم في يده؟\n,أبو بن خلف أو العاص بن واثل,3.0,4.0,4.0
378,ماذا كان رد فعل النبي عندما ذكرت بعض نسائه الكنيسة التي رأينها بأرض الحبشة؟\n,"رفع رأسه وقال: ""أولئك إذا مات منهم الرجل الصالح بنوا على قبره مسجداً؛ ثم صوروا فيه تلك الصورة أولئك شرار الخلق عند الله""",5.0,4.0,1.0
379,كم عدد المرات التي ذكر فيها النبي أن الله سيخرجه من النار من قال لا إله إلا الله؟\n,مرة واحدة,2.0,5.0,2.0
380,ما هي الكلمات التي كان الرسول يعلّمها في التشهد كما يعلّم السورة من القرآن؟\n,"""التحيات المباركات الصلوات الطيبات لله؛ السلام عليك أيها النبي ورحمة الله وبركاته.؛ السلام علينا وعلى عباد الله الصالحين؛ أشهد أن لا إله إلا الله وأشهد أن محمداً رسول الله""",5.0,5.0,5.0


Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
1,ما هو المثل الذي استخدمه الرسول لوصف كيف سيهاجم الأعداء المسلمين؟\n,مثل الأكلة إلى قصعتها,5.0,3.0,4.0
3,ما هي الكلمات التي يجب أن يقولها المصلي عند القعدة حسب حديث أبوموسى الأشعري؟\n,التحيات الطيبات الصلوات لله؛ السلام عليك أيها النبي ورحمة الله وبركاته؛ السلام علينا وعلى عباد الله الصالحين؛ أشهد أن لا إله إلا الله وأشهد أن محمداً عبده ورسوله,5.0,5.0,2.0
5,ما هو الحكم الشرعي إذا زوج الرجل ابنته وهي كارهة؟\n,الزواج مردود.,5.0,5.0,5.0
6,كم مرة تكون صلاة الجماعة أفضل من صلاة الفذ؟\n,سبع وعشرين درجة,5.0,4.0,5.0
7,ما هي العبادة التي جاء بها جميع الأنبياء؟\n,عبادة الله وحده,5.0,5.0,5.0
...,...,...,...,...,...
372,ما هي الحالات الثلاث التي يحل فيها دم المسلم؟\n,النفس، والثيب الزاني، والتارك لدينه المفارق للجماعة,5.0,5.0,5.0
373,ما هو المقصود بذات عرق في الحديث؟\n,المقصود بذات عرق هو الجبل الصغير الموجود فيها.,5.0,5.0,2.0
375,"من الذي روى الحديث ""أنا أغنى الشركاء عن الشرك من عمل عملا أشرك فيه معي غيري تركته وشركه""؟\n",مسلم,5.0,5.0,5.0
377,من الذي جاء إلى النبي محمد مع عظم رميم في يده؟\n,أبو بن خلف أو العاص بن واثل,3.0,4.0,4.0


Now our synthetic evaluation dataset is complete! We can evaluate different RAG systems on this evaluation dataset.

We have generated only a few QA couples here to reduce time and cost. But let's kickstart the next part by loading a pre-generated dataset: