# Preparation of custom RAG evaluation dataset

## Import LLM client

We use LLM hosted in Nebius AI-Studio:

In [1]:
import os
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

load_dotenv()

TEMPERATURE=0.1
TOP_P=0.95
MAX_TOKENS=2048

In [2]:
MODEL = "meta-llama/Meta-Llama-3.1-405B-Instruct"

llm_client = ChatOpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NB_AI_STUDIO_KEY"),
    model=MODEL,
    temperature=TEMPERATURE,
    top_p=TOP_P,
    max_tokens=MAX_TOKENS,
)

llm_client.invoke("What is the capital of France").content

'The capital of France is Paris.'

We will use an OSS LLM hosted on Nebius AI-Studio as a judge to power `mlflow.genai` metrics. For this we need to provide the local proxy server which handles the API calls to LLM. Run the following in your terminal:

```bash
mlflow deployments start-server --config-path mlflow_config/model_config.yaml
```

Import the 'deployement' to use as evaluator LLM and see if it works:

In [3]:
from mlflow.deployments import get_deploy_client

client = get_deploy_client("http://localhost:5000")

data = {
    "messages": [
        {
            "role": "user",
            "content": "What would happen if an asteroid the size of a basketball encountered the Earth traveling at 0.5c? Please provide your answer in .rst format for the purposes of documentation.",
        }
    ],
    "temperature": 0.5,
    "max_tokens": 100,
    "n": 1,
    "frequency_penalty": 0.2,
    "presence_penalty": 0.2,
}
print(client.predict(endpoint="ai-studio-chat", inputs=data)["choices"][0]["message"]["content"])

.. _asteroid-impact:

Asteroid Impact at 0.5c

Introduction
------------

This document discusses the hypothetical scenario of an asteroid, approximately the size of a basketball (with a diameter of about 0.24 meters), encountering Earth while traveling at 0.5 times the speed of light (0.5c, approximately 150,000,000 meters per second).

Impact Energy
-------------

The kinetic energy of


## Prepare evaluation dataset

### Generate QA pairs

In order to evaluate RAG ppeline, we need to have areference dataset which will include golden QA pairs together with reference context. 


To generate QA pairs we are going to use `jamescalam/ai-arxiv-chunked` which contains chunkd of NLP-related reseearch papers. We are not going to use the chunks themselves but only paper summaries which are provided together with chunks. 
There are over 400 unique summaries and we are going to sample 200 of them to use as source context for QA pairs generation

In [4]:
from datasets import load_dataset
import random

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
summaries = list(set(data["summary"]))
sampled_summaries = random.sample(summaries, 200)

  from .autonotebook import tqdm as notebook_tqdm


Let's make batches of 4 to accelerate the generation process

In [5]:
BATCH_SIZE = 4
batches = [sampled_summaries[i * BATCH_SIZE:(i+1) * BATCH_SIZE] for i in range(len(sampled_summaries)//BATCH_SIZE)]
len(batches)

50

We need to create some helper function which will process batches of summaries and generate QA pairs. It is usefull to wrap this function with `retry` in case we run into rate limit for example:

In [6]:
from langchain_core.prompts import ChatPromptTemplate
from tenacity import retry, wait_random_exponential, stop_after_attempt 

SYSTEM_PROMPT = """
Your task is to write a standalone factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
Your answer to the factoid question should be detailed and relying upon given context and be accessible for a wide variety of users.
This means that your standalone factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)"""

USER_PROMPT = """Now here is the context.

Context: {context}\n
Output:::"""

gen_prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_PROMPT),
        ("human", USER_PROMPT),
    ]
)

gen_prompt_template.batch(inputs=[{"context": "context"},{"context": "context"}])

@retry(wait=wait_random_exponential(multiplier=4, max=120), stop=stop_after_attempt(3))
def process_batch_messages(llm, batch):
    return llm.batch(gen_prompt_template.batch(inputs=[{"context": d} for d in batch]))

Some other helper functions which may be useful later:

In [7]:
import hashlib

def hash_string(input: str) -> str:
    h = hashlib.new('sha256')
    h.update(input.encode())
    
    return h.hexdigest()

from datetime import datetime

def get_timestamp() -> str:
    timestamp = datetime.now()
    return timestamp.strftime("%Y-%m-%d_%H-%M-%S")

Here's the code which processes the batches and parses the output into QA pairs:

In [8]:
import json
import mlflow
from tqdm import tqdm


formatted_timestamp = get_timestamp()
mlflow.set_experiment(f"Data generation for RAG eval {formatted_timestamp}")
mlflow.langchain.autolog()

file_path = f'syntetic_data/generated_items_{formatted_timestamp}.jsonl'

outputs = []
for batch in tqdm(batches[:]):
    responses = process_batch_messages(llm_client, batch)
    for response, doc in zip(responses, batch):
        output_QA_couple = response.content
        try:
            question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0]
            answer = output_QA_couple.split("Answer: ")[-1]
            item =  {
                "document": {
                    "content": doc,
                    "collection_id": str(hash_string(doc)),
                },
                "question": question,
                "answer": answer,
            }
            outputs.append(item)
            json_line = json.dumps(item)
            with open(file_path, 'a') as file:
                file.write(json_line + '\n')
        except Exception as e:
            print(e.__str__())
            continue

2024/11/12 09:35:26 INFO mlflow.tracking.fluent: Experiment with name 'Data generation for RAG eval 2024-11-12_09-35-26' does not exist. Creating a new experiment.

For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  return _iterencode(o, 0)
(psycopg2.errors.InternalError_) unexpected data beyond EOF in block 152 of relation base/16385/16609
HINT:  This has been seen to occur with buggy kernels; consider updating your system.

[SQL: INSERT INTO trace_request_metadata (key, value, request_id) VALUES (%(key)s, %(value)s, %(request_id)s)]
[parameters: {'key': 'mlflow.traceInputs', 'value': '[[{"content": "\\nYour task is to write a standalone factoid question and an answer given a context.\\nYour factoid question should be answerable with a specific, concise pie

### (Optional) Evaluate generated eval dataset with `mlflow.genai` metrics

Let's use our `mlflow` deployement as LLM-as-a-judge to power evaluation metrics:

In [9]:
from mlflow.deployments import set_deployments_target

set_deployments_target("http://localhost:5000")

Let's use a built-in `mlflow` metric for relevance. We only need to provide the name of deployement specified in `model_config.yaml`

In [10]:
relevance_metric = mlflow.metrics.genai.relevance(
    model = "endpoints:/ai-studio-chat"
)

We will also need 2 additional metrics to evaluate for groundedness and autonomy. For this, we may use generic `genai` metric constructor `make_genai_metric()` provided by `mlflow`, which takes metric description and corresponding grading prompt:

In [11]:
groundedness_metric = mlflow.metrics.genai.make_genai_metric(
    name="groundedness",
    definition=(
        "Groundedness refers to how well the question is formulated to stay within the provided context, ensuring clarity "
        "and relevance to the task at hand. A grounded question should directly address the instruction, avoid ambiguity, "
        "and be clearly rooted in the context provided."
    ),
    grading_prompt=(
        "Groundedness: Evaluate if the question is formulated clearly, without ambiguity, and remains grounded in the provided Context. "
        "Below are the details for different scores: "
        "- Score 1: The question is vague or unclear and does not engage with the provided Context, making it impossible to discern "
        "how it relates to the question or instruction."
        "- Score 2: The question partially engages with the Context but includes significant ambiguities or unclear portions, often "
        "straying from the context or not fully addressing the question."
        "- Score 3: The question generally addresses the question using the provided Context but has occasional ambiguities or is "
        "unclear in certain aspects, making parts of the question less grounded."
        "- Score 4: The question is mostly clear and unambiguous, providing a grounded question based on the Context. However, there "
        "are minor instances where clarity could be improved or where the grounding in the context is weaker."
        "- Score 5: The question is entirely clear, unambiguous, and fully grounded in the provided Context. It aligns with context "
        "precisely with no unnecessary or unclear content."
    ),
    model="endpoints:/ai-studio-chat",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)

standalone_metric = mlflow.metrics.genai.make_genai_metric(
    name="standalone",
    definition=(
        "Standalone refers to the degree to which the question can be understood and answered independently of any external "
        "context or information. A fully standalone question should be self-contained, meaning it does not rely on additional "
        "documents, previous interactions, or external scenarios to be complete or comprehensible."
    ),
    grading_prompt=(
        "Standalone: Evaluate if the question can be understood and answered without looking at the Context in the Instruction. "
        "Below are the details for different scores: "
        "- Score 1: The question heavily depends on external context or previous information to be understood. It refers to specific "
        "content (e.g., 'in the context' or 'in the document') and is incomplete on its own."
        "- Score 2: The question is mostly dependent on external information. While parts of the question may be clear, it still "
        "requires knowledge of additional context or documents to be fully understood."
        "- Score 3: The question is partially understandable on its own but still relies on some implicit context or background knowledge "
        "to be fully clear. It is incomplete without certain pieces of information."
        "- Score 4: The question is largely standalone and makes sense without needing much additional context. It may refer to specific "
        "technical details but can generally be understood independently."
        "- Score 5: The question is entirely self-contained and makes complete sense on its own. Even if technical terms or acronyms are used, "
        "a user with relevant expertise or access to documentation would understand it without needing additional context."
    ),
    model="endpoints:/ai-studio-chat",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)

To evaluate generated QA dataset, we will use `mlflow.evaluate()` with a static dataset:

In [12]:
import pandas as pd

# simplify the QA generation instruction prompt for evaluation
instruction_prompt = """
Your task is to write a standalone factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.

Context: {context}"""

eval_data = pd.DataFrame(
    {   
        "inputs": [instruction_prompt.format(context=' '.join(o["document"]["content"])) for o in outputs],
        "predictions": [f"""Question: {o["question"]}\n Answer: {o["answer"]}""" for o in outputs],
        "question": [o["question"] for o in outputs],
        "answer": [o["answer"] for o in outputs],
        "context": [o["document"]["content"] for o in outputs],
        "document": [o["document"] for o in outputs],
    }
)

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        predictions="predictions",
        extra_metrics=[
            groundedness_metric, 
            relevance_metric, 
            standalone_metric
        ],
    )
    
    print(f"Aggregated evaluation results: \n{results.metrics}")

2024/11/12 09:40:56 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
* 'schema_extra' has been renamed to 'json_schema_extra'
100%|██████████| 1/1 [00:01<00:00,  1.01s/it]
100%|██████████| 1/1 [00:01<00:00,  1.13s/it]
100%|██████████| 1/1 [00:01<00:00,  1.06s/it]
100%|██████████| 200/200 [00:39<00:00,  5.10it/s]
100%|██████████| 200/200 [00:47<00:00,  4.24it/s]
100%|██████████| 200/200 [00:50<00:00,  4.00it/s]


Aggregated evaluation results: 
{'groundedness/v1/mean': 5.0, 'groundedness/v1/variance': 0.0, 'relevance/v1/mean': 4.905, 'relevance/v1/variance': 0.085975, 'relevance/v1/p90': 5.0, 'standalone/v1/mean': 4.215, 'standalone/v1/variance': 0.16877499999999998}


2024/11/12 09:43:19 INFO mlflow.tracking._tracking_service.client: 🏃 View run secretive-ray-53 at: https://tracking.mlflow-e00rhqs1bwevnqy5wj.backbone-e00ffdgj3ybad7mxrx.msp.eu-north1.nebius.cloud/#/experiments/5/runs/12a4e7b4760047f99164848e7582f64d.
2024/11/12 09:43:19 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://tracking.mlflow-e00rhqs1bwevnqy5wj.backbone-e00ffdgj3ybad7mxrx.msp.eu-north1.nebius.cloud/#/experiments/5.


Let's take a look at evaluation results table:

In [13]:
eval_results_table = results.tables["eval_results_table"]
eval_results_table

Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  3.26it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  3.63it/s]


Unnamed: 0,inputs,question,answer,context,document,outputs,groundedness/v1/score,groundedness/v1/justification,relevance/v1/score,relevance/v1/justification,standalone/v1/score,standalone/v1/justification
0,\nYour task is to write a standalone factoid q...,What is the main goal of interpretable machine...,The main goal of interpretable machine learnin...,"As machine learning systems become ubiquitous,...",{'content': 'As machine learning systems becom...,Question: What is the main goal of interpretab...,5,"The question is entirely clear, unambiguous, a...",4,The output answers the question and is consist...,4,"The question ""What is the main goal of interpr..."
1,\nYour task is to write a standalone factoid q...,What F1 score was achieved on the web portion ...,The proposed solution achieved a score of 71.3...,We consider the problem of adapting neural par...,{'content': 'We consider the problem of adapti...,Question: What F1 score was achieved on the we...,5,"The question is entirely clear, unambiguous, a...",5,The output comprehensively answers the questio...,4,The question is largely standalone and makes s...
2,\nYour task is to write a standalone factoid q...,What is the time and memory complexity reducti...,The sparse factorizations of the attention mat...,"Transformers are powerful sequence models, but...",{'content': 'Transformers are powerful sequenc...,Question: What is the time and memory complexi...,5,"The question is entirely clear, unambiguous, a...",5,The output comprehensively answers the questio...,4,The question is largely standalone and makes s...
3,\nYour task is to write a standalone factoid q...,What is the human performance accuracy on the ...,The human performance accuracy on the Commonse...,"When answering a question, people often draw u...","{'content': 'When answering a question, people...",Question: What is the human performance accura...,5,"The question is entirely clear, unambiguous, a...",5,The output answers the question comprehensivel...,5,"The question ""What is the human performance ac..."
4,\nYour task is to write a standalone factoid q...,What is a universal adversarial perturbation (...,A universal adversarial perturbation (UAP) is ...,The intriguing phenomenon of adversarial examp...,{'content': 'The intriguing phenomenon of adve...,Question: What is a universal adversarial pert...,5,"The question is entirely clear, unambiguous, a...",5,The output comprehensively answers the questio...,4,The question is largely standalone and makes s...
...,...,...,...,...,...,...,...,...,...,...,...,...
195,\nYour task is to write a standalone factoid q...,At what model size does the capability for mor...,The capability for moral self-correction in la...,We test the hypothesis that language models tr...,{'content': 'We test the hypothesis that langu...,Question: At what model size does the capabili...,5,"The question is entirely clear, unambiguous, a...",5,The output comprehensively answers the questio...,4,The question is largely standalone and makes s...
196,\nYour task is to write a standalone factoid q...,What percentage of sentences in lower-resource...,A significant fraction of lower-resource corpo...,With the success of large-scale pre-training a...,{'content': 'With the success of large-scale p...,Question: What percentage of sentences in lowe...,5,"The question is entirely clear, unambiguous, a...",4,The output directly answers the question about...,4,The question is largely standalone and makes s...
197,\nYour task is to write a standalone factoid q...,How do data-driven models compare to rule-base...,Data-driven models lag behind rule-based or co...,How should conversational agents respond to ve...,{'content': 'How should conversational agents ...,Question: How do data-driven models compare to...,5,"The question is entirely clear, unambiguous, a...",5,The output comprehensively answers the questio...,4,The question is largely standalone and makes s...
198,\nYour task is to write a standalone factoid q...,What is BART in natural language processing?\n,BART is a denoising autoencoder for pretrainin...,"We present BART, a denoising autoencoder for p...","{'content': 'We present BART, a denoising auto...",Question: What is BART in natural language pro...,5,"The question is entirely clear, unambiguous, a...",5,The output comprehensively answers the questio...,5,"The question ""What is BART in natural language..."


The code which evaluates QA pairs updates the original items with evaluation scores and corresponding justification:

Let's take a look at the evaluation results:

In [14]:
import pandas as pd
import datasets

pd.set_option("display.max_colwidth", None)

print("Evaluation dataset before filtering:")
display(
    eval_results_table[
        [
            "question",
            "answer",
            "groundedness/v1/score",
            "relevance/v1/score",
            "standalone/v1/score",
        ]
    ]
)
generated_questions = eval_results_table.loc[
    (eval_results_table["groundedness/v1/score"] >= 4)
    & (eval_results_table["relevance/v1/score"] >= 4)
    & (eval_results_table["standalone/v1/score"] >= 5)
]
print("============================================")
print("Final evaluation dataset:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness/v1/score",
            "relevance/v1/score",
            "standalone/v1/score",
        ]
    ]
)

eval_dataset = datasets.Dataset.from_pandas(generated_questions, split="train", preserve_index=False)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness/v1/score,relevance/v1/score,standalone/v1/score
0,What is the main goal of interpretable machine learning systems?\n,"The main goal of interpretable machine learning systems is to provide explanations for their outputs, which can be used to qualitatively assess other criteria such as safety or non-discrimination.",5,4,4
1,What F1 score was achieved on the web portion of TriviaQA using the proposed solution for adapting neural paragraph-level question answering models?\n\n,"The proposed solution achieved a score of 71.3 F1 on the web portion of TriviaQA, significantly improving upon the previous best system's score of 56.7 F1.",5,5,4
2,What is the time and memory complexity reduction achieved by sparse factorizations of the attention matrix in Transformers?\n,The sparse factorizations of the attention matrix reduce the time and memory complexity from quadratic growth with the sequence length to $O(n \sqrt{n})$.,5,5,4
3,What is the human performance accuracy on the CommonsenseQA dataset?\n,The human performance accuracy on the CommonsenseQA dataset is 89%.,5,5,5
4,What is a universal adversarial perturbation (UAP) in machine learning?\n,"A universal adversarial perturbation (UAP) is a single perturbation that can fool a target deep neural network (DNN) for most images, meaning it is a single alteration that can cause the DNN to misclassify a wide range of images.",5,5,4
...,...,...,...,...,...
195,At what model size does the capability for moral self-correction emerge in language models trained with reinforcement learning from human feedback?\n\n,"The capability for moral self-correction in language models trained with reinforcement learning from human feedback emerges at 22B model parameters. This suggests that at this level of scale, language models develop the necessary capabilities to follow instructions and learn complex normative concepts of harm, enabling them to avoid producing harmful outputs when instructed to do so.",5,5,4
196,What percentage of sentences in lower-resource corpora are of acceptable quality?\n,A significant fraction of lower-resource corpora contains less than 50% sentences of acceptable quality.,5,4,4
197,How do data-driven models compare to rule-based or commercial systems in responding to verbal abuse?\n,Data-driven models lag behind rule-based or commercial systems in terms of their perceived appropriateness when responding to verbal abuse.,5,5,4
198,What is BART in natural language processing?\n,"BART is a denoising autoencoder for pretraining sequence-to-sequence models, trained by corrupting text with an arbitrary noising function and learning a model to reconstruct the original text. It uses a standard Transformer-based neural machine translation architecture and is effective for both text generation and comprehension tasks.",5,5,5


Final evaluation dataset:


Unnamed: 0,question,answer,groundedness/v1/score,relevance/v1/score,standalone/v1/score
3,What is the human performance accuracy on the CommonsenseQA dataset?\n,The human performance accuracy on the CommonsenseQA dataset is 89%.,5,5,5
8,What is the name of the dataset created to assess the effectiveness of controllable text generation algorithms at preventing toxic language generation?\n\n,"The dataset is called RealToxicityPrompts, which consists of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier.",5,5,5
11,How many languages are there in the world?\n,"There are over 7000 languages in the world, but only a small number of them are represented in language technologies and applications.",5,4,5
13,How many teams submitted system description papers in OffensEval 2020?\n,"In OffensEval 2020, a total of 70 teams submitted system description papers.",5,5,5
16,What is lifelong learning in humans and animals?\n,"Lifelong learning is the ability to continually acquire, fine-tune, and transfer knowledge and skills throughout one's lifespan, mediated by a rich set of neurocognitive mechanisms that contribute to the development and specialization of sensorimotor skills, as well as long-term memory consolidation and retrieval.",5,5,5
21,What is the approximate number of question-answer-evidence triples in the TriviaQA dataset?\n,"The TriviaQA dataset contains over 650,000 question-answer-evidence triples.",5,5,5
25,What is the top-1 accuracy achieved by NASNet on ImageNet?\n,"NASNet achieves a state-of-the-art accuracy of 82.7% top-1 on ImageNet, which is 1.2% better than the best human-invented architectures.",5,5,5
26,How many tasks does the Beyond the Imitation Game benchmark (BIG-bench) currently consist of?\n\n,"The Beyond the Imitation Game benchmark (BIG-bench) currently consists of 204 tasks, which were contributed by 450 authors across 132 institutions. These tasks cover a diverse range of topics, including linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and more.",5,5,5
28,"What year was the Alexa Prize launched to tackle the problem of achieving natural, sustained, coherent and engaging open-domain dialogs?\n\n","The Alexa Prize was launched in 2016 to tackle the problem of achieving natural, sustained, coherent and engaging open-domain dialogs.",5,5,5
30,What is ExMix in natural language processing?\n,"ExMix, short for Extreme Mixture, is a massive collection of 107 supervised natural language processing (NLP) tasks across diverse domains and task-families. It was created to study the effect of scaling up the number of tasks during pre-training in NLP.",5,5,5


In [15]:
len(eval_dataset)

43

It appears that we have only 1/4 of the generated QA pairs which scored 5/5 on every metric. Proceed to save the filtered dataset:

In [None]:
eval_dataset.save_to_disk("NLP_eval_dataset_demo")

Saving the dataset (1/1 shards): 100%|██████████| 43/43 [00:00<00:00, 9176.97 examples/s] 
