https://huggingface.co/learn/cookbook/en/rag_evaluation

TODO: 
- [] Revise prompts for correct contextuality
- [] Generate full dataset and publish to langsmith
- [] Implement Langsmith [LLM-as-judge](https://docs.smith.langchain.com/old/evaluation)

In [77]:
import random
from itertools import chain, islice

from tqdm import tqdm
from langchain_community.llms import Ollama
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain_community.document_loaders.directory import DirectoryLoader
from langchain_community.document_loaders.markdown import UnstructuredMarkdownLoader

In [78]:
N_GENERATIONS = 48  # We intentionally generate only 10 QA couples here for cost and time considerations
MAX_CONCURRENCY = 6
MAX_ANSWER_LENGTH = 300

In [79]:
def load_docs():
    loader = DirectoryLoader(
        ".content", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader
    )
    return loader.load()


def chunk_docs(raw_documents):
    text_splitter = RecursiveCharacterTextSplitter.from_language(
        Language.MARKDOWN,
        chunk_size=1000,
        chunk_overlap=100,
        length_function=len,
    )

    return text_splitter.split_documents(raw_documents)

In [80]:
llm = Ollama(model="llama3", temperature=0)

In [81]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

In [82]:
raw_documents = load_docs()
documents = chunk_docs(raw_documents)

In [83]:
print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
samples = random.sample(documents, N_GENERATIONS)
for pos in tqdm(range(0, len(samples), MAX_CONCURRENCY)):
    batch = samples[pos:pos + MAX_CONCURRENCY]
    output_QA_couples = llm.batch([QA_generation_prompt.format(context=sampled_context.page_content) for sampled_context in batch])
    for i in range(len(output_QA_couples)):
        try:
            question = output_QA_couples[i].split("Factoid question: ")[-1].split("Answer: ")[0]
            answer = output_QA_couples[i].split("Answer: ")[-1]
            assert len(answer) < MAX_ANSWER_LENGTH, "Answer is too long"
            outputs.append(
                {
                    "context": batch[i].page_content,
                    "question": question.strip(),
                    "answer": answer,
                    "source_doc": batch[i].metadata["source"],
                }
            )
        except:
            continue

Generating 48 QA couples...


100%|██████████| 8/8 [01:00<00:00,  7.50s/it]


In [84]:
# import pprint
# for output in outputs:
#     pprint.pp(output)

# 1.3. Setup critique agents
The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.

We thus build critique agents that will rate each question on several criteria, given in [this paper](https://huggingface.co/papers/2312.10003):

- **Groundedness:** can the question be answered from the given context?
- **Relevance:** is the question relevant to users? For instance, "What is the date when transformers 4.29.1 was released?" is not relevant for ML practicioners.

One last failure case we’ve noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like "What is the name of the function used in this guide?". We also build a critique agent for this criteria:

- **Stand-alone:** is the question understandable free of any context, for someone with domain knowledge/Internet access? The opposite of this would be What is the function used in this article? for a question generated from a specific blog article.

We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.

*When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.*

We now build and run these critique agents.

In [85]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [86]:
print("Generating critique for each QA couple...")

NUM_CRITERIA = 3

def get_output_prompts(output):
    return [
        question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]),
        question_relevance_critique_prompt.format(question=output["question"]),
        question_standalone_critique_prompt.format(question=output["question"]),
    ]

def batched(iterable, n):
    # batched('ABCDEFG', 3) → ABC DEF G
    if n < 1:
        raise ValueError('n must be at least one')
    it = iter(iterable)
    while batch := tuple(islice(it, n)):
        yield batch

scored_outputs = []
max_output_concurrency = MAX_CONCURRENCY//NUM_CRITERIA
for pos in tqdm(range(0, len(outputs), max_output_concurrency)):
    output_batch = outputs[pos:pos + max_output_concurrency]
    batch_prompts = list(chain.from_iterable(map(get_output_prompts, output_batch)))
    results = llm.batch(batch_prompts)

    for i, result_set in enumerate(batched(results, NUM_CRITERIA)):
        try:
            criterion_output = {**output_batch[i]}
            for criterion, evaluation in zip(["groundedness", "relevance", "standalone"], result_set):
                criterion_output[f"{criterion}_score"] = int(evaluation.split("Total rating: ")[-1].strip())
                criterion_output[f"{criterion}_eval"] = evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1]
            scored_outputs.append(criterion_output)
        except Exception as e:
                continue

Generating critique for each QA couple...


100%|██████████| 24/24 [05:39<00:00, 14.14s/it]


In [88]:
import pandas as pd
# import datasets

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(scored_outputs)

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["relevance_score"] >= 4)
    & (generated_questions["standalone_score"] >= 4)
]
print("============================================")
print("Final evaluation dataset:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)

# eval_dataset = datasets.Dataset.from_pandas(generated_questions, split="train", preserve_index=False)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What was hardcoded by Michael N to allow Terraform apply to run for tutorials on the ready-for-qa branch?,deployment_group_name in aws/codedeploy/output.tf,5,2,1
1,What is the expected time for the topics cache key to expire?,10 minutes,5,2,5
2,What time did Tessa report datadog results showing both drafts and tutorials request processing dropped to zero?,21:45 UTC 2019-11-12,4,1,1
3,What time was the request made by partnerships to apply the permanent fix?,5pm Pacific,2,1,1
4,When did Hardy start working on a fix for the parallelism tutorial issue?,"July 31, 2017, 6:53 PM",5,2,1
5,What happens when you modify a Single-AZ deployment to a Multi-AZ deployment in Amazon RDS?,Amazon RDS takes a snapshot of the primary DB instance from your deployment and restores the snapshot into another Availability Zone.,5,1,5
6,What time was a frontend API change deployed on a page with autosaving?,8:45 PM,5,2,3
7,What is the year of Imelda's friends' event?,2023,1,1,1
8,What requires all NewRelic checks to have no failures in order for a service to qualify as healthy?,Tracer.NewRelic.Health,5,3,5
9,What was adjusted on the Oauth consent screen in Google Cloud Console?,"The ""User Type"" was adjusted from external to internal.",5,2,5


Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
17,What service should you go to in the External Services tab to figure out which service started taking up a larger share of request time?,The DB service.,5,4,4
30,Can you put a CSV into Gist directly?,"Yes, it turns into a table with sortable columns.",5,4,5
34,What are the block sizes used in the backfill process?,2^8 - 2^11.,5,4,5
