https://huggingface.co/learn/cookbook/en/rag_evaluation

TODO: 
- [] Revise prompts for correct contextuality
- [] Generate full dataset and publish to langsmith
- [] Implement Langsmith [LLM-as-judge](https://docs.smith.langchain.com/old/evaluation)

In [155]:
import random
from itertools import chain, islice

from tqdm import tqdm
from langchain_community.llms import Ollama
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain_community.document_loaders.directory import DirectoryLoader
from langchain_community.document_loaders.markdown import UnstructuredMarkdownLoader

In [156]:
N_GENERATIONS = 48  # We intentionally generate only 10 QA couples here for cost and time considerations
MAX_CONCURRENCY = 6
MAX_ANSWER_LENGTH = 300

In [157]:
def load_docs():
    loader = DirectoryLoader(
        ".content", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader
    )
    return loader.load()


def chunk_docs(raw_documents):
    text_splitter = RecursiveCharacterTextSplitter.from_language(
        Language.MARKDOWN,
        chunk_size=1000,
        chunk_overlap=100,
        length_function=len,
    )

    return text_splitter.split_documents(raw_documents)

In [158]:
llm = Ollama(model="llama3", temperature=0)

In [159]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention anything like "in the context".
Your factoid question MUST NOT be answerable with just a numeric, date, time, boolean, or single word answer.
Prefer questions with longer answers.

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

In [160]:
raw_documents = load_docs()
documents = chunk_docs(raw_documents)

In [161]:
print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
samples = random.sample(documents, N_GENERATIONS)
for pos in tqdm(range(0, len(samples), MAX_CONCURRENCY)):
    batch = samples[pos:pos + MAX_CONCURRENCY]
    output_QA_couples = llm.batch([QA_generation_prompt.format(context=sampled_context.page_content) for sampled_context in batch])
    for i in range(len(output_QA_couples)):
        try:
            question = output_QA_couples[i].split("Factoid question: ")[-1].split("Answer: ")[0]
            answer = output_QA_couples[i].split("Answer: ")[-1]
            assert len(answer) < MAX_ANSWER_LENGTH, "Answer is too long"
            outputs.append(
                {
                    "context": batch[i].page_content,
                    "question": question.strip(),
                    "answer": answer,
                    "source_doc": batch[i].metadata["source"],
                }
            )
        except:
            continue

Generating 48 QA couples...


100%|██████████| 8/8 [01:12<00:00,  9.03s/it]


In [162]:
# import pprint
# for output in outputs:
#     pprint.pp(output)

# 1.3. Setup critique agents
The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.

We thus build critique agents that will rate each question on several criteria, given in [this paper](https://huggingface.co/papers/2312.10003):

- **Groundedness:** can the question be answered from the given context?
- **Relevance:** is the question relevant to users? For instance, "What is the date when transformers 4.29.1 was released?" is not relevant for ML practicioners.

One last failure case we’ve noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like "What is the name of the function used in this guide?". We also build a critique agent for this criteria:

- **Stand-alone:** is the question understandable free of any context, for someone with domain knowledge/Internet access? The opposite of this would be What is the function used in this article? for a question generated from a specific blog article.

We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.

*When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.*

We now build and run these critique agents.

In [163]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to software developers trying to solve problems in a full-stack ed-tech web application.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Apdex, db, NoRedInk or ddos and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [164]:
print("Generating critique for each QA couple...")

NUM_CRITERIA = 3

def get_output_prompts(output):
    return [
        question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]),
        question_relevance_critique_prompt.format(question=output["question"]),
        question_standalone_critique_prompt.format(question=output["question"]),
    ]

def batched(iterable, n):
    # batched('ABCDEFG', 3) → ABC DEF G
    if n < 1:
        raise ValueError('n must be at least one')
    it = iter(iterable)
    while batch := tuple(islice(it, n)):
        yield batch

scored_outputs = []
max_output_concurrency = MAX_CONCURRENCY//NUM_CRITERIA
for pos in tqdm(range(0, len(outputs), max_output_concurrency)):
    output_batch = outputs[pos:pos + max_output_concurrency]
    batch_prompts = list(chain.from_iterable(map(get_output_prompts, output_batch)))
    results = llm.batch(batch_prompts)

    for i, result_set in enumerate(batched(results, NUM_CRITERIA)):
        try:
            criterion_output = {**output_batch[i]}
            for criterion, evaluation in zip(["groundedness", "relevance", "standalone"], result_set):
                criterion_output[f"{criterion}_score"] = int(evaluation.split("Total rating: ")[-1].strip())
                criterion_output[f"{criterion}_eval"] = evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1]
            scored_outputs.append(criterion_output)
        except Exception as e:
                continue

Generating critique for each QA couple...


100%|██████████| 24/24 [05:55<00:00, 14.81s/it]


In [165]:
import pandas as pd
# import datasets

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(scored_outputs)

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["relevance_score"] >= 4)
    & (generated_questions["standalone_score"] >= 4)
]
print("============================================")
print("Final evaluation dataset:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)

# eval_dataset = datasets.Dataset.from_pandas(generated_questions, split="train", preserve_index=False)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What was the reason for the improved performance expected over the weekend?,"We were working with Telia directly on this issue, and had drained them on our side until their engineers got back to us.",5,2,1
1,"What is the purpose of labeling ""fire follow-up"" in the eng-internal pivotal tracker?","To notify at least one junior and all firefighters for review, and to link the fires doc.",5,5,1
2,"What firefighting resources were used to resolve an issue affecting quizzes, diagnostics, and practices on November 1st, 2017?","The firefighting resource used was archiving the affected assignments, which unblocked students.",2,2,2
3,Who were the on-call personnel for the period from November 15 to November 21 in 2016?,"Rao, Tessa, and Sam.",5,1,5
4,What was the outcome of the deploy attempt to put master back on staging at Jenkins job 1766?,The deploy attempt failed.,5,2,2
5,What is the first step a fire responder should take when investigating an alert?,Acknowledge that you're investigating the alert in #fires.,5,1,5
6,What was the outcome of the demo deploy at 3:07?,The demo deploy was aborted.,5,2,1
7,"What was the duration of the MySQL performance degradation outage on December 1, 2016?","The outage lasted approximately 9 minutes, from 4:13 UTC to 4:18 UTC.",5,2,5
8,What was the date of the incident response retro?,+2018-05-24,5,1,1
9,What is the new identifier for the modified AWS RDS database instance?,prod-internal-replica,5,3,5


Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
12,What is the purpose of the SELECT statement in the given context?,"The purpose of the SELECT statement is to retrieve specific data from the grammar_quiz_questions table, filtering results based on user_id and grammar_question_id.",5,5,4
14,What are some key activities that may cause significant user pain if broken on production?,"Teacher: Assign work and view data, Student: Answer quiz engine questions and go through writing assignments, Both: Sign up and log in.",5,5,5
15,What was the likely cause of the issue that caused error rates to increase?,A single Redis entry with stale data in HQE's Redis cluster.,5,4,5
23,What was the problem that occurred in an endless quiz when students were seeing a 500 error on reanswer?,"The problem was that the student mastery object could be nil, which was not taken into account during the refactor.",5,4,5
25,How much space was freed up after dropping GQQ_old?,~600+gb of space.,5,4,5
27,What can cause a timeout when using find_or_create_by in Rails?,"Competing locks or a tight loop of trying to create a record due to constraints that would fail, potentially leading to a retry that times out.",4,5,5
30,What version of Bundler was installed when reproducing the issue locally?,1.14.5,5,4,4
38,What should you label when issuing a story in Targetprocess for SRE?,"""fire follow-up"" and add triage labels.",5,5,5


In [174]:
# from langchain.evaluation.qa import QAGenerateChain
# example_gen_chain = QAGenerateChain.from_llm(llm)
# new_examples = example_gen_chain.apply_and_parse(
#     [{"doc": t} for t in documents[10:20]]
# )
# res = (list(map(lambda result: result["qa_pairs"], new_examples)))
# pd.DataFrame.from_dict(res)



Unnamed: 0,query,answer
0,What can you try to slow down the quiz engine if it's consuming most of the database time?,You can try slowing down the quiz engine as it's usually the biggest source of DB time consumption.
1,"What is usually the cause of an unusual bump in traffic, according to the provided resource?",The bump in traffic is usually caused by simplistic transactions that hit one or very few endpoints.
2,What should you do if your website (noredink.com) is going down?,"Click Under Attack Mode in the Quick Actions pane on the right, which will cause all users to be challenged to prove they're human when trying to access the website. This might break our front-end and result in a 429 response code with HTML for the challenge page."
3,"What type of attacks did the team have to fight using rate limiting rules, according to the document?",The team had to fight 300k RPM DDoS (Distributed Denial of Service) attacks.
4,What is the primary focus of firefighting resources?,"According to the document, the primary focus of firefighting resources is ""Production & basic info"", indicating that the main emphasis is on producing and providing essential information for firefighting purposes."
5,What would happen to Account Executives if free usage reports are broken?,"They would be unable to prepare for daily Sales calls, which is considered a significant user pain point."
6,"What might happen if the ""reports-service-db-production-db"" is inaccessible?",The Clever classes and rosters may not sync for at least an hour.
7,What are some limitations that teachers will experience with the Drafts Service?,"According to the document, teachers will not be able to save writing, load writing, or refer NoRedInk to others. Additionally, they will not be able to browse content by standards or tests at /curriculum/standards."
8,"What is the reason for students being unable to complete quizzes, according to the documentation?","According to the document, students will not be able to do quizzes because of issues with QuizEngine HTTP."
9,What is one of the potential issues that may occur when loading a student's full essay view in the demo site?,"According to the document, if loading a student's full essay view is not working, it is due to an issue with the ""Dedicated service"" and requires investigation into servers and databases."
