## Use LLM as a Judge to evaluate an **RAG application on a custom metric**

Retrieval Augmented Generation (RAG) is one of the most popular use cases for LLMs, but it is also one of the most difficult to evaluate. There are common metrics for RAG, but they might not always fit the use case or are to “generic”. We define a new RAG additive metric (3-Point Scale)

This 3-point additive metric evaluates RAG system responses based on their adherence to the given context, completeness in addressing all key elements, and relevance combined with conciseness.

*Note: This is a completely made-up metric for demonstration purposes only. It is important you define the metrics and criteria based on your use case and importance.* 

In [None]:
!pip install openai huggingface_hub datasets

To help improve the model's performance, we define three few-shot examples: a 0-score example, a 1-score example, and a 3-score example. You can find them in the [dataset repository](https://huggingface.co/datasets/zeitgeist-ai/financial-rag-nvidia-sec). For the evaluation data, we will use a synthetic dataset from the [**2023_10 NVIDIA SEC Filings](https://stocklight.com/stocks/us/nasdaq-nvda/nvidia/annual-reports/nasdaq-nvda-2023-10K-23668751.pdf).** This dataset includes a question, answer, and context. We are going to evaluate 50 random samples to see how well the answer performs based on our defined metric. 

We are going to use the async client `AsyncOpenAI` client to score multiple examples in parallel.

In [117]:
import asyncio
from openai import AsyncOpenAI
import huggingface_hub
from tqdm.asyncio import tqdm_asyncio

# max concurrency
sem = asyncio.Semaphore(5)

# Initialize the client using the Hugging Face Inference API
client = AsyncOpenAI(
    base_url="https://api-inference.huggingface.co/v1/",
    api_key=huggingface_hub.get_token(),
)

# Combined async helper method to handle concurrent scoring and
async def limited_get_score(dataset):
    async def gen(sample):
        async with sem:
            res = await get_eval_score(sample)
            progress_bar.update(1)
            return res

    progress_bar = tqdm_asyncio(total=len(dataset), desc="Scoring", unit="sample")
    tasks = [gen(text) for text in dataset]
    responses = await tqdm_asyncio.gather(*tasks)
    progress_bar.close()
    return responses


To evaluate our model, we need to define the `additive_criteria`, `evaluation_steps`, `json_schema`.

In [109]:
def format_examples(examples):
    return "\n".join([
        f'Question: {ex["question"]}\nContext: {ex["context"]}\nAnswer: {ex["answer"]}\nEvaluation:{ex["eval"]}' 
        for ex in examples
    ])

EVALUATION_PROMPT_TEMPLATE = """You are an expert judge evaluating the Retrieval Augmented Generation applications. Your task is to evaluate a given answer based on a context and question using the criteria provided below.

Evaluation Criteria (Additive Score, 0-5):
{additive_criteria}

Evaluation Steps:
{evaluation_steps}

Output format:
{json_schema}

Examples:
{examples}

Now, please evaluate the following:

Question:
{question}
Context:
{context}
Answer:
{answer}
"""

ADDITIVE_CRITERIA = """1. Context: Award 1 point if the answer uses only information provided in the context, without introducing external or fabricated details.
2. Completeness: Add 1 point if the answer addresses all key elements of the question based on the available context, without omissions.
3. Conciseness: Add a final point if the answer uses the fewest words possible to address the question and avoids redundancy."""

EVALUATION_STEPS="""1. Read the question and provided answer carefully to understand the context.
2. Go through each evaluation criterion one by one and assess whether the answer meets the criteria
3. Compose your reasoning for each criterion, explaining why you did or did not award a point. Be specific and refer to elements of the answer that influenced your decision. 
4. For each criterion that the answer meets, add the corresponding point (up to 1 point per criterion). You can only award full points.
5. Format your evaluation response according to the specified Output format, ensuring proper JSON syntax with a "reasoning" field for your step-by-step explanation and a "total_score" field for the calculated total. Review your formatted response. It needs to be valid JSON."""

JSON_SCHEMA="""{
  "reasoning": "Your step-by-step explanation for the Evaluation Criteria, why you awarded a point or not."
  "total_score": sum of criteria scores,
}"""

Then, we define our `get_eval_score` method.

In [127]:
import json 
async def get_eval_score(sample):
    prompt = EVALUATION_PROMPT_TEMPLATE.format(
        additive_criteria=ADDITIVE_CRITERIA,
        evaluation_steps=EVALUATION_STEPS,
        json_schema=JSON_SCHEMA,
        examples=format_examples(few_shot_examples),
        question=sample["question"],
        context=sample["context"],
        answer=sample["answer"]
    )
    # Comment in if you want to see the prompt
    # print(prompt)
    response = await client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-70B-Instruct",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=512,
    )
    results = response.choices[0].message.content
    # Add the evaluation results to the sample
    return {**sample, **json.loads(results)}


The last missing piece is the data. We use the `datasets` library to load our samples.

In [137]:
from datasets import load_dataset

eval_ds = load_dataset("zeitgeist-ai/financial-rag-nvidia-sec", split="train").shuffle(seed=42).select(range(50))
print(f"Limited evaluation of {len(eval_ds)} samples")
few_shot_examples = load_dataset("zeitgeist-ai/financial-rag-nvidia-sec","few-shot-examples" ,split="train")
print(f"Limited evaluation of {len(few_shot_examples)} few-shot examples")

Limited evaluation of 50 samples
Limited evaluation of 3 few-shot examples


Let's test an example.

In [129]:
import json

sample = [sample for sample in eval_ds.select(range(1))]
print(f"Question: {sample[0]['question']}\nContext: {sample[0]['context']}\nAnswer: {sample[0]['answer']}")
print("---" * 10)
# change in if you are not in a jupyter notebook
# responses = asyncio.run(limited_get_score(sample))
responses = await limited_get_score(sample)
print(f"Reasoning: {responses[0]['reasoning']}\nTotal Score: {responses[0]['total_score']}")

Question: How much were the company's debt obligations as of December 31, 2023?
Context: The company's debt obligations as of December 31, 2023, totaled $2,299,887 thousand.
Answer: $2,299,887 thousand
------------------------------


100%|██████████| 1/1 [00:00<00:00,  2.68it/s]2.69sample/s]
Scoring: 100%|██████████| 1/1 [00:00<00:00,  2.67sample/s]

Reasoing: 1. Context: The answer accurately uses the information provided in the context, specifically mentioning the exact amount of debt obligations as of December 31, 2023. Therefore, it earns 1 point for using the correct context.
2. Completeness: The answer addresses the key element of the question, which is the specific amount of the company's debt obligations as of December 31, 2023. It does not omit any necessary information, so it earns 1 point for completeness.
3. Conciseness: The answer is concise and directly answers the question without any unnecessary information or redundancy. Therefore, it earns 1 point for conciseness.
Total Score: 3





Awesome, it works and looks good now. Let's evaluate all 50 examples and then calculate our average score.

In [131]:
results = await limited_get_score(eval_ds)


100%|██████████| 50/50 [00:01<00:00, 31.21it/s]9.54sample/s]
Scoring: 100%|██████████| 50/50 [00:01<00:00, 31.06sample/s]


In [132]:
# calculate the average score
total_score = sum([r["total_score"] for r in results]) / len(results)
print(f"Average Score: {total_score}")

# extract and sample with score 0
score_0 = [r for r in results if r["total_score"] == 0]
print(f"Samples with score 0: {len(score_0)}")

Average Score: 2.78
Samples with score 0: 2


Great. We achieved and average score of 2.78! To better understand why only 2.78 lets look at an example which scored poorly and if that's correct.

In [136]:
print(f"Question: {score_0[0]['question']}\nContext: {score_0[0]['context']}\nAnswer: {score_0[0]['answer']}")
print("---" * 10)
print(f"Reasoning: {score_0[0]['reasoning']}\nTotal Score: {score_0[0]['total_score']}")

Question: What was the total dollar value of outstanding commercial real estate loans at the end of 2023?
Context: The total outstanding commercial real estate loans amounted to $72,878 million at the end of December 2022.
Answer: $72.878 billion
------------------------------
Reasoning: 1. Context: The answer does not use the correct information from the provided context. The context mentions the total outstanding commercial real estate loans at the end of December 2022, but the answer provides a value without specifying the correct year. Therefore, no points are awarded for context.
2. Completeness: The answer provides a dollar value, but it does not address the key element of the question, which is the total dollar value at the end of 2023. The context only provides information about 2022, and the answer does not clarify or provide the correct information for 2023. Thus, no points are awarded for completeness.
3. Conciseness: The answer is concise, but it does not address the correc

Wow. Our LLM judge correctly identified that the question asked for 2023, but the context only provided information about 2022. Additionally, we see that the completeness and conciseness criteria really rely heavily on context. Depending on your needs, there could be room for improvements in our prompt.