# Using LLM-as-a-Judge for an Automated and Versatile Evaluation

Evaluating LLMs is a difficult task. Given their broad capabilities, the tasks given to them often shoudl be judged on requirements that would be very broad, and loosely-defined.

An assistant's answer to a question can be
* not grounded in context
* repetitive, repetitive, repetitive
* grammatically incorrect
* excessively lengthy and characterized by an overabundance of words, leading to a situation where the discourse or written content becomes overly detailed and protracted
* incoherent
* ...


Each of these is still hard to measure. Devising a rule-based program to assess the outputs is extremely challenging. Traditional evaluation metrics based on the similarity between outputs and reference answers (e.g., ROUGE, BLEU) are also ineffective for these issues.

A powerful solution to assess outputs in a human way, without requiring costly human time, is LLM-as-a-judge, introduced in the paper [*Juding LLM-as-a-Judge with MT-Bench and Chatbot Arena*](https://huggingface.co/papers/2306.05685). The idea is to ask an LLM to do the grading for us.

## Setups

In [None]:
!pip install -qU huggingface_hub datasets pandas tqdm

In [None]:
import re
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_dataset
from huggingface_hub import notebook_login, InferenceClient

tqdm.pandas() # load tqdm's pandas support
pd.set_option('display.max_colwidth', None)

In [None]:
notebook_login()

In [None]:
repo_id = 'mistralai/Mixtral-8x7B-Instruct-v0.1'

llm_client = InferenceClient(
    model=repo_id,
    timeout=120
)

# Test LLM client
llm_client.text_generation(
    prompt='How are you today?',
    max_new_tokens=20
)

## Prepare the creation and evaluation of LLM judge

If we ask an LLM to answer open-ended questions, the challenge is that measuring the answer's quality is difficult, for example, an exact string match will flag too many correct but differently worded answers as false. In this case we can set up a LLM-as-a-judge.

TO use a LLM-as-a-judge, we first need to evaluate how reliably it rates our model outputs.

The first step is to create human evaluation dataset and it does not need to be large. 30+ examples should be enough to get a good idea of the performance. We can re-use this dataset everytime we want to test our LLM-as-a-judge.

In this example, we will use the [`feedbackQA`](https://huggingface.co/datasets/McGill-NLP/feedbackQA), which contains 2 human evaluations and scores for each question/answer couple. Using a sample of 30 examples will be representative of what our small evaluation dataset could be.

In [None]:
ratings = load_dataset('MiGill-NLP/feedbackQA')['train']
ratings = pd.DataFrame(ratings)

raings['review_1'] = ratings['feedback'].apply(lambda x: x['rating'][0])
ratings['explanation_1'] = ratings['feedback'].apply(lambda x: x['explanation'][0])
ratings['review_2'] = ratings['feedback'].apply(lambda x: x['rating'][1])
ratings['explanation_2'] = ratings['feedback'].apply(lambda x: x['explanation'][1])

ratings = ratings.drop(columns=['feedback'])

# Map scores to numeric values
conversion_dict = {'Excellent': 4, 'Acceptable': 3, 'Could be Improved': 2, 'Bad': 1}
ratings['review_1'] = ratings['review_1'].map(conversion_dict)
ratings['review_2'] = ratings['review_2'].map(conversion_dict)

Check a baseline for performance. In this dataset, it can be the agreement between the two human raters, as measured by the Pearson correlation of the scores they give.

In [None]:
print('Correlation between 2 human raters:')
print(f"{ratings['score_1'].corr(ratings['score_2'], method='pearson'):.3f}")

If our human rating correlation is really bad, it probably means that the rating criteria are not clear enough.

This means that our "ground truth" contains noise: we cannot expect any algorithmic evaluation to come that close to it.

To reduce this noise, we can
* take the average score as our ground truth instead of any single score, we should even out some of the irregularities,
* only select the samples where the human reviewers are in agreement.

In this example, we choose the second option and **only keep examples where the 2 human reviewers are in agreement**.

In [None]:
# sample examples
ratings_where_raters_agree = ratings.loc[ratings['score_1'] == ratings['score_2']]
examples = ratings_where_raters_agree.groupby('score_1').sample(10, random_state=111)
examples['human_score'] = examples['score_1']

# visualize 1 sample for each score
display(examples.groupby('human_score').first())

## Create our LLM judge

We will build our LLM judge with a basic prompt, containing:
* task description
* scale description: `minimum`, `maximum`, value types (`float` here)
* explanation of the output format
* a beginning of an answer, to take the LLM by the hand as far as we can

In [None]:
JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.

Provide your feedback as follows:

Feedback:::
Total rating: (your rating, as a float between 0 and 10)

Now here are the question and answer.

Question: {question}
Answer: {answer}

Feedback:::
Total rating: """

In [None]:
examples['llm_judge'] = examples.progress_apply(
    lambda x: llm_client.text_generation(
        prompt=JUDGE_PROMPT.format(question=x['question'], answer=x['answer']),
        max_new_tokens=1000
    ),
    axis=1
)

In [None]:
def extract_judge_score(answer: str, split_str: str = 'Total rating:') -> int:
    try:
        if split_str in answer:
            rating = answer.split(split_str)[1]
        else:
            rating = answer

        digit_groups = [el.strip() for el in re.findall(r"\d+(?:\.\d+)?", rating)]
        return float(digit_groups[0])
    except Exception as e:
        print(e)
        return None


examples['llm_judge_score'] = examples['llm_judge'].apply(extract_judge_score)
# Rescale the score given by the LLM on the same scale as the human score
examples['llm_judge_score'] = (examples['llm_judge_score'] / 10) + 1

In [None]:
print("Correlation between LLM-as-a-judge and the human raters:")
print(f"{examples['llm_judge_score'].corr(examples['human_score'], method='pearson'):.3f}")

## Improve the LLM judge

LLMs are not good at evaluating outputs in continous ranges. The best practices to build a better prompt would be:
* **Leave more time for thought** by adding an `Evaluation` field before the final answer
* **Use a small integer scale** like 1-4 or 1-5 instead of a large float scale as we had previously
* **Provide an indicative scale for guidance**
* **Add a virtual reward to motivate the LLMs**

In [None]:
IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.

Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and answer.

Question: {question}
Answer: {answer}

Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """

In [None]:
examples['llm_judge_improved'] = examples.progress_apply(
    lambda x: llm_client.text_generation(
        prompt=IMPROVED_JUDGE_PROMPT.format(question=x['question'], answer=x['answer']),
        max_new_tokens=500
    ),
    axis=1
)

examples['llm_judge_improved_score'] = examples['llm_judge_improved'].apply(extract_judge_score)

In [None]:
print("Correlation between LLM-as-a-judge and the human raters:")
print(f"{examples['llm_judge_improved_score'].corr(examples['human_score'], method='pearson'):.3f}")

Comparing with the result with the one in previous section.

Now we need to display a few erros of our LLM judge to analyze them:

In [None]:
errors = pd.concat([
    examples.loc[examples['llm_judge_improved_score'] > examples['human_score']].head(1),
    examples.loc[examples['llm_judge_improved_score'] < examples['human_score']].head(2),
])

display(
    errors[
        ['question', 'answer', 'human_score', 'explanation_1', 'llm_judge_improved_score', 'llm_judge_improved']
    ]
)

## How do we take our LLM judge even further?

Our human ground truth certainly has some noise, so agreement/correlation will never go up to 100% even with a perfect LLM judge.

If we have access to a reference answer for each question, we should definitely give this to the LLM judge in its prompt to get better results.

We can add some few-shot examples of questions and ground truth evaluations in the prompt to improve the results.

When the judgement can be split into atomic criteria, we can also use an additive scale to further improve results:

In [None]:
IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.

Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

- Award 1 point if the answer is related to the question.
- Give 1 additional point if the answer is clear and precise.
- Provide 1 further point if the answer is true.
- One final point should be awarded if the answer provides additional resources to support the user.

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and answer.

Question: {question}
Answer: {answer}

Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """