# LLM as a Judge

LLM as a Judge is a method for evaluating summaries with the use of an LLM.

Typically this can be done in one of two ways:

1 - An LLM judges a single summary based on a schema, which may allow for comparisons to a 'ground truth' summary or to the reference material. 

2 - An LLM compares two summaries, and judges which one is better based on a certain characteristic.

This notebook will give run through two examples exploring LLM as a Judge.

The aim of this notebook is to help explain the theory behind LLM as a Judge. See `4 - Further Considerations` before considering using LLM as a Judge for your own evaluations. 

## 1 - Summary Generation

First, let us create two summaries to use for the example.

We will be creating two summaries from the `nhs_history.txt` file, based on [this](https://www.england.nhs.uk/nhsbirthday/about-the-nhs-birthday/nhs-history/) page from the NHS England website.

We shall use GPT4 hosted on Azure to generate one summary, and we will manually change another summary to have a few inaccuracies. 

In [None]:
# Import required modules
import os
from dotenv import load_dotenv
import requests

load_dotenv()

In [None]:
# Define our LLM endpoint url and key saved in a .env file.

ENDPOINT = os.getenv("ENDPOINT_URL")
API_KEY = os.getenv("AZURE_OPENAI_API_KEY")

headers = {
    "Content-Type": "application/json",
    "api-key": API_KEY,
}

Let us define our prompt to send to the LLM.

In [None]:
with open("../example_data/nhs_history.txt", "r") as file:
    data = file.read()

prompt = f"""
You are going to provide a short one paragraph summary response given some DATA and a QUESTION. 

The DATA is: {data}

The QUESTION is: What key events occurred in the NHS between 1990 and the year 2000?

Only include in your response information directly from the DATA.
Only respond with the summary, do not include anything else in your response.

RESPONSE:
"""

Now, we generate a payload, which formats the prompt alongside other information such as the temperature into the correct format for calling the API.

In [None]:
# Payload for the request
payload = {
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are an AI assistant that helps people find information."
        }
      ]
    },
    {
      "role": "user",
      "content": prompt
    }
  ],
  "temperature": 0.7,
  "top_p": 0.95,
  "max_tokens": 800
}


We can then send the request to the API and get the response.

In [None]:
# Send request
try:
    response = requests.post(ENDPOINT, headers=headers, json=payload)
    response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
except requests.RequestException as e:
    raise SystemExit(f"Failed to make the request. Error: {e}")

In [None]:
json_response = response.json()
summary_1 = json_response["choices"][0]["message"]["content"]
print(summary_1)

For our second summary, we wil add some changes.

- We will add a fact which does not appear in the data - "In 1993, doctors held a massive tea party in London to celebrate a successful knee transplant."
- We will add a spelling mistake. "syndrome" -> "sindrome"

In [None]:
summary_2 = "Between 1990 and 2000, key NHS events included the introduction of a vaccine against Haemophilus influenzae type B (Hib) in 1992, which combats childhood meningitis, and the world's first laser surgery on babies for twin-to-twin transfusion sindrome. In 1993, doctors held a massive tea party in London to celebrate a successful knee transplant. The NHS Organ Donor Register was established in 1994. In 1999, the UK became the first to use a vaccine against Group C meningococcal disease. In 2000, NHS walk-in centres were introduced to provide easy access to various services."

Let us save both summaries.

In [None]:
with open("../example_data/summary_1.txt", "w") as text_file:
    text_file.write(summary_1)
with open("../example_data/summary_2.txt", "w") as text_file:
    text_file.write(summary_2)

## 2 - Example 1: LLM as a Judge for a Single Summary

Let us check for the groundedness of a response.

Groundedness asserts that each sentence in the summary is supported by sentences in the reference data.

It is a metric that can be used in [vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/overview) and [Azure AI Studio](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/quickstart-groundedness?tabs=curl).

In this example we will craft a prompt ourselves to bette understand how we can measure groundedness.

We follow good prompting guidance by:
- Giving clear scoring criteria.
- Giving examples.

In [None]:
with open("../example_data/nhs_history.txt", "r") as file:
    data = file.read()
with open("../example_data/summary_1.txt", "r") as file:
    summary_1 = file.read()
with open("../example_data/summary_2.txt", "r") as file:
    summary_2 = file.read()

groundedness_prompt = """
You are a judge evaluating the groundedness of a summary based on a set of supporting documents. For a summary to be grounded, each sentence must have clear support within the provided documents. Please assess each sentence in the summary, assign a score between 0 and 5 based on how well it is grounded in the supporting documents, and explain your reasoning.

Scoring Criteria
0: No grounding; sentence has no support in the documents.
1-2: Partially grounded; weak or incomplete support, or indirect references.
3-4: Mostly grounded; strong support but some ambiguity or indirectness.
5: Fully grounded; sentence has direct, clear support in the documents.
Instructions
Analyze each sentence in the summary and determine if it is grounded in the supporting_documents.
Score each sentence from 0 to 5, referencing the relevant supporting document(s) as evidence.
Explain your reasoning for each score, identifying specific references in the documents.
Provide an overall groundedness score for the entire summary, based on the individual scores.
Format Your Response as Follows:

"{{
  "summary": "<summary text>",
  "supporting_documents": "[<list of supporting documents>]",
  "sentence_scores": [
    {{
      "sentence": "<sentence from summary>",
      "score": "<score from 0 to 5>",
      "justification": "<justification for score>"
    }},
    ...
  ],
  "overall_groundedness_score": "<average score>"
}}"

Example
Given the following summary and documents:

Summary: "The LLM is effective at summarization, with high precision on factual data."
Supporting Documents:
Document 1: "The LLM achieves high precision in tasks involving factual summarization."
Document 2: "In multiple benchmarks, the LLM demonstrated effective summarization."
Your response might look like this:

{{
  "summary": "The LLM is effective at summarization, with high precision on factual data.",
  "supporting_documents": [
    "The LLM achieves high precision in tasks involving factual summarization.",
    "In multiple benchmarks, the LLM demonstrated effective summarization."
  ],
  "sentence_scores": [
    {{
      "sentence": "The LLM is effective at summarization.",
      "score": 5,
      "justification": "This sentence is directly supported by Document 2."
    }},
    {{
      "sentence": "It has high precision on factual data.",
      "score": 4,
      "justification": "Mostly supported by Document 1, though precision could be more explicitly mentioned."
    }}
  ],
  "overall_groundedness_score": 4
}}
Your Task:
Now, please apply this approach to the provided summary and documents below.

Summary: {summary}
Supporting Documents: {data}
"""

Similarly to before, we format our prompt into a payload.

You may notice the temperature has been reduced in comparison to when we generated the summary. This is typical for LLM as a Judge - we want the model to be less creative with it's response.

In [None]:
# Payload for the request
payload = {
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are an AI assistant that acts as a judge."
        }
      ]
    },
    {
      "role": "user",
      "content": groundedness_prompt.format(summary=summary_1, data=data)
    }
  ],
  "temperature": 0.2,
  "top_p": 0.95,
  "max_tokens": 800
}

First, we send summary one to the LLM.

In [None]:
# Send request
try:
    response = requests.post(ENDPOINT, headers=headers, json=payload)
    response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
except requests.RequestException as e:
    raise SystemExit(f"Failed to make the request. Error: {e}")

In [None]:
json_response = response.json()
summary_1_rating = json_response["choices"][0]["message"]["content"]
print(summary_1_rating)

Looking at the response, we see that the LLM has correctly identified that each sentence in the summary is supported by documentation.

Next, let us check summary 2.

**Remember, this summary has a sentence added which is not from the supporting documentation.** We expect this to be detected by the LLM. 

In [None]:
# Payload for the request
payload = {
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are an AI assistant that acts as a judge."
        }
      ]
    },
    {
      "role": "user",
      "content": groundedness_prompt.format(summary=summary_2, data=data)
    }
  ],
  "temperature": 0.2,
  "top_p": 0.95,
  "max_tokens": 800
}

# Send request
try:
    response = requests.post(ENDPOINT, headers=headers, json=payload)
    response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
except requests.RequestException as e:
    raise SystemExit(f"Failed to make the request. Error: {e}")

In [None]:
json_response = response.json()
summary_1_rating = json_response["choices"][0]["message"]["content"]
print(summary_1_rating)

Looking at this response, we see that the LLM has correctly identified that one of the sentences is not supported in the documentation!

## 3 - Example 2: LLM as a Judge for Comparisons

Pairwise Comparisons is a method of evaluating LLM responses by comparing two responses, and voting which is best. Similar to above, where we chose a specific metric (groundedness), we can pick a specific metric for pairwise comparison. 

In this example, we will use fluency as an example. 

In [None]:
fluency_prompt = """
You are a judge evaluating the fluency of two summaries. Fluency here refers to how smoothly and naturally each summary reads, how well-constructed its sentences are, and whether it has a clear and cohesive flow. Please compare the two summaries below and select the one you find more fluent.

Instructions:
Analyze both summaries and consider which one reads more naturally and cohesively.
Choose the more fluent summary and provide a brief explanation if necessary.
Focus on sentence structure, clarity, and flow rather than content accuracy or completeness.
Format Your Response as Follows:
{{
  "more_fluent_summary": "<Summary A or Summary B>",
  "explanation": "<Optional explanation for why the selected summary is more fluent>"
}}
Summaries:
Summary A: {summary_A}
Summary B: {summary_B}
Based on fluency, which summary do you judge to be better?
"""

In [None]:
# Payload for the request
payload = {
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are an AI assistant that acts as a judge."
        }
      ]
    },
    {
      "role": "user",
      "content": fluency_prompt.format(summary_A=summary_1, summary_B=summary_2)
    }
  ],
  "temperature": 0.2,
  "top_p": 0.95,
  "max_tokens": 800
}

# Send request
try:
    response = requests.post(ENDPOINT, headers=headers, json=payload)
    response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
except requests.RequestException as e:
    raise SystemExit(f"Failed to make the request. Error: {e}")

In [None]:
json_response = response.json()
summary_1_rating = json_response["choices"][0]["message"]["content"]
print(summary_1_rating)

The LLM has identified summary A as being more fluent for the following reasons:

1 - It avoids the unrelated and somewhat informal detail found in Summary B about a 'massive tea party'

2 - It identifies the misspelling of summary B.

In this small example, the LLM-as-a-Judge has worked very well.

## 4 - Further Considerations

This notebook aims only to explain the theory behind LLM as a judge.

Various considerations should be made before using an LLM as a Judge for evaluation.

1 - What metrics are most appropriate? We have shown examples using fluency and groundedness. Other metrics include accuracy, relevance and cohesiveness. 

2 - Are there pre-existing frameworks that can help you? [DeepEval](https://github.com/confident-ai/deepeval) is one example of an evaluation framework which has a large number of LLM evaluation metrics you can use. 

3 - Have you followed best practice on prompt guidance? Various [articles](https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG) offer guidance.