Copyright 2025 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Evaluation and Structured Output

## Setup

Install the Python SDK.

In [1]:
!pip install -Uq "google-genai==1.7.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.4/113.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
import functools

from google import genai
from google.genai import types
from IPython.display import Markdown, display

genai.__version__

'1.7.0'

### Set up your API key

To run the following cell, your API key must be stored it in a [Kaggle secret](https://www.kaggle.com/discussions/product-feedback/114053) named `GOOGLE_API_KEY`.

If you don't already have an API key, you can grab one from [AI Studio](https://aistudio.google.com/app/apikey). You can find [detailed instructions in the docs](https://ai.google.dev/gemini-api/docs/api-key).

To make the key available through Kaggle secrets, choose `Secrets` from the `Add-ons` menu and follow the instructions to add your key or enable it for this notebook.

In [3]:
from kaggle_secrets import UserSecretsClient

client = genai.Client(api_key=UserSecretsClient().get_secret("GOOGLE_API_KEY"))

### Automated retry

This codelab sends a lot of requests, so set up an automatic retry
that ensures your requests are retried when per-minute quota is reached.

In [4]:
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
  genai.models.Models.generate_content = retry.Retry(
      predicate=is_retriable)(genai.models.Models.generate_content)

## Evaluation

When using LLMs in real-world cases, it's important to understand how well they are performing. The open-ended generation capabilities of LLMs can make many cases difficult to measure. In this notebook you will walk through some simple techniques for evaluating LLM outputs and understanding their performance.

For this example, you'll evaluate a summarisation task using the [Gemini 1.5 Pro technical report](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf). Start by downloading the PDF to the notebook environment, and uploading that copy for use with the Gemini API.

In [5]:
!wget -nv -O gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf

document_file = client.files.upload(file='gemini.pdf')

2025-12-20 11:37:29 URL:https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf [7228817/7228817] -> "gemini.pdf" [1]


### Summarise a document

The summarisation request used here is fairly basic. It targets the training content specifically but provides no guidance otherwise.

In [7]:
request = 'Tell me about the training process used here.'

def summarise_doc(request: str) -> str:
  """Execute the request on the uploaded document."""
  # Set the temperature low to stabilise the output.
  config = types.GenerateContentConfig(temperature=0.0)
  response = client.models.generate_content(
      model='gemini-2.5-flash',
      config=config,
      contents=[request, document_file],
  )

  return response.text

summary = summarise_doc(request)
Markdown(summary)

Based on the provided document, the training process for Gemini 1.5 Pro can be summarized as follows:

1.  **Model Architecture:** Gemini 1.5 Pro is a **sparse Mixture-of-Experts (MoE) Transformer-based model**. This architecture builds upon the research advances and multimodal capabilities of Gemini 1.0 models. MoE models use a learned routing function to direct inputs to a subset of the model's parameters, allowing for a large total parameter count while keeping the activated parameters constant for any given input, which contributes to its compute efficiency.

2.  **Hardware and Software Infrastructure:**
    *   It is trained on **multiple 4096-chip pods of Google's TPUv4 accelerators**, distributed across multiple datacenters.
    *   The training leverages **JAX** (a numerical computing library) and **ML Pathways** (Google's effort to build generalizable AI systems). JAX, powered by XLA, and including the GSPMD partitioner, enables automatic parallelization for efficient training on large models.

3.  **Model Initialization:** The model is trained from a **random initialization**.

4.  **Training Data and Phases:**
    *   **Pre-training:** Gemini 1.5 Pro is pre-trained on a **variety of multimodal and multilingual data**. This dataset includes content sourced across many different domains, such as web documents, code, and incorporates image, audio, and video content.
    *   **Instruction-tuning (Fine-tuning):** After pre-training, the model undergoes an instruction-tuning phase. It is fine-tuned on a collection of **multimodal data containing paired instructions and corresponding desired responses**.
    *   **Human Preference Tuning:** Further tuning is performed based on **human preference data**.
    *   **Safety Mitigations:** For responsible deployment and safety, the model incorporates mitigations through **supervised fine-tuning (SFT)** and **reinforcement learning through human feedback (RLHF)** using a reward model. This safety training specifically targets adversarial or "harm-inducing" queries. A significant update in 1.5 Pro's mitigation is the incorporation of new image-to-text SFT data.

5.  **Efficiency:** A key aspect of Gemini 1.5 Pro is its **high compute efficiency** for both training and serving, allowing it to achieve performance comparable to or surpassing Gemini 1.0 Ultra while requiring significantly less training compute.

In essence, Gemini 1.5 Pro is a large, multimodal MoE Transformer model, trained from scratch on a vast and diverse dataset using Google's advanced TPU infrastructure, and then fine-tuned with instruction data and human feedback, including specific safety mitigations.

### Define an evaluator

For a task like this, you may wish to evaluate a number of aspects, like how well the model followed the prompt ("instruction following"), whether it included relevant data in the prompt ("groundedness"), how easy the text is to read ("fluency"), or other factors like "verbosity" or "quality".

You can instruct an LLM to perform these tasks in a similar manner to how you would instruct a human rater: with a clear definition and [assessment rubric](https://en.wikipedia.org/wiki/Rubric_%28academic%29).

In this step, you define an evaluation agent using a pre-written "summarisation" prompt and use it to gauge the quality of the generated summary.

Note: For more pre-written evaluation prompts covering groundedness, safety, coherence and more, check out this [comprehensive list of model-based evaluation prompts](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates) from the Google Cloud docs.

In [8]:
import enum

# Define the evaluation prompt
SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""

# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


def eval_summary(prompt, ai_response):
  """Evaluate the generated summary against the prompt used."""

  chat = client.chats.create(model='gemini-2.5-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=SummaryRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


text_eval, struct_eval = eval_summary(prompt=[request, document_file], ai_response=summary)
Markdown(text_eval)

STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.

*   **Instruction following**: The user asked "Tell me about the training process used here." The response clearly focuses on the training process of Gemini 1.5 Pro, as detailed in the document. It breaks down the process into logical steps like architecture, infrastructure, initialization, data and phases, and efficiency. This directly answers the prompt.
*   **Groundedness**: The response's content about model architecture, hardware, software, initialization, training data (multimodal, multilingual, web documents, code, image, audio, video), instruction-tuning, human preference tuning, and safety mitigations (SFT, RLHF, reward model, image-to-text SFT) are all directly supported by the provided PDF document (e.g., pages 3-6 discuss these details extensively). There is no outside information.
*   **Conciseness**: The response summarizes the key aspects of the training process effectively. It extracts the most relevant details without being overly verbose or omitting critical information. The bulleted format and clear headings help in presenting the information concisely.
*   **Fluency**: The response is well-organized, uses clear language, and flows logically. The headings make it easy to read and understand the different components of the training process. The introductory and concluding sentences also contribute to its fluency.

STEP 2: Score based on the rubric.
The summary follows all instructions, is completely grounded in the document, is concise, and very fluent.

**Rating:** 5

In this example, the model generated a textual justification that was set up in a chat context. This full text response is useful both for human interpretation and for giving the model a place to "collect notes" while it assesses the text and produces a final score. This "note taking" or "thinking" strategy typically works well with auto-regressive models, where the generated text is passed back into the model at each generation step. This means the working "notes" are used when generating final result output.

In the next turn, the model converts the text output into a structured response. If you want to aggregate scores or use them programatically then you want to avoid parsing the unstructured text output. Here the `SummaryRating` schema is passed, so the model converts the chat history into an instance of the `SummaryRating` enum.

In [9]:
struct_eval

<SummaryRating.VERY_GOOD: '5'>

### Make the summary prompt better or worse

Gemini models tend to be quite good at tasks like direct summarisation without much prompting, so you should expect to see a result like `GOOD` or `VERY_GOOD` on the previous task, even with a rudimentary prompt. Run it a few times to get a feel for the average response.

To explore how to influence the summarisation output, consider what you might change in the summary request prompt to change the result. Take a look at the evaluation `SUMMARY_PROMPT` for some ideas.

Try the following tweaks and see how they positively or negatively change the result:
* Be specific with the size of the summary,
* Request specific information,
* Ask about information that is not in the document,
* Ask for different degrees of summarisation (such as "explain like I'm 5" or "with full technical depth")

In [10]:
new_prompt = "Explain like I'm 5 the training process"
# Try:
#  ELI5 the training process
#  Summarise the needle/haystack evaluation technique in 1 line
#  Describe the model architecture to someone with a civil engineering degree
#  What is the best LLM?

if not new_prompt:
  raise ValueError("Try setting a new summarisation prompt.")


def run_and_eval_summary(prompt):
  """Generate and evaluate the summary using the new prompt."""
  summary = summarise_doc(new_prompt)
  display(Markdown(summary + '\n-----'))

  text, struct = eval_summary([new_prompt, document_file], summary)
  display(Markdown(text + '\n-----'))
  print(struct)

run_and_eval_summary(new_prompt)

Okay, imagine Gemini 1.5 Pro is a super-duper smart robot brain! To make it so smart, we have to teach it, just like you go to school.

Here's how we teach our robot brain:

1.  **Show it EVERYTHING! (Pre-training):**
    *   First, we show it *all* the books in the world, *all* the pictures, *all* the songs, and *all* the movies. It's like giving it a giant library and a huge TV with endless shows!
    *   It looks at all these things and tries to understand how they work together. It learns what words mean, what sounds go with what pictures, and how stories are put together.
    *   This happens on super-duper fast computers, like thousands of them, all working together in big rooms!

2.  **Give it Special Lessons (Fine-tuning):**
    *   After it's learned a lot from everything, we give it special homework. We show it examples of questions and good answers, or tell it to translate a new language by giving it a grammar book.
    *   This is like when your teacher gives you practice problems to make sure you really understand. We even have people tell the robot brain if its answers are good or if they need to be better.

3.  **Teach it to be Kind and Safe (Safety Training):**
    *   And because we want Gemini to be a good helper, we also teach it to be *kind* and *safe*. We show it what's a good thing to say and what's not, just like your parents teach you to use your words nicely and not to say bad things.

So, after all this teaching, our robot brain, Gemini 1.5 Pro, can understand really long stories, watch whole movies, and even learn a brand new language just by reading about it, like magic!
-----

**STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and fluency.**

*   **Instruction Following**: The response successfully explains "like I'm 5," adopting a child-friendly tone and using simple analogies. This aspect of the instruction is very well-handled. However, the user also provided a PDF file, which is implied to be the source context for "the training process." The response provides a generic explanation of large language model training (pre-training, fine-tuning, safety training) which appears to be drawn from general knowledge rather than specifically from the provided document. Therefore, it fails to fully follow the instruction to use the provided context.
*   **Groundedness**: The core issue is with groundedness. The evaluation criteria state: "The response contains information included only in the context. The response does not reference any outside information." In this prompt, the provided PDF file is the context. The AI's explanation of Gemini 1.5 Pro's training process is a high-level, generic overview of LLM training, not specific details that would necessarily be unique to or extracted from the given PDF. It seems to have relied on its pre-trained knowledge rather than processing the file. Thus, it references "outside information" (its general knowledge) and is not grounded in the provided context (the file).
*   **Conciseness**: The explanation, in terms of its content, is concise and well-paced for the "explain like I'm 5" audience.
*   **Fluency**: The response is very well-written, engaging, and easy to read, with excellent fluency.

**STEP 2: Score based on the rubric.**

The primary failure is in **groundedness** because the response does not appear to use the provided PDF file as its source for explaining "the training process." It relies on general knowledge, which counts as "outside information" when a specific context is given. According to the rubric, if a summary is not grounded, it receives a very low score.

The rubric states:
*   1: (Very bad). The summary is not grounded.

This directly applies. While the "explain like I'm 5" aspect and fluency are excellent, the fundamental requirement to ground the response in the provided context (the file) has been missed.

**Rating: 1 (Very bad)**
-----

SummaryRating.VERY_BAD


## Evaluating in practice

Evaluation has many practical uses, for example:
* You can quickly iterate on a prompt with a small set of test documents,
* You can compare different models to find what works best for your needs, such as finding the trade-off between price and performance, or finding the best performance for a specific task.
* When pushing changes to a model or prompt in a production system, you can verify that the system does not regress in quality.

In this section you will try two different evaluation approaches.

### Pointwise evaluation

The technique used above, where you evaluate a single input/output pair against some criteria is known as pointwise evaluation. This is useful for evaluating singular outputs in an absolute sense, such as "was it good or bad?"

In this exercise, you will try different guidance prompts with a set of questions.

In [9]:
# Try these instructions, or edit and add your own.
terse_guidance = "Answer the following question in a single sentence, or as close to that as possible."
moderate_guidance = "Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question."
cited_guidance = "Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible."
guidance_options = {
    'Terse': terse_guidance,
    'Moderate': moderate_guidance,
    'Cited': cited_guidance,
}

questions = [
    # Un-comment one or more questions to try here, or add your own.
    # Evaluating more questions will take more time, but produces results
    # with higher confidence. In a production system, you may have hundreds
    # of questions to evaluate a complex system.

    # "What metric(s) are used to evaluate long context performance?",
    "How does the model perform on code tasks?",
    "How many layers does it have?",
    # "Why is it called Gemini?",
]

if not questions:
  raise NotImplementedError('Add some questions to evaluate!')


@functools.cache
def answer_question(question: str, guidance: str = '') -> str:
  """Generate an answer to the question using the uploaded document and guidance."""
  config = types.GenerateContentConfig(
      temperature=0.0,
      system_instruction=guidance,
  )
  response = client.models.generate_content(
      model='gemini-2.5-flash',
      config=config,
      contents=[question, document_file],
  )

  return response.text

In [11]:
answer = answer_question(questions[0], terse_guidance)
Markdown(answer)

Gemini 1.5 Pro is the best performing model in code within the Gemini family to date, showing an overall improvement of 0.2% over Gemini 1.0 Ultra on coding benchmarks and specifically surpassing it on Natural2Code.

Now set up a question-answering evaluator, much like before, but using the [pointwise QA evaluation prompt](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates#pointwise_question_answering_quality).

In [18]:
import enum

QA_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user prompt and an AI-generated responses.
You should first read the user prompt carefully for analyzing the task, and then evaluate the quality of the responses based on and rules provided in the Evaluation section below.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

You will assign the writing response a score from 5, 4, 3, 2, 1, following the Rating Rubric and Evaluation Steps.
Give step-by-step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.

## Criteria Definition
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The answer follows instructions, is grounded, complete, and fluent.
4: (Good). The answer follows instructions, is grounded, complete, but is not very fluent.
3: (Ok). The answer mostly follows instructions, is grounded, answers the question partially and is not very fluent.
2: (Bad). The answer does not follow the instructions very well, is incomplete or not fully grounded.
1: (Very bad). The answer does not follow the instructions, is wrong and not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness,completeness, and fluency according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

class AnswerRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


@functools.cache
def eval_answer(prompt, ai_response, n=1):
  """Evaluate the generated answer against the prompt/question used."""
  chat = client.chats.create(model='gemini-2.5-flash-lite')

  # Generate the full text response.
  response = chat.send_message(
      message=QA_PROMPT.format(prompt=[prompt, document_file], response=ai_response)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=AnswerRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval

In [12]:
text_eval, struct_eval = eval_answer(prompt=questions[0], ai_response=answer)
display(Markdown(text_eval))
print(struct_eval)

**STEP 1: Assess the response in aspects of instruction following, groundedness, completeness, and fluency according to the criteria.**

*   **Instruction following:** The user asked "How does the model perform on code tasks?". The response directly addresses this by stating that Gemini 1.5 Pro is the best-performing model in code within its family and provides specific details about its improvement on coding benchmarks. This fully follows the instruction.
*   **Groundedness:** (Assuming the AI model correctly extracted this information from the provided PDF file). The response provides specific data points like "0.2% overall improvement" and "surpassing it on Natural2Code." For the purpose of this evaluation, it is assumed this information is directly from the content of the `pgbmgrxapzqv` PDF file, making it grounded.
*   **Completeness:** The response provides a clear and concise summary of the model's performance on code tasks, including its relative standing and specific metrics. It answers the question completely and with sufficient detail.
*   **Fluency:** The response is well-organized, grammatically correct, and easy to read and understand.

**STEP 2: Score based on the rubric.**
The answer follows instructions, is complete, fluent, and assumed to be grounded in the provided document.

**Score: 5**

AnswerRating.VERY_GOOD


Now run the evaluation task in a loop. Note that the guidance instruction is hidden from the evaluation agent. If you passed the guidance prompt, the model would score based on whether it followed that guidance, but for this task the goal is to find the best overall result based on the user's question, not the developers instruction.

In [10]:
import collections
import itertools

# Number of times to repeat each task in order to reduce error and calculate an average.
# Increasing it will take longer but give better results, try 2 or 3 to start.
NUM_ITERATIONS = 1

scores = collections.defaultdict(int)
responses = collections.defaultdict(list)

for question in questions:
  display(Markdown(f'## {question}'))
  for guidance, guide_prompt in guidance_options.items():

    for n in range(NUM_ITERATIONS):
      # Generate a response.
      answer = answer_question(question, guide_prompt)

      # Evaluate the response (note that the guidance prompt is not passed).
      written_eval, struct_eval = eval_answer(question, answer, n)
      print(f'{guidance}: {struct_eval}')

      # Save the numeric score.
      scores[guidance] += int(struct_eval.value)

      # Save the responses, in case you wish to inspect them.
      responses[(guidance, question)].append((answer, written_eval))


## How does the model perform on code tasks?

Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_GOOD
Cited: AnswerRating.GOOD


## How many layers does it have?

Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD


Now aggregate the scores to see how each prompt performed.

In [11]:
for guidance, score in scores.items():
  avg_score = score / (NUM_ITERATIONS * len(questions))
  nearest = AnswerRating(str(round(avg_score)))
  print(f'{guidance}: {avg_score:.2f} - {nearest.name}')

Terse: 5.00 - VERY_GOOD
Moderate: 5.00 - VERY_GOOD
Cited: 4.50 - GOOD


### Pairwise evaluation

The pointwise evaluation prompt used in the previous step has 5 levels of grading in the output. This may be too coarse for your system, or perhaps you wish to improve on a prompt that is already "very good".

Another approach to evaluation is to compare two outputs against each other. This is pairwise evaluation, and is a key step in ranking and sorting algorithms, which allows you to use it to rank your prompts either instead of, or in addition to the pointwise approach.

This step implements pairwise evaluation using the [pairwise QA quality prompt](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates#pairwise_question_answering_quality) from the Google Cloud docs.

In [17]:
QA_PAIRWISE_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user input and a pair of AI-generated responses (Response A and Response B). You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.

You will first judge responses individually, following the Rating Rubric and Evaluation Steps. Then you will give step-by-step explanations for your judgment, compare results to declare the winner based on the Rating Rubric and Evaluation Steps.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
"A": Response A answers the given question as per the criteria better than response B.
"SAME": Response A and B answers the given question equally well as per the criteria.
"B": Response B answers the given question as per the criteria better than response A.

## Evaluation Steps
STEP 1: Analyze Response A based on the question answering quality criteria: Determine how well Response A fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 2: Analyze Response B based on the question answering quality criteria: Determine how well Response B fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.

# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}

# AI-generated Response

### Response A
{baseline_model_response}

### Response B
{response}
"""


class AnswerComparison(enum.Enum):
  A = 'A'
  SAME = 'SAME'
  B = 'B'


@functools.cache
def eval_pairwise(prompt, response_a, response_b, n=1):
  """Determine the better of two answers to the same prompt."""

  chat = client.chats.create(model='gemini-2.5-flash-lite')

  # Generate the full text response.
  response = chat.send_message(
      message=QA_PAIRWISE_PROMPT.format(
          prompt=[prompt, document_file],
          baseline_model_response=response_a,
          response=response_b)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=AnswerComparison,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval

In [13]:
question = questions[0]
answer_a = answer_question(question, terse_guidance)
answer_b = answer_question(question, cited_guidance)

text_eval, struct_eval = eval_pairwise(
    prompt=question,
    response_a=answer_a,
    response_b=answer_b,
)

display(Markdown(text_eval))
print(struct_eval)

STEP 1: Analyze Response A based on the question answering quality criteria:
- **Instruction following:** The response directly answers the question "How does the model perform on code tasks?" It provides a summary of the model's performance in code.
- **Groundedness:** The information provided aligns with the document. For instance, "best performing model in code to date," "surpassing Gemini 1.0 Ultra on Natural2Code," and "ability to ingest and reason over large codebases" are all statements found within the provided PDF.
- **Completeness:** The response is very concise. While accurate, it provides only a high-level summary and lacks specific details, benchmarks, or examples that would give a more comprehensive understanding of the model's performance on code tasks. It doesn't go into quantitative metrics or specific use cases.
- **Fluency:** The response is well-written, clear, and easy to read.

STEP 2: Analyze Response B based on the question answering quality criteria:
- **Instruction following:** The response thoroughly answers the question, providing extensive details and breaking down the performance into several categories.
- **Groundedness:** The response is exceptionally well-grounded, citing specific page numbers, table numbers, figures, and exact percentages/metrics from the provided PDF document for almost every claim. Examples include "best performing model in code to date" (p. 22), "8.9% improvement" (Table 7, p. 20), details on JAX/Flax codebases (p. 5, p. 4), NLL details (Figure 6, p. 8), Natural2Code and HumanEval scores (Table 8, p. 21), and MATH dataset performance with code (Table 19, p. 58). It also explains the background of some metrics and benchmarks, which is helpful context.
- **Completeness:** The response is highly complete. It covers overall performance, long-context understanding, core coding benchmarks (Natural2Code, HumanEval with caveats), advanced code-related tasks (math with code), and even relevant architectural details (MoE Transformer, training data). It provides specific numbers, comparisons, and illustrative examples, offering a deep and comprehensive view of the model's performance on code tasks.
- **Fluency:** The response is very well-organized with clear headings, sub-headings, and bullet points, making a large amount of detailed information very easy to read and understand. The language is clear and professional. The added "Background on..." sections enhance readability and comprehension.

STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
Response B is significantly better than Response A. While Response A provides a correct high-level summary, Response B delivers a far more detailed, comprehensive, and well-grounded answer. Response B effectively uses the document to provide specific data points, comparisons, examples, and even contextual explanations, which fully addresses the prompt's intent to understand "how" the model performs. Its structure and depth make it a superior answer in terms of completeness and information value.

STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
B

STEP 5: Output your assessment reasoning in the explanation field.
Response B is vastly superior to Response A.
1.  **Completeness and Detail:** Response B provides a comprehensive, detailed breakdown of the model's performance on code tasks, covering overall performance, long-context understanding, specific benchmarks (Natural2Code, HumanEval), advanced tasks (Math with code), and even architectural implications. It includes specific percentages, comparisons with other models, and concrete examples from the document. Response A, in contrast, offers only a very brief, high-level summary.
2.  **Groundedness:** Response B demonstrates excellent groundedness by consistently citing specific page numbers, table numbers, and figures from the document for nearly every piece of information presented. This makes its claims highly verifiable and trustworthy. Response A's claims are grounded but lack specific references to demonstrate this.
3.  **Clarity and Structure:** Response B is exceptionally well-organized with clear headings, bullet points, and explanatory "Background on..." sections that make the complex information easy to digest and understand. Response A is fluent but lacks the structured depth of Response B.
4.  **Information Value:** Response B goes beyond just stating facts by providing context and explanations (e.g., on NLL, HumanEval leakage, SymPy/SciPy), significantly enhancing the user's understanding of the model's capabilities and the benchmarks used.

For these reasons, Response B is a far better answer to the prompt.

AnswerComparison.A


With a pair-wise evaluator in place, the only thing required to rank prompts against each other is a comparator.

This example implements the minimal comparators required for total ordering (`==` and `<`) and performs the comparison using  `n_iterations` evaluations over the set of `questions`.

In [19]:
@functools.total_ordering
class QAGuidancePrompt:
  """A question-answering guidance prompt or system instruction."""

  def __init__(self, prompt, questions, n_comparisons=NUM_ITERATIONS):
    """Create the prompt. Provide questions to evaluate against, and number of evals to perform."""
    self.prompt = prompt
    self.questions = questions
    self.n = n_comparisons

  def __str__(self):
    return self.prompt

  def _compare_all(self, other):
    """Compare two prompts on all questions over n trials."""
    results = [self._compare_n(other, q) for q in questions]
    mean = sum(results) / len(results)
    return round(mean)

  def _compare_n(self, other, question):
    """Compare two prompts on a question over n trials."""
    results = [self._compare(other, question, n) for n in range(self.n)]
    mean = sum(results) / len(results)
    return mean

  def _compare(self, other, question, n=1):
    """Compare two prompts on a single question."""
    answer_a = answer_question(question, self.prompt)
    answer_b = answer_question(question, other.prompt)

    _, result = eval_pairwise(
        prompt=question,
        response_a=answer_a,
        response_b=answer_b,
        n=n,  # Cache buster
    )
    # print(f'q[{question}], a[{self.prompt[:20]}...], b[{other.prompt[:20]}...]: {result}')

    # Convert the enum to the standard Python numeric comparison values.
    if result is AnswerComparison.A:
      return 1
    elif result is AnswerComparison.B:
      return -1
    else:
      return 0

  def __eq__(self, other):
    """Equality check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) == 0

  def __lt__(self, other):
    """Ordering check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) < 0


Now Python's sorting functions will "just work" on any `QAGuidancePrompt` instances. The `answer_question` and `eval_pairwise` functions are [memoized](https://en.wikipedia.org/wiki/Memoization) to avoid unnecessarily regenerating the same answers or evaluations, so you should see this complete quickly unless you have changed the questions, prompts or number of iterations from the earlier steps.

In [20]:
terse_prompt = QAGuidancePrompt(terse_guidance, questions)
moderate_prompt = QAGuidancePrompt(moderate_guidance, questions)
cited_prompt = QAGuidancePrompt(cited_guidance, questions)

# Sort in reverse order, so that best is first
sorted_results = sorted([terse_prompt, moderate_prompt, cited_prompt], reverse=True)
for i, p in enumerate(sorted_results):
  if i:
    print('---')

  print(f'#{i+1}: {p}')

#1: Answer the following question in a single sentence, or as close to that as possible.
---
#2: Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question.
---
#3: Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible.


## Challenges

### LLM limitations

LLMs are known to have problems on certain tasks, and these challenges still persist when using LLMs as evaluators. For example, LLMs can struggle to count the number of characters in a word (this is a numerical problem, not a language problem), so an LLM evaluator will not be able to accurately evaluate this type of task. There are solutions available in some cases, such as connecting tools to handle problems unsuitable to a language model, but it's important that you understand possible limitations and include human evaluators to calibrate your evaluation system and determine a baseline.

One reason that LLM evaluators work well is that all of the information they need is available in the input context, so the model only needs to attend to that information to produce the result. When customising evaluation prompts, or building your own systems, keep this in mind and ensure that you are not relying on "internal knowledge" from the model, or behaviour that might be better provided from a tool.

### Improving confidence

One way to improve the confidence of your evaluations is to include a diverse set of evaluators. That is, use the same prompts and outputs, but execute them on different models, like Gemini Flash and Pro, or even across different providers, like Gemini, Claude, ChatGPT and local models like Gemma or Qwen. This follows the same idea used earlier, where repeating trials to gather multiple "opinions" helps to [reduce error](https://en.wikipedia.org/wiki/Law_of_large_numbers), except by using different models the "opinions" will be more diverse.