In [2]:
!pip install -Uq "google-genai==1.7.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-adk 1.18.0 requires google-genai<2.0.0,>=1.45.0, but you have google-genai 1.7.0 which is incompatible.
google-cloud-aiplatform 1.125.0 requires google-genai<2.0.0,>=1.37.0, but you have google-genai 1.7.0 which is incompatible.[0m[31m
[0m

In [3]:
from google import genai
from google.genai import types

from IPython.display import Markdown, display

genai.__version__

'1.7.0'

In [4]:
from kaggle_secrets import UserSecretsClient

client = genai.Client(api_key=UserSecretsClient().get_secret("GOOGLE_API_KEY"))

In [5]:
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
    genai.models.Models.generate_content = retry.Retry(predicate=is_retriable)(genai.models.Models.generate_content)

### Evaluation

We'll evaluate a summarisation task using the Gemini 1.5 Pro technical report.

In [6]:
!wget -nv -O gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf

document_file = client.files.upload(file='gemini.pdf')

2025-11-14 09:21:44 URL:https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf [7228817/7228817] -> "gemini.pdf" [1]


#### Summarise a document
The summarisation request used here is fairly basic. It targets the training content specifically but provides no guidance otherwise.

In [7]:
request = 'Tell me about the training process used here.'

def summarize_doc(request):
    """Execute the request on the uploaded document."""
    config = types.GenerateContentConfig(temperature=0.0)

    response = client.models.generate_content(
        model='gemini-2.0-flash',
        config=config,
        contents=[request, document_file]
    )

    return response.text

summary = summarize_doc(request)
Markdown(summary)

Based on the document you provided, here's a breakdown of the training process used for Gemini 1.5 Pro:

**1. Data:**

*   **Multimodal and Multilingual Data:** The model is trained on a diverse dataset that includes text, images, audio, and video content. The text data is sourced from various domains, including web documents and code.
*   **Pre-training Dataset:** The pre-training dataset includes data sourced across many different domains, including web documents and code, and incorporates image, audio, and video content.
*   **Instruction-Tuning Phase:** Gemini 1.5 Pro is fine-tuned on a collection of multimodal data containing paired instructions and appropriate responses, with further tuning based on human preference data.

**2. Architecture:**

*   **Mixture-of-Experts (MoE) Transformer:** Gemini 1.5 Pro is based on a sparse MoE Transformer architecture. This allows the model to have a large number of parameters while only activating a subset for any given input.

**3. Infrastructure:**

*   **TPUv4 Accelerators:** The model is trained on multiple 4096-chip pods of Google's TPUv4 accelerators, distributed across multiple datacenters.

**4. Training Process:**

*   **Pre-training:** The model is initially pre-trained on the large multimodal dataset.
*   **Instruction Tuning:** After pre-training, the model is fine-tuned on a collection of multimodal data containing paired instructions and appropriate responses.
*   **Human Preference Tuning:** Further tuning is performed based on human preference data.

**5. Key Improvements:**

*   **Architecture:** Improvements across the model stack, including architecture, data, optimization, and systems.
*   **Long-Context Understanding:** Significant architecture changes enable understanding of inputs up to 10 million tokens without performance degradation.

**In summary:** Gemini 1.5 Pro is trained using a large, diverse multimodal dataset on Google's TPUv4 infrastructure. It uses a MoE Transformer architecture and undergoes pre-training, instruction tuning, and human preference tuning. The training process incorporates improvements across the model stack to enable long-context understanding and overall performance.

#### Define an evaluator

For a task like this, we want to evaluate a number of aspects, like how well the model followed the prompt ("instruction following"), whether it included relevant data in the prompt ("groundedness"), how easy the text is to read ("fluency"), or other factors like "verbosity" or "quality".

In this step, we define an evaluation agent using a pre-written "summarisation" prompt and use it to gauge the quality of the generated summary.

In [8]:
SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [9]:
import enum

# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'

def eval_summary(prompt, ai_response):
    """Evaluate the generated summary against the prompt used."""

    chat = client.chats.create(model='gemini-2.0-flash')

    #generate full chat response
    response = chat.send_message(
        message=SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)
    )

    verbose_eval = response.text

    # get desired structure
    structured_output_config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=SummaryRating,
        )

    response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
      )
    structured_eval = response.parsed

    return verbose_eval, structured_eval

In [10]:
text_eval, struct_eval = eval_summary(prompt=[request, document_file], ai_response=summary)
Markdown(text_eval)

## Evaluation
STEP 1:
The AI-generated response summarizes the training process of Gemini 1.5 Pro based on the uploaded document, covering aspects like data, architecture, infrastructure, and training stages. The information is grounded in the document, as it accurately reflects the details provided about the multimodal dataset, MoE Transformer architecture, TPUv4 accelerators, and the pre-training, instruction tuning, and human preference tuning stages. The response could be more concise. The level of detail is good, but it could be condensed slightly to focus on the most critical aspects without losing key information. The response is generally fluent and well-organized, making it easy to understand. The use of bullet points and clear headings helps structure the information effectively.

STEP 2:
The response fulfills the instructions, is grounded, and is mostly concise and fluent.
I will give a score of 4.

## Rating:
4

In [11]:
struct_eval

<SummaryRating.GOOD: '4'>

In [12]:
new_prompt = "Explain like I'm 5 the training process"

if not new_prompt:
  raise ValueError("Try setting a new summarisation prompt.")

def run_and_eval_summary(prompt):
  """Generate and evaluate the summary using the new prompt."""
  summary = summarize_doc(new_prompt)
  display(Markdown(summary + '\n-----'))

  text, struct = eval_summary([new_prompt, document_file], summary)
  display(Markdown(text + '\n-----'))
  print(struct)

run_and_eval_summary(new_prompt)

Okay, I can explain the training process of a large language model like Gemini 1.5 Pro in a way that a 5-year-old can understand.

Imagine you have a puppy, and you want to teach it to understand and respond to your commands. That's kind of like training a big computer brain!

1.  **Lots of Examples:** First, you show the puppy lots and lots of things. You show it pictures of cats, dogs, cars, and houses. You also read it stories and tell it about all sorts of things. The computer brain also gets to see and read lots of things – millions and millions of pictures, books, and websites!

2.  **Learning Patterns:** The puppy starts to notice patterns. It learns that things with pointy ears and a tail are often dogs, and that when you say "sit," it should put its bottom on the ground. The computer brain also learns patterns. It learns that certain words often go together, and that certain pictures are related to certain words.

3.  **Making Predictions:** Now, you ask the puppy a question, like "Where's the ball?" The puppy tries to guess where the ball is based on what it has learned. The computer brain also tries to guess the answer to questions.

4.  **Getting Feedback:** If the puppy guesses right, you give it a treat and say "Good job!" If it guesses wrong, you gently correct it. The computer brain also gets feedback. If it guesses right, it gets a little reward. If it guesses wrong, it adjusts itself to try to guess better next time.

5.  **Repeating and Improving:** You keep showing the puppy things, asking questions, and giving feedback over and over again. The puppy gets better and better at understanding and responding to you. The computer brain also keeps learning and improving. It gets better at understanding and answering questions, and even at doing new things that it wasn't specifically taught!

So, training a big computer brain is like teaching a puppy, but with lots and lots of examples, and instead of treats, the computer brain gets little rewards that help it learn and improve. And just like a well-trained puppy, a well-trained computer brain can be very helpful and do amazing things!
-----

## Evaluation
STEP 1: 
The response is well-written and easy to understand, and it is written as though speaking to a child of 5 years old. The response is grounded in the document that was provided.

STEP 2:
4 - The response is good, and does follow instructions. The explanation is very good for the prompt. grounded, concise, and fluent.
-----

SummaryRating.GOOD


#### Pointwise evaluation

The technique used above, where we evaluated a single input/output pair against some criteria is known as pointwise evaluation. This is useful for evaluating singular outputs in an absolute sense, such as "was it good or bad?"

In this exercise, we will try different guidance prompts with a set of questions.

In [13]:
terse_guidance = "Answer the following question in a single sentence, or as close to that as possible."
moderate_guidance = "Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question."
cited_guidance = "Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible."
guidance_options = {
    'Terse': terse_guidance,
    'Moderate': moderate_guidance,
    'Cited': cited_guidance,
}

questions = [
    "What metric(s) are used to evaluate long context performance?",
    "How does the model perform on code tasks?",
    "How many layers does it have?",
    # "Why is it called Gemini?",
]

if not questions:
  raise NotImplementedError('Add some questions to evaluate!')

In [14]:
import functools

@functools.cache # caches the output of function in LRU cache (makes fetching results later easy O(1))
def answer_question(question, guidance):
    """Generate an answer to the question using the uploaded document and guidance."""
    config = types.GenerateContentConfig(
        temperature=0.0,
        system_instruction=guidance
    )

    response = client.models.generate_content(
        model='gemini-2.0-flash',
        config=config,
        contents=[question, document_file]
    )

    return response.text

answer = answer_question(questions[0], terse_guidance)
Markdown(answer)

Long context performance is evaluated using metrics such as near-perfect recall on retrieval tasks, improvements in long-document QA, long-video QA, and long-context ASR, and by matching or surpassing the performance of other models on a broad set of benchmarks.


In [15]:
answer = answer_question(questions[0], cited_guidance)
Markdown(answer)

Based on the document "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context", here's a breakdown of the metrics used to evaluate long context performance:

**1. Diagnostic Long-Context Evaluations:**

*   **Perplexity over Long Sequences:**
    *   **Metric:** Negative Log-Likelihood (NLL) of tokens at different positions in the input sequences.
    *   **Interpretation:** A lower NLL indicates better next-token prediction and more effective use of long-context information. The trend of the NLL curve (downward or upward) reveals whether the model is effectively using the context or deteriorating in prediction quality.
*   **Needle-in-a-Haystack Retrieval:**
    *   **Metric:** Recall rate (percentage of successful retrievals).
    *   **Task:** The model is given a long context (the "haystack") and asked to find a specific piece of information (the "needle") inserted at a random location.
    *   **Modalities:** This task is performed across text, video, and audio.
    *   **Variations:** The document also explores variations of this task, such as retrieving multiple needles in a single turn and modulating retrieval difficulty by varying the similarity of the needles.

**2. Realistic Long-Context Evaluations:**

*   **In-Context Language Learning (Machine Translation from One Book - MTOB):**
    *   **Task:** Learning to translate a new language (Kalamang) from a single set of linguistic documentation (grammar, dictionary, parallel sentences).
    *   **Metrics:**
        *   Human evaluation scores (on a scale of 0 to 6, with 6 being an excellent translation).
        *   Automatic metrics: BLEURT (for Kalamang to English) and chrF (for English to Kalamang).
*   **Long-Document Question Answering:**
    *   **Task:** Answering questions about a long document (e.g., "Les Misérables").
    *   **Metric:** Human evaluation using the Attributable to Identified Sources (AIS) protocol.
    *   **Evaluation:** The model's ability to answer questions correctly when the entire document is provided as input.
*   **Long-Context Automatic Speech Recognition (ASR):**
    *   **Task:** Transcribing long audio segments (15-minute videos).
    *   **Metric:** Word Error Rate (WER).
*   **Long-Context Video Question Answering:**
    *   **Task:** Answering questions about long videos (40-105 minutes).
    *   **Metric:** Accuracy.

**Additional Context and Background:**

*   **Importance of Long Context:** The document emphasizes that expanding the context window allows models to incorporate more task-specific information not found in the training data, leading to improved performance.
*   **Comparison with Other Models:** Gemini 1.5 Pro is compared against Gemini 1.0 Pro/Ultra, Claude 2.1, and GPT-4 Turbo on various tasks to demonstrate its capabilities.
*   **Trade-offs:** The document also addresses the potential trade-offs between long-context performance and core capabilities (performance on non-long-context tasks).
*   **Responsible Deployment:** The document includes a section on responsible deployment, covering impact assessment, model mitigations, and safety evaluations.
*   **Divergence:** The document also evaluates Gemini 1.5 Pro to understand its susceptibility to divergence and in particular, emitting memorized training data via this attack.


Now let's set up a question-answering evaluator, much like before, but using the pointwise QA evaluation prompt.

In [16]:
QA_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user prompt and an AI-generated responses.
You should first read the user prompt carefully for analyzing the task, and then evaluate the quality of the responses based on and rules provided in the Evaluation section below.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

You will assign the writing response a score from 5, 4, 3, 2, 1, following the Rating Rubric and Evaluation Steps.
Give step-by-step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.

## Criteria Definition
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The answer follows instructions, is grounded, complete, and fluent.
4: (Good). The answer follows instructions, is grounded, complete, but is not very fluent.
3: (Ok). The answer mostly follows instructions, is grounded, answers the question partially and is not very fluent.
2: (Bad). The answer does not follow the instructions very well, is incomplete or not fully grounded.
1: (Very bad). The answer does not follow the instructions, is wrong and not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness,completeness, and fluency according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [17]:
class AnswerRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


@functools.cache
def eval_answer(prompt, ai_response, n=1):
    """Evaluate the generated answer against the prompt/question used."""
    chat = client.chats.create(model='gemini-2.0-flash')

    response = chat.send_message(
        message=QA_PROMPT.format(prompt=prompt, response=ai_response)
    )
    verbose_eval = response.text

    # structured output
    config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=AnswerRating,
    )

    response = chat.send_message(
      message="Convert the final score.",
      config=config,
      )
    structured_eval = response.parsed

    return verbose_eval, structured_eval

In [18]:
text_eval, struct_eval = eval_answer(prompt=questions[0], ai_response=answer)
display(Markdown(text_eval))
print(struct_eval)

STEP 1: Assess the response in aspects of instruction following, groundedness, completeness, and fluency according to the criteria.

The response is grounded, as it states that the information comes from the document "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context". The answer completely answers the question with sufficient detail. The response is well-organized and easy to read. The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.

STEP 2: Score based on the rubric.

5


AnswerRating.VERY_GOOD


In [19]:
struct_eval

<AnswerRating.VERY_GOOD: '5'>

In [20]:
import collections
import itertools

NUM_ITERATIONS = 1

scores = collections.defaultdict(int)
responses = collections.defaultdict(list)

for question in questions:
    display(Markdown(f'## {question}'))
    for guidance, guide_prompt in guidance_options.items():
        for n in range(NUM_ITERATIONS):
            answer = answer_question(question, guide_prompt)
            written_eval, struct_eval = eval_answer(question, answer, n)
            print(f'{guidance}: {struct_eval}')
            scores[guidance] += int(struct_eval.value)
            responses[(guidance, question)].append((answer, written_eval))

## What metric(s) are used to evaluate long context performance?

Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD


## How does the model perform on code tasks?

Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.GOOD
Cited: AnswerRating.VERY_GOOD


## How many layers does it have?

Terse: AnswerRating.OK
Moderate: AnswerRating.OK
Cited: AnswerRating.VERY_GOOD


In [21]:
for guidance, score in scores.items():
  avg_score = score / (NUM_ITERATIONS * len(questions))
  nearest = AnswerRating(str(round(avg_score)))
  print(f'{guidance}: {avg_score:.2f} - {nearest.name}')

Terse: 4.33 - GOOD
Moderate: 4.00 - GOOD
Cited: 5.00 - VERY_GOOD


#### Pairwise Evaluation

In [22]:
QA_PAIRWISE_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user input and a pair of AI-generated responses (Response A and Response B). You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.

You will first judge responses individually, following the Rating Rubric and Evaluation Steps. Then you will give step-by-step explanations for your judgment, compare results to declare the winner based on the Rating Rubric and Evaluation Steps.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
"A": Response A answers the given question as per the criteria better than response B.
"SAME": Response A and B answers the given question equally well as per the criteria.
"B": Response B answers the given question as per the criteria better than response A.

## Evaluation Steps
STEP 1: Analyze Response A based on the question answering quality criteria: Determine how well Response A fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 2: Analyze Response B based on the question answering quality criteria: Determine how well Response B fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.

# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}

# AI-generated Response

### Response A
{baseline_model_response}

### Response B
{response}
"""

In [24]:
class AnswerComparison(enum.Enum):
  A = 'A'
  SAME = 'SAME'
  B = 'B'


@functools.cache
def eval_pairwise(prompt, response_a, response_b, n=1):
    """Determine the better of two answers to the same prompt."""

    chat = client.chats.create(model='gemini-2.0-flash')

    response = chat.send_message(
        message=QA_PAIRWISE_PROMPT.format(prompt=prompt, baseline_model_response=response_a, response=response_b)
    )
    verbose_eval = response.text

    structured_output_config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=AnswerComparison
    )

    response = chat.send_message(
        message="Convert the final score.",
        config=structured_output_config,
    )

    stuctured_eval = response.parsed

    return verbose_eval, stuctured_eval

In [25]:
question = questions[0]
answer_a = answer_question(question, terse_guidance)
answer_b = answer_question(question, cited_guidance)

text_eval, struct_eval = eval_pairwise(
    prompt=question,
    response_a=answer_a,
    response_b=answer_b,
)

display(Markdown(text_eval))
print(struct_eval)

STEP 1: Analyze Response A based on the question answering quality criteria:
Response A fulfills the prompt, but lacks the details provided in the document it is referencing.

STEP 2: Analyze Response B based on the question answering quality criteria:
Response B directly answers the question by providing a breakdown of the metrics used to evaluate long context performance as defined by the document provided.

STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
Response B is a much better response than Response A because it provides the breakdown of the metrics that the document provides.

STEP 4: Output your preference of "B" to the pairwise_choice field according to the Rating Rubric.

STEP 5: Output your assessment reasoning in the explanation field.
Response B is superior because it provides a comprehensive and well-organized breakdown of the metrics used to evaluate long context performance, as defined by the referenced document. This includes diagnostic and realistic evaluations, along with specific metrics such as Negative Log-Likelihood (NLL), recall rate, human evaluation scores, BLEURT, chrF, Word Error Rate (WER), and accuracy. The response also offers additional context and background, such as the importance of long context, comparisons with other models, trade-offs, responsible deployment, and divergence. In contrast, Response A is too general and lacks the depth and specificity found in Response B.


AnswerComparison.B


In [26]:
@functools.total_ordering
class QAGuidancePrompt:
  """A question-answering guidance prompt or system instruction."""

  def __init__(self, prompt, questions, n_comparisons=NUM_ITERATIONS):
    """Create the prompt. Provide questions to evaluate against, and number of evals to perform."""
    self.prompt = prompt
    self.questions = questions
    self.n = n_comparisons

  def __str__(self):
    return self.prompt

  def _compare_all(self, other):
    """Compare two prompts on all questions over n trials."""
    results = [self._compare_n(other, q) for q in questions]
    mean = sum(results) / len(results)
    return round(mean)

  def _compare_n(self, other, question):
    """Compare two prompts on a question over n trials."""
    results = [self._compare(other, question, n) for n in range(self.n)]
    mean = sum(results) / len(results)
    return mean

  def _compare(self, other, question, n=1):
    """Compare two prompts on a single question."""
    answer_a = answer_question(question, self.prompt)
    answer_b = answer_question(question, other.prompt)

    _, result = eval_pairwise(
        prompt=question,
        response_a=answer_a,
        response_b=answer_b,
        n=n,  # Cache buster
    )

    # Convert the enum to the standard Python numeric comparison values.
    if result is AnswerComparison.A:
      return 1
    elif result is AnswerComparison.B:
      return -1
    else:
      return 0

  def __eq__(self, other):
    """Equality check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) == 0

  def __lt__(self, other):
    """Ordering check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) < 0

Ranking prompts against each-other

In [27]:
terse_prompt = QAGuidancePrompt(terse_guidance, questions)
moderate_prompt = QAGuidancePrompt(moderate_guidance, questions)
cited_prompt = QAGuidancePrompt(cited_guidance, questions)

# Sort in reverse order, so that best is first
sorted_results = sorted([terse_prompt, moderate_prompt, cited_prompt], reverse=True)
for i, p in enumerate(sorted_results):
  if i:
    print('---')

  print(f'#{i+1}: {p}')

#1: Answer the following question in a single sentence, or as close to that as possible.
---
#2: Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible.
---
#3: Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question.
