In [1]:
!pip install -Uq "google-genai==1.7.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-adk 1.18.0 requires google-genai<2.0.0,>=1.45.0, but you have google-genai 1.7.0 which is incompatible.
google-cloud-aiplatform 1.125.0 requires google-genai<2.0.0,>=1.37.0, but you have google-genai 1.7.0 which is incompatible.[0m[31m
[0m

In [2]:
from google import genai
from google.genai import types

from IPython.display import Markdown, display

genai.__version__

'1.7.0'

In [3]:
from kaggle_secrets import UserSecretsClient

client = genai.Client(api_key=UserSecretsClient().get_secret("GOOGLE_API_KEY"))

In [4]:
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
    genai.models.Models.generate_content = retry.Retry(predicate=is_retriable)(genai.models.Models.generate_content)

### Evaluation

We'll evaluate a summarisation task using the Gemini 1.5 Pro technical report.

In [5]:
!wget -nv -O gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf

document_file = client.files.upload(file='gemini.pdf')

2026-02-03 12:54:48 URL:https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf [7228817/7228817] -> "gemini.pdf" [1]


#### Summarise a document
The summarisation request used here is fairly basic. It targets the training content specifically but provides no guidance otherwise.

In [6]:
request = 'Tell me about the training process used here.'

def summarize_doc(request):
    """Execute the request on the uploaded document."""
    config = types.GenerateContentConfig(temperature=0.0)

    response = client.models.generate_content(
        model='gemini-2.0-flash',
        config=config,
        contents=[request, document_file]
    )

    return response.text

summary = summarize_doc(request)
Markdown(summary)

Here's a breakdown of the training process used for Gemini 1.5 Pro, based on the provided document:

**1. Model Architecture:**

*   **Mixture-of-Experts (MoE):** Gemini 1.5 Pro uses a sparse MoE architecture. This means it has a large number of parameters, but only a subset of them are activated for any given input. A learned routing function directs inputs to a subset of the model's parameters (experts) for processing. This allows for scaling the model's size without a proportional increase in computational cost during inference.
*   **Transformer-based:** It builds upon the Transformer architecture, which is the foundation for many modern language models.

**2. Training Data:**

*   **Multimodal and Multilingual:** The model is trained on a diverse dataset that includes text, images, audio, and video content. The data is sourced from various domains, including web documents and code.
*   **Pre-training:** The model undergoes pre-training on this large, diverse dataset.
*   **Instruction Tuning:** After pre-training, the model is fine-tuned using a collection of multimodal data containing paired instructions and appropriate responses. This helps the model better follow instructions and generate desired outputs.
*   **Human Preference Data:** Further tuning is based on human preference data, likely using techniques like Reinforcement Learning from Human Feedback (RLHF) to align the model's behavior with human expectations.

**3. Training Infrastructure:**

*   **TPUv4 Accelerators:** The model is trained on multiple 4096-chip pods of Google's TPUv4 accelerators.
*   **Distributed Training:** The training is distributed across multiple datacenters.

**4. Long-Context Training:**

*   The model incorporates significant architecture changes that enable long-context understanding of inputs up to 10 million tokens without degrading performance.

**5. Safety Mitigations:**

*   **Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF):** These techniques are used to mitigate safety risks.
*   **Focus on Adversarial Queries:** The safety mitigation is focused on adversarial or "harm-inducing" queries, where an unprotected model is likely to produce harmful responses.
*   **Multimodal Safety Data:** New image-to-text SFT data is incorporated, as text-only safety data was found to be less effective for harm-inducing image-to-text queries.

**In summary, the training process for Gemini 1.5 Pro involves a combination of large-scale pre-training on diverse multimodal data, instruction tuning with paired instructions and responses, and safety mitigations using SFT and RLHF. The model is trained on Google's TPUv4 accelerators using distributed training techniques.**

#### Define an evaluator

For a task like this, we want to evaluate a number of aspects, like how well the model followed the prompt ("instruction following"), whether it included relevant data in the prompt ("groundedness"), how easy the text is to read ("fluency"), or other factors like "verbosity" or "quality".

In this step, we define an evaluation agent using a pre-written "summarisation" prompt and use it to gauge the quality of the generated summary.

In [10]:
SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [11]:
import enum

# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'

def eval_summary(prompt, ai_response):
    """Evaluate the generated summary against the prompt used."""

    chat = client.chats.create(model='gemini-2.0-flash')

    #generate full chat response
    response = chat.send_message(
        message=SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)
    )

    verbose_eval = response.text

    # get desired structure
    structured_output_config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=SummaryRating,
        )

    response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
      )
    structured_eval = response.parsed

    return verbose_eval, structured_eval

In [12]:
text_eval, struct_eval = eval_summary(prompt=[request, document_file], ai_response=summary)
Markdown(text_eval)

## Evaluation
STEP 1:
The response contains an overview of the training process from the document, but is not very concise.

STEP 2:
I'm giving this response a rating of 4. It mostly follows instructions, is grounded, and is fluent. It is not very concise.

## Rating: 4


In [13]:
struct_eval

<SummaryRating.GOOD: '4'>

In [14]:
new_prompt = "Explain like I'm 5 the training process"

if not new_prompt:
  raise ValueError("Try setting a new summarisation prompt.")

def run_and_eval_summary(prompt):
  """Generate and evaluate the summary using the new prompt."""
  summary = summarize_doc(new_prompt)
  display(Markdown(summary + '\n-----'))

  text, struct = eval_summary([new_prompt, document_file], summary)
  display(Markdown(text + '\n-----'))
  print(struct)

run_and_eval_summary(new_prompt)

Okay, I can explain the training process of a large language model like Gemini 1.5 in a way that a 5-year-old can understand.

Imagine you have a puppy, and you want to teach it to understand and respond to your commands. Here's how it's similar to training a big computer brain:

1.  **Lots and Lots of Examples:**
    *   The puppy needs to see and hear many examples of what you want it to do. For example, you show it a ball and say "Fetch!" over and over.
    *   The computer brain also needs to see lots of examples. It reads millions of books, articles, and websites. It also sees pictures, videos, and hears sounds. This helps it learn about the world.

2.  **Learning the Rules:**
    *   The puppy starts to learn that "Fetch!" means to go get the ball and bring it back. It learns the rules of the game.
    *   The computer brain learns the rules of language, like how words go together to make sentences. It also learns about different topics, like animals, planets, and history.

3.  **Practice and Correction:**
    *   When the puppy does something wrong, you gently correct it. Maybe it brings back a shoe instead of the ball. You say, "No, fetch the ball!"
    *   The computer brain also makes mistakes. When it does, the people who are training it tell it what it did wrong. The computer brain then adjusts itself to do better next time.

4.  **Getting Smarter Over Time:**
    *   With lots of practice and correction, the puppy gets better and better at understanding what you want it to do.
    *   The computer brain also gets smarter over time. It can answer questions, write stories, and even translate languages!

So, training a big computer brain is like teaching a puppy, but with lots and lots more examples and a lot of math! The computer brain learns from all the information it sees and hears, and it gets better and better at understanding and responding to what people want.
-----

## Evaluation
STEP 1: The response fulfills the prompt, however, it does not provide a summary of the document given. Rather it's a canned answer that it provides given the request to explain a training process like I'm 5.
STEP 2: The rating is a 2 as it does not follow instructions, but is grounded in reality.

## Rating
2

-----

SummaryRating.BAD


#### Pointwise evaluation

The technique used above, where we evaluated a single input/output pair against some criteria is known as pointwise evaluation. This is useful for evaluating singular outputs in an absolute sense, such as "was it good or bad?"

In this exercise, we will try different guidance prompts with a set of questions.

In [17]:
terse_guidance = "Answer the following question in a single sentence, or as close to that as possible."
moderate_guidance = "Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question."
cited_guidance = "Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible."
guidance_options = {
    'Terse': terse_guidance,
    'Moderate': moderate_guidance,
    'Cited': cited_guidance,
}

questions = [
    "What metric(s) are used to evaluate long context performance?",
    "How does the model perform on code tasks?",
    "How many layers does it have?",
    # "Why is it called Gemini?",
]

if not questions:
  raise NotImplementedError('Add some questions to evaluate!')

In [18]:
import functools

@functools.cache # caches the output of function in LRU cache (makes fetching results later easy O(1))
def answer_question(question, guidance):
    """Generate an answer to the question using the uploaded document and guidance."""
    config = types.GenerateContentConfig(
        temperature=0.0,
        system_instruction=guidance
    )

    response = client.models.generate_content(
        model='gemini-2.0-flash',
        config=config,
        contents=[question, document_file]
    )

    return response.text

answer = answer_question(questions[0], terse_guidance)
Markdown(answer)

Metrics used to evaluate long context performance include next-token prediction, near-perfect retrieval, long-document question answering, long-video question answering, and long-context automatic speech recognition.


In [19]:
answer = answer_question(questions[0], cited_guidance)
Markdown(answer)

Based on the document "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context", here's a breakdown of the metrics used to evaluate long-context performance:

**1. Diagnostic Long-Context Evaluations:**

*   **Perplexity over Long Sequences:**
    *   **Metric:** Negative Log-Likelihood (NLL) of tokens at different positions in the input sequences.
    *   **Interpretation:** A lower NLL indicates better prediction and more effective use of long-context information. The trend of the NLL curve (downward or upward) signifies the model's ability to reason over long contexts.
*   **Text Haystack (Needle-in-a-Haystack Retrieval):**
    *   **Metric:** Recall rate (percentage of successful retrievals of the "needle" or secret number).
    *   **Interpretation:** A high recall rate demonstrates the model's ability to reliably retrieve specific information from a large amount of distractor context.
*   **Video Haystack:**
    *   **Metric:** Recall rate of the secret word embedded as text in a random frame of a long video.
    *   **Interpretation:** Tests the model's ability to retrieve specific information across multiple hours of video.
*   **Audio Haystack:**
    *   **Metric:** Accuracy (percentage of times the model correctly identifies the "secret keyword" in a long audio segment).
    *   **Interpretation:** Assesses the model's ability to understand audio over long contexts.
*   **Improved Diagnostics:**
    *   **Multiple Needles-in-a-Haystack:** Retrieval performance of multiple needles in a single turn.
    *   **Multi-round Co-reference Resolution (MRCR):** String similarity score between the model output and the correct response in a multi-turn conversation.

**2. Realistic Long-Context Evaluations:**

*   **In-context Language Learning (Machine Translation from One Book - MTOB):**
    *   **Metrics:**
        *   Human evaluation scores (on a scale of 0 to 6, with 6 being excellent translation).
        *   Automatic metrics: BLEURT (for Kalamang to English) and chrF (for English to Kalamang).
    *   **Interpretation:** Measures the model's ability to learn a new language from a single set of linguistic documentation.
*   **Long-Document QA:**
    *   **Metrics:**
        *   AutoAIS (Automatic Attributable to Identified Sources) score.
        *   AIS Human Evaluation.
        *   Number of Sentences per answer.
    *   **Interpretation:** Assesses the model's ability to answer questions about long documents, requiring understanding of relationships between pieces of information spanning large portions of text.
*   **Long-Context Audio (Automatic Speech Recognition - ASR):**
    *   **Metric:** Word Error Rate (WER).
    *   **Interpretation:** A lower WER indicates better transcription accuracy for long audio segments.
*   **Long-Context Video QA:**
    *   **Metric:** Accuracy.
    *   **Interpretation:** Assesses the model's ability to answer questions about long videos.

**Additional Background and Context:**

*   **Context Window:** The size of the context window (the amount of text, audio, or video the model can consider at once) is a key factor in long-context performance. Gemini 1.5 Pro significantly extends the context length compared to previous models.
*   **Multimodality:** Gemini 1.5 Pro is a multimodal model, meaning it can process and reason over different types of data (text, audio, video, images) simultaneously.
*   **Mixture-of-Experts (MoE):** Gemini 1.5 Pro uses a MoE architecture, which allows the model to have a large number of parameters while only activating a subset of them for any given input. This improves efficiency and scalability.
*   **Retrieval-Augmented Generation:** Some models use external retrieval mechanisms to access relevant information from a database or corpus. Gemini 1.5 Pro's large context window reduces the need for external retrieval in some cases.
*   **Safety and Responsible Deployment:** The document also discusses the importance of safety and responsible deployment of large language models, including measures to mitigate harmful outputs and biases.

In summary, the document uses a combination of diagnostic and realistic evaluations to assess Gemini 1.5 Pro's long-context capabilities across different modalities. The metrics used include perplexity, recall, accuracy, human evaluation scores, and automatic metrics like BLEURT and WER.


Now let's set up a question-answering evaluator, much like before, but using the pointwise QA evaluation prompt.

In [22]:
QA_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user prompt and an AI-generated responses.
You should first read the user prompt carefully for analyzing the task, and then evaluate the quality of the responses based on and rules provided in the Evaluation section below.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

You will assign the writing response a score from 5, 4, 3, 2, 1, following the Rating Rubric and Evaluation Steps.
Give step-by-step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.

## Criteria Definition
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The answer follows instructions, is grounded, complete, and fluent.
4: (Good). The answer follows instructions, is grounded, complete, but is not very fluent.
3: (Ok). The answer mostly follows instructions, is grounded, answers the question partially and is not very fluent.
2: (Bad). The answer does not follow the instructions very well, is incomplete or not fully grounded.
1: (Very bad). The answer does not follow the instructions, is wrong and not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness,completeness, and fluency according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [23]:
class AnswerRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


@functools.cache
def eval_answer(prompt, ai_response, n=1):
    """Evaluate the generated answer against the prompt/question used."""
    chat = client.chats.create(model='gemini-2.0-flash')

    response = chat.send_message(
        message=QA_PROMPT.format(prompt=prompt, response=ai_response)
    )
    verbose_eval = response.text

    # structured output
    config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=AnswerRating,
    )

    response = chat.send_message(
      message="Convert the final score.",
      config=config,
      )
    structured_eval = response.parsed

    return verbose_eval, structured_eval

In [24]:
text_eval, struct_eval = eval_answer(prompt=questions[0], ai_response=answer)
display(Markdown(text_eval))
print(struct_eval)

STEP 1: The model gives a thorough response about the different metrics that are used to evaluate long context performance. It also includes the interpretation of the metrics and the meaning behind them.
STEP 2: The answer follows instructions, is grounded, complete, and fluent.

Score: 5


AnswerRating.VERY_GOOD


In [25]:
struct_eval

<AnswerRating.VERY_GOOD: '5'>

In [26]:
import collections
import itertools

NUM_ITERATIONS = 1

scores = collections.defaultdict(int)
responses = collections.defaultdict(list)

for question in questions:
    display(Markdown(f'## {question}'))
    for guidance, guide_prompt in guidance_options.items():
        for n in range(NUM_ITERATIONS):
            answer = answer_question(question, guide_prompt)
            written_eval, struct_eval = eval_answer(question, answer, n)
            print(f'{guidance}: {struct_eval}')
            scores[guidance] += int(struct_eval.value)
            responses[(guidance, question)].append((answer, written_eval))

## What metric(s) are used to evaluate long context performance?

Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD


## How does the model perform on code tasks?

Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD


## How many layers does it have?

Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.OK
Cited: AnswerRating.VERY_GOOD


In [28]:
for guidance, score in scores.items():
  avg_score = score / (NUM_ITERATIONS * len(questions))
  nearest = AnswerRating(str(round(avg_score)))
  print(f'{guidance}: {avg_score:.2f} - {nearest.name}')

Terse: 5.00 - VERY_GOOD
Moderate: 4.33 - GOOD
Cited: 5.00 - VERY_GOOD


#### Pairwise Evaluation

In [29]:
QA_PAIRWISE_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user input and a pair of AI-generated responses (Response A and Response B). You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.

You will first judge responses individually, following the Rating Rubric and Evaluation Steps. Then you will give step-by-step explanations for your judgment, compare results to declare the winner based on the Rating Rubric and Evaluation Steps.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
"A": Response A answers the given question as per the criteria better than response B.
"SAME": Response A and B answers the given question equally well as per the criteria.
"B": Response B answers the given question as per the criteria better than response A.

## Evaluation Steps
STEP 1: Analyze Response A based on the question answering quality criteria: Determine how well Response A fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 2: Analyze Response B based on the question answering quality criteria: Determine how well Response B fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.

# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}

# AI-generated Response

### Response A
{baseline_model_response}

### Response B
{response}
"""

In [30]:
class AnswerComparison(enum.Enum):
  A = 'A'
  SAME = 'SAME'
  B = 'B'


@functools.cache
def eval_pairwise(prompt, response_a, response_b, n=1):
    """Determine the better of two answers to the same prompt."""

    chat = client.chats.create(model='gemini-2.0-flash')

    response = chat.send_message(
        message=QA_PAIRWISE_PROMPT.format(prompt=prompt, baseline_model_response=response_a, response=response_b)
    )
    verbose_eval = response.text

    structured_output_config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=AnswerComparison
    )

    response = chat.send_message(
        message="Convert the final score.",
        config=structured_output_config,
    )

    stuctured_eval = response.parsed

    return verbose_eval, stuctured_eval

In [31]:
question = questions[0]
answer_a = answer_question(question, terse_guidance)
answer_b = answer_question(question, cited_guidance)

text_eval, struct_eval = eval_pairwise(
    prompt=question,
    response_a=answer_a,
    response_b=answer_b,
)

display(Markdown(text_eval))
print(struct_eval)

STEP 1: Analyze Response A based on the question answering quality criteria:
Response A provides a list of metrics, but it is too short and doesn't provide enough detail.

STEP 2: Analyze Response B based on the question answering quality criteria:
Response B is very thorough and provides a lot of detail regarding the metrics used to evaluate long context performance.

STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
Response B is much more helpful and provides more detail to the user.

STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
B

STEP 5: Output your assessment reasoning in the explanation field.
Response B provides a comprehensive list of metrics used to evaluate long context performance, and includes useful context for each metric. Response A is too short.

AnswerComparison.B


In [32]:
@functools.total_ordering
class QAGuidancePrompt:
  """A question-answering guidance prompt or system instruction."""

  def __init__(self, prompt, questions, n_comparisons=NUM_ITERATIONS):
    """Create the prompt. Provide questions to evaluate against, and number of evals to perform."""
    self.prompt = prompt
    self.questions = questions
    self.n = n_comparisons

  def __str__(self):
    return self.prompt

  def _compare_all(self, other):
    """Compare two prompts on all questions over n trials."""
    results = [self._compare_n(other, q) for q in questions]
    mean = sum(results) / len(results)
    return round(mean)

  def _compare_n(self, other, question):
    """Compare two prompts on a question over n trials."""
    results = [self._compare(other, question, n) for n in range(self.n)]
    mean = sum(results) / len(results)
    return mean

  def _compare(self, other, question, n=1):
    """Compare two prompts on a single question."""
    answer_a = answer_question(question, self.prompt)
    answer_b = answer_question(question, other.prompt)

    _, result = eval_pairwise(
        prompt=question,
        response_a=answer_a,
        response_b=answer_b,
        n=n,  # Cache buster
    )

    # Convert the enum to the standard Python numeric comparison values.
    if result is AnswerComparison.A:
      return 1
    elif result is AnswerComparison.B:
      return -1
    else:
      return 0

  def __eq__(self, other):
    """Equality check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) == 0

  def __lt__(self, other):
    """Ordering check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) < 0

Ranking prompts against each-other

In [33]:
terse_prompt = QAGuidancePrompt(terse_guidance, questions)
moderate_prompt = QAGuidancePrompt(moderate_guidance, questions)
cited_prompt = QAGuidancePrompt(cited_guidance, questions)

# Sort in reverse order, so that best is first
sorted_results = sorted([terse_prompt, moderate_prompt, cited_prompt], reverse=True)
for i, p in enumerate(sorted_results):
  if i:
    print('---')

  print(f'#{i+1}: {p}')

#1: Answer the following question in a single sentence, or as close to that as possible.
---
#2: Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible.
---
#3: Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question.
