In [1]:
!pip install -Uq "google-genai==1.7.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-adk 1.18.0 requires google-genai<2.0.0,>=1.45.0, but you have google-genai 1.7.0 which is incompatible.
google-cloud-aiplatform 1.125.0 requires google-genai<2.0.0,>=1.37.0, but you have google-genai 1.7.0 which is incompatible.[0m[31m
[0m

In [3]:
from google import genai
from google.genai import types

from IPython.display import Markdown, display

genai.__version__

'1.7.0'

In [7]:
from kaggle_secrets import UserSecretsClient

client = genai.Client(api_key=UserSecretsClient().get_secret("GOOGLE_API_KEY"))

In [8]:
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
    genai.models.Models.generate_content = retry.Retry(predicate=is_retriable)(genai.models.Models.generate_content)

### Evaluation

We'll evaluate a summarisation task using the Gemini 1.5 Pro technical report.

In [9]:
!wget -nv -O gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf

document_file = client.files.upload(file='gemini.pdf')

2025-11-14 06:15:08 URL:https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf [7228817/7228817] -> "gemini.pdf" [1]


#### Summarise a document
The summarisation request used here is fairly basic. It targets the training content specifically but provides no guidance otherwise.

In [10]:
request = 'Tell me about the training process used here.'

def summarize_doc(request):
    """Execute the request on the uploaded document."""
    config = types.GenerateContentConfig(temperature=0.0)

    response = client.models.generate_content(
        model='gemini-2.0-flash',
        config=config,
        contents=[request, document_file]
    )

    return response.text

summary = summarize_doc(request)
Markdown(summary)

Based on the document you provided, here's a breakdown of the training process used for Gemini 1.5 Pro:

**1. Data:**

*   **Multimodal and Multilingual Data:** The model is trained on a diverse dataset that includes text, images, audio, and video content. The text data is sourced from various domains, including web documents and code.
*   **Pre-training Dataset:** The pre-training dataset includes data sourced across many different domains, including web documents and code, and incorporates image, audio, and video content.
*   **Instruction-Tuning Phase:** Gemini 1.5 Pro is fine-tuned on a collection of multimodal data containing paired instructions and appropriate responses, with further tuning based on human preference data.

**2. Architecture:**

*   **Mixture-of-Experts (MoE) Transformer:** Gemini 1.5 Pro is based on a sparse MoE Transformer architecture. This allows the model to have a large number of parameters while only activating a subset for any given input.

**3. Infrastructure:**

*   **TPUv4 Accelerators:** The model is trained on multiple 4096-chip pods of Google's TPUv4 accelerators, distributed across multiple datacenters.

**4. Training Process:**

*   **Pre-training:** The model is initially pre-trained on the large multimodal dataset.
*   **Instruction Tuning:** After pre-training, the model is fine-tuned on a collection of multimodal data containing paired instructions and appropriate responses.
*   **Human Preference Tuning:** Further tuning is performed based on human preference data.

**5. Key Improvements:**

*   **Architecture:** Improvements across the model stack, including architecture, data, optimization, and systems.
*   **Long-Context Understanding:** Significant architecture changes enable understanding of inputs up to 10 million tokens without performance degradation.

**In summary:** Gemini 1.5 Pro is trained using a large, diverse multimodal dataset on Google's TPUv4 infrastructure. It uses a MoE Transformer architecture and undergoes pre-training, instruction tuning, and human preference tuning. The training process incorporates improvements across the model stack to enable long-context understanding and overall performance.

#### Define an evaluator

For a task like this, we want to evaluate a number of aspects, like how well the model followed the prompt ("instruction following"), whether it included relevant data in the prompt ("groundedness"), how easy the text is to read ("fluency"), or other factors like "verbosity" or "quality".

In this step, we define an evaluation agent using a pre-written "summarisation" prompt and use it to gauge the quality of the generated summary.

In [11]:
SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [14]:
import enum

# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'

def eval_summary(prompt, ai_response):
    """Evaluate the generated summary against the prompt used."""

    chat = client.chats.create(model='gemini-2.0-flash')

    #generate full chat response
    response = chat.send_message(
        message=SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)
    )

    verbose_eval = response.text

    # get desired structure
    structured_output_config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=SummaryRating,
        )

    response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
      )
    structured_eval = response.parsed

    return verbose_eval, structured_eval

In [15]:
text_eval, struct_eval = eval_summary(prompt=[request, document_file], ai_response=summary)
Markdown(text_eval)

## Evaluation
STEP 1:
The response summarizes the training process well. It follows instructions and provides a comprehensive overview of the data, architecture, infrastructure, training process, and key improvements. The response is grounded and provides accurate details from the document. The information is presented in a well-organized and easy-to-read manner.

STEP 2:
Rating: 5


In [16]:
struct_eval

<SummaryRating.VERY_GOOD: '5'>

In [18]:
new_prompt = "Explain like I'm 5 the training process"

if not new_prompt:
  raise ValueError("Try setting a new summarisation prompt.")

def run_and_eval_summary(prompt):
  """Generate and evaluate the summary using the new prompt."""
  summary = summarize_doc(new_prompt)
  display(Markdown(summary + '\n-----'))

  text, struct = eval_summary([new_prompt, document_file], summary)
  display(Markdown(text + '\n-----'))
  print(struct)

run_and_eval_summary(new_prompt)

Okay, I can explain the training process of a large language model like Gemini 1.5 in a way that a 5-year-old can understand.

Imagine you have a puppy, and you want to teach it to understand and follow your instructions.

1.  **Gathering Lots of Examples (Data):** First, you need to show the puppy lots and lots of examples. You might show it pictures of cats and say "cat," pictures of dogs and say "dog," and so on. For Gemini, this means feeding it tons of text, pictures, videos, and sounds from the internet and books. It's like showing the puppy everything in the world!

2.  **Teaching the Puppy (Training):** Now, you start teaching the puppy what things mean. You might say, "Fetch the ball!" and then reward the puppy with a treat when it brings you the ball. For Gemini, this means the computer is learning to predict what word or picture comes next in a sequence. If it predicts correctly, it gets a "treat" (a small adjustment to its internal settings).

3.  **Making the Puppy Smarter (Fine-tuning):** After the puppy knows the basics, you can teach it more complicated tricks. You might say, "Sit and stay!" and then reward the puppy when it does both things correctly. For Gemini, this means giving it specific instructions and examples of how to answer questions, write stories, or translate languages.

4.  **Testing the Puppy (Evaluation):** Finally, you need to test the puppy to see if it has learned everything correctly. You might give it a new command and see if it follows it. For Gemini, this means giving it tests to see if it can answer questions correctly, write good stories, and translate languages accurately.

So, the training process is all about showing the computer lots of examples, teaching it what things mean, making it smarter with specific instructions, and then testing it to see if it has learned everything correctly. It's like training a puppy, but with computers instead of dogs!
-----

## Evaluation

### STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
Instruction following: The response follows the instruction of explaining like I'm 5.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response is concise.
Fluency: The response is well-organized and easy to read.

### STEP 2: Score based on the rubric.
5

-----

SummaryRating.VERY_GOOD
