##### Copyright 2025 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemini API: Context Caching Quickstart

<a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Caching.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=30/></a>

This notebook introduces context caching with the Gemini API and provides examples of interacting with the Apollo 11 transcript using the Python SDK. For a more comprehensive look, check out [the caching guide](https://ai.google.dev/gemini-api/docs/caching?lang=python).

### Install dependencies

In [None]:
%pip install -q -U "google-genai>=1.0.0"

### Configure your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](../quickstarts/Authentication.ipynb) for an example.

In [None]:
from google.colab import userdata
from google import genai

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
client = genai.Client(api_key=GOOGLE_API_KEY)

## Upload a file

A common pattern with the Gemini API is to ask a number of questions of the same document. Context caching is designed to assist with this case, and can be more efficient by avoiding the need to pass the same tokens through the model for each new request.

This example will be based on the transcript from the Apollo 11 mission.

Start by downloading that transcript.

In [None]:
!wget -q https://storage.googleapis.com/generativeai-downloads/data/a11.txt
!head a11.txt

INTRODUCTION

This is the transcription of the Technical Air-to-Ground Voice Transmission (GOSS NET 1) from the Apollo 11 mission.

Communicators in the text may be identified according to the following list.

Spacecraft:
CDR	Commander	Neil A. Armstrong
CMP	Command module pilot   	Michael Collins
LMP	Lunar module pilot	Edwin E. ALdrin, Jr.


Now upload the transcript using the [File API](../quickstarts/File_API.ipynb).

In [None]:
document = client.files.upload(file="a11.txt")

## Cache the prompt

Next create a [`CachedContent`](https://ai.google.dev/api/python/google/generativeai/protos/CachedContent) object specifying the prompt you want to use, including the file and other fields you wish to cache. In this example the [`system_instruction`](../quickstarts/System_instructions.ipynb) has been set, and the document was provided in the prompt.

Note that caches are model specific. You cannot use a cache made with a different model as their tokenization might be slightly different.

In [None]:
# Note that caching requires a frozen model, e.g. one with a `-001` suffix.
MODEL_ID = "gemini-2.5-flash-preview-04-17"  # @param ["gemini-1.5-flash-8b-latest", "gemini-1.5-flash-002", "gemini-1.5-pro-002", "gemini-2.0-flash-001", "gemini-2.5-flash-preview-04-17", "gemini-2.5-pro-preview-03-25"] {"allow-input":true, isTemplate: true}

apollo_cache = client.caches.create(
    model=MODEL_ID,
    config={
        'contents': [document],
        'system_instruction': 'You are an expert at analyzing transcripts.',
    },
)

apollo_cache

CachedContent(name='cachedContents/lbe3mahxf23r', display_name='', model='models/gemini-2.5-flash-preview-04-17', create_time=datetime.datetime(2025, 4, 18, 9, 33, 11, 724277, tzinfo=TzInfo(UTC)), update_time=datetime.datetime(2025, 4, 18, 9, 33, 11, 724277, tzinfo=TzInfo(UTC)), expire_time=datetime.datetime(2025, 4, 18, 10, 33, 6, 88300, tzinfo=TzInfo(UTC)), usage_metadata=CachedContentUsageMetadata(audio_duration_seconds=None, image_count=None, text_count=None, total_token_count=322698, video_duration_seconds=None))

In [None]:
from IPython.display import Markdown

display(Markdown(f"As you can see in the output, you just cached **{apollo_cache.usage_metadata.total_token_count}** tokens."))

As you can see in the output, you just cached **322698** tokens.

## Manage the cache expiry

Once you have a `CachedContent` object, you can update the expiry time to keep it alive while you need it.

In [None]:
from google.genai import types

client.caches.update(
    name=apollo_cache.name,
    config=types.UpdateCachedContentConfig(ttl="7200s")  # 2 hours in seconds
)

apollo_cache = client.caches.get(name=apollo_cache.name) # Get the updated cache
apollo_cache

CachedContent(name='cachedContents/lbe3mahxf23r', display_name='', model='models/gemini-2.5-flash-preview-04-17', create_time=datetime.datetime(2025, 4, 18, 9, 33, 11, 724277, tzinfo=TzInfo(UTC)), update_time=datetime.datetime(2025, 4, 18, 9, 33, 12, 390573, tzinfo=TzInfo(UTC)), expire_time=datetime.datetime(2025, 4, 18, 11, 33, 11, 918775, tzinfo=TzInfo(UTC)), usage_metadata=CachedContentUsageMetadata(audio_duration_seconds=None, image_count=None, text_count=None, total_token_count=322698, video_duration_seconds=None))

## Use the cache for generation

As the `CachedContent` object refers to a specific model and parameters, you must create a [`GenerativeModel`](https://ai.google.dev/api/python/google/generativeai/GenerativeModel) using [`from_cached_content`](https://ai.google.dev/api/python/google/generativeai/GenerativeModel#from_cached_content). Then, generate content as you would with a directly instantiated model object.

In [None]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents='Find a lighthearted moment from this transcript',
    config=types.GenerateContentConfig(
        cached_content=apollo_cache.name,
    )
)

display(Markdown(response.text))

Here is a lighthearted moment from the transcript, where the crew and Mission Control chat about baseball and eating contests:

**Time: 02 05 57 48 - 02 05 58 30**

**CMP (COLUMBIA):** Roger. I assume Houston didn't play yesterday.
**CC:** That's correct.
**CMP (COLUMBIA):** I'd like to enter Aldrin in the oatmeal eating contest next time.
**CC:** Is he pretty good at that?
**CMP (COLUMBIA):** He's doing his share up here.
**CC:** Let's see. You all just finished a meal not long ago, too, didn't you?
**LMP:** I'm still eating.
**CMP (COLUMBIA):** He's on his - He's on his 19th bowl.

You can inspect token usage through `usage_metadata`. Note that the cached prompt tokens are included in `prompt_token_count`, but excluded from the `total_token_count`.

In [None]:
response.usage_metadata

GenerateContentResponseUsageMetadata(cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=322698)], cached_content_token_count=322698, candidates_token_count=20669, candidates_tokens_details=None, prompt_token_count=322707, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=322707)], thoughts_token_count=20474, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=343376, traffic_type=None)

In [None]:
display(Markdown(f"""
  As you can see in the `usage_metadata`, the token usage is split between:
  *  {response.usage_metadata.cached_content_token_count} tokens for the cache,
  *  {response.usage_metadata.prompt_token_count} tokens for the input (including the cache, so {response.usage_metadata.prompt_token_count - response.usage_metadata.cached_content_token_count} for the actual prompt),
  *  {response.usage_metadata.candidates_token_count} tokens for the output,
  *  {response.usage_metadata.total_token_count} tokens in total.
"""))


  As you can see in the `usage_metadata`, the token usage is split between:
  *  322698 tokens for the cache,
  *  322707 tokens for the input (including the cache, so 9 for the actual prompt),
  *  20669 tokens for the output,
  *  343376 tokens in total.


You can ask new questions of the model, and the cache is reused.

In [None]:
chat = client.chats.create(
  model=MODEL_ID,
  config={"cached_content": apollo_cache.name}
)

response = chat.send_message(message="Give me a quote from the most important part of the transcript.")
display(Markdown(response.text))

Based on the historical significance of the Apollo 11 mission, the most important part of this transcript is likely the communication during the lunar landing and the first steps on the Moon.

Here is the famous quote from that moment:

04 13 24 48 CDR (TRANQ)
THAT'S ONE SMALL STEP FOR (A) MAN, ONE GIANT LEAP FOR MANKIND.

This quote, spoken by Neil Armstrong upon stepping onto the lunar surface, is widely considered one of the most significant statements in human history and is the defining moment of the Apollo 11 mission.

In [None]:
response = chat.send_message(
    message="What was recounted after that?",
    config={"cached_content": apollo_cache.name}
)
display(Markdown(response.text))

Immediately after the famous quote, Commander Neil Armstrong (CDR) recounted his initial observations of the lunar surface and his experience of moving around:

1.  **Surface Description:** He described the surface as "fine and powdery." He noted that he could pick it up loosely with his toe and that it adhered in fine layers "like powdered charcoal" to the sole and sides of his boots. He estimated he only sank a small fraction of an inch (maybe an eighth of an inch) and could clearly see his footprints and the treads in the particles.
2.  **Mobility Assessment:** He stated there seemed to be "no difficulty in moving around," finding it perhaps even easier than the 1/6g simulations performed on the ground. He confirmed it was "actually no trouble to walk around."
3.  **Engine Impact:** He mentioned that the descent engine did not leave a crater of any size and that the LM had about 1 foot of clearance on the ground. He noted seeing "some evidence of rays emanating from the descent engine, but a very insignificant amount."

CapCom Charlie Duke (CC) acknowledged these transmissions from Houston, stating, "Neil, this is Houston. We're copying."

Neil then transitioned to the next activity, asking Buzz Aldrin, "Okay, Buzz, we ready to bring down the camera?"

In [None]:
response.usage_metadata

GenerateContentResponseUsageMetadata(cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=322698)], cached_content_token_count=322698, candidates_token_count=1031, candidates_tokens_details=None, prompt_token_count=322846, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=322846)], thoughts_token_count=757, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=323877, traffic_type=None)

In [None]:
display(Markdown(f"""
  As you can see in the `usage_metadata`, the token usage is split between:
  *  {response.usage_metadata.cached_content_token_count} tokens for the cache,
  *  {response.usage_metadata.prompt_token_count} tokens for the input (including the cache, so {response.usage_metadata.prompt_token_count - response.usage_metadata.cached_content_token_count} for the actual prompt),
  *  {response.usage_metadata.candidates_token_count} tokens for the output,
  *  {response.usage_metadata.total_token_count} tokens in total.
"""))


  As you can see in the `usage_metadata`, the token usage is split between:
  *  322698 tokens for the cache,
  *  322846 tokens for the input (including the cache, so 148 for the actual prompt),
  *  1031 tokens for the output,
  *  323877 tokens in total.


Since the cached tokens are cheaper than the normal ones, it means this prompt was much cheaper that if you had not used caching. Check the [pricing here](https://ai.google.dev/pricing) for the up-to-date discount on cached tokens.

## Delete the cache

The cache has a small recurring storage cost (cf. [pricing](https://ai.google.dev/pricing)) so by default it is only saved for an hour. In this case you even set it up for a shorter amont of time (using `"ttl"`) of 2h.

Still, if you don't need you cache anymore, it is good practice to delete it proactively.

In [None]:
print(apollo_cache.name)
client.caches.delete(name=apollo_cache.name)

cachedContents/lbe3mahxf23r


DeleteCachedContentResponse()

## Next Steps
### Useful API references:

If you want to know more about the caching API, you can check the full [API specifications](https://ai.google.dev/api/rest/v1beta/cachedContents) and the [caching documentation](https://ai.google.dev/gemini-api/docs/caching).

### Continue your discovery of the Gemini API

Check the File API notebook to know more about that API. The [vision capabilities](../quickstarts/Video.ipynb) of the Gemini API are a good reason to use the File API and the caching.
The Gemini API also has configurable [safety settings](../quickstarts/Safety.ipynb) that you might have to customize when dealing with big files.
