# Caching Examples

## Setup

In [1]:
%pip install -U -q "google-genai>=1.0.0"  # Install the Python SDK

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.1/43.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.9/231.9 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see the [Authentication](../quickstarts/Authentication.ipynb) quickstart for an example.

In [2]:
from google.colab import userdata
from google import genai

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
client = genai.Client(api_key=GOOGLE_API_KEY)

MODEL_ID = "gemini-2.5-flash"

In [3]:
from google.genai import types

import requests
import json
import math

questions = requests.get("https://raw.githubusercontent.com/phil-daniel/gemini-batcher/refs/heads/main/examples/demo_files/questions.txt").text.split('\n')
content = requests.get("https://raw.githubusercontent.com/phil-daniel/gemini-batcher/refs/heads/main/examples/demo_files/content.txt").text

## Caching example 1 - Implicit Caching

In this example, we can see how to check whether implicit caching has had an effect on our API call.

In [4]:
response = client.models.generate_content(
    model="gemini-2.5-flash",
    config=types.GenerateContentConfig(thinking_config=types.ThinkingConfig(thinking_budget=0)),
    contents=[f'Content:\n{content}', f'\nQuestion:\n{questions[-1]}']
)

print(f'Total input tokens: {response.usage_metadata.prompt_token_count}')
print(f'Total input tokens from cache: {response.usage_metadata.cached_content_token_count}')

Total input tokens: 13220
Total input tokens from cache: None


Now asking the question again, check if there is any different in `cached_content_token_count`.

In [5]:
response = client.models.generate_content(
    model="gemini-2.5-flash",
    config=types.GenerateContentConfig(thinking_config=types.ThinkingConfig(thinking_budget=0)),
    contents=[f'Content:\n{content}', f'\nQuestion:\n{questions[-1]}']
)

print(f'Total input tokens: {response.usage_metadata.prompt_token_count}')
print(f'Total input tokens from cache: {response.usage_metadata.cached_content_token_count}')

Total input tokens: 13220
Total input tokens from cache: 12280


## Caching example 2 - Explicit Caching
In this example, we demonstrate how explicit caching can be done with the Gemini Python SDK. In particular, we upload the entire transcript to the cache, which can then be used in future queries, rather than having to add the transcript to the `contents` parameter every time.

**NOTE: Explicit caching is currently only enabled for paid Gemini tiers**

In [None]:
# Adding the content (the transcript) to the cache.
cache = client.caches.create(
    model="gemini-2.5-flash",
    config=types.CreateCachedContentConfig(
        display_name='transcript_content', # This allows for the cache to easily be accessed and referred to.
        contents=[content], # The actual contents of the cache. This could also contain other media types, such as videos and photos.
        ttl="300s", # The TTL (time to live) of the cache, this limits how long the cache is accessible for.
    )
)

response = client.models.generate_content(
    model="gemini-2.5-flash",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=0),
        cached_content = cache.name # Here we referred to the previously cached transcript.
    ),
    contents=[f'\nQuestion:\n{questions[-2]}'] # Only the questions are passed here and not the transcript.
)

print(f'Total input tokens: {response.usage_metadata.prompt_token_count}')
print(f'Total input tokens from cache: {response.usage_metadata.cached_content_token_count}')