# Context Caching with Gemini

This notebook demonstrates how to use the context caching feature of the Gemini. Context caching allows you to store large amounts of context (like documents or lengthy instructions) with the model once, and then refer to that cached context in subsequent requests saving up to 75% cost. When you cache a set of tokens, you can choose how long you want the cache to exist before the tokens are automatically deleted. This caching duration is called the time to live (TTL). In the example we are going to cache the a github repository to then more easily ask questions on it. 

How it works:
1. You create a cache inckuding the content (text, files), an optional system instruction, and a time-to-live (TTL).
2. When generating content, you reference the cache name instead of adding the context to the `contents`. The model uses the cached information alongside your new prompt.

* Pricing: https://ai.google.dev/gemini-api/docs/pricing
* Gemini API Caching Documentation: https://ai.google.dev/gemini-api/docs/caching?lang=python 


In [None]:
%pip install google-genai gitingest

In [1]:
import os
from google import genai
from gitingest import ingest_async

# create client
client = genai.Client()

model_id = "gemini-2.5-pro" # "gemini-2.0-flash"
system_instruction = "You are a helpful coding assistant with the FastMCP github repository available in context. If a users asks question about FastMCP or how to build an MCP server use the available information."

# Load the Fast MCP repository and exclude tests and bloated pattern
summary, tree, content = await ingest_async("https://github.com/jlowin/fastmcp",exclude_patterns="*.json, *.css, *.js, uv.lock, python-version, tests/, .github/")

Skipping already visited path: /private/var/folders/f1/3vdgcm01195b80qcp3t14n_m01b2f5/T/gitingest/3eaed527-02bf-47b3-a07f-e6f425819a57/jlowin-fastmcp/AGENTS.md


In [2]:
# Create a cached content object
cache = client.caches.create(
    model=model_id,
    config=genai.types.CreateCachedContentConfig(
      system_instruction=system_instruction,
      contents=[content],
      ttl="300s"
    ),
)

# Display the cache details
print(f"Cache Details:\nname: {cache.name}\nmodel: {cache.model}\nexpire_time: {cache.expire_time.astimezone().isoformat(timespec='seconds')}\nToken Count: {cache.usage_metadata.total_token_count} tokens")


Cache Details:
name: cachedContents/391i55qc71miw8dfql5q9cl0cbv6wvbs4qimfyy1
model: models/gemini-2.5-pro
expire_time: 2025-09-21T16:21:15+02:00
Token Count: 692469 tokens


In [3]:
# Generate content using the cached prompt and document
response = client.models.generate_content(
  model=model_id,
  contents="Build a simple MCP server for reading and writing local files under /tmp/mcp",
  config=genai.types.GenerateContentConfig(
    cached_content=cache.name
  ))

# Print usage metadata for insights into the API call
print(f"Cached Tokens used: {response.usage_metadata.cached_content_token_count}\nNo Cache Tokens used: {response.usage_metadata.prompt_token_count - response.usage_metadata.cached_content_token_count}\nThoughts Tokens used: {response.usage_metadata.thoughts_token_count}\nOutput Tokens used: {response.usage_metadata.candidates_token_count}")

Cached Tokens used: 692469
No Cache Tokens used: 18
Thoughts Tokens used: 906
Output Tokens used: 1929


In [4]:
# Print the generated text
print(response.text)

Of course. Here is a simple but complete MCP server built with FastMCP that allows for reading, writing, and listing files within a sandboxed `/tmp/mcp` directory.

This server demonstrates several core FastMCP concepts:
*   **Tools (`@mcp.tool`)** for actions that have side effects (like writing a file).
*   **Resources (`@mcp.resource`)** for exposing data (like listing files).
*   **Resource Templates** for dynamically generating resources based on parameters (like reading a specific file).

### `file_server.py`

```python
import os
from pathlib import Path
from fastmcp import FastMCP, Context
from fastmcp.exceptions import ToolError, ResourceError

# --- Configuration ---
# Define a safe base directory to prevent access to other parts of the filesystem.
# All file operations will be restricted to this "sandbox".
BASE_DIR = Path("/tmp/mcp_file_server").resolve()

# --- Server Setup ---
# Create the FastMCP server instance.
mcp = FastMCP(name="Local File Server")

# --- Security Help

## Implicit Caching

The Gemini API supports [implicit caching](https://ai.google.dev/gemini-api/docs/caching?lang=python), unlocking automatic 75% cost savings when your requests hit the cache! This means if you send a request to Gemini 2.5 models with a common prefix as one of previous requests, it’s eligible for a cache hit. The minimum input token count for context caching is 1,024 for 2.5 Flash and 2,048 for 2.5 Pro.




In [6]:
import os
from google import genai

# create client
client = genai.Client()
model_id = "gemini-2.5-pro" # "gemini-2.5-flash

# upload big pdf file
file_path = "../assets/2025q1-alphabet-earnings-release.pdf"
pdf_file = client.files.upload(file=file_path)


# count tokens
tokens = client.models.count_tokens(model=model_id, contents=[pdf_file])
print(f"Tokens: {tokens.total_tokens}")

Tokens: 2581


In [7]:
# Ask the model to summarize the earnings release
instruction = "Summarize the earnings release for the first quarter of 2025"
response_1 = client.models.generate_content(
    model=model_id,
    contents=[pdf_file, instruction],
)

# Print usage metadata for insights into the API call
print(f"input tokens: {response_1.usage_metadata.prompt_token_count }")

input tokens: 2596


In [8]:
# 2nd request which uses the cached prefix (pdf file)
instruction = "What are focus areas for the second quarter of 2025?"
response_2 = client.models.generate_content(
    model=model_id,
    contents=[pdf_file, instruction],
)

print(f"cached tokens: {response_2.usage_metadata.cached_content_token_count}")

cached tokens: 2160


In [10]:
from google import genai

# create client
client = genai.Client()
model_id = "gemini-2.5-flash" # "gemini-2.5-pro"


cache_count = 0
for i in range(10):
    # upload big pdf file
    file_path = "../assets/2025q1-alphabet-earnings-release.pdf"
    pdf_file = client.files.upload(file=file_path)
    # Ask the model to summarize the earnings release
    instruction = "Summarize the earnings release for the first quarter of 2025"
    response_1 = client.models.generate_content(
        model=model_id,
        contents=[pdf_file, instruction],
    )
    # 2nd request which uses the cached prefix (pdf file)
    instruction = "What are focus areas for the second quarter of 2025?"
    response_2 = client.models.generate_content(
        model=model_id,
        contents=[pdf_file, instruction],
    )
    if response_2.usage_metadata.cached_content_token_count:
        print("cached")
        cache_count += 1
    else:
        print("no cached")

cached
cached
cached
cached
cached
cached
cached
cached
cached
cached
