##### Copyright 2026 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemini API: Cost and latency optimization patterns

<a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/examples/Cost_and_Latency_Optimization.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=30/></a>

<!-- Community Contributor Badge -->
<table>
  <tr>
    <!-- Author Avatar Cell -->
    <td bgcolor="#d7e6ff">
      <a href="https://github.com/pankaj0695" target="_blank" title="View Pankaj's profile on GitHub">
        <img src="https://github.com/pankaj0695.png?size=100"
             alt="pankaj0695's GitHub avatar"
             width="100"
             height="100">
      </a>
    </td>
    <!-- Text Content Cell -->
    <td bgcolor="#d7e6ff">
      <h2><font color='black'>This notebook was contributed by <a href="https://github.com/pankaj0695" target="_blank"><font color='#217bfe'><strong>Pankaj Gupta</strong></font></a>.</font></h2>
      <h5><font color='black'><a href="https://www.linkedin.com/in/pankajgupta0695/" target="_blank"><font color="#078efb">LinkedIn</font></a> - See <a href="https://github.com/pankaj0695" target="_blank"><font color="#078efb"><strong>Pankaj</strong></font></a>'s other notebooks <a href="https://github.com/search?q=repo%3Agoogle-gemini%2Fcookbook%20%22pankaj0695%22&type=code" target="_blank"><font color="#078efb">here</font></a>.</h5></font><br>
      <!-- Footer -->
      <font color='black'><small><em>Have a cool Gemini example? Feel free to <a href="https://github.com/google-gemini/cookbook/blob/main/CONTRIBUTING.md" target="_blank"><font color="#078efb">share it too</font></a>!</em></small></font>
    </td>
  </tr>
</table>

This notebook demonstrates practical techniques to reduce **cost** and **latency** when using the Gemini API. You will run the same tasks using different optimization strategies and compare results with measurable metrics (tokens, time).

**What you will learn:**
1. Count and estimate tokens before making requests.
2. Use **streaming** for faster perceived latency (time-to-first-token).
3. Reduce context size with prompt trimming and summarization.
4. Compare models (Flash vs Pro) for cost/latency tradeoffs.
5. Use the **Batch API** for high-throughput, non-urgent workloads.

By the end, you will have a reusable "playbook" for making your Gemini apps faster and cheaper.

## Setup

### Install SDK

In [3]:
%pip install -U -q "google-genai>=1.0.0"

### Set up your API key

To run the following cell, your API key must be stored in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see the [Authentication ![image](https://storage.googleapis.com/generativeai-downloads/images/colab_icon16.png)](../quickstarts/Authentication.ipynb) quickstart for an example.

In [None]:
from google.colab import userdata
from google import genai
from google.genai import types

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
client = genai.Client(api_key=GOOGLE_API_KEY)

### Choose a model

Select a model to use throughout this guide. You will compare different models later in the notebook.

In [6]:
MODEL_ID = "gemini-2.5-flash" # @param ["gemini-2.5-flash-lite", "gemini-2.5-flash", "gemini-2.5-pro", "gemini-3-flash-preview", "gemini-3-pro-preview"] {"allow-input":true, isTemplate: true}

### Helper functions for timing

These helpers measure request latency and display results in a consistent format.

In [None]:
# @title Timing helpers
import time

def timed_generate(model_id, contents, config=None):
    start = time.perf_counter()
    response = client.models.generate_content(
        model=model_id,
        contents=contents,
        config=config
    )
    elapsed = time.perf_counter() - start
    return response, elapsed


def print_metrics(label, response, elapsed):
    usage = response.usage_metadata
    print(f"\n=== {label} ===")
    print(f"Input tokens:  {usage.prompt_token_count}")
    print(f"Output tokens: {usage.candidates_token_count}")
    print(f"Total tokens:  {usage.total_token_count}")
    print(f"Time:          {elapsed:.2f}s")

### Define a sample task

Use a consistent prompt throughout to compare optimization strategies fairly.

In [9]:
SAMPLE_PROMPT = """
Explain the concept of neural networks to a high school student.
Cover: what they are, how they learn and give one real-world example.
Keep your answer under 200 words.
"""

## 1. Counting tokens before requests

Knowing your token usage **before** making a request helps you estimate cost and stay within context limits. Use `client.models.count_tokens` to count tokens without generating a response.

In [10]:
token_count = client.models.count_tokens(
    model=MODEL_ID,
    contents=SAMPLE_PROMPT
)

print(f"Prompt token count: {token_count.total_tokens}")

Prompt token count: 44


You can also check the model's context window to ensure your input fits:

In [11]:
model_info = client.models.get(model=MODEL_ID)

print(f"Model: {MODEL_ID}")
print(f"Input token limit:  {model_info.input_token_limit:,} tokens")
print(f"Output token limit: {model_info.output_token_limit:,} tokens")

Model: gemini-2.5-flash
Input token limit:  1,048,576 tokens
Output token limit: 65,536 tokens


## 2. Baseline: standard synchronous request

Start with a standard (non-streaming) request to establish a baseline for latency and token usage.

In [12]:
baseline_response, baseline_time = timed_generate(MODEL_ID, SAMPLE_PROMPT)

print_metrics("Baseline (synchronous)", baseline_response, baseline_time)
print(f"\nResponse preview: {baseline_response.text[:200]}...")


=== Baseline (synchronous) ===
Input tokens:  44
Output tokens: 175
Total tokens:  985
Time:          5.43s

Response preview: Imagine a neural network as a simplified digital "brain." It's a computer program inspired by how our brains work, made of interconnected "neurons" arranged in layers. Each neuron takes some input, do...


## 3. Streaming for faster time-to-first-token

Streaming returns chunks as they are generated, reducing **perceived latency**. The total generation time may be similar, but users see output faster.

**When to use streaming:**
- Chat or conversational UIs
- Long responses where users want to start reading immediately

In [13]:
start = time.perf_counter()
first_chunk_time = None
full_text = ""

for chunk in client.models.generate_content_stream(
    model=MODEL_ID,
    contents=SAMPLE_PROMPT
):
    if first_chunk_time is None:
        first_chunk_time = time.perf_counter() - start
    if chunk.text:
        full_text += chunk.text

total_time = time.perf_counter() - start

print(f"=== Streaming ===")
print(f"Time to first chunk: {first_chunk_time:.2f}s")
print(f"Total time:          {total_time:.2f}s")
print(f"\nResponse preview: {full_text[:200]}...")

=== Streaming ===
Time to first chunk: 3.65s
Total time:          4.70s

Response preview: Imagine a computer system built loosely like the human brain! A neural network is a collection of interconnected "nodes" (like brain cells) organized into layers: an input layer, one or more hidden la...


**Result:** Streaming provides faster time-to-first-token while total time remains similar.

## 4. Context reduction: prompt trimming

Fewer input tokens mean lower cost and often faster responses. Techniques include:
- Remove unnecessary context
- Summarize long documents before including them
- Use specific, concise instructions

Compare a verbose prompt with a concise one:

In [None]:
verbose_prompt = """
I would really like you to help me understand something. I'm a high school 
student and I've been hearing a lot about artificial intelligence and machine 
learning lately. Could you please explain to me, in simple terms that I can 
understand, what neural networks are? I'd like to know what they are, how they 
actually learn from data, and maybe you could give me one example of how they 
are used in the real world? Please try to keep your explanation relatively 
brief, maybe around 200 words or so if possible. Thank you so much!
"""

concise_prompt = """
Explain neural networks to a high school student: what they are, how they 
learn, one real-world example. Under 200 words.
"""

verbose_tokens = client.models.count_tokens(model=MODEL_ID, contents=verbose_prompt)
concise_tokens = client.models.count_tokens(model=MODEL_ID, contents=concise_prompt)

print(f"Verbose prompt: {verbose_tokens.total_tokens} tokens")
print(f"Concise prompt: {concise_tokens.total_tokens} tokens")
print(f"Token savings:  {verbose_tokens.total_tokens - concise_tokens.total_tokens} tokens")

Verbose prompt: 130 tokens
Concise prompt: 35 tokens
Token savings:  95 tokens


In [16]:
verbose_response, verbose_time = timed_generate(MODEL_ID, verbose_prompt)
concise_response, concise_time = timed_generate(MODEL_ID, concise_prompt)

print_metrics("Verbose prompt", verbose_response, verbose_time)
print_metrics("Concise prompt", concise_response, concise_time)


=== Verbose prompt ===
Input tokens:  130
Output tokens: 253
Total tokens:  1369
Time:          7.03s

=== Concise prompt ===
Input tokens:  35
Output tokens: 168
Total tokens:  903
Time:          5.35s


**Takeaway:** Concise prompts reduce input tokens and can improve latency, especially for large contexts.

## 5. Summarization for long documents

When working with long documents, summarize them first to reduce context size for downstream tasks.

In [17]:
long_document = """
Neural networks are a subset of machine learning and are at the heart of deep 
learning algorithms. Their name and structure are inspired by the human brain, 
mimicking the way that biological neurons signal to one another. Neural networks 
are composed of node layers, containing an input layer, one or more hidden layers, 
and an output layer. Each node, or artificial neuron, connects to another and has 
an associated weight and threshold. If the output of any individual node is above 
the specified threshold value, that node is activated, sending data to the next 
layer of the network. Otherwise, no data is passed along to the next layer.

Neural networks rely on training data to learn and improve their accuracy over time. 
Once these learning algorithms are fine-tuned for accuracy, they are powerful tools 
in computer science and artificial intelligence, allowing us to classify and cluster 
data at a high velocity. Tasks in speech recognition or image recognition can take 
minutes versus hours when compared to the manual identification by human experts.

Deep learning neural networks, or artificial neural networks, attempt to mimic the 
human brain through a combination of data inputs, weights, and bias. These elements 
work together to accurately recognize, classify, and describe objects within the data.
""" * 3

print(f"Original document tokens: {client.models.count_tokens(model=MODEL_ID, contents=long_document).total_tokens}")

Original document tokens: 809


In [18]:
# Step 1: Summarize the document
summary_response, _ = timed_generate(
    MODEL_ID,
    f"Summarize this in 2-3 sentences:\n\n{long_document}"
)
summary = summary_response.text

print(f"Summary tokens: {client.models.count_tokens(model=MODEL_ID, contents=summary).total_tokens}")
print(f"\nSummary: {summary}")

Summary tokens: 60

Summary: Neural networks are a subset of machine learning algorithms, inspired by the human brain's structure, composed of interconnected node layers that process data based on weights and thresholds. They learn from training data to classify and cluster information at high speeds, making them powerful tools for tasks like speech and image recognition.


In [19]:
# Step 2: Use summary for downstream task (instead of full document)
question = "Based on this context, what makes neural networks powerful?"

# Using full document
full_context_response, full_time = timed_generate(
    MODEL_ID,
    f"Context: {long_document}\n\nQuestion: {question}"
)

# Using summary
summary_context_response, summary_time = timed_generate(
    MODEL_ID,
    f"Context: {summary}\n\nQuestion: {question}"
)

print_metrics("Full document context", full_context_response, full_time)
print_metrics("Summary context", summary_context_response, summary_time)


=== Full document context ===
Input tokens:  825
Output tokens: 128
Total tokens:  1905
Time:          5.53s

=== Summary context ===
Input tokens:  76
Output tokens: 34
Total tokens:  417
Time:          2.16s


## 6. Model comparison: Flash vs Pro

Different models offer tradeoffs between speed, cost, and capability:
- **Flash/Flash-Lite**: Faster and cheaper, good for simpler tasks
- **Pro**: Higher capability, better for complex reasoning

Compare the same task across models:

In [21]:
models_to_compare = [
    "gemini-2.5-flash-lite",
    "gemini-2.5-flash",
    "gemini-2.5-pro",
]

results = []

for model in models_to_compare:
    try:
        response, elapsed = timed_generate(model, SAMPLE_PROMPT)
        results.append({
            "model": model,
            "input_tokens": response.usage_metadata.prompt_token_count,
            "output_tokens": response.usage_metadata.candidates_token_count,
            "time": elapsed
        })
        print(f"{model}: {elapsed:.2f}s")
    except Exception as e:
        print(f"{model}: Error - {e}")

gemini-2.5-flash-lite: 1.91s
gemini-2.5-flash: 3.98s
gemini-2.5-pro: 16.49s


In [22]:
print("\n=== Model Comparison ===")
print(f"{'Model':<25} {'Input':<10} {'Output':<10} {'Time':<10}")
print("-" * 55)
for r in results:
    print(f"{r['model']:<25} {r['input_tokens']:<10} {r['output_tokens']:<10} {r['time']:.2f}s")


=== Model Comparison ===
Model                     Input      Output     Time      
-------------------------------------------------------
gemini-2.5-flash-lite     44         179        1.91s
gemini-2.5-flash          44         173        3.98s
gemini-2.5-pro            44         160        16.49s


**Guidance:**
- Use **Flash-Lite** for simple tasks requiring speed
- Use **Flash** for balanced performance
- Use **Pro** when quality/reasoning is critical

## 7. Batch API for offline workloads

The [Batch API](../quickstarts/Batch_mode.ipynb) is ideal for non-latency-critical tasks:
- **50% cost discount** compared to standard API
- Process large volumes asynchronously (24-hour SLO)
- Great for pre-processing datasets, evaluations, bulk generation

**When to use Batch API:**
- You have many requests that don't need immediate responses
- Cost savings are more important than latency
- Processing datasets or running evaluations

In [None]:
batch_prompts = [
    "Explain photosynthesis in one sentence.",
    "What is the capital of France?",
    "Describe gravity to a child.",
    "What causes rainbows?",
    "Explain why the sky is blue."
]

# Format as inline requests (list of request dicts)
batch_requests = [
    {"contents": [{"parts": [{"text": prompt}]}]}
    for prompt in batch_prompts
]

print(f"Prepared {len(batch_requests)} batch requests")

Prepared 5 batch requests


In [26]:
# Create the batch job with inline requests
batch_job = client.batches.create(
    model=MODEL_ID,
    src=batch_requests,
    config={"display_name": "cost-latency-example-batch"}
)

print(f"Batch job created: {batch_job.name}")
print(f"State: {batch_job.state.name}")

Batch job created: batches/1g59mu1a101y5fpzb0xjxkvglhfmeudt7ky1
State: JOB_STATE_PENDING


In [27]:
# @title Poll for batch completion (may take a few minutes)

while batch_job.state.name in ["JOB_STATE_PENDING", "JOB_STATE_RUNNING"]:
    print(f"Status: {batch_job.state.name}... waiting 30s")
    time.sleep(30)
    batch_job = client.batches.get(name=batch_job.name)

print(f"\nFinal state: {batch_job.state.name}")

Status: JOB_STATE_PENDING... waiting 30s
Status: JOB_STATE_PENDING... waiting 30s
Status: JOB_STATE_PENDING... waiting 30s
Status: JOB_STATE_PENDING... waiting 30s
Status: JOB_STATE_RUNNING... waiting 30s
Status: JOB_STATE_RUNNING... waiting 30s

Final state: JOB_STATE_SUCCEEDED


In [28]:
# Retrieve results (inline responses for inline batch jobs)
if batch_job.state.name == "JOB_STATE_SUCCEEDED":
    print("=== Batch Results ===")
    for i, inline_response in enumerate(batch_job.dest.inlined_responses):
        print(f"\n[{i+1}] {batch_prompts[i]}")
        if inline_response.response:
            text = inline_response.response.candidates[0].content.parts[0].text
            print(f"    → {text[:100]}...")
        else:
            print(f"    → Error: {inline_response.error}")
else:
    print(f"Batch job did not succeed. State: {batch_job.state.name}")

=== Batch Results ===

[1] Explain photosynthesis in one sentence.
    → Photosynthesis is the process by which plants convert light energy from the sun into chemical energy...

[2] What is the capital of France?
    → Paris....

[3] Describe gravity to a child.
    → Okay, imagine you have a special invisible super-helper, like a giant magnet for everything!

When y...

[4] What causes rainbows?
    → Rainbows are a beautiful optical phenomenon caused by the interaction of **sunlight** with **water d...

[5] Explain why the sky is blue.
    → The sky is blue because of a phenomenon called **Rayleigh scattering**, which describes how light in...


## Summary: when to use what

| Technique | Best for | Cost impact | Latency impact |
|-----------|----------|-------------|----------------|
| **Token counting** | Budget planning, staying within limits | Prevents overages | None |
| **Streaming** | Chat UIs, long responses | None | Faster perceived latency |
| **Prompt trimming** | All requests | Lower input cost | Faster |
| **Summarization** | Long document workflows | Lower downstream cost | Faster downstream |
| **Flash-Lite model** | Simple, speed-critical tasks | ~3x cheaper than Pro | Fastest |
| **Flash model** | Balanced workloads | ~2x cheaper than Pro | Fast |
| **Batch API** | Offline, bulk processing | 50% discount | Async (up to 24h) |

**General recommendations:**
1. Always count tokens to understand your usage
2. Use streaming for user-facing applications
3. Trim prompts and summarize long contexts
4. Choose the right model for your task complexity
5. Use Batch API for anything that doesn't need real-time responses

## Next steps

### Useful API references
- [Pricing](https://ai.google.dev/pricing) - Understand token costs per model
- [Rate limits and quotas](https://ai.google.dev/gemini-api/docs/rate-limits)
- [Batch API documentation](https://ai.google.dev/gemini-api/docs/batch-mode)

### Related examples
- [Counting Tokens](../quickstarts/Counting_Tokens.ipynb) - Deep dive on token counting
- [Streaming](../quickstarts/Streaming.ipynb) - More streaming patterns
- [Batch Mode](../quickstarts/Batch_mode.ipynb) - Advanced batch workflows

### Continue your discovery of the Gemini API
- [Get started](../quickstarts/Get_started.ipynb) - Introduction to the Gemini API
- [Caching](../quickstarts/Caching.ipynb) - Reduce costs with context caching