##### Copyright 2025 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemini API: Batching and Chunking

<a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/examples/batching_and_chunking.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=30/></a>

<!-- Community Contributor Badge -->
<table>
  <tr>
    <!-- Author Avatar Cell -->
    <td bgcolor="#d7e6ff">
      <a href="https://github.com/phil-daniel" target="_blank" title="View Phillip's profile on GitHub">
        <img src="https://github.com/phil-daniel.png?size=100"
             alt="phil-daniel's GitHub avatar"
             width="100"
             height="100">
      </a>
    </td>
    <!-- Text Content Cell -->
    <td bgcolor="#d7e6ff">
      <h2><font color='black'>This notebook was contributed by <a href="https://github.com/phil-daniel" target="_blank"><font color='#217bfe'><strong>Phillip</strong></font></a> as part of Google Summer of Code 2025.</font></h2>
      <h5><font color='black'>Find more information about the Gemini-Batcher project and more chunking and batching examples <a href="URL"><font color="#078efb">here</font></a>.</h5></font><br>
    </td>
  </tr>
</table>

This notebook introduces two useful techniques for working with large inputs:
- Batching - Combining multiple inputs into a single request
- Chunking - Splitting long inputs into multiple smaller pieces.
Together, these methods can help models process large inputs more efficiently whilst staying within their token limits.

This guide provides simple examples for each of the techniques, as well as discussing more complex strategies that can be implemented.

## Setup

### Install SDK

In [1]:
%pip install -U -q "google-genai>=1.0.0"  # Install the Python SDK

### Set up your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see the [Authentication](../quickstarts/Authentication.ipynb) quickstart for an example.

In [2]:
from google.colab import userdata
from google import genai

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
client = genai.Client(api_key=GOOGLE_API_KEY)

Now select the model you want to use in this guide, either by selecting one in the list or writing it down. Keep in mind that some models, like the 2.5 ones are thinking models and thus take slightly more time to respond (cf. [thinking notebook](./Get_started_thinking.ipynb) for more details and in particular learn how to switch the thiking off).

In [3]:
MODEL_ID = "gemini-2.5-flash" # @param ["gemini-2.5-flash-lite-preview-06-17","gemini-2.0-flash","gemini-2.5-flash","gemini-2.5-pro"] {"allow-input":true, isTemplate: true}

### Loading example material & additional libraries

Each example uses the same sample material - a video lecture transcript and a set of questions about the lecture - with the goal being for the model to answer each question using only information from the transcript.

In [4]:
from google.genai import types

import requests
import json
import math

questions = requests.get("https://raw.githubusercontent.com/phil-daniel/gemini-batcher/refs/heads/main/examples/demo_files/questions.txt").text.split('\n')
content = requests.get("https://raw.githubusercontent.com/phil-daniel/gemini-batcher/refs/heads/main/examples/demo_files/content.txt").text

print(f'Example question: {questions[0]}')

Example question: What is the goal of MIT 6.00 (Introduction to Computer Science and Programming)?


## Batching

Batching describes the process of combining multiple individual API calls together into a single API call. Imagine you need to buy three things from a shop, rather than going to the shop three separate times, buying one item each time, it would be more efficient to only go to the shop once, getting everything you need. The technique can provide multiple benefits, including:
- Reduced latency - Rather than having to make repeated HTTP calls, only a single one must be made, reducing latency. In addition, since many LLM APIs have rate limits, the number of requests which can be made may be limited.
- Improved cost efficiency - In some situations, combining your inputs into a single API call can reduce the number of tokens required. For example, given a paragraph costing 400 tokens to process, and 5 questions each costing 10 tokens, asking the questions one at a time would take ≈ (400 + 10) * 5 = 2050 tokens, whereas batching the questions would only take ≈ 400 + (10 * 5) = 450 tokens, giving a signficant improvement.

### Batching example 1 - no batching (baseline)

In this example, the baseline number of tokens required to answer the first five questions is calculated. Each question is sent to the model sequentially, along with the entire transcript.

The response is returned in JSON format for easier comparison to the batched example.

In [5]:
system_prompt = "Answer the questions using *only* the content provided, with each answer being a different string in the JSON response."

total_input_tokens_no_batching = 0
total_output_tokens_no_batching = 0

for question in questions[:5]:
    response = client.models.generate_content(
        model=MODEL_ID,
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema=list[str],
            system_instruction=system_prompt,
        ),
        contents=[f'Content:\n{content}', f'\nQuestion:\n{question}']
    )
    total_input_tokens_no_batching += response.usage_metadata.prompt_token_count
    total_output_tokens_no_batching += response.usage_metadata.candidates_token_count

print(f'Sample Question: {questions[4]}\nResponse: {json.loads(response.text)[0]}')

print (f'Total input tokens used with no batching: {total_input_tokens_no_batching}')
print (f'Total output tokens used with no batching: {total_output_tokens_no_batching}')

Sample Question: How is the course structured (lectures, recitations, workload)?
Response: The course is structured with two hours of lectures per week, held on Tuesdays and Thursdays at 11:00, and one hour of recitation per week on Fridays. Students are expected to dedicate nine hours a week to outside-the-class work, primarily focused on problem sets involving Python programming. Recitations cover material not found in lectures or readings, and attendance is expected.
Total input tokens used with no batching: 66239
Total output tokens used with no batching: 543


### Batching example 2 - with batching
In this example, the model is asked the same five questions, but rather than being asked individually, they are answered all at once. This results in a significant reduction in the number of input tokens used as the model is only provided with the large content once rather than five times.

The response is returned in JSON format to allow for easier separation of each question's answer.

In [6]:
system_prompt = "Answer the questions using *only* the content provided, with each answer being a different string in the JSON response."

batched_questions = ("\n").join(questions[:5])

batched_response = client.models.generate_content(
    model=MODEL_ID,
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=list[str],
        system_instruction=system_prompt,
        thinking_config=types.ThinkingConfig(thinking_budget=0,)
    ),
    contents=[f'Content:\n{content}', f'\nQuestions:\n{batched_questions}']
)

answers = batched_response.text
batched_answers = json.loads(answers.strip())

print(f'Sample Question: {questions[4]}\nResponse: {batched_answers[-1]}')

total_input_tokens_with_batching = batched_response.usage_metadata.prompt_token_count
total_output_tokens_with_batching = batched_response.usage_metadata.candidates_token_count

print (f'Total input tokens used with batching: {total_input_tokens_with_batching}')
print (f'Total output tokens used with batching: {total_output_tokens_with_batching}')

Sample Question: How is the course structured (lectures, recitations, workload)?
Response: The course is structured with two hours of lecture per week (Tuesdays and Thursdays at 11:00), one hour of recitation per week (on Fridays), and nine hours per week of outside-the-class work, primarily focused on problem sets involving programming in Python.
Total input tokens used with batching: 13299
Total output tokens used with batching: 404


### Batching results
From running the two examples above, it's clear that batching can significantly reduce the number of input tokens needed to handle the same number of queries, making requests more efficient and cost-effective.

The larger the content chunk, the more effective batching becomes, since sending multiple queries individually requires repeating the same large content block each time. With batching, that shared context is only included once, so the relative savings grow with the chunk size.

One side effect of batching is that, when using the exact same prompt, the responses tend to be shorter, as reflected in the lower number of output tokens. However, it is possible to adjust this behaviour by altering the system prompt.

## Chunking

Chunking is the opposite of batching and describes the process of breaking down a large input into multiple smaller pieces, referred to as chunks. Once again taking a real life example, imagine you are eating a steak, it is too large to eat in a single mouthful so instead you cut it into pieces and eat a piece at a time.

In the context of LLMs, models have token limits, which restrict the amount of data that can be injested in a single API call, so developers must be aware of the amount of content being transmitted to the model.

There are also other benefits to using chunking, including:
- Improved Performance - If an error occurs during API calls, only the individual chunk needs to be reprocessed rather than the entire input, which is significantly quicker.

### Chunking techniques

Since the Gemini LLM is natively multimodal, the various media types will require custom chunking strategies. In the folowing examples, only simple text chunking methods are demonstrated, however techniques for other media types are discussed later.

It is also worth noting that the Google Gemini Models come with large context windows (1,048,576 input tokens for 2.5 Pro and Flash), so chunking may not be needed in some use cases.

#### Fixed chunking

In fixed chunking, the content is split into non-overlapping chunks each containing a set number of characters, in this case 10,000. An example of this can be seen below.

<img src="https://raw.githubusercontent.com/phil-daniel/gemini-batcher/refs/heads/main/docs/concepts/images/fixed_chunking.svg" alt="A visual example of fixed chunking." width="600">

In [7]:
chunk_char_size = 10000
chunked_content = []
chunk_count = math.ceil(len(content) / chunk_char_size)

for i in range(chunk_count):
    chunk_start_pos = i * chunk_char_size
    chunk_end_pos = min(chunk_start_pos + chunk_char_size, len(content))
    chunked_content.append(content[chunk_start_pos : chunk_end_pos])

print(f'Number of chunks: {len(chunked_content)}')

Number of chunks: 6


#### Sliding window chunking

One disadvantage of fixed chunking is that since it breaks context at arbitrary positions, important information may get split between chunks, meaning that neither chunk contains enough information to fully answer a question.

A simple solution to this is to follow a sliding window approach, where an overlap (called the window) between adjacent chunks is introduced. This increases the likelihood that a complete answer can be found within a single chunk, however can also increase the total number of chunks.

An example of this can be seen below.

<img src="https://raw.githubusercontent.com/phil-daniel/gemini-batcher/refs/heads/main/docs/concepts/images/sliding_window.svg" alt="A visual example of sliding window chunking." width="600">

In [8]:
chunk_char_size = 10000
window_char_size = 2500

chunked_content = []
chunk_count = math.ceil(len(content) / (chunk_char_size - window_char_size))

for i in range(chunk_count):
    chunk_start_pos = i * (chunk_char_size - window_char_size)
    chunk_end_pos = min(chunk_start_pos + chunk_char_size, len(content))
    chunked_content.append(content[chunk_start_pos : chunk_end_pos])

print(f'Number of chunks: {len(chunked_content)}')

Number of chunks: 8


After generating the content chunks, each one can then be sent to the Gemini API individually, ensuring that the input stays within the token limit.

### Other chunking techniques

Below are several other chunking methods for different content types. It is not an exhaustive list and depending on your use case, a combination of these methods or an entirely different technique may provide better results.

- Text
    - Semantic Chunking: This involves breaking down the content into chunks based on semantic meaning. Here sentences are grouped together if they discuss similar topics, making it more likely that a question can be answered entirely by a single chunk. One implementation of this would involve calcuating the embeddings of each sentence using `SentenceTransformer` and then computing the cosine similarity of each sentence. This could also be extended to batching questions to chunks by comparing their cosine similarity.
- Audio
    - Fixed/Sliding Window Chunking by duration: For audio, similar techniques can be used, rather than chunking by the number of sentences, the input can be split based on time duration.
    - Text Methods via Transcripts: Many models, such as Gemini, can be used to create a transcript of an audio file. This allows the text based methods (fixed/sliding window/semantic) to be used, as both models also provide timestamps for when each sentence occured.
    - Speaker Diarization: Analysis can also be completed on the audio itself to detect when the speaker changes or there is a natural break in speech, which can also often act as good chunking positions. One common library for this use is `pyannote.audio`.
- Video
    - Audio Methods: Each of the methods mentioned when discussing the audio techniques can also be used for video content by isolating the audio.
    - Visual Content: Finally, you could analyse the pictures shown in the video to detect a change in the scene, for example a camera cut, which could provide a good chunking position. A useful library for this is `PySceneDetect` which detects when visual scene changes occur.


## Combining batching and chunking

In general, larger batches increase the number of output tokens required, while larger chunks increase the number of input tokens required. This means that by combining batching and chunking, you can maximise the model's token usage, reducing the number of API calls required to complete a task.

A simple way to do this is to use a binary search-style algorithm: repeatedly call the model, and if the input token limit is exceeded, split the content into two smaller chunks; if the output token limit is exceeded, reduce the batch size by half. This is repeated until both the input and model's response fit within the token limits.

This can be implemented using the following functionalities of the Gemini library:
- The input token limit of the model being used. This can be accessed using `client.models.get(model = model_name).input_token_limit`
- The `count_tokens()` function (described [here](https://ai.google.dev/api/tokens#example-request)) to check the input token size, this can then be compared to be limit.
- The `FinishReason` of an API call (i.e. `response.candidates[0].finish_reason`). If this is equal to `types.FinishReason.MAX_TOKENS` then the response ended because the output token limit was exceeded. More information about this can be found [here](https://ai.google.dev/api/generate-content#FinishReason).

## Next Steps

- Check out the [Gemini Documentation](https://ai.google.dev/gemini-api/docs) for more information.
- Check out the [Gemini-Batcher](https://github.com/phil-daniel/gemini-batcher) Google Summer of Code project for more chunking and batching examples.
