# Gemini API: Batching

Batching describes the process of combining multiple individual API calls together into a single API call. Imagine you need to buy three things from a shop, rather than going to the shop three separate times, buying one item each time, it would be more efficient to only go to the shop once, getting everything you need. The technique can provide multiple benefits, including:
* Reduced latency - Rather than having to make repeated HTTP calls, only a single one must be made, reducing latency. In addition, since many LLM APIs have rate limits, the number of requests which can be made may be limited. 
* Improved cost efficiency - In some situations, combining your inputs into a single API call can reduce the number of tokens required. For example, given a paragraph costing 400 tokens to process, and 5 questions each costing 10 tokens, asking the questions one at a time would take ≈ (400 + 10) * 5 = 2050 tokens, whereas batching the questions would only take ≈ 400 + (10 * 5) = 450 tokens, giving a signficant improvement. An example of this is shown below.

## Setup

In [1]:
from google import genai
from google.genai import types
import json
import os
from dotenv import load_dotenv

load_dotenv()
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

content = ""
with open("demo_files/content.txt", 'r', encoding='utf-8') as file:
    content = file.read()
    
questions = []
with open("demo_files/questions.txt", 'r', encoding='utf-8') as file:
    for question in file:
        questions.append(question.strip())
# Only getting the first 5 questions for this example.
questions = questions[:5]

client = genai.Client(api_key=GEMINI_API_KEY)

### No Batching

In this example, we provide the model with a large block of content and ask it five questions based on that content, one at a time.

In [None]:
total_input_tokens_no_batching = 0
total_output_tokens_no_batching = 0

system_prompt = """
    Answer the question using the content provided, with each answer being a different string in the JSON response.
    * **Accuracy and Precision:** Provide direct, factual answers.
    * **Source Constraint:** Use *only* information explicitly present in the content. Do not infer, speculate, or bring in outside knowledge.
    * **Completeness:** Ensure each answer fully addresses the question, *to the extent possible with the given transcript*.
"""

for question in questions:
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema=list[str],
            system_instruction=system_prompt,
            thinking_config=types.ThinkingConfig(thinking_budget=0)
        ),
        contents=[f'Content:\n{content}', f'\nQuestion:\n{question}']
    )
    total_input_tokens_no_batching += response.usage_metadata.prompt_token_count
    total_output_tokens_no_batching += response.usage_metadata.candidates_token_count

print (f'Total input tokens used with no batching: {total_input_tokens_no_batching}')
print (f'Total output tokens used with no batching: {total_output_tokens_no_batching}')

Total input tokens used with no batching: 66584
Total output tokens used with no batching: 711


### Batching

In this batched example, we ask model the same questions, but ask them all at once. A slightly modified system prompt is also used, requiring the API to respond in JSON format. This makes its significantly easier to parse the response and split it into the individual answers.

The signficant reductiom in the number of input tokens used in this is example is because the large content only needs to be provided to the API once.

In [None]:
system_prompt = """
    Answer the questions using the content provided, with each answer being a different string in the JSON response.
    * **Accuracy and Precision:** Provide direct, factual answers.
    * **Source Constraint:** Use *only* information explicitly present in the content. Do not infer, speculate, or bring in outside knowledge.
    * **Completeness:** Ensure each answer fully addresses the question, *to the extent possible with the given transcript*.
"""

batched_questions = ("\n").join(questions)

batched_response = client.models.generate_content(
    model="gemini-2.5-flash",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=list[str],
        system_instruction=system_prompt,
        thinking_config=types.ThinkingConfig(thinking_budget=0,)
    ),
    contents=[f'Content:\n{content}', f'\nQuestions:\n{batched_questions}']
)

answers = batched_response.text
batched_answers = json.loads(answers.strip())

total_input_tokens_with_batching = batched_response.usage_metadata.prompt_token_count
total_output_tokens_with_batching = batched_response.usage_metadata.candidates_token_count

print (f'Total input tokens used with batching: {total_input_tokens_with_batching}')
print (f'Total output tokens used with batching: {total_output_tokens_with_batching}')

Total input tokens used with batching: 13368
Total output tokens used with batching: 375


When batching, it is important to take into account the context limit of the model used. This is because:
* Batching too many questions together may increase the number of input tokens over the model's limit, causing errors. One solution to this is to also break down the content into multiple chunks, which is called chunking.
* Batching too many questions may increase the number of output tokens over the model's limit, meaning that not all of the questions are answered.


In [2]:
uploaded_file = client.files.upload(file="./demo_files/audio.mp3")

In [None]:
from pydantic import BaseModel

class TranscriptedSentence(BaseModel):
    start_time: float
    end_time: float
    sentence: str

response = client.models.generate_content(
        model="gemini-2.5-flash",
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema=list[TranscriptedSentence],
            thinking_config=types.ThinkingConfig(thinking_budget=0)
        ),
        contents=[f'Transcript the attached file, outputted as JSON. Each entry must be a single sentence with the following fields: start_time (in seconds as float), end_time (in seconds as float) and the sentence itself.', uploaded_file]
    )

print (response.text)

[
  {
    "start_time": 4.549,
    "end_time": 8.769,
    "sentence": "I think I've always liked making things from a very young age."
  },
  {
    "start_time": 8.769,
    "end_time": 13.569,
    "sentence": "Um, I remember when I was very young, I used to tell people that I wanted to be an inventor."
  },
  {
    "start_time": 14.281,
    "end_time": 21.281,
    "sentence": "I decided I wanted to study computer science at university, um, part way through my computer science AS level."
  },
  {
    "start_time": 21.579,
    "end_time": 27.279,
    "sentence": "But it was probably my uncle, he used to make websites that got me interested in computing when I was maybe 10 or 11 years old."
  },
  {
    "start_time": 27.533,
    "end_time": 44.883,
    "sentence": "I'm glad that I didn't choose computer science when I went to university because, you know, it's been transformed um in the couple of decades since then into a subject that's really worthwhile studying that has a much deeper um

In [1]:
import tempfile
import os

with tempfile.TemporaryDirectory() as temp_dir:
    print("Temporary directory:", temp_dir)

Temporary directory: /var/folders/tm/jfd6krc55c72j0gl30902gj80000gn/T/tmpk1do4jnj
