# Gemini API: Batching

Batching describes the process of combining multiple individual API calls together into single API call. Imagine you need to buy three things from a shop, rather than going to the shop three separate times, buying one item each time, it would be more efficient to only go to the shop once, getting everything you need. The technique can provide multiple benefits, including:
* Reduced latency - Rather than having to make repeated HTTP calls, only a single one must be made, reducing latency. In addition, since many LLM APIs have rate limits, the number of requests which can be made may be limited. 
* Improved cost efficiency - In some situations, combining your inputs into a single API call can reduce the number of tokens required. For example, given a paragraph costing 400 tokens to process, and 5 questions each costing 10 tokens, asking the questions one at a time would take ≈ (400 + 10) * 5 = 2050 tokens, whereas batching the questions would only take ≈ 400 + (10 * 5) = 450 tokens, giving a signficant improvement. An example of this is shown below.

## Setup

In [15]:
from google import genai
from google.genai import types
import json
import os
from dotenv import load_dotenv

load_dotenv()
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

content = ""
with open("content.txt", 'r', encoding='utf-8') as file:
    content = file.read()
    
questions = []
with open("questions.txt", 'r', encoding='utf-8') as file:
    for question in file:
        questions.append(question.strip())
# Only getting the first 5 questions for this example.
questions = questions[:5]

client = genai.Client(api_key=GEMINI_API_KEY)

### No Batching

In this example, we provide the model with a large block of content and ask it five questions based on that content, one at a time.

In [None]:
total_input_tokens_no_batching = 0
total_output_tokens_no_batching = 0

system_prompt = """
    Answer the question using the content provided.
    * **Accuracy and Precision:** Provide direct, factual answers.
    * **Source Constraint:** Use *only* information explicitly present in the content. Do not infer, speculate, or bring in outside knowledge.
    * **Completeness:** Ensure each answer fully addresses the question, *to the extent possible with the given transcript*.
"""

for question in questions:
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        config=types.GenerateContentConfig(
            system_instruction=system_prompt
        ),
        contents=[f'Content:\n{content}', f'\nQuestion:\n{question}']
    )
    total_input_tokens_no_batching += response.usage_metadata.prompt_token_count
    total_output_tokens_no_batching += response.usage_metadata.candidates_token_count

print (f'Total input tokens used with no batching: {total_input_tokens_no_batching}')
print (f'Total output tokens used with no batching: {total_output_tokens_no_batching}')

Total input tokens used with no batching: 66454
Total output tokens used with no batching: 638


### Batching

In this batched example, we ask model the same questions, but ask them all at once. A slightly modified system prompt is also used, requiring the API to respond in JSON format. This makes is significantly easier to parse the response and split it into the individual answers.

The signficant redunction in the number of input tokens used in this is example is because the large content only needs to be provided to the API once.

In [21]:
system_prompt = """
    Answer the questions using the content provided.
    * **Accuracy and Precision:** Provide direct, factual answers.
    * **Source Constraint:** Use *only* information explicitly present in the content. Do not infer, speculate, or bring in outside knowledge.
    * **Completeness:** Ensure each answer fully addresses the question, *to the extent possible with the given transcript*.
    Respond using a JSON array, where each element is an answer to a question.
    e.g. {
        {'answer' : "Answer to question 1"},
        {'answer' : "Answer to question 2"},
    }
"""

batched_questions = ("\n").join(questions)

batched_response = client.models.generate_content(
    model="gemini-2.0-flash",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        system_instruction=system_prompt
    ),
    contents=[f'Content:\n{content}', f'\nQuestions:\n{batched_questions}']
)

batched_answer = json.loads(response.text)
batched_answer = [ans['answer'] for ans in batched_answer]

total_input_tokens_with_batching = response.usage_metadata.prompt_token_count
total_output_tokens_with_batching = response.usage_metadata.candidates_token_count

print (f'Total input tokens used with batching: {total_input_tokens_with_batching}')
print (f'Total output tokens used with batching: {total_output_tokens_with_batching}')

Total input tokens used with batching: 13360
Total output tokens used with batching: 367


When batching, it is important to take into account the context limit of the model used. This is because:
* Batching too many questions together may increase the number of input tokens over the model's limit, causing errors. One solution to this is to also break down the content into multiple chunks, which is called chunking.
* Batching too many questions may increase the number of output tokens over the model's limit, meaning that not all of the questions are answered.
