# Gemini API: Chunking

Chunking is the opposite of batching and describes the process of breaking down a large input into multiple smaller pieces, referred to as chunks. Once again taking a real life example, imagine you are eating a steak, it is too large to eat in a single mouthful so instead you cut it into pieces and eat a piece at a time.

In the context of LLMs, models have token limits, which restrict the amount of data that can be injested in a single API call, so developers must be aware of the amount of content being transmitted to the model.

There are also other benefits to using chunking, including:
* Improved Performance - If an error occurs during API calls, only the individual chunk needs to be reprocessed rather than the entire input, which is significantly quicker. 

## Setup

In [4]:
from google import genai
from google.genai import types
import os
from dotenv import load_dotenv
import math

load_dotenv()
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

content = ""
with open("demo_files/content.txt", 'r', encoding='utf-8') as file:
    content = file.read()
    
questions = []
with open("demo_files/questions.txt", 'r', encoding='utf-8') as file:
    for question in file:
        questions.append(question.strip())
# Only getting the first question for this example.
question = questions[0]

client = genai.Client(api_key=GEMINI_API_KEY)

## Chunking Techniques

Since the Gemini LLM is natively multimodal, the various media types will require custom chunking strategies. In this example, only simple text chunking methods are demonstrated, however other techniques are discussed later.

It is also worth noting that the Google Gemini Models come with large context windows (1,048,576 input tokens for 2.5 Pro and Flash), so chunking may not be needed in some use cases.

### Fixed Chunking

The example content used in this demonstration has 53,405 characters, which is less that that of the input token limit, however for this example imagine that the token limit is ~10,000 characters.

In fixed chunking, the content is split into non-overlapping chunks that are each of 10,000 characters.

TODO: Add visualisation.

In [None]:
chunk_char_size = 10000
chunked_content = []
chunk_count = math.ceil(len(content) / chunk_char_size)

for i in range(chunk_count):
    chunk_start_pos = i * chunk_char_size
    chunk_end_pos = min(chunk_start_pos + chunk_char_size, len(content))
    chunked_content.append(content[chunk_start_pos : chunk_end_pos])

print(f'Number of chunks: {len(chunked_content)}')
# TODO: Add API calls to Gemini to demonstrate using the chunks?

### Sliding Window Chunking

One disadvantage of fixed chunking is that it may break context at arbitrary positions, meaning that important information can get split between chunkings, i.e. half a sentence is in the first chunk and half is in the second, meaning neither chunks are able to fully answer a question about the sentence. The sliding window approach addresses this by using overlapping chunks, so each chunk shares some content with the previous chunk (this is called the window).

One disadvantage is that this increases the number of chunks required, causing an increase in the amount of API calls needed.

TODO: Add visualisation.

In [5]:
chunk_char_size = 10000
window_char_size = 2500

chunked_content = []
chunk_count = math.ceil(len(content) / (chunk_char_size - window_char_size))

for i in range(chunk_count):
    chunk_start_pos = i * (chunk_char_size - window_char_size)
    chunk_end_pos = min(chunk_start_pos + chunk_char_size, len(content))
    chunked_content.append(content[chunk_start_pos : chunk_end_pos])

print(f'Number of chunks: {len(chunked_content)}')

Number of chunks: 8


### Other Chunking Methods

TODO: these methods increase in complexity, requiring more time to complete.


This list is not exhaustive and a combination of all the techniques or a different technique altogether may perform better depending on the use case.

- Text
    - Semantic Chunking: This involves breaking down the content into chunks based on semantic meaning. Here sentences are grouped together if they discuss similar topics, making it more likely that a question can be answered entirely by a single chunk. One implementation of this would involve calcuating the embeddings of each sentence using `SentenceTransformer` and then computing the cosine similarity of each sentence.
- Audio
    - Fixed/Sliding Window Chunking by duration: For audio, similar techniques can be used, rather than chunking by the number of sentences, the input can be split based on time duration.
    - Text Methods via Transcripts: Models such as Google's Speech-to-Text or OpenAI's Whisper can be used to create a transcript of a file. This allows the text based methods (fixed/sliding window/semantic) to be used, as both models also provide timestamps for when each sentence occured.
    - Speaker Diarization: Analysis can also be completed on the audio itself to detect when the speaker changes or there is a natural break in speech, which can also often act as good chunking positions. One common library for this use is `pyannote.audio`.
- Video
    - Audio Methods: Each of the methods mentioned when discussing the audio techniques can also be used for video content by isolating the audio.
    - Visual Content: Finally, you could analyse the pictures shown in the video to detect a change in the scene, for example a camera cut, which could provide a good chunking position. A useful library for this is `PySceneDetect` which detects when visual scene changes occur.
