# Chunking

Chunking is the process of breaking down large documents into smaller, manageable pieces or "chunks." This is particularly useful in the context of information retrieval and natural language processing, where working with smaller segments of text can improve the efficiency and accuracy of various tasks, such as search, summarization, and question answering.

I used to think chunking is a simple task, as it is just breaking down a document into smaller pieces. However, I have learned that chunking can be pretty tricky, and it does affect the performance of the downstream tasks. You do not want to have too small chunks, nor too large chunks. 

There are some common trade-offs to consider when chunking documents:

- **Chunk Size**: Smaller chunks may lead to more precise information retrieval but can also result in loss of context. Larger chunks retain more context but may include irrelevant information. Super large chunks will just dilute the information and pile up the token usage in later stages.

- **Context Preservation**: Maintaining context is crucial for understanding the meaning of text. If chunks are too small, they may not provide enough context for accurate interpretation. Fix-sized chunks normally have this issue, as the meaning of a sentence can be lost if it is cut off in the middle.

- **Overlapping Chunks**: Sometimes, overlapping chunks can help preserve context by ensuring that important information is not lost between adjacent chunks. However, this can also lead to redundancy and increased processing time.


## Fixed-size chunking

Fixed-size chunking is the most basic form of chunking, where documents are divided into chunks of a predetermined size, regardless of the content. This method is straightforward but can lead to issues with context and meaning, as it does not take into account the natural breaks in the text.

The size of chunks is crucial; too large may lose nuanced meaning, while too small may lack context.

![image.png](attachment:image.png)

---

![image-2.png](attachment:image-2.png)

## Variable-size chunking
Variable-size chunking is a more advanced technique that creates chunks based on the content and context of the text. This method allows for more flexibility and can lead to better understanding and retention of information. By analyzing the text, variable-size chunking can identify natural breaks and group related information together, resulting in chunks that are more aligned with the author's intent.

![image.png](attachment:image.png)

### Recursive Character Text Splitting

In the context of chunking for vector databases, recursive character text splitting is a technique that allows you to break down text into chunks based on specific characters, such as newlines. This method is particularly useful because it respects the structure of the document, allowing related ideas to stay together, which can enhance the relevance of search results.

Imagine you have a book filled with chapters and paragraphs. If you were to split the text only at fixed intervals, you might end up cutting off sentences or even thoughts in the middle. However, by using recursive character text splitting, you can choose to split the text at natural breaks, like the end of a paragraph. This way, you create chunks that are more meaningful and contextually rich, making it easier for a system to retrieve relevant information.

## Mixed chunking

You can combine fixed-size and variable-size chunking to take advantage of both methods. For instance, use a variable-size chunker to divide text at paragraph markers, and then apply a fixed-size filter. If a chunk is too small, you can merge it with the next one, and if a chunk is too large, you can split it in the middle or at another marker within the chunk.

```python
def mixed_chunking(source_text):
    """
    Splits the given source_text into chunks using a mix of fixed-size and variable-size chunking.
    It first splits the text by Asciidoc markers and then processes the chunks to ensure they are 
    of appropriate size. Smaller chunks are merged with the next chunk, and larger chunks can be 
    further split at the middle or specific markers within the chunk.

    Args:
    - source_text (str): The text to be chunked.

    Returns:
    - list: A list of text chunks.
    """

    # Split the text by Asciidoc marker
    chunks = source_text.split("\n==")

    # Chunking logic
    new_chunks = []
    chunk_buffer = ""
    min_length = 25

    for chunk in chunks:
        new_buffer = chunk_buffer + chunk  # Create new buffer
        new_buffer_words = new_buffer.split(" ")  # Split into words
        if len(new_buffer_words) < min_length:  # Check whether buffer length is too small
            chunk_buffer = new_buffer  # Carry over to the next chunk
        else:
            new_chunks.append(new_buffer)  # Add to chunks
            chunk_buffer = ""

    if len(chunk_buffer) > 0:
        new_chunks.append(chunk_buffer)  # Add last chunk, if necessary

    return new_chunks
    ```


## Advanced Chunking

### Semantic Chunking

Semantic chunking is a technique that enhances the way we break down text into manageable pieces, or "chunks," by focusing on the meaning of the sentences rather than just their length or structure. Imagine reading a story where each paragraph is a puzzle piece. If the pieces are cut randomly, you might miss the bigger picture. Semantic chunking ensures that the pieces fit together based on their meaning, allowing you to see the full story more clearly.

Here's how it works: the algorithm processes the text one sentence at a time. For each sentence, it checks if it shares a similar meaning with the previous sentences in the current chunk. To do this, both the current chunk and the next sentence are transformed into vectors, which are like numerical representations of their meanings. If the distance between these vectors is small enough, it means the sentences are similar, and they stay together in the same chunk. This continues until the next sentence is too different, prompting a new chunk to start. 

For example, if you have a paragraph discussing the benefits of exercise and then suddenly shift to a completely different topic like nutrition, semantic chunking would recognize that shift and create separate chunks for each topic. This method not only preserves the context but also allows for a more coherent understanding of the text.

![image.png](attachment:image.png)



### Large Language Model-Based Chunking

Large Language Model-Based (LLM-based) chunking is an innovative technique that leverages the capabilities of advanced language models to create meaningful chunks of text. Imagine having a highly intelligent assistant who not only understands the content of your document but can also organize it in a way that makes it easier to digest. This is precisely what LLM-based chunking aims to achieve.

In this method, you provide a document to a language model along with specific instructions on how you want the text to be chunked. For example, you might instruct the model to group sentences that discuss similar concepts together and to create new chunks when the topic shifts. The language model then processes the text and generates the chunks accordingly, much like a chef preparing a dish by combining the right ingredients in the right way. This approach is particularly effective because it allows for a nuanced understanding of the text, capturing the flow of ideas and maintaining coherence.

One of the key advantages of LLM-based chunking is its high performance. Although it operates somewhat like a black box—meaning the inner workings can be complex and not always transparent—it often yields excellent results in terms of search relevance and information retrieval. As the costs associated with using large language models decrease, this technique is becoming more accessible and economically viable for various applications.

![image-2.png](attachment:image-2.png)


### Context-Aware Chunking

Context-aware chunking takes the concept of chunking a step further by adding a layer of context to each chunk of text. Think of it as a storyteller who not only shares a tale but also provides background information to help the audience understand the significance of each part. This technique ensures that each chunk is not just a standalone piece but is enriched with context that ties it back to the overall narrative.

Here's how it works: when you apply context-aware chunking, a language model processes the entire document, creating chunks while also summarizing the context for each one. For instance, if a blog post ends with a list of contributors, that list might be hard to interpret on its own. However, the language model can add a brief explanation, such as "These individuals supported the project," which helps clarify the purpose of that chunk. This added context is beneficial both when searching for information and when retrieving the chunk later, as it provides a clearer understanding of its relevance.

The beauty of context-aware chunking lies in its ability to improve search relevance without slowing down the search process. While it does require significant computational resources for pre-processing, the payoff is substantial: users can find more relevant information quickly and easily, leading to a more satisfying experience.

![image.png](attachment:image.png)


## 