# Advanced Chunking Methods for RAG

LLMs are known to hallucinate, and the most powerful solution to that issue _is_ **R**etrieval **A**ugmented **G**eneration (RAG). RAG allows us to provide up to date and relevant information to LLMs — providing citable sources for their generations.

However, using _RAG_ does not necessarily provide you with perfect performance out of the box. We need to build RAG the right way for it to serve our LLMs well.

The very first step in getting RAG right is how we process data for RAG. One of the biggest yet overlooked factors here is _chunking_. Chunking is the process of taking a long piece of text and breaking it into smaller pieces — we do this to optimize the **Capture of Meaning in Embeddings**, optimize for the **LLM Context Window**, and improve **User Experience and Trust**.

### Capture of Meaning in Embeddings

RAG relies on vector embeddings of text. One text, small or large, must be compressed into a _single_ vector. If we consider the size and contents of a book, how much meaning is captured in that book? A book will contain many hundreds or thousands of pieces of information and meaning — by compressing all of them into a single vector we are diluting and averaging the meaning of the full book.

In some scenarios this dilution and average of meaning could be useful. For example, we may want to search for whole books based on their tone or themes — for that we could avoid chunking. However, most RAG use-cases are about finding _specific_ pieces of information, and this is where we use _chunking_.

By breaking a long peice of text into many smaller pieces we can optimize our chunks to contain a singular (or close to singular) focus. That results in much more optimal vector embeddings as they are able to capture a specific meaning or piece of information and compress that into a single vector embedding.

### LLM Context Window

LLMs are being built with increasingly large context windows. Anthropic's Claude 3 models boast a 200K token context window — or around _150K words_. That is big, but is still not big enough to process truly meaningful amounts of data. A Harry Potter LLM, for example, could only process ~10% of the Harry Potter books.

There are also other problems with "stuffing" the LLM context window. First is **recall performance**, there has been a lot of research showing that LLMs typically miss or forget information provided to them within the middle of their context windows [LOST] — so even after stuffing the context window with information, much of it may be forgotten.

The most problematic issue with context window stuffing is _cost_. Using Claude 3 Opus we pay \$3 / million input tokens. If we fill the 200K context window that means we're paying $3 for just _five interactions_. Naturally, that _does not_ scale.

Chunking our text and returning just a few paragraphs (or less) to an LLM massively reduces cost while increasing recall. At the same time it unbuckles us from context window limits as it allows us to filter down a potentially infinite corpus of information into a digestable few chunks for our LLM.

### User Experience and Trust

Another problem we will experience _without_ chunking is response times. Every token that we add to the input of an LLM means that the LLM will take longer to respond. By maximizing the context window we are also maximizing latency. Naturally, this leads to a degraded user experience — no one wants to wait longer for responses.

Last, but not least is **trust**. Most people, whether technical or not, understand that LLMs _hallucinate_. That makes it hard for people to trust the output of an LLM and many will typically resort to third party web search to validate the information that our LLM has provided to us.

That lack of trust gives a poor user experience. However, when we use chunks and RAG we can provide users with precise references to the original information that our LLM answers have been constructed from. That allows users to rapidly validate information for themselves without needing to rely on third part tooling.

## References

[LOST] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, [Lost in the Middle: How Language Models Use Long Context](https://arxiv.org/abs/2307.03172) (2023), TACL