# Implement RAG chunking strategies with LangChain and Mistral API

 The overall goal will be to perform chunking to effective implement.

Some key components of chunking include: 
* Chunking strategy: Choosing the right chunking strategy for your RAG application is important as it determines the boundaries for setting chunks. We will explore some of these in the next section. 
* Chunk size: Maximum number of tokens to be in each chunk. Determining the appropriate chunk size usually involves some experimenting. 
* Chunk overlap: The number of tokens overlapping between chunks to preserve context. This is an optional parameter.



## Choosing the right chunking strategy for your RAG application
There are several different chunking strategies to choose from. It is important to select the most effective chunking technique for the specific use case of your LLM application. Some commonly used chunking processes include:


* **Fixed-size chunking**: Splitting text based on a chunk size and optional chunk overlap. This approach is most common and straightforward.

* **Recursive chunking**: Iterating default separators until one of them produces the preferred chunk size. Default separators include `["\n\n", "\n", " ", ""]`. This chunking method uses hierarchical separators so that paragraphs, followed by sentences and then words, are kept together as much as possible.  

* **Semantic chunking**: Splitting text in a way that groups sentences based on the semantic similarity of their. Embeddings of high semantic similarity are closer together than those of low semantic similarity. This results in context-aware chunks.

* **Document-based chunking**: Splitting based on document structure. This splitter can utilize Markdown text, images, tables and even Python code classes and functions as ways of determining structure. In doing so, large documents can be chunked and processed by the LLM.

* **Agentic chunking**: Leverages [agentic AI](https://www.ibm.com/think/topics/ai-agents) by allowing the LLM to determine appropriate document splitting based on semantic meaning as well as content structure such as paragraph types, section headings, step-by-step instructions and more. This chunker is experimental and attempts to simulate human reasoning when processing long documents.

-> We will focus on **Semantic chunking** because it is considered as the best strategy.

In [1]:
!pip install -q langchain mistralai langchain_experimental langchain-text-splitters langchain_chroma transformers bs4 langchain_huggingface sentence-transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
# imports 
import getpass

from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader

In [4]:
MISTRAL_APIKEY = getpass.getpass("Please enter your Mistral API key (hit enter): ")

In [None]:
!pip install langchain-mistralai
from langchain_mistralai import ChatMistralAI # Normal Mistral SDK is not compatible with `langchain`

### Step 4. Initialize your LLM


In [34]:
llm = ChatMistralAI(model="mistral-small-latest", mistral_api_key=MISTRAL_APIKEY)

### Step 5. Load your document

The context we are using for our RAG pipeline is the official IBM announcement for the release of Granite 3.1. We can load the blog to a `Document` directly from the webpage by using LangChain's `WebBaseLoader`.

In [19]:
url = "https://www.ibm.com/new/announcements/ibm-granite-3-1-powerful-performance-long-context-and-more"
doc = WebBaseLoader(url).load()

# Semantic chunking
Semantic chunking requires an embedding or encoder model. We can use the `granite-embedding-30m-english` model as our embedding model. We can also print one of the chunks for a better understanding of their structure.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")
text_splitter = SemanticChunker(embeddings_model)
semantic_chunks = text_splitter.create_documents([doc[0].page_content])

### Step 7. Create vector store

Now that we have experimented with various chunking strategies, let's move along with our RAG implementation. For this tutorial, we will choose the chunks produced by the semantic split and convert them to vector embeddings. An open source vector store we can use is [Chroma DB](https://python.langchain.com/docs/integrations/vectorstores/chroma/). We can easily access Chroma functionality through the `langchain_chroma` package.

Let's initialize our Chroma vector database, provide it with our embeddings model and add our documents produced by semantic chunking.

In [56]:
import os

vector_db = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings_model,
    persist_directory=os.path.abspath("../chroma_langchain_db"),  # Where to save data locally, remove if not necessary
)

In [57]:
import re

# Nettoyer les \n et espaces multiples dans les chunks
for chunk in semantic_chunks:
    # Remplace un ou plusieurs \n par un seul espace
    chunk.page_content = re.sub(r'\n+', ' ', chunk.page_content)
    # Remplace plusieurs espaces consécutifs par un seul espace
    chunk.page_content = re.sub(r'\s+', ' ', chunk.page_content).strip()

# Puis ajouter à la base
vector_db.add_documents(semantic_chunks)

['6f0092e9-5390-456f-8cd5-0baed4c68923',
 'f9da10c8-f264-4299-90c0-5694c4fcf2ca',
 '8f96f844-0bd6-4e02-993e-cf1306264683',
 '179da3fc-dd37-40ae-9e2a-79e6a7c5f607',
 '167ca692-e8d5-4edf-a0b6-944ec89ce88f']

### Step 8. Structure the prompt template
Next, we can move onto creating a prompt template for our LLM. This prompt template allows us to ask multiple questions without altering the initial prompt structure. We can also provide our vector store as the retriever. This step finalizes the RAG structure. 

In [40]:
from langchain.chains import create_retrieval_chain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

prompt_template = """<|start_of_role|>user<|end_of_role|>Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {input}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""

qa_chain_prompt = PromptTemplate.from_template(prompt_template)
combine_docs_chain = create_stuff_documents_chain(llm, qa_chain_prompt)
rag_chain = create_retrieval_chain(vector_db.as_retriever(), combine_docs_chain)

### Step 9. Prompt the RAG chain

Using our completed RAG workflow, let's invoke a user query. First, we can strategically prompt the model without any additional context from the vector store we built to test whether the model is using its built-in knowledge or truly using the RAG context. The Granite 3.1 announcement blog references [Docling](https://github.com/DS4SD/docling?tab=readme-ov-file), IBM's tool for parsing various document types and converting them into Markdown or JSON. Let's ask the LLM about Docling.

Clearly, the model was not trained on information about Docling and without outside tools or information, it cannot provide us with the correct information. The model hallucinates. Now, let's try providing the same query to the RAG chain we built. 

In [43]:
rag_output = rag_chain.invoke({"input": "What is Docling?"})
rag_output['answer']

'Docling is an open-source tool developed by IBM for preprocessing and extracting information from various document formats, making them more accessible for large language models (LLMs) like Granite. It can parse documents in formats such as PDF, DOCX, images, PPTX, XLSX, HTML, and AsciiDoc, converting them into model-friendly formats like Markdown or JSON. This enables the information to be easily accessed by models for tasks like retrieval-augmented generation (RAG) and other workflows.\n\nDocling goes beyond simple optical character recognition (OCR) and text extraction by integrating contextual and element-based preprocessing techniques. For example, it can extract tables spanning multiple pages as a single table or separate body text, images, and tables based on their original context. The tool is designed to work seamlessly with agentic frameworks like LlamaIndex, LangChain, and Bee, and is open-sourced under the MIT License.\n\nAdditional features under development include equat

Great! The Granite model correctly used the RAG context to tell us correct information about Docling while preserving semantic coherence. We proved this same result was not possible without the use of RAG. 

## Summary
In this tutorial, you created a RAG pipeline and experimented with several chunking strategies to improve the system’s retrieval accuracy. Using the Granite 3.1 model, we successfully produced appropriate model responses to a user query related to the documents provided as context. The text we used for this RAG implementation was loaded from a blog on ibm.com announcing the release of Granite 3.1. The model provided us with information only accessible through the provided context since it was not part of the model's initial knowledge base. 

For those in search of further reading, check out the results of a [project](https://developer.ibm.com/articles/awb-enhancing-llm-performance-document-chunking-with-watsonx/) comparing LLM performance using HTML structured chunking in comparison to watsonx chunking.