[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/rag/rag_chunking_strategies.ipynb)


# RAG Series Part 3: Choosing the right chunking strategy for RAG

In this notebook, we will explore and evaluate different chunking techniques for RAG.


## Step 1: Install required libraries


In [1]:
! pip install -qU langchain langchain-openai langchain-mongodb langchain-experimental ragas pymongo tqdm

## Step 2: Setup pre-requisites

- Set the MongoDB connection string. Follow the steps [here](https://www.mongodb.com/docs/manual/reference/connection-string/) to get the connection string from the Atlas UI.

- Set the OpenAI API key. Steps to obtain an API key as [here](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)


In [2]:
import getpass
import os

from openai import OpenAI

In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")
openai_client = OpenAI()

In [4]:
MONGODB_URI = getpass.getpass("Enter your MongoDB connection string:")

## Step 3: Load the dataset


In [5]:
from langchain_community.document_loaders import WebBaseLoader

web_loader = WebBaseLoader(
    [
        "https://peps.python.org/pep-0483/",
        "https://peps.python.org/pep-0008/",
        "https://peps.python.org/pep-0257/",
    ]
)

pages = web_loader.load()

In [6]:
len(pages)

3

## Step 4: Define chunking functions


In [34]:
from typing import Dict, List, Optional

from langchain.text_splitter import (
    Language,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
)
from langchain_core.documents import Document
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

In [8]:
def fixed_token_split(
    docs: List[Document], chunk_size: int, chunk_overlap: int
) -> List[Document]:
    """
    Fixed token chunking

    Args:
        docs (List[Document]): List of documents to chunk
        chunk_size (int): Chunk size (number of tokens)
        chunk_overlap (int): Token overlap between chunks

    Returns:
        List[Document]: List of chunked documents
    """
    splitter = TokenTextSplitter(
        encoding_name="cl100k_base", chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    return splitter.split_documents(docs)

In [9]:
def recursive_split(
    docs: List[Document],
    chunk_size: int,
    chunk_overlap: int,
    language: Optional[Language] = None,
) -> List[Document]:
    """
    Recursive chunking

    Args:
        docs (List[Document]): List of documents to chunk
        chunk_size (int): Chunk size (number of tokens)
        chunk_overlap (int): Token overlap between chunks
        language (Optional[Language], optional): Language enum name. Defaults to None.

    Returns:
        List[Document]: List of chunked documents
    """
    separators = ["\n\n", "\n", " ", ""]

    if language is not None:
        try:
            separators = RecursiveCharacterTextSplitter.get_separators_for_language(
                language
            )
        except (NameError, ValueError):
            print(f"No separators found for language {language}. Using defaults.")

    splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        encoding_name="cl100k_base",
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=separators,
    )
    return splitter.split_documents(docs)

In [10]:
def semantic_split(docs: List[Document]) -> List[Document]:
    """
    Semantic chunking

    Args:
        docs (List[Document]): List of documents to chunk

    Returns:
        List[Document]: List of chunked documents
    """
    splitter = SemanticChunker(
        OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
    )
    return splitter.split_documents(docs)

## Step 5: Generate the evaluation dataset


In [None]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas import RunConfig
from ragas.testset.evolutions import multi_context, reasoning, simple
from ragas.testset.generator import TestsetGenerator

RUN_CONFIG = RunConfig(max_workers=4, max_wait=180)

In [16]:
# Generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)

# Set question type distribution
distributions = {simple: 0.5, multi_context: 0.4, reasoning: 0.1}

testset = generator.generate_with_langchain_docs(
    pages, 10, distributions, run_config=RUN_CONFIG
)

Filename and doc_id are the same for all nodes.                 
Generating: 100%|██████████| 10/10 [01:16<00:00,  7.68s/it]


In [17]:
testset = testset.to_pandas()

In [18]:
len(testset)

10

In [19]:
testset.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is the purpose of the Callable type in Py...,"[ items, the\nfirst is an int, the second is a...",The Callable type in Python's typing module is...,simple,[{'source': 'https://peps.python.org/pep-0483/...,True
1,What naming convention should be used for type...,[�L’ instead.\n\n\nASCII Compatibility\nIdenti...,,simple,[{'source': 'https://peps.python.org/pep-0008/...,True
2,What is the recommended approach for implement...,[ations.\n\nComparisons to singletons like Non...,When implementing ordering operations with ric...,simple,[{'source': 'https://peps.python.org/pep-0008/...,True
3,Why should blank lines be removed from the beg...,"[ fits on a line, place the closing quotes\non...",Blank lines should be removed from the beginni...,simple,[{'source': 'https://peps.python.org/pep-0257/...,True
4,What are some ways to declare types in Python?,[class Derived(Base[T_co]):\n ...\n\n\nA ty...,Type variables can be declared in unconstraine...,simple,[{'source': 'https://peps.python.org/pep-0483/...,True


## Step 6: Evaluate chunking strategies


In [None]:
from langchain_mongodb import MongoDBAtlasVectorSearch
from pymongo import MongoClient

client = MongoClient(MONGODB_URI, appname="devrel.showcase.chunking_strategies")
DB_NAME = "evals"
COLLECTION_NAME = "chunking"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index"
MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

In [32]:
def create_vector_store(docs: List[Document]) -> MongoDBAtlasVectorSearch:
    """
    Create MongoDB Atlas vector store

    Args:
        docs (List[Document]): List of documents to create the vector store

    Returns:
        MongoDBAtlasVectorSearch: MongoDB Atlas vector store
    """
    vector_store = MongoDBAtlasVectorSearch.from_documents(
        documents=docs,
        embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
        collection=MONGODB_COLLECTION,
        index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
    )

    return vector_store

In [22]:
import nest_asyncio
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import context_precision, context_recall
from tqdm import tqdm

# Allow nested use of asyncio (used by RAGAS)
nest_asyncio.apply()

# Disable tqdm locks
tqdm.get_lock().locks = []

QUESTIONS = testset.question.to_list()
GROUND_TRUTH = testset.ground_truth.to_list()

In [30]:
def perform_eval(docs: List[Document]) -> Dict[str, float]:
    """
    Perform RAGAS evaluation

    Args:
        docs (List[Document]): List of documents to create the vector store

    Returns:
        Dict[str, float]: Dictionary of evaluation metrics
    """
    eval_data = {
        "question": QUESTIONS,
        "ground_truth": GROUND_TRUTH,
        "contexts": [],
    }

    print(f"Deleting existing documents in the collection {DB_NAME}.{COLLECTION_NAME}")
    MONGODB_COLLECTION.delete_many({})
    print("Deletion complete")
    vector_store = create_vector_store(docs)

    # Getting relevant documents for questions in the evaluation dataset
    print("Getting contexts for evaluation set")
    for question in tqdm(QUESTIONS):
        eval_data["contexts"].append(
            [doc.page_content for doc in vector_store.similarity_search(question, k=3)]
        )
    # RAGAS expects a Dataset object
    dataset = Dataset.from_dict(eval_data)

    print("Running evals")
    result = evaluate(
        dataset=dataset,
        metrics=[context_precision, context_recall],
        run_config=RUN_CONFIG,
        raise_exceptions=False,
    )
    return result

In [24]:
for chunk_size in [100, 200, 500, 1000]:
    chunk_overlap = int(0.15 * chunk_size)
    print(f"CHUNK SIZE: {chunk_size}")
    print("------ Fixed token without overlap ------")
    print(f"Result: {perform_eval(fixed_token_split(pages, chunk_size, 0))}")
    print("------ Fixed token with overlap ------")
    print(
        f"Result: {perform_eval(fixed_token_split(pages, chunk_size, chunk_overlap))}"
    )
    print("------ Recursive with overlap ------")
    print(f"Result: {perform_eval(recursive_split(pages, chunk_size, chunk_overlap))}")
    print("------ Recursive Python splitter with overlap ------")
    print(
        f"Result: {perform_eval(recursive_split(pages, chunk_size, chunk_overlap, Language.PYTHON))}"
    )
print("------ Semantic chunking ------")
print(f"Result: {perform_eval(semantic_split(pages))}")

CHUNK SIZE: 100
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:01<00:00,  5.22it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:23<00:00,  1.17s/it]


Result: {'context_precision': 0.8583, 'context_recall': 0.7833}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:01<00:00,  5.12it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:18<00:00,  1.09it/s]


Result: {'context_precision': 0.9000, 'context_recall': 0.9500}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:02<00:00,  4.93it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.10s/it]


Result: {'context_precision': 0.9000, 'context_recall': 0.9833}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:02<00:00,  4.90it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.15s/it]


Result: {'context_precision': 0.9833, 'context_recall': 0.9833}
CHUNK SIZE: 200
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:02<00:00,  4.94it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:21<00:00,  1.09s/it]


Result: {'context_precision': 0.9000, 'context_recall': 0.9000}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:01<00:00,  5.10it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:20<00:00,  1.03s/it]


Result: {'context_precision': 1.0000, 'context_recall': 0.9383}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:01<00:00,  5.13it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.12s/it]


Result: {'context_precision': 0.9000, 'context_recall': 0.9008}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:02<00:00,  4.75it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:21<00:00,  1.10s/it]


Result: {'context_precision': 1.0000, 'context_recall': 0.8583}
CHUNK SIZE: 500
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:02<00:00,  4.99it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.11s/it]


Result: {'context_precision': 0.8833, 'context_recall': 0.9500}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:02<00:00,  4.77it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:17<00:00,  1.15it/s]


Result: {'context_precision': 0.7000, 'context_recall': 0.9000}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:02<00:00,  4.65it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:20<00:00,  1.02s/it]


Result: {'context_precision': 0.5667, 'context_recall': 0.8236}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:01<00:00,  5.11it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:15<00:00,  1.30it/s]


Result: {'context_precision': 0.6000, 'context_recall': 0.8800}
CHUNK SIZE: 1000
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:01<00:00,  5.18it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:18<00:00,  1.08it/s]


Result: {'context_precision': 0.9000, 'context_recall': 0.8909}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:02<00:00,  4.27it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:17<00:00,  1.15it/s]


Result: {'context_precision': 0.7833, 'context_recall': 0.8909}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:03<00:00,  2.64it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:19<00:00,  1.02it/s]


Result: {'context_precision': 0.7833, 'context_recall': 0.8800}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:02<00:00,  4.64it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:19<00:00,  1.01it/s]


Result: {'context_precision': 0.8000, 'context_recall': 0.8709}
------ Semantic chunking ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set


100%|██████████| 10/10 [00:02<00:00,  4.69it/s]


Running evals


Evaluating: 100%|██████████| 20/20 [00:23<00:00,  1.16s/it]

Result: {'context_precision': 0.9000, 'context_recall': 0.8187}



