[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/evals/ragas-evaluation.ipynb)


# RAG Series Part 3: Data Modeling Strategies for RAG

In this notebook, we will explore and evaluate different chunking techniques for RAG.


## Step 1: Install required libraries


In [27]:
! pip install -qU langchain langchain-openai langchain-experimental ragas lxml bs4 nest_asyncio

## Step 2: Setup pre-requisites

- Set the MongoDB connection string. Follow the steps [here](https://www.mongodb.com/docs/manual/reference/connection-string/) to get the connection string from the Atlas UI.

- Set the OpenAI API key. Steps to obtain an API key as [here](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)


In [8]:
import os
import getpass
from openai import OpenAI

In [9]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")
openai_client = OpenAI()

In [10]:
MONGODB_URI = getpass.getpass("Enter your MongoDB connection string:")

## Step 3: Load the dataset


In [102]:
from langchain_community.document_loaders import WebBaseLoader

web_loader = WebBaseLoader(
    [
        "https://peps.python.org/pep-0483/",
        "https://peps.python.org/pep-0008/",
        "https://peps.python.org/pep-0257/",
    ]
)

pages = web_loader.load()

In [79]:
len(pages)

3

## Step 4: Define chunking functions


In [95]:
from langchain.text_splitter import (
    Language,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

In [81]:
def fixed_token(docs, chunk_size, chunk_overlap):
    splitter = TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return splitter.split_documents(docs)

In [90]:
def recursive_split(docs, chunk_size, chunk_overlap):
    splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        encoding_name="cl100k_base",
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )
    return splitter.split_documents(docs)

In [124]:
def recursive_python_split(docs, chunk_size, chunk_overlap, language):
    splitter = RecursiveCharacterTextSplitter.from_language(
        language=language,
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )
    return splitter.split_documents(docs)

In [84]:
def semantic_split(docs):
    splitter = SemanticChunker(
        OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
    )
    return splitter.split_documents(docs)