[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/evals/ragas-evaluation.ipynb)

# RAG Series Part 3: Data Modeling Strategies for RAG

In this notebook, we will explore and evaluate different chunking techniques for RAG.

## Step 1: Install required libraries

In [93]:
! pip install -qU pypdf langchain langchain-openai langchain-experimental ragas

## Step 2: Setup pre-requisites

- Set the MongoDB connection string. Follow the steps [here](https://www.mongodb.com/docs/manual/reference/connection-string/) to get the connection string from the Atlas UI.

- Set the OpenAI API key. Steps to obtain an API key as [here](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)

In [2]:
import os
import getpass
from openai import OpenAI

In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")
openai_client = OpenAI()

Enter your OpenAI API Key:········


In [4]:
MONGODB_URI = getpass.getpass("Enter your MongoDB connection string:")

Enter your MongoDB connection string:········


## Step 3: Load the dataset

In [5]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://arxiv.org/pdf/2312.10997")
pages = loader.load()

In [42]:
len(pages)

21

In [7]:
pages[1]

Document(page_content='2\nFig. 1. Technology tree of RAG research. The stages of involving RAG mainly include pre-training, fine-tuning, and inference. With the emergence of LLMs,\nresearch on RAG initially focused on leveraging the powerful in context learning abilities of LLMs, primarily concentrating on the inference stage. Subsequent\nresearch has delved deeper, gradually integrating more with the fine-tuning of LLMs. Researchers have also been exploring ways to enhance language models\nin the pre-training stage through retrieval-augmented techniques.\nadvanced RAG, and modular RAG. This review contex-\ntualizes the broader scope of RAG research within the\nlandscape of LLMs.\n•We identify and discuss the central technologies integral\nto the RAG process, specifically focusing on the aspects\nof “Retrieval”, “Generation” and “Augmentation”, and\ndelve into their synergies, elucidating how these com-\nponents intricately collaborate to form a cohesive and\neffective RAG framework.\n

## Step 4: Define chunking functions

In [94]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

In [101]:
def fixed_token_split_no_overlap(docs):
    splitter = TokenTextSplitter(chunk_size=200, chunk_overlap=0)
    return splitter.split_documents(docs)

In [102]:
def fixed_token_split_overlap(docs):
    splitter = TokenTextSplitter(chunk_size=200, chunk_overlap=30)
    return splitter.split_documents(docs)

In [112]:
def paragraph_split(docs):
    splitter = CharacterTextSplitter.from_tiktoken_encoder(encoding_name="cl100k_base", separator="\n", chunk_size=200)
    return splitter.split_documents(docs)

In [116]:
def recursive_split(docs):
    splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(encoding_name="cl100k_base", chunk_size=200, chunk_overlap=30)
    return splitter.split_documents(docs)

In [120]:
def semantic_split(docs):
    splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="percentile")
    return splitter.split_documents(docs)