<a href="https://colab.research.google.com/github/kaippatel/dmls-langchain-workshop/blob/main/RAG_Text_Splitting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **RAG Text Splitting options**


## **📌Overview**
RAG Text Splitting Overview
RAG (Retrieval-Augmented Generation) text splitting involves dividing input text or documents into smaller, coherent segments (chunks) to optimize retrieval of relevant information. This preprocessing step balances context preservation (e.g., using semantic or overlapping splits) with efficient processing, ensuring the retriever fetches meaningful data for the generator to produce accurate, context-aware outputs. There are several strategies to enhance RAG performance.

In [None]:
# First uninstall conflicting versions
!pip uninstall -y google-generativeai google-ai-generativelanguage

# Then install all dependencies in one go
!pip install -qU \
  python-dotenv \
  langchain-core \
  langchain-google-genai \
  chromadb \
  pypdf \
  langchain-community \
  google-generativeai

In [None]:
from langchain_core.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import ChatPromptTemplate

## **Obtain a Google Gemini API Key (GOOGLE COLLAB SETUP):**

If you have a Google Gemini API Key:
- Copy your API key and replace "your_google_api_key_here" in the code below

Otherwise:  
- Go to the Google AI Studio API Console: [Google AI Studio](https://aistudio.google.com/prompts/new_chat)
- Sign in with your Google account and create a new API key.
- Copy your API key and replace "your_google_api_key_here" in the code below

In [None]:
# Set your Google API key manually
import os
os.environ["GOOGLE_API_KEY"] = "YOUR_API_KEY"

## **Load Environment Variables (LOCAL SETUP)**

In [None]:
# Load environment variables
from dotenv import load_dotenv
load_dotenv()

---

## **Imports**  

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

## **Colab File Setup**   

In [None]:
from google.colab import files
uploaded = files.upload()  # Upload your 1984.txt when prompted


## **Use current directory for persistence**

In [None]:
persistent_directory = "/content/chroma_db"


## **Load Document**

In [None]:
loader = TextLoader("1984.txt")  # Use uploaded filename
documents = loader.load()

# **Function to create vector stores**\

In [None]:
def create_vector_store(docs, store_name):
    store_path = os.path.join(persistent_directory, store_name)
    if not os.path.exists(store_path):
        print(f"\n--- Creating vector store: {store_name} ---")
        db = Chroma.from_documents(
            docs,
            embeddings,
            persist_directory=store_path
        )
        print(f"--- Finished creating vector store: {store_name} ---")
    else:
        print(f"Vector store '{store_name}' already exists, skipping.")


## **Splitting Options**

Character Splitting
Divides text into fixed-size chunks (e.g., 500 characters). Simple but risks splitting words/sentences mid-context. Ideal for raw preprocessing.

Sentence Splitting
Splits text at sentence boundaries (using NLP tools like spaCy or NLTK). Preserves semantic coherence but may create uneven chunk sizes.

Token Splitting
Splits text into token units (e.g., words/subwords aligned with model tokenizers like GPT). Ensures chunks fit model limits without breaking tokens.

Recursive Splitting
Hierarchical approach: splits text using multiple separators (paragraph → sentence → word) iteratively. Balances context retention and chunk size consistency.

Use Case Alignment:

Character: Speed-focused tasks.

Sentence/Token: NLP/model inputs.

Recursive: Context-heavy retrieval (e.g., RAG).

In [None]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    SentenceTransformersTokenTextSplitter,
    TokenTextSplitter,
    TextSplitter,
)

# Character Based Text Splitter

Tries to split at \n\n (default separator)

If no split found, tries \n

The text contains sections without natural split points (like paragraphs without line breaks), forcing the splitter to create larger chunks than requested.

If the chunk is longer than the specified size, a Warning will be shown for each chunk.


In [None]:
# 1.1 Character-based splitting default using \n\n
print("\n--- Character-based Splitting ---")
char_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
char_docs = char_splitter.split_documents(documents)
create_vector_store(char_docs, "chroma_db_char")

In [None]:
# 1.2 Character-based splitting with space as the separator
print("\n--- Character-based Splitting with \" \"---")
char_splitter = CharacterTextSplitter(separator=" ", chunk_size=1000, chunk_overlap=100)
char_docs_space = char_splitter.split_documents(documents)
create_vector_store(char_docs_space, "chroma_db_char_space")

# Sentence Based Text Splitting

Splits text into chunks based on sentences, ensuring chunks end at sentence boundaries.
Ideal for maintaining semantic coherence within chunks.

In [None]:
# 2 Sentence-based splitting
print("\n--- Sentence-based Splitting ---")
sent_splitter = SentenceTransformersTokenTextSplitter(chunk_size=1000)
sent_docs = sent_splitter.split_documents(documents)
create_vector_store(sent_docs, "chroma_db_sent")

# Token-based Splitting
Splits text into chunks based on tokens (words or subwords), using tokenizers like GPT-2.
Useful for transformer models with strict token limits.

In [None]:
!pip install tiktoken
# 3 Token-based splitting
print("\n--- Using Token-based Splitting ---")
token_splitter = TokenTextSplitter(chunk_overlap=0, chunk_size=512)
token_docs = token_splitter.split_documents(documents)
create_vector_store(token_docs, "chroma_db_token")

# **Recursive Character-based Splitting**
Attempts to split text at natural boundaries (sentences, paragraphs) within character limit.

Balances between maintaining coherence and adhering to character limits.


In [None]:
# 3 Recursive Character-based splitting
print("\n--- Using Recursive Character-based Splitting ---")
rec_char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=100)
rec_char_docs = rec_char_splitter.split_documents(documents)
create_vector_store(rec_char_docs, "chroma_db_rec_char")

In [None]:
def query_vector_store(store_name, query):
    # point at your existing persistent_directory
    store_path = os.path.join(persistent_directory, store_name)
    if os.path.exists(store_path):
        print(f"\n--- Querying the Vector Store {store_name} ---")
        db = Chroma(
            persist_directory=store_path,
            embedding_function=embeddings
        )
        retriever = db.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={"k": 1, "score_threshold": 0.5},
        )
        relevant_docs = retriever.invoke(query)
        # Display the relevant results with metadata
        print(f"\n--- Relevant Documents for {store_name} ---")
        for i, doc in enumerate(relevant_docs, 1):
            print(f"Document {i}:\n{doc.page_content}\n")
            if doc.metadata:
                print(f"Source: {doc.metadata.get('source', 'Unknown')}\n")
    else:
        print(f"Vector store {store_name} does not exist.")

# Define the user's question
query = "Where is Oceania?"

# Query each vector store
query_vector_store("chroma_db_char", query)
query_vector_store("chroma_db_char_space", query)
query_vector_store("chroma_db_sent", query)
query_vector_store("chroma_db_token", query)
query_vector_store("chroma_db_rec_char", query)
query_vector_store("chroma_db_custom", query)

# **Setup LLM**

In [None]:
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import Chroma

# Set up Google Gemini LLM
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash-exp-image-generation")

# Your question
query = "What is the name of the city or place where 1984 is set?"

# List all your split‐based stores
stores = [
    "chroma_db_char",
    "chroma_db_char_space",
    "chroma_db_sent",
    "chroma_db_token",
    "chroma_db_rec_char",
    "chroma_db_custom",
]

for store_name in stores:
    store_path = os.path.join(persistent_directory, store_name)
    if not os.path.exists(store_path):
        print(f"Skipping missing store: {store_name}")
        continue

    # Load the vector store with the same embeddings you used to create it
    db = Chroma(
        persist_directory=store_path,
        embedding_function=embeddings
    )

    # Build a RetrievalQA chain for this store
    qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=db.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={"k": 3, "score_threshold": 0.5}
        ),
        return_source_documents=True
    )

    # Invoke the chain
    result = qa_chain.invoke({"query": query})

    # Print the output
    print(f"\n=== Results using store: {store_name} ===")
    print("\n--- Answer ---")
    print(result["result"])

    # Print the single “chunk used” (top‐ranked doc)
    if result["source_documents"]:
        top = result["source_documents"][0]
        print("\n--- Chunk Used ---")
        print(top.page_content)
    else:
        print("\n--- Chunk Used ---")
        print("No chunk passed the threshold.")
