<a href="https://colab.research.google.com/github/piyusheth/Develop/blob/master/2_2_RAG_Chunking_Strategies_From_Basic_to_Advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



**Lecture Title:  RAG Chunking Strategies: From Basic to Advanced**

**I. Introduction: Retrieval-Augmented Generation (RAG) and Chunking**

*   **What is RAG?**
    *   Combines the power of pre-trained language models (LLMs) with external knowledge retrieval.
    *   Improves LLM responses by providing relevant context from a knowledge base.
    *   Reduces hallucinations (making up facts) and improves accuracy.
    *   Allows LLMs to access up-to-date information without retraining.

*   **The RAG Process (High-Level):**
    1.  **User Query:** The user provides a question or prompt.
    2.  **Retrieval:**  A retriever component searches a knowledge base (e.g., a vector database) for documents/chunks relevant to the query.
    3.  **Augmentation:** The retrieved chunks are combined with the original query, forming an augmented prompt.
    4.  **Generation:** The LLM uses the augmented prompt to generate a response.

*   **Why Chunking is Crucial:**
    *   **Context Window Limits:** LLMs have a maximum input length (context window).  We can't feed entire documents.
    *   **Efficiency:**  Smaller, relevant chunks lead to faster retrieval and generation.
    *   **Precision:**  Well-chosen chunks improve the quality of the retrieved information, leading to better LLM responses.  Irrelevant information can confuse the LLM.
    * **Cost optimization:** Processing fewer tokens with smaller chunks reduces computational costs.

*   **The Chunking Challenge:**  Finding the *optimal* way to split text into meaningful, contextually relevant pieces.  Too small, and you lose context.  Too large, and you exceed context limits or introduce irrelevant information.

**II.  Basic Chunking Strategies**

*   **A.  Fixed-Size Chunking (with `RecursiveCharacterTextSplitter`)**
    *   **Concept:** Divide the text into chunks of a predetermined size (e.g., 100 characters, 512 tokens).
    *   **Code Example (from the provided file):**
        ```python
        from langchain.text_splitter import RecursiveCharacterTextSplitter

        document = "..."  # Your long text document

        # Small chunk size (demonstrates context loss)
        small_splitter = RecursiveCharacterTextSplitter(chunk_size=10, chunk_overlap=2)
        small_chunks = small_splitter.split_text(document)
        print("Smaller Chunks -")
        print(small_chunks)

        # Larger chunk size (more context, but potentially less precise)
        large_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=10)
        large_chunks = large_splitter.split_text(document)
        print("Larger Chunks -")
        print(large_chunks)
        ```
    *   **Parameters:**
        *   `chunk_size`:  The target length of each chunk (in characters, by default).
        *   `chunk_overlap`:  The number of characters to overlap between adjacent chunks.  Helps preserve context.
    *   **Pros:**
        *   Simple to implement.
        *   Guarantees chunks won't exceed a specific size.
    *   **Cons:**
        *   Can split sentences, paragraphs, or semantic units in the middle, disrupting meaning.
        *   May not be optimal for all types of content.
    * **Example (Fixed-size using a sliding window)**
      ```python
      def fixed_size_chunk(text, chunk_size=512):
          words = text.split()
          chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
          return chunks
      ```

*   **B.  Sentence-Based Chunking**
    *   **Concept:** Split the text into individual sentences.  Assumes sentences are relatively self-contained units of meaning.
    *   **Implementation:**
        *   Can use libraries like `nltk` or `spaCy` for sentence tokenization.
        *   The provided code *repeats* the fixed-size chunking example, which is incorrect.  Here's a corrected example using `nltk`:
            ```python
            import nltk
            nltk.download('punkt')  # Download the sentence tokenizer (only needed once)

            from nltk.tokenize import sent_tokenize

            text = "This is the first sentence.  This is the second sentence.  And this is a third."
            sentences = sent_tokenize(text)
            for i, sentence in enumerate(sentences):
                print(f"Sentence {i+1}: {sentence}")
            ```
    *   **Pros:**
        *   Preserves sentence-level coherence.
        *   Often a good starting point for many tasks.
    *   **Cons:**
        *   Sentences can vary greatly in length.  Some may be too short to be useful, others too long.
        *   May not capture relationships *between* sentences.

*   **C.  Document-Based Chunking**
    *   **Concept:**  Treat each document as a single chunk.  Only applicable if your documents are already relatively small and self-contained.
    *   **Implementation:**
        *   The provided code demonstrates loading a PDF using `PyPDFLoader`:
            ```python
            from langchain.document_loaders import PyPDFLoader

            pdf_loader = PyPDFLoader("sample.pdf") # Ensure sample.pdf exists
            documents = pdf_loader.load()
            ```
          * The pdf will be chunked by page.
    *   **Pros:**
        *   Simple if your documents are already appropriately sized.
    *   **Cons:**
        *   Often impractical, as documents are frequently too large for LLM context windows.

**III.  Advanced Chunking Strategies**

*   **A.  Semantic-Based Chunking (Dynamic Chunking)**
    *   **Concept:**  Group sentences or phrases that are semantically related into the same chunk.  Uses sentence embeddings and clustering.
    *   **Code Example (from the provided file):**
        ```python
        from sentence_transformers import SentenceTransformer
        from sklearn.cluster import KMeans
        import numpy as np

        model = SentenceTransformer('all-MiniLM-L6-v2') # Or any suitable sentence transformer

        sentences = [
            "Astronauts are sent to space.",
            "The Martian is about survival on Mars.",
            "Interstellar deals with space exploration.",
            "Space travel involves many challenges."
        ]

        embeddings = model.encode(sentences)
        kmeans = KMeans(n_clusters=2, random_state=0, n_init=10) #Added n_init for suppressing warnings
        labels = kmeans.fit_predict(embeddings)

        chunks = {}
        for i, label in enumerate(labels):
            if label not in chunks:
                chunks[label] = []
            chunks[label].append(sentences[i])

        for label, chunk in chunks.items():
            print(f"Semantic Chunk {label + 1}: {', '.join(chunk)}")
        ```
    *   **Explanation:**
        1.  **Sentence Embeddings:**  Convert each sentence into a numerical vector (embedding) that represents its meaning.  Sentence transformers are pre-trained models designed for this.
        2.  **Clustering:**  Use a clustering algorithm (like K-Means) to group similar embeddings together.  Sentences with similar meanings will be in the same cluster.
        3.  **Chunk Formation:**  Create chunks based on the cluster assignments.
    *   **Pros:**
        *   Creates chunks that are thematically coherent.
        *   Can adapt to the content of the text.
    *   **Cons:**
        *   More computationally expensive than basic methods.
        *   Requires choosing an appropriate number of clusters (`n_clusters`).
        *   The quality of the chunks depends on the quality of the sentence embeddings.

*   **B.  Overlapping Chunking**
    *   **Concept:**  Create chunks that overlap with each other.  Ensures that if a relevant piece of information falls near a chunk boundary, it's still captured.
    *   **Code Example (from the provided file):**
        ```python
        def overlapping_chunk(text, chunk_size=5, overlap=2):
            words = text.split()
            chunks = []
            for i in range(0, len(words), chunk_size - overlap):
                chunk = words[i:i + chunk_size]
                chunks.append(' '.join(chunk))
            return chunks

        text = "This is an example of overlapping chunking to maintain context between chunks."
        chunks = overlapping_chunk(text, chunk_size=5, overlap=2)
        ```
    *   **Pros:**
        *   Reduces the risk of missing important information due to arbitrary chunk boundaries.
        *   Improves context preservation.
    *   **Cons:**
        *   Increases the number of chunks.
        *   Can lead to some redundancy.

*   **C.  Recursive Chunking**
    *   **Concept:**  Split the text hierarchically, using different separators at different levels.  Useful for documents with clear structure (e.g., headings, paragraphs).
    *   **Code Example (from the provided file, using `RecursiveCharacterTextSplitter`):**
        ```python
        from langchain.text_splitter import RecursiveCharacterTextSplitter

        text = """This is a paragraph.
        This is another paragraph. This is a new paragraph.

        Here is some additional content."""

        splitter = RecursiveCharacterTextSplitter(
            separators=["\n\n", "\n", " "],  # Try these separators in order
            chunk_size=50,
            chunk_overlap=10
        )

        chunks = splitter.split_text(text)
        ```
    *   **Explanation:**
        *   The `separators` list defines the order in which the splitter tries to split the text.  It first tries to split on double newlines (`\n\n`), then single newlines (`\n`), and finally spaces (` `).
    *   **Pros:**
        *   Adapts to the structure of the document.
        *   Can create chunks of varying sizes, reflecting the natural organization of the text.
    *   **Cons:**
        *   Requires careful selection of separators.
        *   May not be suitable for unstructured text.

*   **D.  Agentic Chunking (using an LLM)**
    * **Concept:** Use the LLM itself to generate chunks.
    *   **Code Example (from the provided file):**
          ```python
          from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

          tokenizer = AutoTokenizer.from_pretrained("t5-base")
          model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

          def chunk_text(text, max_length=512, stride=256):
              input_ids = tokenizer(text, return_tensors="pt").input_ids[0]
              chunks = []
              i = 0
              while i < len(input_ids):
                  end_idx = min(i + max_length, len(input_ids))
                  chunk_ids = input_ids[i:end_idx]
                  chunks.append(tokenizer.decode(chunk_ids, skip_special_tokens=True))
                  i += stride
              return chunks
          ```
    *   **Pros:**
          * Most intelligent chunking, with appropriate context
        *   Leverages the LLM's understanding of language and context.
    *   **Cons:**
        *   Computationally expensive.
        *   Can be slower than other methods.

*   **E.  Content-Aware Chunking**
    *   **Concept:**  Split the text based on its content, using markers like headings, section titles, or other structural cues.
    *   **Code Example (from the provided file):**
        ```python
        sample_text = """
        Introduction
        ... (rest of the text) ...
        """

        def content_aware_chunk(text):
            chunks = []
            current_chunk = []
            for line in text.splitlines():
                if line.startswith(('##', '###', 'Introduction', 'Conclusion')):
                    if current_chunk:
                        chunks.append('\n'.join(current_chunk))
                    current_chunk = [line]
                else:
                    current_chunk.append(line)
            if current_chunk:
                chunks.append('\n'.join(current_chunk))
            return chunks

        content_chunks = content_aware_chunk(sample_text)
        ```
    *   **Pros:**
        *   Creates chunks that align with the logical structure of the document.
        *   Improves the coherence of the retrieved information.
    *   **Cons:**
        *   Requires the text to have a clear and consistent structure.
        *   May need to be customized for different document types.

*   **F.  Token-Based Chunking**
    *   **Concept:** Split the text into chunks based on a maximum number of *tokens*, rather than characters.  More precise, as it accounts for the way LLMs process text.
    *   **Code Example (from the provided file):**
        ```python
        from transformers import GPT2Tokenizer

        tokenizer = GPT2Tokenizer.from_pretrained("gpt2") # Or any other tokenizer

        def token_based_chunk(text, max_tokens=200):
            tokens = tokenizer(text)["input_ids"]
            chunks = [tokens[i:i + max_tokens] for i in range(0, len(tokens), max_tokens)]
            return [tokenizer.decode(chunk) for chunk in chunks]

        token_chunks = token_based_chunk("Sample text for token-based chunking.")
        ```
    *   **Pros:**
        *   Directly controls the size of the chunks in terms of LLM input tokens.
        *   More accurate than character-based chunking for ensuring chunks fit within context limits.
    *   **Cons:**
        *   Requires using a tokenizer.
        *   The actual text length of the chunks may vary slightly.

**IV. Choosing the Right Chunking Strategy**

*   **No One-Size-Fits-All:** The best strategy depends on:
    *   **The nature of your documents:**  Are they well-structured?  Do they contain short, self-contained units of information?  Are they very long?
    *   **Your LLM's context window:**  Larger context windows allow for larger chunks.
    *   **Your retrieval needs:**  Do you need very precise retrieval, or is broader context more important?
    *   **Computational resources:**  More complex strategies are more computationally expensive.

*   **Recommendations:**
    *   **Start Simple:** Begin with fixed-size or sentence-based chunking.  These are easy to implement and often provide good results.
    *   **Experiment:**  Try different chunk sizes and strategies.  Evaluate the quality of your RAG system's responses.
    *   **Consider Structure:** If your documents have a clear structure, use recursive or content-aware chunking.
    *   **Semantic Chunking for Precision:**  If you need highly relevant chunks, use semantic chunking.
    *   **Overlap for Context:** Use overlapping chunking to avoid missing information at chunk boundaries.
    *   **Token-Based for Accuracy:** Use token-based chunking for precise control over chunk size.

**V.  Evaluating Chunking Strategies**

*   **Qualitative Evaluation:**
    *   Manually inspect the chunks.  Do they make sense?  Do they capture the relevant information?
    *   Test your RAG system with a variety of queries and evaluate the quality of the responses.

*   **Quantitative Evaluation:**
    *   **Retrieval Metrics:**  Measure the precision and recall of your retriever.  Are you retrieving the right chunks?
    *   **End-to-End Metrics:**  Evaluate the overall performance of your RAG system using metrics like accuracy, fluency, and relevance.
    * **Context relevance**: Evaluate the relevance of retrieved chunks to the query.
    * **Context recall**: Measure if all the relevant information from the source text is retrieved.

**VI. Conclusion**

Effective chunking is critical for building high-performing RAG systems. By understanding the different chunking strategies and their trade-offs, you can choose the best approach for your specific needs and create a RAG system that provides accurate, relevant, and contextually rich responses.  Remember to experiment and evaluate your choices to optimize performance.
