# Semantic Chunking for RAG

LLMs are bound to hallucinate and then we have different strategies to mitigate this hallucination behaviour of LLMs. One such strategy is Retrieval Augmented Generation (RAG), where a knowledge base is already augmented/provided to the LLM to retrieve the information from and this way LLMs won't hallucinate since the knowledge base is already specified. 

RAG involves a step by step process of loading the documents/data, splitting the documents into chunks using any AI framework such as LangChain or LlamaIndex, and vector embeddings generation for the data and storing these embeddings in a vector database.

So, broadly we can devide the RAG into two main parts, Storing and Retrieval.

While enahncing our RAG pipeline, one thing we need to looak at is the retriavl strategy and technoques involved.
We can improve retrieval in RAG using the proper chunking strategy. But finding the right chunk size for any given text is a very hard question in general.

Today, we will see how semantic chunking works.

Semantic Chunking considers the relationships within the text. It divides the text into meaningful, semantically complete chunks. This approach ensures the information’s integrity during retrieval, leading to a more accurate and contextually appropriate outcome.

Let's experiment with Semantic chunking & see the results

## Tech Stack Used
#### LangChain - Open source AI fraamework to load, split and to create embeddings of the data
#### SingleStore - As a robust vector database to store vector embeddings
#### Groq and HuggingFace - To choose our LLMs and embedding models

### Download data

In [8]:
! wget "https://arxiv.org/pdf/1810.04805.pdf"

--2024-07-15 05:48:13--  https://arxiv.org/pdf/1810.04805.pdf
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.131.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/1810.04805 [following]
--2024-07-15 05:48:13--  http://arxiv.org/pdf/1810.04805
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 775166 (757K) [application/pdf]
Saving to: ‘1810.04805.pdf.1’


2024-07-15 05:48:13 (41.3 MB/s) - ‘1810.04805.pdf.1’ saved [775166/775166]



## Process the PDF Content

In [9]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [10]:
loader = PyPDFLoader("1810.04805.pdf")
documents = loader.load()

In [11]:
print(len(documents))

16


## Perform Naive Chunking(RecursiveCharacterTextSplitting)

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False
)

In [13]:
naive_chunks = text_splitter.split_documents(documents)
for chunk in naive_chunks[10:15]:
  print(chunk.page_content+ "\n")

BERT BERT 
E[CLS] E1 E[SEP] ... ENE1’... EM’
C
T1
T[SEP] ...
 TN
T1’...
 TM’
[CLS] Tok 1 [SEP] ... Tok NTok 1 ... TokM 
Question Paragraph Start/End Span 
BERT 
E[CLS] E1 E[SEP] ... ENE1’... EM’
C
T1
T[SEP] ...
 TN
T1’...
 TM’
[CLS] Tok 1 [SEP] ... Tok NTok 1 ... TokM 
Masked Sentence A Masked Sentence B 
Pre-training Fine-Tuning NSP Mask LM Mask LM 
Unlabeled Sentence A and B Pair SQuAD 
Question Answer Pair NER MNLI Figure 1: Overall pre-training and ﬁne-tuning procedures for BERT. Apart from output layers, the same architec-
tures are used in both pre-training and ﬁne-tuning. The same pre-trained model parameters are used to initialize
models for different down-stream tasks. During ﬁne-tuning, all parameters are ﬁne-tuned. [CLS] is a special
symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques-
tions/answers).
ing and auto-encoder objectives have been used
for pre-training such models (Howard and Ruder,

2018; Radford et al., 201

## Instantiate Embedding Model

In [33]:
pip install sentence-transformers --quiet

Note: you may need to restart the kernel to use updated packages.


In [38]:
from langchain.embeddings import HuggingFaceEmbeddings

embed_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Setup the API Key for LLM

In [43]:
from groq import Groq
from langchain_groq import ChatGroq

In [44]:
import os

groq_api_key = os.getenv("Add your Groq API Key")

## Perform Semantic Chunking

In [45]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

semantic_chunker = SemanticChunker(embed_model, breakpoint_threshold_type="percentile")

In [46]:
semantic_chunks = semantic_chunker.create_documents([d.page_content for d in documents])

In [51]:
for semantic_chunk in semantic_chunks:
  if "Effect of Pre-training Tasks" in semantic_chunk.page_content:
    print(semantic_chunk.page_content)
    print(len(semantic_chunk.page_content))

Dev Set
Tasks MNLI-m QNLI MRPC SST-2 SQuAD
(Acc) (Acc) (Acc) (Acc) (F1)
BERT BASE 84.4 88.4 86.7 92.7 88.5
No NSP 83.9 84.9 86.5 92.6 87.9
LTR & No NSP 82.1 84.3 77.5 92.1 77.8
+ BiLSTM 82.1 84.1 75.7 91.6 84.9
Table 5: Ablation over the pre-training tasks using the
BERT BASE architecture. “No NSP” is trained without
the next sentence prediction task. “LTR & No NSP” is
trained as a left-to-right LM without the next sentence
prediction, like OpenAI GPT. “+ BiLSTM” adds a ran-
domly initialized BiLSTM on top of the “LTR + No
NSP” model during ﬁne-tuning. ablation studies can be found in Appendix C. 5.1 Effect of Pre-training Tasks
We demonstrate the importance of the deep bidi-
rectionality of BERT by evaluating two pre-
training objectives using exactly the same pre-
training data, ﬁne-tuning scheme, and hyperpa-
rameters as BERT BASE :
No NSP : A bidirectional model which is trained
using the “masked LM” (MLM) but without the
“next sentence prediction” (NSP) task. LTR & No NSP : A left

### Store the chunks in our database

In [56]:
!pip install singlestoredb --quiet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [57]:
from langchain_community.vectorstores import SingleStoreDB

In [77]:
semantic_chunk_vectorstore = SingleStoreDB.from_documents(semantic_chunks, embedding=embed_model)

### Instantiate Retrieval Step

In [78]:
semantic_chunk_retriever = semantic_chunk_vectorstore.as_retriever(search_kwargs={"k" : 1})
semantic_chunk_retriever.invoke("Describe the Feature-based Approach with BERT?")

[Document(page_content='The right part of the paper represents the\nDev set results. For the feature-based approach,\nwe concatenate the last 4 layers of BERT as the\nfeatures, which was shown to be the best approach\nin Section 5.3. From the table it can be seen that ﬁne-tuning is\nsurprisingly robust to different masking strategies. However, as expected, using only the M ASK strat-\negy was problematic when applying the feature-\nbased approach to NER. Interestingly, using only\nthe R NDstrategy performs much worse than our\nstrategy as well.')]

### Instantiate Augmentation Step(for content Augmentation)

In [84]:
from langchain_core.prompts import ChatPromptTemplate

rag_template = """\
Use the following context to answer the user's query. If you cannot answer, please respond with 'I don't know'.

User's Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(rag_template)

### Instantiate the Generation Step

In [88]:
# Define the userdata dictionary with your API key
userdata = {
    "GROQ_API_KEY": "Add your Groq API Key"
}

# Initialize the chat model
chat_model = ChatGroq(
    temperature=0,
    model_name="mixtral-8x7b-32768",
    api_key=userdata.get("GROQ_API_KEY")
)

## RAG Pipeline Utilizing Semantic Chunking

In [89]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

semantic_rag_chain = (
    {"context" : semantic_chunk_retriever, "question" : RunnablePassthrough()}
    | rag_prompt
    | chat_model
    | StrOutputParser()
)

### Ask Question 1

In [90]:
semantic_rag_chain.invoke("Describe the Feature-based Approach with BERT?")

'The Feature-based Approach with BERT involves concatenating the last 4 layers of BERT as features for a given task. In the context provided, this approach was applied to a Named Entity Recognition (NER) task, and it was shown to be the best approach in Section 5.3 of the paper. The right part of the paper shows the Dev set results for this approach.\n\nFrom the table, it can be observed that fine-tuning is surprisingly robust to different masking strategies. However, using only the MASK strategy was problematic when applying the feature-based approach to NER. Interestingly, using only the RND strategy performed much worse than the strategy used in the paper.\n\nTherefore, the Feature-based Approach with BERT involves using the last 4 layers of BERT as features and fine-tuning these features for a given task. This approach was shown to be robust to different masking strategies, except for the MASK strategy, and it outperformed the RND strategy in a NER task.'

### Ask Question 2

In [93]:
semantic_rag_chain.invoke("What is SQuADv2.0?")

'SQuAD v2.0, or Squad 2.0, is a version of the Stanford Question Answering Dataset (SQuAD) that extends the problem definition of SQuAD 1.1 by allowing for the possibility that no short answer exists in the provided paragraph. This makes the problem more realistic. To extend the SQuAD v1.1 BERT model for this task, questions that do not have an answer are treated as having an answer span with start and end at the [CLS] token. The probability space for the start and end answer span positions is extended to include the position of the [CLS] token. For prediction, the score of the no-answer span is compared to the score of the best non-null span. The TriviaQA data used for this task consists of paragraphs from TriviaQA-Wiki formed of the first 400 tokens in documents, that contain at least one of the provided possible answers.'

### Ask Question 3

In [94]:
semantic_rag_chain.invoke("What is the purpose of Ablation Studies?")

"Ablation studies are used to understand the impact of different components or settings of a machine learning model on its performance. In the provided context, ablation studies are used to answer specific questions about the BERT model's pre-training process.\n\nFor example, one ablation study investigates the effect of the number of training steps on the model's accuracy. By comparing the performance of BERT after pre-training for different numbers of steps, the study finds that BERT BASE achieves higher fine-tuning accuracy when pre-trained for 1 million steps compared to 500k steps.\n\nAnother ablation study evaluates the impact of different masking strategies on the model's performance. By comparing the accuracy of BERT when using different masking rates and masking procedures, the study finds that masking 80% of the tokens and randomly selecting 20% of the remaining tokens to be the target tokens results in the highest MNLI and NER accuracy.\n\nOverall, ablation studies help to i

## Implement a RAG pipeline using Naive Chunking Strategy

In [95]:
naive_chunk_vectorstore = SingleStoreDB.from_documents(naive_chunks, embedding=embed_model)
naive_chunk_retriever = naive_chunk_vectorstore.as_retriever(search_kwargs={"k" : 5})
naive_rag_chain = (
    {"context" : naive_chunk_retriever, "question" : RunnablePassthrough()}
    | rag_prompt
    | chat_model
    | StrOutputParser()
)

### Ask Question 1

In [96]:
naive_rag_chain.invoke("Describe the Feature-based Approach with BERT?")

'The Feature-based Approach with BERT involves extracting the activations from one or more layers of the pre-trained BERT model without fine-tuning any of its parameters. These contextual embeddings are then used as input to a separately initialized two-layer 768-dimensional BiLSTM before the classification layer. The results presented in the document show that this method performs competitively with state-of-the-art methods, particularly when concatenating token representations from the top four hidden layers of the pre-trained Transformer. This demonstrates the effectiveness of BERT for both fine-tuning and feature-based approaches. In the context provided, the feature-based approach uses the last four layers of BERT, which was found to be the best approach in Section 5.3. Fine-tuning, on the other hand, is shown to be robust to different masking strategies during MLM pre-training.'

### Ask Question 2

In [97]:
naive_rag_chain.invoke("What is SQuADv2.0?")

'SQuAD 2.0 is a version of the Stanford Question Answering Dataset (SQuAD) that extends the problem definition of SQuAD 1.1 by allowing for the possibility that no short answer exists in the provided paragraph. This makes the problem more realistic. To handle this, the SQuAD v1.1 BERT model is extended to treat questions that do not have an answer as having an answer span with start and end at the [CLS] token, and the probability space for the start and end answer span positions is extended to include the position of the [CLS] token. The score of the no-answer span is then compared to the score of the best non-null span for prediction. The TriviaQA data used consists of paragraphs from TriviaQA-Wiki formed of the first 400 tokens in documents, that contain at least one of the provided possible answers.'

### Ask Question 3

In [98]:
naive_rag_chain.invoke("What is the purpose of Ablation Studies?")

'Ablation studies are used to evaluate the effect of different components or settings in a machine learning model. In the provided context, there are two ablation studies mentioned:\n\n1. Effect of Number of Training Steps: This study investigates the impact of the number of training steps on the performance of BERT. By comparing the MNLI Dev accuracy of BERT BASE after fine-tuning from a checkpoint pre-trained for different numbers of steps, the study aims to answer questions about the necessity and convergence rate of MLM pre-training.\n\n2. Ablation for Different Masking Procedures: This study evaluates the effect of different masking strategies used during pre-training with the masked language model (MLM) objective. By comparing the MNLI Dev accuracy and NER Dev set results of BERT BASE using different masking rates and approaches, the study aims to reduce the mismatch between pre-training and fine-tuning.\n\nIn summary, ablation studies help to understand the importance and impact