# **Implementing RAG pipeline for querying and extracting relevant information from list of PDF docs**

Due to security/privacy concern over using Chatmodels on sensitive datas, It is not recommended to train the model using private sensitive datas.

Therefore, RAG pipeline is used for extracting relevant infos from sensitive datas, without requiring to train the model again.



In [None]:
# We will be using langchain framework to build end-to-end pipeline for RAG flow.
# langchain is most popular framework to building LLM usecase pipelines.

!pip install -r requirements.txt

In [21]:
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain import chains, vectorstores
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
import os
import PyPDF2
import pypdf
import re
from langchain.embeddings import HuggingFaceEmbeddings

# Indexing the Documents and storing it into **VectorStores** Databases.

These documents will be converted into fixed sized chunks and then stored in document embedded forms.

This way, it will be easy to retrieve the required documents.

In [33]:
pdf_document_load = PyPDF2.PdfReader(open("/content/sell_my_dream.pdf", "rb"))

In [34]:
len(pdf_document_load.pages)

4

In [39]:
# Start building pipelines be intiating the model

# load enviroment variables and API_KEYs
load_dotenv()

# Instantiate chat model
def init_chatmodel():
  chatmodel = ChatOpenAI(model="gpt-3.5-turbo")
  return chatmodel

# instantiate Embeddings
def init_embeddings():
  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
  return embeddings

# load the pdf documents and load its contents for further processing
def load_pdf_document(pdfPath):
  pdf_document = PyPDFLoader(pdfPath)
  documents = pdf_document.load()
  return documents

# split the documents into multiple chunks(of size 1000 tokens) for indexing
def split_pdf_documents(documents):
  doc_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
  docs = doc_splitter.split_documents(documents)
  return docs

In [40]:
# now, document embedding is to be done using BERT based OpenAIEmbeddings
"""
This method leverages the power of a OpenAIEmbeddings model for embedding generation
and the efficiency of Chroma for storing and querying embeddings, facilitating
various downstream tasks like document retrieval and clustering.
ChromaDB then adds each document's embedding to the collection using unique
document identifiers
"""

def create_vectorDB_store(textdocs, embeddingFun):
  vectorDB = vectorstores.Chroma.from_documents(textdocs, embeddingFun)
  return vectorDB


In [41]:
# Retrieval Part
"""
Given a query and documents, this method will fetch the relevant documents
based on similarity search.
"""

# It can produce n number of relevant documents chunks requested by user.

def get_retrieved_docs(n_documents, stored_VectorDB):
  retrieved_docs = stored_VectorDB.as_retriever(search_type="similarity", search_kwargs={"k":n_documents})
  return retrieved_docs

# Combining all components to create the RAG pipeline

In [47]:
# Integrate all components into RAG pipeline for end-to-end chain flow

loaded_docs = load_pdf_document("/content/sell_my_dream.pdf")
textdocs = split_pdf_documents(loaded_docs)
print("chunk size: ", len(textdocs))
vectordb = create_vectorDB_store(textdocs, embeddingFun=init_embeddings())
retrievers = get_retrieved_docs(3, vectordb)


chunk size:  4


In [48]:
# get relevant docs as per query
query = "Why does the author compare Neruda to a Renaissance pope?"
relevant_docs = retrievers.invoke(query)

In [53]:
# print those relevant docs
for i, doc in enumerate(relevant_docs, 1):
  print(f"Document {i}: \n {doc.metadata} ")
  print("\n-------------------------------------------------------\n")

Document 1: 
 {'page': 2, 'source': '/content/sell_my_dream.pdf'} 

-------------------------------------------------------

Document 2: 
 {'page': 2, 'source': '/content/sell_my_dream.pdf'} 

-------------------------------------------------------

Document 3: 
 {'page': 3, 'source': '/content/sell_my_dream.pdf'} 

-------------------------------------------------------



In [None]:
# Combined chain with LLM model in it.
chatmodel = init_chatmodel()

rag_chain = chains.RetrievalQA.from_chain_type(
    llm=chatmodel,
    chain_type="stuff",
    retriever=retrievers,
    return_source_documents=True
)

In [None]:
# Calling RAG pipeline with query:
query_result1 = rag_chain.invoke({"query": "Why does the author compare Neruda to a Renaissance pope?"})

print("answer: ", query_result1["result"])

In [None]:
# We can also know which documents consitutes the answer:
for document in query_result1["source_documents"]:
  print(re.sub(r"\s+", " ", document.page_content.strip()))
  print("\n-------------------------------------------------------\n")

In [None]:
# Setting up memory context for Chat of previous query.
chat_history1 = [(query_result1["query"], query_result1["result"])]
query_result2 = rag_chain.invoke({
    "query": "Are there any other reasons of comparing Neruda to a Renaissance pope?",
    "chat_history": chat_history1
})
print("answer: ", query_result2["result"])

In [None]:
for document in query_result2["source_documents"]:
  print(re.sub(r"\s+", " ", document.page_content.strip()))
  print("\n-------------------------------------------------------\n")