In this notebook,we will try implementing question-answering based rag on given pdf


## Data Preprocessing

### Load Data

In [19]:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader

load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
DOC_PATH = "../data/raw/QUIC_Protocol.pdf"
CHROMA_PATH = "../embedings/pdf-qa-system" 

# load your pdf doc
loader = PyPDFLoader(DOC_PATH)
pages = loader.load()

### Index Data

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split the doc into smaller chunks i.e. chunk_size=500
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(pages)

In [24]:
# embed chunks as vectors

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# get OpenAI Embedding model
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

# embed the chunks as vectors and load them into the database
db_chroma = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_PATH)

In [25]:
db_chroma

<langchain_community.vectorstores.chroma.Chroma at 0x799082a9b5e0>

In [28]:
# this is an example of a user question (query)
query = 'What are the main applications of QUIC?'

# retrieve context - top 5 most relevant (closests) chunks to the query vector 
# (by default Langchain is using cosine distance metric)
docs_chroma = db_chroma.similarity_search_with_score(query, k=5)

# generate an answer based on given user query and retrieved context information
context_text = "\n\n".join([doc.page_content for doc, _score in docs_chroma])

### Retrieve and Generate Answer with LLM

In [29]:
context_text

'crosoft. Popular applications like YouTube, Google Search, and Chrome already use QUIC\nto deliver faster and more reliable user experiences.\nWeb Application\nIn web applications, the QUIC protocol can significantly improve browser loading speeds\nand reduce page rendering times. This is important to improve user experience. Currently,\nmany mainstream browsers support the QUIC protocol, such as Google Chrome and Mozilla\nFirefox.\nReal-time Audio and Video Communication\n\nConclusion\nQUIC is a game-changing protocol that overcomes the limitations of TCP, offering faster\nconnections, built-in encryption, and efficient multiplexing. As the backbone of HTTP/3, it\nimproves web performance, especially for latency-sensitive applications like streaming and\ngaming. Its ability to handle changes in the mobile network and improve reliability makes it\nvital for modern internet use. Looking ahead, QUIC is poised to drive greater adoption in\n\nOverview\nThe QUIC protocol, pronounced ”quick

In [30]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
import


# you can use a prompt template
PROMPT_TEMPLATE = """
Prepare a Multiple Choice Question (MCQ) based on the following context. Provide four options and indicate the correct answer.

Context: {context}

Format the output as follows:
1. **Question**: [Insert question here]  
   a) [Option 1]  
   b) [Option 2]  
   c) [Option 3]  
   d) [Option 4]  
   **Correct Answer**: [Correct option]
"""

# load retrieved context and user query in the prompt template
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)

context_text ="QUIC features"
prompt = prompt_template.format(context=context_text)


In [33]:
## Generate answer with LLM

# call LLM model to generate the answer based on the given context and query
model = ChatOpenAI()
response_text = model.predict(prompt)

response = openai.Completion.create(
    engine="text-davinci-003",  # Or any other model
    prompt=prompt,
    max_tokens=150,
    temperature=0.7
)

  model = ChatOpenAI()
  response_text = model.predict(prompt)


In [34]:
response_text

'1. **Question**: What is one of the key features of QUIC protocol?\n   a) Low latency\n   b) High bandwidth\n   c) Improved security\n   d) Increased packet loss\n   **Correct Answer**: a) Low latency'