<a href="https://colab.research.google.com/github/okkidoggi/Movie-Booking-API/blob/master/rag_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a PDF Question and Answer System Using Retrieval-Augmented Generation

---

### Introduction

In this tutorial, we will create a Question and Answer (Q&A) system that uses Retrieval-Augmented Generation (RAG) to answer questions about the contents of a PDF file. We will be using Langchain and OpenAI to build this system, which will enable us to extract information intelligently and efficiently.

This guide is designed to be straightforward, breaking down the process into simple, easy-to-follow steps. Whether you're new to coding or have some experience, you will find everything you need to get started on your own intelligent Q&A system.

We will be using ChatGPT as our Language Model (LLM) to add a conversational aspect to our Q&A system.

### Steps to Create Your Q&A System

**Step 1: Install Required Libraries**  
To get started, we need to install all the libraries necessary for our project. Open your command line or terminal and run the following command:


In [7]:
!pip -q install langchain_community langchain_chroma langchain_openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.2 MB[0m [31m6.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.2/1.2 MB[0m [31m19.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
!pip -q install langchain openai chromadb pypdf

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m615.7/615.7 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m273.8/273.8 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.7/94.7 kB[0m [31m8.2 MB/s[0m eta [36m0:0

**Step 2: Initialize Embeddings and the Language Model**  
Now, we need to set up the embeddings and load the ChatGPT model. This code snippet will help you do just that:

In [8]:
# Load required libraries
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.chains.question_answering import load_qa_chain

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

import os
from dotenv import load_dotenv
from langchain_openai.llms import OpenAI

In [9]:
# Set your OpenAI API key
load_dotenv()
OpenAI.api_key = os.getenv("OPENAI_API_KEY")

# Load the embedding and LLM model
embeddings_model = OpenAIEmbeddings()
llm = ChatOpenAI(model_name="gpt-4o-mini")

  llm = ChatOpenAI(model_name="gpt-4o-mini")


**Step 3: Perform Q&A Over a PDF File**  
Next, we wil set up our Q&A system to work with a PDF file. You will need to provide either the link to where your PDF is hosted or the local path on your computer where the PDF is stored.

For our example, we will use a research paper in PDF format for our Q&A tasks.
Simply download the PDF, place it in your current working directory, and provide its path in the following variable:

In [10]:
pdf_link = "attention_paper_3295222.3295349.pdf"
loader = PyPDFLoader(pdf_link, extract_images=False)
pages = loader.load_and_split()

In [17]:
pdf_link2 = "prompt_engineering.pdf"
loader2 = PyPDFLoader(pdf_link2, extract_images=False)
pages2 = loader2.load_and_split()

Once we have successfully extracted the data from the PDF, we’ll break it into smaller, more manageable chunks using the `RecursiveCharacterTextSplitter` from Langchain. This is essential because it helps us deal with the token limitations of the LLM models, allowing us to process the data more effectively.

In [20]:
# Split data into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=20,
    length_function=len,
    add_start_index=True,
)
chunks = text_splitter.split_documents(pages)
chunks2 = text_splitter.split_documents(pages2)
# Combine chunks from both PDFs
all_chunks = chunks + chunks2

**Step 4: Create Embeddings and Store Them in the Vector Database**  
Now, we’re ready to create embeddings for the chunks we just split and store them in a vector database. We’ll be using Chroma as our vector database. Here’s how to do that:

In [21]:
# Store data into the database
db = Chroma.from_documents(all_chunks, embedding=embeddings_model, persist_directory="test_index")


In this step, you will provide the chunk data you want to create an embedding for, specify the model used for creating the embedding, and set the directory where the database will be stored for future use.

**Step 5: Load the Existing Database**  
Once the information is safely stored in the database, you don’t have to repeat the previous steps every time. Instead, we can load the pre-existing database using the following code snippet:

In [13]:
# Load the database
vectordb = Chroma(persist_directory="test_index", embedding_function=embeddings_model)

# Load the retriever
retriever = vectordb.as_retriever(search_kwargs={"k": 3})
chain = load_qa_chain(llm, chain_type="stuff")

stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  chain = load_qa_chain(llm, chain_type="stuff")


The retriever will be responsible for fetching the most relevant chunk from the database that might contain the answer to the user’s question. In this example, the `search_kwargs` parameter, with `k` set to 3, ensures we retrieve the top three most relevant chunks from the database.

**Step 6: Create a Function to Generate Responses**  
Next, we will create a function that helps generate answers to user questions. This function will take the user’s question as input, retrieve the relevant information, and use the Q&A chain to provide a response.

In [26]:
# A utility function for answer generation
def ask(question):
    context = retriever.invoke(question)
    answer = chain.invoke({"input_documents": context, "question": question})
    return answer

**Step 7: Ask Questions and Get Answers**  
Now, we are ready to use our Q&A system! To ask a question, simply use the following lines of code in your script:

In [25]:
# Take the user input and call the function to generate output
user_question = input("User: ")

answer = ask(user_question)
print("Answer:", answer['output_text'])

User: why is the sky blue?
Answer: I don't know.


In this tutorial, we built a RAG Q&A system using Langchain and OpenAI, demonstrating how to combine advanced language models with effective data processing. We covered all the key steps, from installing libraries to performing Q&A on PDF data.

This guide empowers you to enhance your projects with dynamic Q&A features. By using Langchain and OpenAI, you can transform simple questions into meaningful conversations, paving the way for more interactive applications. As you start your own projects, keep in mind the potential of merging language models with data processing.