# Simple RAG system

In this notebook we will build a simple RAG system using a PDF document, OpenAI embeddings, and FAISS for efficient similarity search. The system will encode a PDF document, split it into chunks, create embeddings, and retrieve relevant sections based on a user query.


In [1]:
from dotenv import load_dotenv
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# Load environment variables from .env file
load_dotenv()

# Access the API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Load PDF documents

We will use `PyPDFLoader` to extract text from the PDF. It reads the PDF page by page and stores the extracted text in a list of document objects, where each document contains the content of a single page.

In [2]:
# Path to the PDF file
path = "Understanding_Climate_Change.pdf"

loader = PyPDFLoader(path)
documents = loader.load()

### Preprocessing

##### Split the document into chunks
We use the `RecursiveCharacterTextSplitter` to split the document into smaller chunks. This is ideal when working with large documents to make them manageable for embedding generation and for easier retrieval.

In [3]:
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, length_function=len)
texts = text_splitter.split_documents(documents)

we are splitting the document into chunks of size 1000 characters with 200 characters of overlap between chunks.
- `chunk_size` makes it more manageable for indexing and retrieval.
- `chunk_overlap` ensures that the context is preserved when the text is split. This helps maintain the flow of information between chunks.
- `length_function` tells the splitter to calculate the length of the chunks based on the number of characters, ensuring that the chunks are exactly the specified size.

##### Replace tabs with spaces
In many cases, PDFs may contain tab characters (`\t`) that were used for indentation but aren't necessary for the final processed text. We will replace these with spaces.

In [4]:
# Replace tab characters with spaces in the text content
for text in texts:
    text.page_content = text.page_content.replace('\t', ' ')  # Replace tabs with spaces

### Generate embeddings using OpenAI
Once the text is cleaned and processed, we can create embeddings for each of the chunks using OpenAI API. These embeddings represent the meaning of the text in a high-dimensional vector space.

In [5]:
# Initializes the OpenAI embeddings model
embeddings = OpenAIEmbeddings()

We will also use FAISS to efficiently store and index the embeddings, which allows us to perform similarity search and query efficiently.

In [6]:
# Create vector store using FAISS
vector_store = FAISS.from_documents(texts, embeddings)

Here, we create a FAISS vector store by:
- Generating embeddings for each chunk of text.
- Storing the embeddings in FAISS, which allows fast similarity search later.

This function automatically creates a flat (brute-force) index by default.

### Setup the retriever
Now that we have created the embeddings and stored them in FAISS, we can query the vector store to retrieve relevant information based on user queries. The retriever will help us fetch the top N most relevant document chunks based on a given query.

In [7]:
# Create a retriever that fetches top 2 relevant documents
chunks_query_retriever = vector_store.as_retriever(search_kwargs={"k": 2})

### Test querying the system
Now, we will query the system with a question and retrieve the most relevant context.

In [9]:
test_query = "What is the main cause of climate change?"

# Retrieve the most relevant documents for the query
docs = chunks_query_retriever.invoke(test_query)

# Extract content from the relevant documents
context = [doc.page_content for doc in docs]

# Display the relevant context
print("Retrieved Context:")
for i, c in enumerate(context):
    print(f"Context {i + 1}:")
    print(c)
    print("\n")


Retrieved Context:
Context 1:
Chapter 2: Causes of Climate Change 
Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is essential 
for life on Earth, as it keeps the planet warm enough to support life. However, human 
activities have intensified this natural process, leading to a warmer climate. 
Fossil Fuels 
Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and 
natural gas used for electricity, heating, and transportation. The industrial revolution marked 
the beginning of a significant increase in fossil fuel consumption, which continues to rise 
today. 
Coal


Context 2:
Most of these climate changes are attributed to very small variations in Earth's orbit that 
change the amount of solar energy our planet receives. During

The printed results displayed concise and contextually relevant information that addressed the query, showcasing the effectiveness of the retrieval process.