# Simple RAG (Retrieval-Augmented Generation) System

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents.  The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

## Key Components

1. PDF processing and text extraction
2. Text chunking for manageable processing
3. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings
4. Retriever setup for querying the processed documents
5. Evaluation of the RAG system

## Method Details

### Document Preprocessing

1. The PDF is loaded using PyPDFLoader.
2. The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.

### Text Cleaning

A custom function `replace_t_with_space` is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.

### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the text chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.

### Encoding Function

The `encode_pdf` function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.

## Key Features

1. Modular Design: The encoding process is encapsulated in a single function for easy reuse.
2. Configurable Chunking: Allows adjustment of chunk size and overlap.
3. Efficient Retrieval: Uses FAISS for fast similarity search.
4. Evaluation: Includes a function to evaluate the RAG system's performance.

## Usage Example

The code includes a test query: "What is the main cause of climate change?". This demonstrates how to use the retriever to fetch relevant context from the processed document.

## Evaluation

The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections.

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [None]:
# Install required packages
!pip install pypdf==5.6.0
!pip install PyMuPDF==1.26.1
!pip install python-dotenv==1.1.0
!pip install langchain-community==0.3.25
!pip install langchain_openai==0.3.23
!pip install rank_bm25==0.2.2
!pip install faiss-cpu==1.11.0
!pip install deepeval==3.1.0

In [2]:
# Clone the repository to access helper functions and evaluation modules
!git clone https://github.com/k4mrul/RAG_TECHNIQUES.git
import sys
sys.path.append('RAG_TECHNIQUES')

# If you need to run with the latest data
# !cp -r RAG_TECHNIQUES/data .

Cloning into 'RAG_TECHNIQUES'...
remote: Enumerating objects: 1776, done.[K
remote: Counting objects: 100% (1112/1112), done.[K
remote: Compressing objects: 100% (423/423), done.[K
remote: Total 1776 (delta 739), reused 693 (delta 689), pack-reused 664 (from 4)[K
Receiving objects: 100% (1776/1776), 36.52 MiB | 19.18 MiB/s, done.
Resolving deltas: 100% (1125/1125), done.


In [None]:
import os
import sys
from dotenv import load_dotenv
from google.colab import userdata



# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable (comment out if not using OpenAI)
if not userdata.get('OPENAI_API_KEY'):
    os.environ["OPENAI_API_KEY"] = input("Please enter your OpenAI API key: ")
else:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Original path append replaced for Colab compatibility

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from helper_functions import (EmbeddingProvider,
                              retrieve_context_per_question,
                              replace_t_with_space,
                              get_langchain_embedding_provider,
                              show_context)

from evaluation.evalute_rag import evaluate_rag

from langchain.vectorstores import FAISS


### Read Docs

In [None]:
# Download required data files
import os
os.makedirs('data', exist_ok=True)

# Download the PDF document used in this notebook
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf


--2025-06-14 07:31:48--  https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206372 (202K) [application/octet-stream]
Saving to: ‘data/Understanding_Climate_Change.pdf’


2025-06-14 07:31:48 (5.89 MB/s) - ‘data/Understanding_Climate_Change.pdf’ saved [206372/206372]

--2025-06-14 07:31:48--  https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
L

In [None]:
path = "data/Understanding_Climate_Change.pdf"

### Encode document

In [None]:
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings (Tested with OpenAI and Amazon Bedrock)
    embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)
    #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)

    # Create vector store
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

In [None]:
chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

### Create retriever

In [None]:
# This line creates a 'retriever' object from the 'chunks_vector_store'.
# A retriever acts as a smart search tool for the vectorized text chunks
# extracted from the PDF document.
#
# 'chunks_vector_store' contains all the numerical representations (embeddings)
# of the document's text chunks.
#
# The '.as_retriever()' method converts this organized collection into a
# specialized tool capable of finding relevant chunks based on a query.
#
# The 'search_kwargs={"k": 2}' parameter specifies that when a query is made,
# the retriever should return the top 2 most similar or relevant text chunks
# from the 'chunks_vector_store'.
#
# We need this retriever to efficiently identify and extract the most pertinent
# information from the document in response to a user's question. This retrieved
# context is then typically used by a language model to generate accurate and
# well-informed answers, avoiding the need to process the entire document every time.
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})

### Test retriever

In [None]:
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

Context 1:
Chapter 2: Causes of Climate Change 
Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is essential 
for life on Earth, as it keeps the planet warm enough to support life. However, human 
activities have intensified this natural process, leading to a warmer climate. 
Fossil Fuels 
Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and 
natural gas used for electricity, heating, and transportation. The industrial revolution marked 
the beginning of a significant increase in fossil fuel consumption, which continues to rise 
today. 
Coal


Context 2:
Most of these climate changes are attributed to very small variations in Earth's orbit that 
change the amount of solar energy our planet receives. During the Holocene epoch,

  docs = chunks_query_retriever.get_relevant_documents(question)


### Evaluate results

In [None]:
#Note - this currently works with OPENAI only
evaluate_rag(chunks_query_retriever)

{'questions': ['1. **Multiple Choice: Causes of Climate Change**',
  '   - What is the primary cause of the current climate change trend?',
  '     A) Solar radiation variations',
  '     B) Natural cycles of the Earth',
  '     C) Human activities, such as burning fossil fuels',
  '     D) Volcanic eruptions',
  '',
  '2. **True or False: Climate Change Impacts**',
  '   - True or False: Climate change only affects the temperature of the planet, not weather patterns, sea levels, or ecosystems.',
  '',
  '3. **Short Answer: Mitigation Strategies**',
  '   - Describe two effective strategies that could be implemented to mitigate the effects of climate change.',
  '',
  '4. **Matching: Climate Change Terminology**',
  '   - Match the following terms with their correct definitions:',
  '     A) Greenhouse Gases',
  '     B) Carbon Footprint',
  '     C) Renewable Energy',
  '     D) Adaptation',
  '     - Definitions:',
  '       1. The total amount of greenhouse gases produced to directl

The output you're seeing is the result of the RAG system's evaluation using the deepeval library (as suggested by the installation of deepeval==3.1.0). It provides insights into the quality of the retrieved information for a set of generated questions.

Here's what each part means:

'questions': This is a list of test questions that were automatically generated to assess how well your RAG system performs. These questions cover various aspects of the document you processed (Understanding_Climate_Change.pdf).

'results': This is a list of evaluation scores, typically one for each aspect or part of the questions, presented in JSON format. Each JSON object contains three key metrics:

- "Relevance": This score indicates how pertinent or on-topic the retrieved information (or generated answer, if the evaluation also covered generation) is to the specific question. A score of 5 generally means it's highly relevant, while a lower score indicates less relevance.

- "Completeness": This score measures how thoroughly the retrieved information addresses all parts of the question. A higher score (e.g., 5) means the retrieved context fully covers what was asked, whereas a lower score suggests missing details.

- "Conciseness": This score assesses how brief and to-the-point the retrieved information is, without unnecessary verbosity. A higher score means it's succinct and directly answers the query, while a lower score might imply it's too wordy or includes irrelevant information.
In essence, these metrics help you understand the quality of the chunks_query_retriever by quantifying how well it finds relevant, complete, and concise information in response to various questions about your document.

## **Notebook Summary: RAG System for PDF Processing**

This notebook demonstrates the construction of a Retrieval-Augmented Generation (RAG) system designed to process PDF documents, extract relevant information, and retrieve it efficiently based on user queries.

### **Overall Process and Sequence of Steps**

1. **Environment Setup**: Installed necessary Python packages (including `pypdf`, `langchain-community`, `langchain_openai`, `faiss-cpu`, and `deepeval`) and cloned a GitHub repository containing helper functions.
2. **API Key Configuration**: Loaded environment variables, specifically the OpenAI API key, using `python-dotenv` for accessing embedding models.
3. **Document Loading**: Downloaded a sample PDF document, `"Understanding_Climate_Change.pdf"`, from a GitHub repository into a local data directory.
4. **PDF Encoding Function (`encode_pdf`)**:
   - Loaded the PDF content using `PyPDFLoader`.
   - Split the document into smaller, overlapping text chunks using `RecursiveCharacterTextSplitter` (configured with `chunk_size=1000` and `chunk_overlap=200`).
   - Applied a custom text cleaning function (`replace_t_with_space`) to clean the text chunks.
   - Generated vector embeddings for each text chunk using OpenAI's embedding model.
   - Created a FAISS vector store from these embeddings, enabling efficient similarity search.
5. **Retriever Creation**: Configured a retriever object from the FAISS vector store using `.as_retriever()`, set to fetch the top 2 (`k=2`) most relevant text chunks for any given query.
6. **Retriever Testing**: Tested the retriever with a sample query (`"What is the main cause of climate change?"`) to demonstrate its ability to retrieve relevant context using helper functions (`retrieve_context_per_question`, `show_context`).
7. **RAG Evaluation**: Evaluated the RAG system's performance using the `deepeval` library via the `evaluate_rag` function, which assesses the relevance, completeness, and conciseness of the retrieved information for various generated questions.

### **Tools, Libraries, and Their Functions**

| Tool/Library | Function |
|-------------|----------|
| **pypdf** / **PyMuPDF** | Loading and parsing PDF documents. |
| **python-dotenv** | Loading environment variables (API keys) from a `.env` file. |
| **langchain-community** | Provides `PyPDFLoader` and other community integrations for document loading. |
| **langchain_openai** | Integrates OpenAI models for generating embeddings. |
| **faiss-cpu** | Efficient similarity search and clustering of dense vectors; creates and manages the vector store. |
| **rank_bm25** | Alternative retrieval method (dependency for other libraries). |
| **deepeval** | Framework for evaluating RAG system performance (relevance, completeness, conciseness). |
| **PyPDFLoader** | Loads content from PDF files into document objects. |
| **RecursiveCharacterTextSplitter** | Breaks down large texts into smaller, manageable, overlapping chunks for efficient embedding and retrieval. |
| **OpenAI Embeddings** | Converts text chunks into numerical vector representations. |
| **FAISS Vector Store** | Stores numerical embeddings and enables fast nearest-neighbor search to retrieve semantically similar text chunks. |
| **Retriever (`.as_retriever()`)** | Interface that takes a query and returns relevant documents from the vector store. |
| **Helper Functions** (`retrieve_context_per_question`, `show_context`, `replace_t_with_space`, `get_langchain_embedding_provider`) | Custom utilities for retrieving, displaying, cleaning text data, and embedding selection. |

### **End Goal**

The ultimate goal is to provide a foundational understanding and practical implementation of a RAG system. It demonstrates how to transform unstructured PDF data into a structured, searchable format (a vector store) and use this structure to efficiently retrieve information pertinent to specific queries. Finally, it establishes a mechanism to evaluate the quality of the retrieval process, laying the groundwork for more complex information retrieval and question-answering systems.