# **Retrieval-Augmented Generation (RAG) System**

## **Objective**
The goal of this project was to implement a **Retrieval-Augmented Generation (RAG)** system to answer user queries by retrieving relevant context from a **PDF book**.

## **Process Overview**

### 1. **PDF Processing**
We used **LangChain's `PyPDFLoader`** to load the content of the PDF and split it into manageable **chunks** using the **`RecursiveCharacterTextSplitter`**. This helps in making the document easier to process and retrieve information from.

### 2. **Embedding the Content**
- **SentenceTransformers** was used to embed the text chunks into **vectors** using the "all-MiniLM-L6-v2" model.
- These embeddings were stored in **FAISS**, a vector store that enables **efficient similarity search**.

### 3. **Context Retrieval**
- For each user **query**, the question was converted into an **embedding vector**.
- The **FAISS vector store** (with **LangChain retriever**) was used to search for the most relevant document chunks by measuring **semantic similarity** between the question and the document embeddings.

### 4. **Integration**
- The retriever, vector store, and embedding model were combined into a complete system that could retrieve relevant document content and generate an answer for the query.

## **Libraries and Tools Used**

- **LangChain**: For document loading, embedding management, and creating a retriever.
- **FAISS**: To store and search for embeddings efficiently.
- **SentenceTransformers**: To convert text into vector embeddings.
- **PyPDFLoader**: To read and load the PDF document.
- **RecursiveCharacterTextSplitter**: To break the document into chunks for easier processing.

## **Outcome**
- We successfully built a **RAG system** capable of processing a large PDF, embedding its content, and answering queries based on semantic similarity.
- The system retrieves relevant context from the document and generates answers based on the retrieved information.

## **Challenges Faced**
- Efficient **chunking** and **embedding** of large documents.
- Ensuring **semantic retrieval**, not just keyword matching, for accurate answers.
- Setting up the correct **environment** and handling dependencies.

## **Final Thoughts**
- This **RAG system** using **LangChain** and **FAISS** allows us to retrieve and generate information based on semantic meaning, making it an efficient way to handle large documents for question-answering systems.
- The setup can be further improved with **fine-tuned models**, **chunking strategies**, and better **query processing**.

---

**Let’s continue building and improving this powerful system for larger, more complex datasets!**




In [None]:
import import_ipynb
from helper_functions import *
from dotenv import load_dotenv

# Read Docs

In [2]:
path = "AKA Book.pdf"

In [3]:
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using SentenceTransformers embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Step 1: Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Step 2: Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len)
    texts = text_splitter.split_documents(documents)

    # Step 3: Clean text (optional, if a function like replace_t_with_space exists in your code)
    cleaned_texts = replace_t_with_space(texts) if 'replace_t_with_space' in globals() else texts

    # Step 4: Initialize the SentenceTransformer model
    model = SentenceTransformer("all-MiniLM-L6-v2")  # Free embedding model

    # Step 5: Use HuggingFaceEmbeddings wrapper to handle SentenceTransformer model
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

    # Step 6: Create FAISS vector store
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

In [4]:
chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


### Create retriever

In [14]:
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})

### Test Retriever

In [15]:
test_query = "Who edited the book 'Beckett's Industrial Chocolate Manufacture and Use'"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

Context 1:
Beckett’s Industrial Chocolate  
Manufacture and Use


Context 2:
Beckett’s Industrial 
Chocolate 
Manufacture 
and Use
FIFTH EDITION
EDITED BY
Stephen T. Beckett
Mark S. Fowler
Gregory R. Ziegler


