# RAG (Retrieval-Augmented Generation) with Langchain

## Overview

Implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

## Key Components

The Basic Retrieval-Augmented Generation (RAG) Pipeline operates through two main phases:

1. Data Indexing
2. Retrieval & Generation


### Data Indexing Process:
a. Data Loading: Importing all the documents or information to be utilized.
b. Document Splitting: Documents are divided into smaller chunks, for instance, sections of no more than 500 characters each.
c. Data Embedding: The data is converted into vectors using an embedding model.
d. Data Storing: Embeddings are saved in a vector database, allowing them to be easily searched.

### Retrieval and Generation Process:

Retrieval: When a user asks a question:
The user’s input is first transformed into a vector (query vector) using the same embedding model from the Data Indexing phase.
This query vector is then matched against all vectors in the vector database to find the most similar ones (e.g., using the Euclidean distance metric) that might contain the answer to the user’s question. This step is about identifying relevant knowledge chunks.
2. Generation: The LLM model takes the user’s question and the relevant information retrieved from the vector database to create a response. This process combines the question with the identified data to generate an answer.

## Implementation Details

### Data Loading

1. The PDF is loaded using PyPDFLoader.
2. The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.

### Text Cleaning

A custom function `replace_tab_with_space` is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.

### Vector Store Creation

1. Huggingface embeddings are used to create vector representations of the text chunks.
2. A Chroma vector store is created from these embeddings for efficient similarity search.

### Encoding Function

The `encode_pdf` function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.

## Usage Example

The code includes a test query: "What is the main cause of climate change?". This demonstrates how to use the retriever to fetch relevant context from the processed document.

### Install dependencies

In [None]:
pip install -r requirements.txt

### Read environment variables

In [None]:
import os

# Set the OpenAI API key environment variable (comment out if not using OpenAI)
if not os.getenv('OPENAI_API_KEY'):
    os.environ["OPENAI_API_KEY"] = input("Please enter your OpenAI API key: ")
else:
    os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Setup Path

In [None]:
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks
# This pdf is generated from https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi
path = "../data/LLM_Inference_TGI_Adyen.pdf"
print(f"Document path = {path}")

### Generate Document Embeddings

In [50]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_chroma import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI


# ----- Data Indexing Process -----

def replace_tab_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document.

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """

    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents
    

def convert_pdf_to_embeddings(path, chunk_size=1000, chunk_overlap=200):
    """
    Converts a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # load your pdf doc
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    #print(len(texts))
    cleaned_texts = replace_tab_with_space(texts)

    # Create embeddings (Tested with OpenAI and Amazon Bedrock)
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    
    # Create Chroma vector store
    #vectorstore = Chroma.from_documents(documents, OpenAIEmbeddings())
    vectorstore = Chroma.from_documents(texts, embeddings)

    return vectorstore

In [None]:
# Get Vector Store
chunks_vector_store = convert_pdf_to_embeddings(path, chunk_size=500, chunk_overlap=50)
print(type(chunks_vector_store))

In [58]:
### Test retriever

In [52]:
# Run query and get the answer
query = "What is TGI?"
docs = chunks_vector_store.similarity_search(query)
print(docs[0].page_content)

To better illustrate how TGI's continuous batching algorithm works, let's walk through a
specific example with the following initial setup seen in Table 1. Initially, no requests are
being processed so the total token budget is equal to MBT.
prefill(batch)
# Main loop to manage requests
while batch:
# Update the token budget based on current batch
batch_max_tokens = sum(request.input_tokens + request.max_new_tokens for request in batch)
token_budget = max_batch_total_tokens - batch_max_tokens
