# Introduction

This jupyter notebook helps you to build a RAG system from scratch.

I strongly recommend you to checkout the [README](./readme.md) section to gain a background about this topic before diving straight into the code.


# Setup dev env


### Python

It all starts with Python as usual. Install it as described [here](https://wiki.python.org/moin/BeginnersGuide/Download).


### Git

- Create a folder named "RAGify" or any other name you would like.
- Make a Git repository out of this folder. If you are new to the Git topic, then [check here for help](https://docs.github.com/en/get-started/getting-started-with-git/set-up-git).


In [None]:
# Create a folder named "RAGify" or any other name you would like
import os

#os.makedirs("RAGify", exist_ok=True)

# Navigate into the folder
#%cd RAGify

# Initialize a git repository
#!git init

## Python Virtual Environment

- [Check here](https://realpython.com/python-virtual-environments-a-primer/) why is a venv useful
- Run cell below to create a venv


In [None]:
# Create a Python virtual environment
#!python -m venv rag_venv

# Add the virtual environment folder to ".gitignore" file
with open(".gitignore", "a") as f:
    f.write("rag_venv/\n")


- Activate the virtual environment:
  - On Windows - `.\rag_venv\Scripts\activate`
  - On Mac - `source rag_venv/bin/activate`


## Install Packages


In [None]:
# Create requirements.txt file
requirements = """
streamlit
sentence-transformers
pypdf
langchain
faiss-cpu
google-generativeai
"""

with open("requirements.txt", "w") as f:
    f.write(requirements)

# Install all dependencies
!pip install -r requirements.txt


# Process input PDF files


In [2]:
# Import the PdfReader class from the pypdf library
from pypdf import PdfReader

def extract_text_from_pdf(pdf_path):
    """
    Extract text from a PDF file.

    Args:
    pdf_path (str): The file path to the PDF.

    Returns:
    str: Extracted text from all pages of the PDF.
    """
    # Open the PDF file in binary read mode
    with open(pdf_path, 'rb') as file:
        # Create a PdfReader object to read the PDF
        reader = PdfReader(file)

        # Initialize an empty string to store the extracted text
        text = ''

        # Iterate through each page in the PDF
        for page in reader.pages:
            # Extract text from the current page and add it to the text string
            # The '\n' adds a newline character after each page's text
            text += page.extract_text() + '\n'

    # Return the accumulated text from all pages
    return text

# List of file paths for the PDFs to process
pdf_paths = [
    './input_files/01-about_blunder_mifflin.pdf',
    './input_files/02-employee_handbook.pdf',
    './input_files/03-relationships_policy.pdf',
    './input_files/04-prank_protocol.pdf',
    './input_files/05-birthday_party_committee_rules.pdf',
]

# Use a list comprehension to extract text from all PDFs
# This creates a list where each item is the extracted text from one PDF
documents = [extract_text_from_pdf(pdf_path) for pdf_path in pdf_paths]

# At this point, 'documents' is a list of strings, where each string
# contains the full text of one PDF file

# Text Chunking with LangChain


In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_chunks(text, chunk_size=500, chunk_overlap=50):
    """
    Split a large text into smaller chunks.

    Args:
    text (str): The input text to be split.
    chunk_size (int): The maximum size of each chunk.
    chunk_overlap (int): The number of characters to overlap between chunks.

    Returns:
    list: A list of text chunks.
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    chunks = text_splitter.split_text(text)
    return chunks

# Create chunks for all documents
all_chunks = []
for doc in documents:
    all_chunks.extend(create_chunks(doc))

# Embedding Creation with Sentence Transformers

In [4]:
from sentence_transformers import SentenceTransformer

def create_embeddings(chunks):
    """
    Generate embeddings for a list of text chunks.

    Args:
        chunks (list): A list of text chunks to embed.

    Returns:
        list: A list of embedding vectors.
    """
    # Initialize the SentenceTransformer model
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Generate embeddings for all chunks
    embeddings = model.encode(chunks)

    return embeddings

# Create embeddings for all chunks
embeddings = create_embeddings(all_chunks)

  from tqdm.autonotebook import tqdm, trange


# Vector Database Setup with FAISS

In [5]:
import faiss
import numpy as np

def setup_faiss_index(embeddings):
    """
    Create and populate a FAISS index with the given embeddings.

    Args:
    embeddings (list): A list of embedding vectors.

    Returns:
    faiss.Index: The populated FAISS index.
    """
    # Convert embeddings to numpy array if not already
    embeddings_np = np.array(embeddings).astype('float32')

    # Create a FAISS index
    # We use IndexFlatL2, which performs exact L2 distance search
    dimension = embeddings_np.shape[1]
    index = faiss.IndexFlatL2(dimension)

    # Add vectors to the index
    index.add(embeddings_np)

    return index

# Create and populate the FAISS index
faiss_index = setup_faiss_index(embeddings)

# Save index to disk for future use
faiss.write_index(faiss_index, "blunder_mifflin_index.faiss")