<a href="https://colab.research.google.com/github/otoledanosole/learning_local_rag/blob/main/local_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Creation of a Local Retrieval Augmented Genration (RAG)**

Objective of the document

## **Example of what we are triying to build**

<img src="https://github.com/otoledanosole/learning_local_rag/blob/main/images/Nvidia_rag-pipeline-ingest-query-flow-b-2048x960.png?raw=true" alt="Retrieval Augmented Generation (RAG) Sequence Diagram from Nvidia" />

We are going to use the post ["RAG 101: Demystifying Retrieval Augmented Generation Pipelines"](https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/) from the Nvidia Technical Blog as a reference of what we are going to build except that we are not going to use an exsistant framework.

The objective is, instead of using LangChain or LlamaIndex as they do in the example, we are going to build our own framework in order to really understand what a RAG is.


## **What is RAG?**

RAG stands for Retrieval Augmented Generation

RAG is a technique for enhancing the acuracy and reliability of generative AI models with facts fetched from external resources.

It was firstly introduced in a 202 paper called: [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401#)

To understand it with a real example, imagine a psycologist:
>A psycologist can create strategies or aply treatment to their patients based of their general understanding of diferent psycology techniques but sometimes a patient requires special expertise so psycologist send clerks to a library where they store specific studies they can use

Like a good psycologyst, large language models (LLMs) can respond to a variety of queries  but if we look under de hood of a LLM we find a neural network with parametres that essentialy represent the general patterns of how humans use words to form sentences. Some pre-trained LLMs models have been shown to store factual knowledge in their parameters, working as a parameterized implicit knowledge base.

While this is interesting, such models have downsides: the information that they are based of is dificult to update o revise, they may cause "hallucinations" and they can not provide in depth knowledge of a specific topic. Some hybrid models combine parametric and non-parametric memories.

The goal of RAG is to take information from a source, pre-process it and pass it to an LLM so it can generate outputs based on that information.


>To understand what retrieval-augmented generation means, we can roughly broke down eatch step to:
* Retrieval - Find relevant information given a query, also known as prompt. For example: "What pushes a person to act against his beliefs?" → retrieves the information related to this topic from a given sorurce, for example the book "Social psycology" from Solomon Asch.
* Augmented - Take the relevant information and augment our input to an LLM with that relevant information.
* Generation - Take the firsts two steps and pass them to an LLM for generative outputs.






## **Benefits of RAG**

The main advantages of using RAG are:
- Empowering LLM solutions with real-time data access
- Preserving data privacy
- Mitigating LLM hallucinations

**Empowering LLM solutions with real-time data access**
>Generaly, LLMs trained with "internet" data have a fairly good understanding of languaje in general. However, it also means that the responses may be to generic for some aplications.
Because data is constantly changing, using RAG, facilitates direct access to the resources than LLM models use. These resources can consist of real-time and personalized data.
RAG helps to create especific responses based on specific data.

**Preserving data privacy**
>One of the main benefits of using RAG is the privacy. Because the LLM is self hosted, all the sensitive data is not exposed.

**Mitigating LLM hallucinations**
>LLMs are good at generating *good looking* text, however, this text doesen't mean is factual. We know as hallucination, hen a LLM provides faulty responses but in such a convincing way they sound real.
RAG helps mitigate hallucinations.



## **Where we can use RAG?**

The main usage for RAG is to take your relevant documents and convert them to a prompt to be used by a LLM.

In fact, almost any business can turn its internal data into resources called knowledge bases and enhance LLMs.

Som uses could be:
- Email data analysis: Imagine that you work for a company that recibes tons and tons of emails and you need to search for a specific topic or conversations. You can use RAG to get all the relevant information of the emails and then use a LLM to generate a response.
- Company internal knowledgebase: Another use could be to create a n internal knowledgebase. Imagine that you are working for a software company and you find a bug in a system. You can create a document on how to fix it and pass it to a RAG so your collegues can use a LLM to search for answers.
- Text book reading: Imagine that you need to read a book about something. You can use RAG to go through the book and extract the relevant information.

## **Why do you want to run it localy?**

Lets go through a few arguments about why is a good idea to run it localy:
- The first one is privacy of the data. If you are using sensitive or private data, maybe you dont want to send it through the internet to another company.
- Another benefit is speed. You dont depend on network availability and you can use large amount of data without having to send it using an API.
- We must also look at the cost. Running it localy may need a big initial investment but in the long run, the cost is paid. If aan external service is used, you have to pay API fees.
- Another good argument is that you are not locked to a specific vendor if you run your own haardware/software. Imagine if you are using ChatGPT to run your system. If OpenAI/another comany shuts down, you can still run the system.

## **What we are trying to build**

Using the example made by [Daniel Bourke](https://github.com/mrdbourke/simple-local-rag/tree/main), we are going to build a RAG pipeline to chat with a PDF document.

In our case, we are going to use an open source [nutrition textbook](https://pressbooks.oer.hawaii.edu/humannutrition2/).

We are going to create a code for the next steps:


1. Open a PDF document
2. Format the text of the PDF textbook ready for an embedding model. This process is called text splitting/chinking
3. Embed all of the chunks of text in the textbook and return them a into numerical representation which we can store later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.

We can classify the above steps in two major groups:
1. Document processing and embeding creation (steps 1-3)
2. Search and answer (steps 4-6)

<img src="https://github.com/otoledanosole/learning_local_rag/blob/main/images/Learning_Local_RAG.png?raw=true" alt="Retrieval Augmented Generation (RAG) Sequence Diagram" />


## **Key words**

**TOKEN**


>A sub-word piece of text. For example, "hello, world!" could be split into ["hello", ",", "world", "!"]. A token can be a whole word, part of a word or group of punctuation characters. 1 token ~= 4 characters in English, 100 tokens ~= 75 words.<br>
Text gets broken into tokens before being passed to an LLM.<br>
Before words go to an LLM, for example, ChatGPT, words get tokenized. We can find more information about what is a token here: [What is a token end how to count them](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)<br>
With OpenAi [Tokenizer](https://platform.openai.com/tokenizer) tool, we can see how a text is converted into tokens:<br>
<img src="https://github.com/otoledanosole/learning_local_rag/blob/main/images/Tokenized_Hello_World.png?raw=true" alt="Tokenization of words"/>|

## **Requirements and setup**

Because we are going to use Google Colab to run our RAG, we need to do some initial configurations:

In [6]:
import os                                                 # Python for os module

if "COLAB_GPU" in os.environ:
    print("[INFO] Running in Google Colab, installing requirements.")
    !pip install -U torch                                 # requires torch 2.1.1+ (for efficient sdpa implementation)
    !pip install PyMuPDF                                  # for reading PDFs with Python
    !pip install tqdm                                     # for progress bars
    !pip install sentence-transformers                    # for embedding models
    !pip install accelerate                               # for quantization model loading
    !pip install bitsandbytes                             # for quantizing models (less storage space)
    !pip install flash-attn --no-build-isolation          # for faster attention mechanism = faster LLM inference

[INFO] Running in Google Colab, installing requirements.


## **Document processing and embeding creation**

For the first part of the RAG creation we need:
- PDF document of choice
- Embeding model of choice

The steps that we are going to follow are:
1. Import PDF document
2. Process text for embeding (Split into chunks of sentences)
3. Embed text chunks with embeding model
4. Save embedings for later use

### Import PDF document

We are going to satrt with a PDF but this will work with diferent kinds of documents like text files, email chains, support documentation, articles from the internet, etc.

We are going to use PyMuPDF as the library to open the PDFs.

First we'll download the PDF if it doesn't exist.

In [7]:
import os
import requests

# Obtain the PDF Path
pdf_path = "Human-Nutrition-2020-Edition.pdf"

# Download PDF
if not os.path.exists(pdf_path): #The first thing we do is check if the file exists
  print(f"[INFO] File doesn't exist, downloading...")

  #Enter  the URL of the PDF
  url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"  #This has to be the DOWNLOAD URL. If not, it's not gona give us the PDF

  # The local file name to save the downloaded file
  filename = pdf_path #We can use the same name for the file as the pdf path

  #Send a GET request to the URL
  response = requests.get(url) #We use the request library to send a GET to the url

  #Check if the request was succesful
  #You can chech the response status code in the next page: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
  if response.status_code == 200: #200 value means OK, because we are using a GET request --> 200 = The resource has been fetched and transmitted in the message body.
    #Open the file and save if
    with open(filename, "wb") as file:
      file.write(response.content)
    print(f"[INFO] The file has been download and saved as {filename}")
  else:
    print(f"[INFO] Failed to download the file. Status code: {response.status_code}")

else:
  print(f"[INFO] The file {pdf_path} already exists")


[INFO] The file Human-Nutrition-2020-Edition.pdf already exists


PDF Downloaded!

Now we ned to open it. We are going to use [PyMuPDF]("https://github.com/pymupdf/PyMuPDF") (import fitz) to open and read our PDF document.

The small function that we are going to build is to read a PDF. Not all the texts are read the same.

We'll save each page to a dictionary and then append that dictionary to a list for ease of use later.

In [15]:
import fitz # Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
from tqdm.auto import tqdm # pip install tqdm

# We are going to define helper functions

"""
This function perfoms a formatting over the text

  Parametres:
    text (str): string of text

  Returns:
    cleaned_text (str): the same text but with the formatting aplied
"""
def text_formatter(text: str) -> str:
  """
  When you import text in some way, its important to check the format of the text because it may not be perfect word for word.
  Thats why its important to do som text formatting and write some preproccessing steps.
  In our case with the PDF:

  We are going to replace new lines ("\n") with space (" ") and we are going to stip the white spaces on the end (.strip())
  """

  cleaned_text = text.replace("\n", " ").strip()

  """
  Its important to check and test the necessary formatting for each text
  So we may need more text formatting functions

  ¡¡  Better formatting = Better text = Better and more acurate responses from the LLM  !!
  """
  return cleaned_text



"""
This function opens and reads every page/line from the PDF
The function only focuses on text, no images or other elements.
The objective of the function is to open a PDF file, read its text page by page and collect statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.

"""
def open_and_read_pdf(pdf_path: str) -> list[dict]:
  doc = fitz.open(pdf_path) #Open the PDF with PyMuPDF
  pages_and_texts = [] #We create an empty list

  for page_number, page in tqdm(enumerate(doc)):    #tqdm is a progress bar
      text = page.get_text()  #Get the text from the page
      text = text_formatter(text) #Format the text
      pages_and_texts.append({"page_number": page_number - 41,  #In our case, the page numbers on the document starts on page 43. Take in mind that the page number may not match the original PDF
                              "page_char_count": len(text), #How many characters we have on the text
                              "page_word_count": len(text.split(" ")),  #How many words we have on the text (roughly)
                              "page_sentence_count_raw": len(text.split(". ")), #How many sentences we have on the text (roughly)
                              "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                              "text": text})

  return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2] #We get the first two samples

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [17]:
import random #Import random library

random.sample(pages_and_texts, k=3) #We obtain 3 random pages from our list of dictionaries

[{'page_number': 713,
  'page_char_count': 1499,
  'page_word_count': 282,
  'page_sentence_count_raw': 14,
  'page_token_count': 374.75,
  'text': 'Image by  Allison  Calabrese /  CC BY 4.0  been sufficient scientific research into the particular  nutritional requirements for infants. Consequently, all of the  DRI values for infants are AIs derived from nutrient values in  human breast milk. For older babies and children, AI values are  derived from human milk coupled with data on adults. The AI is  meant for a healthy target group and is not meant to be  sufficient for certain at-risk groups, such as premature infants.  2. Tolerable Upper Intake Levels. The UL was established to help  distinguish healthful and harmful nutrient intakes. Developed  in part as a response to the growing usage of dietary  supplements, ULs indicate the highest level of continuous  intake of a particular nutrient that may be taken without  causing health problems. When a nutrient does not have any  known is

## Sources

https://arxiv.org/abs/2005.11401#

https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/

https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/