<a href="https://colab.research.google.com/github/otoledanosole/learning_local_rag/blob/main/local_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Creation of a Local Retrieval Augmented Genration (RAG)**

Objective of the document

## **Example of what we are triying to build**

<img src="https://github.com/otoledanosole/learning_local_rag/blob/main/images/Nvidia_rag-pipeline-ingest-query-flow-b-2048x960.png?raw=true" alt="Retrieval Augmented Generation (RAG) Sequence Diagram from Nvidia" />

We are going to use the post ["RAG 101: Demystifying Retrieval Augmented Generation Pipelines"](https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/) from the Nvidia Technical Blog as a reference of what we are going to build except that we are not going to use an exsistant framework.

The objective is, instead of using LangChain or LlamaIndex as they do in the example, we are going to build our own framework in order to really understand what a RAG is.


## **What is RAG?**

RAG stands for Retrieval Augmented Generation

RAG is a technique for enhancing the acuracy and reliability of generative AI models with facts fetched from external resources.

It was firstly introduced in a 202 paper called: [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401#)

To understand it with a real example, imagine a psycologist:
>A psycologist can create strategies or aply treatment to their patients based of their general understanding of diferent psycology techniques but sometimes a patient requires special expertise so psycologist send clerks to a library where they store specific studies they can use

Like a good psycologyst, large language models (LLMs) can respond to a variety of queries  but if we look under de hood of a LLM we find a neural network with parametres that essentialy represent the general patterns of how humans use words to form sentences. Some pre-trained LLMs models have been shown to store factual knowledge in their parameters, working as a parameterized implicit knowledge base.

While this is interesting, such models have downsides: the information that they are based of is dificult to update o revise, they may cause "hallucinations" and they can not provide in depth knowledge of a specific topic. Some hybrid models combine parametric and non-parametric memories.

The goal of RAG is to take information from a source, pre-process it and pass it to an LLM so it can generate outputs based on that information.


>To understand what retrieval-augmented generation means, we can roughly broke down eatch step to:
* Retrieval - Find relevant information given a query, also known as prompt. For example: "What pushes a person to act against his beliefs?" → retrieves the information related to this topic from a given sorurce, for example the book "Social psycology" from Solomon Asch.
* Augmented - Take the relevant information and augment our input to an LLM with that relevant information.
* Generation - Take the firsts two steps and pass them to an LLM for generative outputs.






## **Benefits of RAG**

The main advantages of using RAG are:
- Empowering LLM solutions with real-time data access
- Preserving data privacy
- Mitigating LLM hallucinations

**Empowering LLM solutions with real-time data access**
>Generaly, LLMs trained with "internet" data have a fairly good understanding of languaje in general. However, it also means that the responses may be to generic for some aplications.
Because data is constantly changing, using RAG, facilitates direct access to the resources than LLM models use. These resources can consist of real-time and personalized data.
RAG helps to create especific responses based on specific data.

**Preserving data privacy**
>One of the main benefits of using RAG is the privacy. Because the LLM is self hosted, all the sensitive data is not exposed.

**Mitigating LLM hallucinations**
>LLMs are good at generating *good looking* text, however, this text doesen't mean is factual. We know as hallucination, hen a LLM provides faulty responses but in such a convincing way they sound real.
RAG helps mitigate hallucinations.



## **Where we can use RAG?**

The main usage for RAG is to take your relevant documents and convert them to a prompt to be used by a LLM.

In fact, almost any business can turn its internal data into resources called knowledge bases and enhance LLMs.

Som uses could be:
- Email data analysis: Imagine that you work for a company that recibes tons and tons of emails and you need to search for a specific topic or conversations. You can use RAG to get all the relevant information of the emails and then use a LLM to generate a response.
- Company internal knowledgebase: Another use could be to create a n internal knowledgebase. Imagine that you are working for a software company and you find a bug in a system. You can create a document on how to fix it and pass it to a RAG so your collegues can use a LLM to search for answers.
- Text book reading: Imagine that you need to read a book about something. You can use RAG to go through the book and extract the relevant information.

## **Why do you want to run it localy?**

Lets go through a few arguments about why is a good idea to run it localy:
- The first one is privacy of the data. If you are using sensitive or private data, maybe you dont want to send it through the internet to another company.
- Another benefit is speed. You dont depend on network availability and you can use large amount of data without having to send it using an API.
- We must also look at the cost. Running it localy may need a big initial investment but in the long run, the cost is paid. If aan external service is used, you have to pay API fees.
- Another good argument is that you are not locked to a specific vendor if you run your own haardware/software. Imagine if you are using ChatGPT to run your system. If OpenAI/another comany shuts down, you can still run the system.

## **What we are trying to build**

Using the example made by [Daniel Bourke](https://github.com/mrdbourke/simple-local-rag/tree/main), we are going to build a RAG pipeline to chat with a PDF document.

In our case, we are going to use an open source [nutrition textbook](https://pressbooks.oer.hawaii.edu/humannutrition2/).

We are going to create a code for the next steps:


1. Open a PDF document
2. Format the text of the PDF textbook ready for an embedding model. This process is called text splitting/chinking
3. Embed all of the chunks of text in the textbook and return them a into numerical representation which we can store later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.

We can classify the above steps in two major groups:
1. Document processing and embeding creation (steps 1-3)
2. Search and answer (steps 4-6)

<img src="https://github.com/otoledanosole/learning_local_rag/blob/main/images/Learning_Local_RAG.png?raw=true" alt="Retrieval Augmented Generation (RAG) Sequence Diagram" />


## Sources

https://arxiv.org/abs/2005.11401#

https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/

https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/