# **Retrieval Augmented Generation (RAG)**

In this hands-on tutorial, we will explore the concept of Retrieval Augmented Generation (RAG) and apply it to the [IDRIS Documentation](http://www.idris.fr/) database. RAG is a powerful technique that combines retrieval of information from a database with language generation models (LLMs) to enhance the generation process.

We will leverage the [Langchain](https://www.langchain.com/) library to implement RAG and demonstrate its capabilities in the context of the IDRIS Documentation. By retrieving relevant information from a database, we can provide additional context and knowledge to the LLM, enabling it to generate more accurate and informative responses.

Let's dive into the world of RAG and discover how it can revolutionize the generation process in natural language understanding and generation tasks.

![image](./images/rag.jpg)

In [None]:
from pathlib import Path
import re
import os
import datasets
from collections import Counter
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification
import torch
from tqdm.notebook import tqdm
import random
from utils import seed_everything, clean_idris_doc
import numpy as np

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import BSHTMLLoader, DirectoryLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

DSDIR = Path(os.environ['DSDIR'])
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
seed_everything(53)

## **Create Vector DataBase**
In this section, we will create a Vector Database (VDB) from the IDRIS documentation. The IDRIS documentation consists of HTML files that were scraped from the IDRIS website using the `urls_to_html.py` script in the [Use-Case repository](https://github.com/idriscnrs/SpeLLM-Use-Case). The VDB will serve as the foundation for the Retrieval Augmented Generation (RAG) technique. The VDB will store vector representations of the documents, allowing us to efficiently retrieve relevant information during the generation phase.

In [None]:
DOC_HTML_PATH = DSDIR / "idris_doc_html"

Let's explore the IDRIS documentation database:

In [None]:
for path in DOC_HTML_PATH.iterdir():
    print(path.name)

In [None]:
for path in (DOC_HTML_PATH / "jean-zay/gpu").iterdir():
    print(path.name)

In [None]:
(DOC_HTML_PATH / "support_avance.html").read_text()

Feel free to explore the database in more detail:

We can use the [`BSHTMLLoader`](https://python.langchain.com/docs/modules/data_connection/document_loaders/html#loading-html-with-beautifulsoup4) class to load an HTML file into a langchain document. This class automatically extracts the text from the HTML file using BeautifulSoup.

In [None]:
loader = BSHTMLLoader(DOC_HTML_PATH / "support_avance.html")
doc = loader.load()[0]
print(doc.page_content)

We can use the function `clean_idris_doc` (regex functions) to remove the artefact from the IDRIS webpage:

In [None]:
print(clean_idris_doc(doc.page_content))

The [`DirectoryLoader`](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) class from langchain can iterate over a directory and load the files using a specific method. We will use it to load every HTML file from our database with the `BSHTMLLoader`.

In [None]:
loader = DirectoryLoader(
    DOC_HTML_PATH, glob="**/*.html", loader_cls=BSHTMLLoader
)
docs = loader.load()
len(docs)

In [None]:
print(clean_idris_doc(docs[0].page_content))

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span>  Create a function that loads every HTML file into Langchain documents. The content of each document should be extracted from the HTML file and cleaned using the `clean_idris_doc` function. The function should return a list of Langchain documents.

**Ease level 1:**

**Ease level 2:**

**Solution:**

**Test it here:**<br>
~1-2min

In [None]:
docs = create_docs(DOC_HTML_PATH)
print(docs[0].page_content)

<hr style="border:1px solid red">

### **Split documents into chunks**

To make it easier to retrieve relevant information, we need to split the documents into smaller chunks. This allows us to have more granular control over the retrieval process and ensures that the generated responses are more accurate and informative. By breaking down the documents into smaller units, we can effectively match the user's query with the most relevant chunks of information. This step is crucial in optimizing the retrieval process and enhancing the overall performance of the system.

![image](./images/rag_split.jpg)

For that, we will use [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter) of Langchain.

In [None]:
doc_example = docs[1]

In [None]:
print(doc_example.page_content)

In [None]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=10,
    separators=["\n\n", "\n", r"(?<=\. )",  " ", "",]
)
splitted_docs = splitter.split_documents([doc_example])

for doc in splitted_docs:
    print(doc.page_content)
    print("#"*20)

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Experiment with different combinations to optimize the future retrieval process. Discuss the results and seek guidance from an instructor to further enhance the performance of the system.

<hr style="border:1px solid red"> 

For the rest of the hands-on, we are going to use the following configuration:

In [None]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=30,
    separators=["\n\n", "\n", r"(?<=\. )",  " ", "",]
)
splitted_docs = splitter.split_documents(docs)

### **Create a vector database from splitted documents**

![image](./images/index.jpg)

In [None]:
HF_MODELS_PATH = DSDIR / "HuggingFace_Models"
EMBEDDING_PATH = HF_MODELS_PATH / "intfloat/multilingual-e5-large"
VDB_PATH = Path("./chroma_vdb")

First, we will load an embedding model. We chose [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) because we need an embedding model that was trained on French and English. We will use the [`HuggingFaceEmbeddings`](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.huggingface.HuggingFaceEmbeddings.html) from langchain, as it is easier to integrate with the langchain vector database.

In [None]:
embedding = HuggingFaceEmbeddings(
    model_name=str(EMBEDDING_PATH),  # Does not accept Path
    model_kwargs={"device": "cuda"},
)

Let's test the embedding by calculating the dot product of different vector embeddings for different sentences.

In [None]:
sentence1 = "This is a cat."
sentence2 = "This is a dog."
sentence3 = "I like train."
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)
print(np.dot(embedding1, embedding2))
print(np.dot(embedding1, embedding3))

Feel free to experiment more with the embedding model and explore its capabilities:

With the embedding model, we can create a vector database from the splitted documents. The database will be a [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) database, which is the easiest to use.

In [None]:
def create_vdb(docs, embedding, vdb_path):
        """Create a vector database from the documents"""

        if vdb_path.exists():
            if any(vdb_path.iterdir()):
                raise FileExistsError(
                    f"Vector database directory {vdb_path} is not empty"
                )
        else:
            vdb_path.mkdir(parents=True)

        vectordb = Chroma.from_documents(
            documents=docs,
            embedding=embedding,
            persist_directory=str(vdb_path),  # Does not accept Path
        )
        vectordb.persist()  # Save database to use it later

        print(f"vector database created in {vdb_path}")
        return vectordb

vectordb = create_vdb(splitted_docs, embedding, VDB_PATH)

We do not have to create the database every time we want to use it. We can simply reuse the Chroma vector database that we have already set up with the following command:

In [None]:
vectordb = Chroma(
    embedding_function=embedding,
    persist_directory=str(VDB_PATH)
)

## **Generation with retrieved informations**

In this section, we will explore how to enhance the generation process of the LLM by incorporating retrieved information from the vector database. By leveraging the Retrieval Augmented Generation (RAG) technique, we can provide additional context and knowledge to the LLM, resulting in more accurate and informative responses.

Now that we have a vector database, we can retrieve relevant information based on the user's query and use it to enrich the input of the LLM.

First, let's try to query the database with a simple sentence to check the retrieved information:
![image](./images/retrieval.jpg)

In [None]:
query = "What are the available training ?"

In [None]:
docs = vectordb.similarity_search(query, k=6)
for doc in docs:
    print(doc.page_content)
    print("#"*20)

In the previous cell, we demonstrated the simplest method to retrieve information from the vector database. However, there are other retrieval algorithms that can be used, such as [Maximum Marginal Relevance (MMR)](https://python.langchain.com/docs/modules/model_io/prompts/example_selector_types/mmr). 

In our case, it is relevant to use the MMR algorithm because we have both a French and an English version of our documentation. Therefore, we want to select only one language when retrieving information, rather than having information in both languages.

By using the MMR algorithm, we can prioritize the most relevant and informative documents in a single language, enhancing the accuracy and coherence of the generated responses.

![image](./images/mmr.jpg)

In [None]:
docs = vectordb.max_marginal_relevance_search(query, k=6, fetch_k=10)
for doc in docs:
    print(doc.page_content)
    print("#"*20)

Let's see how the LLM will answer without using RAG:

In [None]:
# Initialize the model and its tokenizer
model = AutoModelForCausalLM.from_pretrained(
    DSDIR / "HuggingFace_Models/microsoft/phi-2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,  # Allow using code that was not written by HuggingFace
    attn_implementation="flash_attention_2"  # Optimize the model with Flash Attention
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(DSDIR / "HuggingFace_Models/microsoft/phi-2")

In [None]:
def generation(prompt, **gen_parameters):
    """Generate text from a prompt and print it."""
    model_inp = tokenizer(prompt, return_tensors="pt").to("cuda")
    # the generate() method is a succession of forward (auto-regressive) 
    out = model.generate(input_ids=model_inp["input_ids"], **gen_parameters)
    print(tokenizer.decode(out[0]))

In [None]:
generation(query, do_sample=False, max_new_tokens=50)

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Write a function named `rag_generation` that utilizes the Retrieval Augmented Generation (RAG) technique to generate answers to questions. The function should take into account the different elements we have explored previously. We advise you to use templates to format your prompt.

**Ease level 1:**

**Ease level 2:**

**Solution:**

**Test it here:**

In [None]:
rag_generation(query, k=4, fetch_k=8, max_new_tokens=200)

<hr style="border:1px solid red"> 

## **Bonus: Re-ranking**
In the context of the vector database retrieval process, re-ranking is an additional step that can be performed to enhance the accuracy of the retrieved information. While the initial retrieval pass focuses on speed, re-ranking utilizes a slower but more accurate model to score the similarity between the queries and the retrieved information.

By applying re-ranking, we can refine the ranking of the retrieved information based on a more comprehensive analysis. This helps to ensure that the most relevant and informative results are presented to the user.

It is important to note that re-ranking is an optional step and its implementation depends on the specific requirements of the project. It can be particularly useful in scenarios where precision and accuracy are of utmost importance, such as in information retrieval systems or question-answering applications.

The combination of the initial retrieval pass and the re-ranking step provides a powerful approach to optimize the retrieval process and enhance the overall performance of the system.

![image](./images/rerank.jpg)

In this bonus section, we will see a simple way to use the [BGE reranker large model](https://huggingface.co/BAAI/bge-reranker-large) to improve the process we defined previously. Note that the model was trained on English and Chinese, so it is not relevant to use it for retrieving French information.

In [None]:
# Import the reranker model and its tokenizer
rerank_tokenizer = AutoTokenizer.from_pretrained(DSDIR / "HuggingFace_Models/BAAI/bge-reranker-large")
rerank_model = AutoModelForSequenceClassification.from_pretrained(DSDIR / "HuggingFace_Models/BAAI/bge-reranker-large")
rerank_model.eval()

Let's try the reranker model on two basic examples:

In [None]:
pairs = [
    ["What are the available training ?", "To find out the dates of the sessions scheduled for these different training courses, consult the website: https://cours.idris.fr/"],
    ["What are the available training ?", "To use the A100 GPUs, you must first load the cpuarch/amd."]
]
with torch.no_grad():
    inputs = rerank_tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = rerank_model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

Now we will define the retrieval process. We will simply use the function we defined previously and enhance it with reranking:

In [None]:
def rag_rerank_generation(query, k=10, rerank=4, **gen_parameters):
    """Generate text from a prompt after rag and print it."""
    docs = vectordb.similarity_search(query, k=k)
    
    # Re-rank
    rerank_inp = [[query, doc.page_content] for doc in docs]
    with torch.no_grad():
        inputs = rerank_tokenizer(rerank_inp, padding=True, truncation=True, return_tensors='pt', max_length=512)
        scores = rerank_model(**inputs, return_dict=True).logits.view(-1, ).float()
        
    _, indices = scores.topk(rerank)
    
    retrieved_infos = " ".join([docs[idx].page_content for idx in range(len(docs)) if idx in indices])
    
    text_input = f"With the following informations: {retrieved_infos}\nAnswer this question: {query}\nAnswer:"

    model_inp = tokenizer(text_input, return_tensors="pt").to("cuda")
    input_nb_tokens = model_inp['input_ids'].shape[1]
    
    out = model.generate(input_ids=model_inp["input_ids"], **gen_parameters)
    
    print(f"LLM input:\n{text_input}\n" + "#"*50)
    print(f"LLM output:\n{tokenizer.decode(out[0][input_nb_tokens:])}")

In [None]:
rag_rerank_generation(query, k=10, rerank=4, max_new_tokens=53)