<a href="https://colab.research.google.com/github/phamnguyenlongvu/LLMs/blob/main/LLM_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction:
RAG - Retrieval Augmented Generation allow us to ask questions about our documents (that were not included in the training data), without fine-tuning the LLM. When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed.

In [None]:
!pip install langchain # Framework designed to simplify the creation of applications using LLMS
!pip install chromadb # Vector database
!pip install sentence_transformers
!pip install pypdf
!pip install huggingface_hub
!pip install transformers
!pip install accelerate
!pip install bitsandbytes #
!pip install langchain_community
!pip install -U "huggingface_hub[cli]"

### What is a Retrieval Augmented Generation (RAG)?
LLMs has proven their ability to understand context and provide accurate answers to various NLP tasks, such as summarization, Q&A, text generation, ... While being able to provide very good answers to questions about information that they were trained with, they tend to hallucinate when the topic is about information that they do "not know" - were not included in their training data. RAG combines external resources with LLMs. Two main components of RAG are a retrieval and a generation.

The retriever part can be able to encode our data, so that can be easily retrieved the relevant parts of it upon queriying it. The encoding is done using text embeddings - model trained to create a vector representation of the information. The best option for implementing a retriver is a vector database. There are muliple options like ChromaDB, FAISS, Pinecone.

The Generator part, the obvious option is LLM.

In [None]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time

from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

### Loggin Hugging Face

In [None]:
!huggingface-cli login

### Initialize model, tokenizer, quantization and query pipeline
Define bitsandbytes configuration.

Quantization is a compression technique that involes mapping high precision values to lower precision one. For an LLM, that means modifying the precision of their weights and activations making it less memory intentive. This surely dose have impact on the capabilities of the model including the accuracy.

Instead of using high-precision data types, such sa 32-bit floating-point numbers, quantization represents values using lower precision data types. such as 8-bit integers. This process significantly reduces memory usage and can speed up model execution while maintaining acceptable accuracy.

- Load a Model in 4-bit or 8-bit Quantization. (print(model.get_memmory.footprint())

- Changing the Compute Data Type: bnb_4bit_compute_dtype = torch.bfloat16

- Using NF4 Data Type: Desiged for weights initilized using normal distribution

- Nest Quantization: bnb_4but_use_double_quant = True

- Offloading Between CPU and GPU: llm_int8_enable_fp32_cpu_offload=True



In [None]:
model_id = "meta-llama/Llama-2-7b-chat-hf"

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True, # Load model in 4-bit, you can reduce memory usage by approximately fourfold
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

print(device)

### Define model and tokenizer

In [None]:
time_start = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    trust_remote_code=True,
    max_new_tokens=1024
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_end = time()
print(f"Prepare model, tokenizer: {round(time_end-time_start, 3)} sec.")

### Define the query pipeline

In [None]:
time_start = time()
query_pipeline = transformers.pipeline(
        "text-generation", # Task for pipeline
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        max_length=1024,
        device_map="auto",)
time_end = time()
print(f"Prepare pipeline: {round(time_end-time_start, 3)} sec.")

In [None]:
def test_model(tokenizer, pipeline, message):
    time_start = time()
    sequences = pipeline(
        message,
        do_sample=True, # Enables sampling - True if want creative
        top_k=10, # Limits the sampling pool to the top K tokens with the highest prob at each steps
        num_return_sequences=1, # Specifies the number of output sequences to generate for each input
        eos_token_id=tokenizer.eos_token_id,
        max_length=200) # Maximum numbers of tokens in the generate sequence
    time_end = time()
    total_time = f"{round(time_end-time_start, 3)} sec."

    question = sequences[0]['generated_text'][:len(message)]
    answer = sequences[0]['generated_text'][len(message):]

    return f"Question: {question}\nAnswer: {answer}\nTotal time: {total_time}"

In [None]:
from IPython.display import display, Markdown
def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Answer", "Total time"], ["blue", "red", "green", "magenta"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [None]:
response = test_model(tokenizer,
                    query_pipeline,
                   "Please explain what is EU AI Act.")
display(Markdown(colorize_text(response)))

### Hugging Face Pipeline

In [None]:
llm = HuggingFacePipeline(pipeline=query_pipeline)


time_start = time()
question = "Please explain what EU AI Act is."
response = llm(prompt=question)
time_end = time()
total_time = f"{round(time_end-time_start, 3)} sec."
full_response =  f"Question: {question}\nAnswer: {response}\nTotal time: {total_time}"
display(Markdown(colorize_text(full_response)))

### Load document

In [None]:
loader = PyPDFLoader("/content/aiact_final_draft.pdf")
documents = loader.load()

### Split data in chunks
After loaded data, the next step in the indexing pipeling is splittting the documents into manageable chunks.Because:
- Easy of search: Large chunks of data are harder to seach over.
- Context window size: LLM allow only a finite number of tokens in prompts and completions.

Chunking strategies depends on:
- Nature of content
- Embedding model being used
- Expected length and complexity of user queries
- Application Specific Requirements

Levels of text splitting:
- Character Splitting: Simply dividing your text into N-character. It easy, but it don't take into account the structure of our document.
- Recursive Character Text Splitter: We'll specify a series of separatators with will be used to split our docs



In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
all_splits = text_splitter.split_documents(documents)

### Embedding

In [None]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

### Vector databases:
Using high-dimentional vectors which can contain hundreds of different dimensions (vector index, vector search):

- Chroma: Designed for managing and searching color data such as computer vision and image processing.
- Milvus
- Weaviate
- FAISS

The choice betwen ChromaDB and FAISS depends on the nature of data and specific requirements of application.
- Color-based similarity search -> ChromaDB
- General-purpose for similarity search on large-scale vetor data -> FAISS

In [None]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

### Define retriever

In [None]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

In [None]:
def test_rag(qa, query):

    time_start = time()
    doc = vectordb.similarity_search(query)
    print(f"Query: {query}")
    print(f"Retrieved documents: {len(doc)}")

    response = qa.run(query)
    time_end = time()
    total_time = f"{round(time_end-time_start, 3)} sec."

    full_response =  f"Question: {query}\nAnswer: {response}\nTotal time: {total_time}"
    display(Markdown(colorize_text(full_response)))

In [None]:
query = "How is performed the testing of high-risk AI systems in real world conditions?"
test_rag(qa, query)

In [None]:
query = "What are the operational obligations of notified bodies?"
test_rag(qa, query)

In [None]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")