Question ? 

Architecture du modèle

Méthode d’entraînement

GPT BERT


Qualité et taille des données d’entraînement
Nombre de paramètres

quel sont les métrique de performance comment les mettre en place et les analyser ? 
Comment fine-tuning un model ? 
Banchmark MMLU, GLUE ou SuperGLUE ? 


AI agent

# RAG

## What is RAG?

RAG stands for Retrieval Augmented Generation.

![RAG Overview](./assets/images/rag_concepts.png)

It was introduced in the paper [*Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*](https://arxiv.org/abs/2005.11401).

Each step can be roughly broken down to:

* **Retrieval** - Seeking relevant information from a source given a query. For example, getting relevant passages of Wikipedia text from a database given a question.
* **Augmented** - Using the relevant retrieved information to modify an input to a generative model (e.g. an LLM).
* **Generation** - Generating an output given an input. For example, in the case of an LLM, generating a passage of text given an input prompt.


![RAG Pipeline](./assets/images/rag_pipeline.png)


## Why RAG?

The main goal of RAG is to improve the generation outptus of LLMs.

Two primary improvements can be seen as:
1. **Preventing hallucinations** 
2. **Work with custom data**

## Why RAG use for?

For example you could use RAG for:
* **Customer support Q&A chat** 
* **Email chain analysis**
* **Company internal documentation chat**
* **Textbook Q&A** 

## Terms

**Token** A piece of text

**Embedding** A learned numerical representation (vector) of a piece of data. 

**Embedding model** A model designed to accept input data and output a numerical representation.

**Similarity search/vector search** Similarity search/vector search aims to find two vectors which are close together in high-demensional space.

**Large Language Model (LLM)** A model which has been trained to numerically represent the patterns in text.

**LLM context window** The number of tokens a LLM can accept as input. For example, as of March 2024, GPT-4 has a default 
context window of 32k tokens. In a RAG pipeline, if a model has a larger context window, it can accept more reference items.

**Prompt** A common term for describing the input to a generative LLM.

Basic RAG steps : 
- Setup 
  - Set environment variables
  - Setup embedding model
  - Create vector database
  - Setup llm model
  - Define Prompt Template
- Pre Indexing
  - Load Documents
  - Cleaning & formatting
  - Metadata extraction
- Indexing
  - Chunking
  - Embedding
  - Persist data
  - Create retriever
- Retrieval
  - Retrieve documents 
  - Format documents
- Generation
  - Build prompt
  - Generate Response


## Lanscape 

![RAG advanced Overview](./assets/images/advanced_overview.png)

Architecture détaillée d'un système de *Retrieval-Augmented Generation* (RAG)  

### 1. **Query Translation (Traduction de la requête)**
Cette section vise à transformer la question initiale en une forme plus adaptée à la récupération d’informations.  
- **Techniques utilisées** :  
  - *Multi-query* : Décompose une question en plusieurs sous-questions.  
  - *RAG-Fusion* : Fusionne les résultats de différentes requêtes pour améliorer la qualité des réponses.  
  - *Décomposition* : Scinde une requête complexe en plusieurs étapes plus simples.  
  - *Step-back* : Reformule la question pour obtenir des résultats plus pertinents.  
  - *HyDE* (*Hypothetical Document Embeddings*) : Génère des documents hypothétiques basés sur la question pour améliorer la recherche.  

### 2. **Routing (Routage de la requête)**
Cette section détermine le meilleur moyen d’accéder aux informations demandées.  
- **Routage logique** :  
  - Un LLM choisit la base de données (*Relational DB*, *Vector DB*, *Graph DB*) en fonction de la requête.  
- **Routage sémantique** :  
  - La question est encodée (*embedding*) et comparée à différents modèles de prompts pour sélectionner le plus adapté.  

### 3. **Query Construction (Construction de la requête)**
Cette section traduit la question en une requête adaptée à la base de données utilisée.  
- **Relational DBs** : Convertit la requête en SQL via *Text-to-SQL*, notamment avec *PGVector*.  
- **GraphDBs** : Utilise *Text-to-Cypher* pour interagir avec des bases orientées graphe.  
- **VectorDBs** : Un *Self-query retriever* génère automatiquement des filtres de métadonnées pour optimiser la recherche.  

### 4. **Indexing (Indexation des documents)**
L’indexation permet d’organiser et de structurer les données pour améliorer la récupération d’informations.  
- **Optimisation des segments (Chunk Optimization)** : Divise les documents en segments optimaux en fonction des caractères, sections, et délimiteurs sémantiques (*Semantic Splitter*).  
- **Indexation multi-représentation** :  
  - Convertit les documents en unités plus compactes, comme des résumés (*Parent Document, Dense X*).  
- **Embeddings spécialisés** :  
  - Utilise des modèles avancés comme *ColBERT* pour un encodage sémantique amélioré.  
- **Indexation hiérarchique** :  
  - Implémente une organisation en arbre pour résumer les documents à différents niveaux d’abstraction (*RAPTOR*).  


### 5. **Retrieval (Récupération de l'information)**
Cette phase consiste à rechercher les documents les plus pertinents et à les affiner.  
- **Classement des documents** (*Ranking*) :  
  - Utilisation de *Re-Rank*, *RankGPT*, et *RAG-Fusion* pour trier et filtrer les documents selon leur pertinence.  
- **Affinement** (*Refinement*) :  
  - Compression des documents pour en extraire l’essentiel (via *CRAG*).  
- **Récupération active** (*Active Retrieval*) :  
  - Recherche supplémentaire dans de nouvelles sources si les résultats initiaux sont jugés insuffisants.  

### 6. **Generation (Génération de réponse)**
Après la récupération des documents pertinents, un modèle génératif produit la réponse finale.  
- **Self-RAG, RRR (Response Refinement & Retrieval)** :  
  - Ajuste la qualité des réponses en réécrivant la question ou en récupérant d’autres documents.  

Cette architecture optimise la pertinence et la qualité des réponses en combinant plusieurs stratégies de recherche et de traitement des données.



## LLM MODEL

| Entreprise | Modèle         | Nombre de paramètres | Fenêtre de contexte | Licence      | Remarques                                                                      |
|------------|----------------|----------------------|---------------------|--------------|--------------------------------------------------------------------------------|
| **Mistral**| Mistral 7B     | 7 milliards          | Jusqu’à 32 k tokens | Open source  | Conçu pour offrir de bonnes performances tout en restant accessible en ressources.|

### Ollama commands

```bash
ollama pull mistral:7b 
```

```bash
ollama run mistral:7b
```

In [None]:
import subprocess

def pull_model(model):
    print(f"--- Pulling {model} model ---")
    subprocess.run(["ollama", "pull", model], check=True)
    print(f"Model {model} pulled successfully.")

In [5]:
model="mistral:7b"
pull_model(model)

--- Pulling mistral:7b model ---
Model mistral:7b pulled successfully.


## DEMOS

In [None]:
%pip install ollama
%pip install chromadb
%pip install langchain
%pip install langchain-core
%pip install langchain_community

In [None]:
import time
from ollama import chat
from langchain.llms import Ollama
from langchain.prompts import PromptTemplate
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

#### SETUP ####

def load_embedding_model(model_name="sentence-transformers/all-MiniLM-L6-v2"):
    """
    Load the embedding model for text vectorization.
    :param model_name: Name of the SentenceTransformer model to load.
    :return: Loaded embedding model instance.
    """
    print(f"[INFO] Loading embedding model: {model_name}")
    start_time = time.time()
    model_kwargs = {"device": "cpu"}
    encode_kwargs = {"normalize_embeddings": True}
    embeddings = HuggingFaceEmbeddings(
        model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
    )
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Embedding model loaded in {elapsed_time:.2f}s")
    return embeddings


#### PRE-INDEXING ####

def load_document(file_path):
    """
    Load document content from a file.
    :param file_path: Path to the document.
    :return: String content of the file.
    """
    print(f"[INFO] Loading document from {file_path}")
    start_time = time.time()
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Document loaded. Length: {len(content)} characters. {elapsed_time:.2f}s")
    return content


def chunk_text(doc_content, chunk_size=1000, chunk_overlap=200):
    """
    Split text into overlapping chunks.
    :param doc_content: Input text.
    :param chunk_size: Size of each chunk.
    :param chunk_overlap: Overlap between consecutive chunks.
    :return: List of text chunks.
    """
    print("[INFO] Chunking document...")
    start_time = time.time()
    docs = [Document(page_content=doc_content)]
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = text_splitter.split_documents(docs)
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Document split into {len(chunks)} chunks. {elapsed_time:.2f}s")
    return chunks

#### INDEXING ####

def create_vectorstore_indexing_chunks(chunks, embeddings, persist_directory="./chroma_db"):
    """
    Create chromaDB database, embeded chunks and Store embeddings.
    :param chunks: List of text chunks.
    :param embeddings: List of embeddings.
    :param persist_directory: Path where the database is stored.
    """
    print("[INFO] Storing embeddings in ChromaDB")
    start_time = time.time()
    vectorstore = Chroma.from_documents(chunks, embedding=embeddings, persist_directory=persist_directory)
    vectorstore.persist()
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Documents stored in ChromaDB. {elapsed_time:.2f}s")
    return vectorstore

#### RETRIEVAL ####

def create_retriever(vectorstore):
    """Create a retriever from the ChromaDB vector store."""
    print("[INFO] Creating retriever")
    return vectorstore.as_retriever()

#### GENERATION ####

def define_prompt_template():
    """Define the prompt template for the LLM."""
    print("[INFO] Defining prompt template...")
    start_time = time.time()
    prompt_template = PromptTemplate(
        template="Context:\n{context}\n\nQuestion: {question}\nAnswer:",
        input_variables=["context", "question"]
    )
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Prompt template generated. {elapsed_time:.2f}s")
    return prompt_template


def generate_response(question, retriever, prompt_template):
    """
    Generate a response based on retrieved context and user query.
    :param context: Retrieved documents.
    :param question: User question.
    :return: Generated response.
    """
    print("[INFO] Generating response...")
    start_time = time.time()
    docs = retriever.invoke(question)
    formatted_context = "\n\n".join(doc.page_content for doc in docs)

    # Build prompt
    full_prompt = prompt_template.format(question=question, context=formatted_context)

    # Execution
    response = chat(
        model="mistral:7b",
        messages=[{'role': 'user', 'content': full_prompt}],
        stream=False
    )
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Response generated. {elapsed_time:.2f}s")
    return StrOutputParser().parse(response["message"]["content"])


#### EXECUTION ####

print("\n#### STEP 1: Setup ####")
start_time = time.time()
embedding_model = load_embedding_model()
elapsed_time = time.time() - start_time
print(f"[SUCCESS] {elapsed_time:.2f}s")

print("\n#### STEP 2: Pre-Indexing ####")
start_time = time.time()
doc_content = load_document('./assets/ressources/base.txt')
chunks = chunk_text(doc_content)
elapsed_time = time.time() - start_time
print(f"[SUCCESS] {elapsed_time:.2f}s")

print("\n#### STEP 3: Indexing ####")
start_time = time.time()
vectorstore = create_vectorstore_indexing_chunks(chunks, embedding_model)
elapsed_time = time.time() - start_time
print(f"[SUCCESS] {elapsed_time:.2f}s")

print("\n#### STEP 4: Retrieval & Generation ####")
start_time = time.time()
question = "Can you list the team members of M-Motors?"
retriever = create_retriever(vectorstore)
prompt_template = define_prompt_template()
response = generate_response(question, retriever, prompt_template)
elapsed_time = time.time() - start_time
print(f"[SUCCESS] {elapsed_time:.2f}s")

print("\n#### FINAL RESPONSE ####")
start_time = time.time()
print(response)
elapsed_time = time.time() - start_time
print(f"[SUCCESS] {elapsed_time:.2f}s")


#### STEP 1: Setup ####
[INFO] Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
[SUCCESS] Embedding model loaded in 2.74s
[SUCCESS] 2.74s

#### STEP 2: Pre-Indexing ####
[INFO] Loading document from ./assets/ressources/base.txt
[SUCCESS] Document loaded. Length: 4388 characters. 0.00s
[INFO] Chunking document...
[SUCCESS] Document split into 6 chunks. 0.01s
[SUCCESS] 0.01s

#### STEP 3: Indexing ####
[INFO] Storing embeddings in ChromaDB
[SUCCESS] Documents stored in ChromaDB. 8.29s
[SUCCESS] 8.29s

#### STEP 4: Retrieval & Generation ####
[INFO] Creating retriever
[INFO] Defining prompt template...
[SUCCESS] Prompt template generated. 0.02s
[INFO] Generating response...
[SUCCESS] Response generated. 553.79s
[SUCCESS] 553.81s

#### FINAL RESPONSE ####
 M-Motors is a successful and highly reputable company specializing in the selling of used vehicles, having become one of the top 10 companies nationwide after 30 years. The company offers a diverse range of brands, models

## without LangChain

In [None]:
%pip install chromadb
%pip install sentence-transformers

In [21]:
import os
import time
import ollama
import chromadb
from sentence_transformers import SentenceTransformer

#### SETUP ####

def load_embedding_model(model_name="sentence-transformers/all-MiniLM-L6-v2"):
    """
    Load the embedding model for text vectorization.
    :param model_name: Name of the SentenceTransformer model to load.
    :return: Loaded embedding model instance.
    """
    print(f"[INFO] Loading embedding model: {model_name}")
    start_time = time.time()
    model = SentenceTransformer(model_name)
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Embedding model loaded. {elapsed_time:.2f}s")
    return model


def create_vector_database(persist_directory="./chroma_db"):
    """
    Initialize a persistent ChromaDB client and collection.
    :param persist_directory: Path where the database is stored.
    :return: ChromaDB collection instance.
    """
    print(f"[INFO] Creating ChromaDB database at {persist_directory}")
    start_time = time.time()
    client = chromadb.PersistentClient(path=persist_directory)
    collection = client.get_or_create_collection(name="documents")
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] ChromaDB database initialized. {elapsed_time:.2f}s")
    return collection

#### PRE-INDEXING ####

def load_document(file_path):
    """
    Load document content from a file.
    :param file_path: Path to the document.
    :return: String content of the file.
    """
    print(f"[INFO] Loading document from {file_path}")
    start_time = time.time()
    with open(file_path, 'r', encoding='utf-8') as fichier:
        content = fichier.read()
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Document loaded. Length: {len(content)} characters. {elapsed_time:.2f}s")
    return content


def chunk_text(text, chunk_size=1000, chunk_overlap=200):
    """
    Split text into overlapping chunks.
    :param text: Input text.
    :param chunk_size: Size of each chunk.
    :param chunk_overlap: Overlap between consecutive chunks.
    :return: List of text chunks.
    """
    print("[INFO] Chunking document...")
    start_time = time.time()
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start += chunk_size - chunk_overlap 
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Document split into {len(chunks)} chunks. {elapsed_time:.2f}s")
    return chunks

#### INDEXING ####

def create_embeddings(model, text_chunks):
    """
    Generate embeddings for text chunks.
    :param model: Embedding model.
    :param text_chunks: List of text chunks.
    :return: List of embeddings.
    """
    print("[INFO] Generating embeddings...")
    start_time = time.time()
    embeddings = model.encode(text_chunks).tolist()
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Generated {len(embeddings)} embeddings in {elapsed_time:.2f}s")
    return embeddings


def store_embeddings(collection, text_chunks, embeddings):
    """
    Store embeddings and corresponding text chunks in ChromaDB.
    :param collection: ChromaDB collection.
    :param text_chunks: List of text chunks.
    :param embeddings: List of embeddings.
    """
    print("[INFO] Storing embeddings in ChromaDB...")
    start_time = time.time()
    collection.add(
        ids=[str(i) for i in range(len(text_chunks))],
        documents=text_chunks,
        embeddings=embeddings
    )
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Documents stored in ChromaDB. {elapsed_time:.2f}s")

#### RETRIEVAL ####

def retrieve_documents(collection, model, query, top_k=3):
    """
    Retrieve the most relevant documents for a given query.
    :param collection: ChromaDB collection.
    :param model: Embedding model.
    :param query: Query string.
    :param top_k: Number of documents to retrieve.
    :return: List of retrieved documents.
    """
    print(f"[INFO] Retrieving top-{top_k} documents for query: {query}")
    start_time = time.time()
    query_embedding = model.encode([query]).tolist()[0]
    results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
    retrieved_docs = results["documents"][0]
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Retrieved {len(retrieved_docs)} documents. {elapsed_time:.2f}s")
    return retrieved_docs

#### GENERATION ####

def generate_response(retrieved_docs, question):
    """
    Generate a response based on retrieved context and user query.
    :param context: Retrieved documents.
    :param question: User question.
    :return: Generated response.
    """
    print("[INFO] Generating response...")    
    start_time = time.time()
    formatted_context = "\n\n".join(retrieved_docs)
    full_prompt = f"Context:\n{formatted_context}\n\nQuestion: {question}\nAnswer:"
    response = ollama.chat(
        model="mistral:7b",
        messages=[{"role": "user", "content": full_prompt}],
        stream=False
    )
    elapsed_time = time.time() - start_time
    print(f"[SUCCESS] Response generated. {elapsed_time:.2f}s")
    return response["message"]["content"]

#### EXECUTION ####

print("\n#### STEP 1: Setup ####")
start_time = time.time()
embedding_model = load_embedding_model()
collection = create_vector_database()
elapsed_time = time.time() - start_time
print(f"[SUCCESS] {elapsed_time:.2f}s")

print("\n#### STEP 2: Pre-Indexing ####")
start_time = time.time()
document = load_document("./assets/ressources/base.txt")
chunks = chunk_text(document)
elapsed_time = time.time() - start_time
print(f"[SUCCESS] {elapsed_time:.2f}s")

print("\n#### STEP 3: Indexing ####")
start_time = time.time()
embeddings = create_embeddings(embedding_model, chunks)
store_embeddings(collection, chunks, embeddings)
elapsed_time = time.time() - start_time
print(f"[SUCCESS] {elapsed_time:.2f}s")

print("\n#### STEP 4: Retrieval & Generation ####")
start_time = time.time()
query = "Can you list the team members of M-Motors?"
retrieved_docs = retrieve_documents(collection, embedding_model, query)
response = generate_response(retrieved_docs, query)
elapsed_time = time.time() - start_time
print(f"[SUCCESS] {elapsed_time:.2f}s")

print("\n#### FINAL RESPONSE ####")
start_time = time.time()
print(response)
elapsed_time = time.time() - start_time
print(f"[SUCCESS] {elapsed_time:.2f}s")


#### STEP 1: Setup ####
[INFO] Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
[SUCCESS] Embedding model loaded. 8.03s
[INFO] Creating ChromaDB database at ./chroma_db
[SUCCESS] ChromaDB database initialized. 0.17s
[SUCCESS] 8.20s

#### STEP 2: Pre-Indexing ####
[INFO] Loading document from ./assets/ressources/base.txt
[SUCCESS] Document loaded. Length: 4388 characters. 0.00s
[INFO] Chunking document...
[SUCCESS] Document split into 6 chunks. 0.00s
[SUCCESS] 0.00s

#### STEP 3: Indexing ####
[INFO] Generating embeddings...


Add of existing embedding ID: 0
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 5
Insert of existing embedding ID: 0
Insert of existing embedding ID: 1
Insert of existing embedding ID: 2
Insert of existing embedding ID: 3
Insert of existing embedding ID: 4
Insert of existing embedding ID: 5


[SUCCESS] Generated 6 embeddings in 1.05s
[INFO] Storing embeddings in ChromaDB...
[SUCCESS] Documents stored in ChromaDB. 0.06s
[SUCCESS] 1.10s

#### STEP 4: Retrieval & Generation ####
[INFO] Retrieving top-3 documents for query: Can you list the team members of M-Motors?
[SUCCESS] Retrieved 3 documents. 0.04s
[INFO] Generating response...
[SUCCESS] Response generated. 199.26s
[SUCCESS] 199.30s

#### FINAL RESPONSE ####
 The team members of M-Motors are DIOUF Makhtar, Charlery Malcolm, RENÉ Marie, BENGUIGUI Avidan, RENEVIER Joachim, REKIK Kylian, and BERNARD Anne-Flore.
[SUCCESS] 0.00s
