# <center>Panongbene Sawadogo</center>

📩 **Contact** : amet1900@gmail.com

🌐 **Linkedin** : https://www.linkedin.com/in/panongbene-jean-mohamed-sawadogo-33234a168/

🗓️ **Dernière modification** : 20 August 2025

# <center>**Embedding chunks for Retrieval-Augmented Generation (RAG)**</center>

In this document, I implement and compare different techniques for **embedding text chunks**, in an optimal way, to build an effective **Retrieval-Augmented Generation (RAG)** system.

**Embedding** involves representing a text (word, sentence, paragraph, or chunk) as a numerical vector in a high-dimensional space. These representations enable the measurement of semantic similarity between texts and are essential in a **Retrieval-Augmented Generation (RAG)** system.

The goal is to explore how to transform these text units into vector representations (embeddings) that can then be **compared, indexed, and leveraged** efficiently within a RAG pipeline.

More specifically, this work aims to:

1. **Evaluate different embedding methods** (based on general-purpose models or models specialized in semantic similarity).
2. **Analyze the quality of vector representations** to ensure that semantically related chunks are correctly aligned in the vector space.
4. **Improve RAG accuracy** by ensuring that the generation model receives context that is both relevant and coherent with the user’s query.
5. **Compare the strengths and limitations** of different embedding approaches in terms of quality, performance, and computational cost.

In short, this document highlights the crucial importance of the embedding step in a RAG system: it directly determines the model’s ability to retrieve the most relevant information and to generate reliable, contextualized responses.

# Libraries

In [82]:
#!pip install numpy
#!pip install pandas
#!pip install PyPDF2
#!pip install -U gensim
#!pip install matplotlib
#!pip install bitsandbytes
#!pip install InstructorEmbedding
#!pip install --upgrade transformers
#!pip install tensorflow tensorflow-hub

In [1]:
import os
import re
import sys
import json
import nltk
import torch
import PyPDF2
import warnings
import requests
import threading
import numpy as np
import pandas as pd
from PIL import Image
from rich.panel import Panel
import tensorflow_hub as hub
from rich.syntax import Syntax
import matplotlib.pyplot as plt
from rich.console import Console
from dataclasses import dataclass
from gensim.models import KeyedVectors
from nltk.tokenize import sent_tokenize
from FlagEmbedding import BGEM3FlagModel
from InstructorEmbedding import INSTRUCTOR
from IPython.display import Markdown, display
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Dict, Tuple, Union, Callable, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor, AutoModel, TextStreamer, CLIPModel, LlavaForConditionalGeneration, TextIteratorStreamer, BitsAndBytesConfig

In [2]:
warnings.filterwarnings('ignore')

In [3]:
# Download NLTK resources if not already done (only once)
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')

## Loading data

In [4]:
def extraire_texte_pdf(chemin_pdf):
    """
    """
    # Check if the file exists
    if not os.path.exists(chemin_pdf):
        raise FileNotFoundError(f"Le fichier {chemin_pdf} n'existe pas.")
    
    # Check the file extension
    if not chemin_pdf.lower().endswith('.pdf'):
        raise ValueError("Le fichier doit avoir l'extension .pdf")
    
    result_extract = []

    with open(chemin_pdf, 'rb') as fichier:
        # Create a PdfReader object
        lecteur_pdf = PyPDF2.PdfReader(fichier)
        
        # Get the number of pages
        nombre_pages = len(lecteur_pdf.pages)
        print(f"Nombre de pages dans le PDF: {nombre_pages}")
        
        # Extract the text from each page
        for numero_page in range(nombre_pages):
            page = lecteur_pdf.pages[numero_page]
            result_extract.append(page.extract_text().strip())

    return result_extract

In [5]:
%%time
# Extract the text
texte_extrait = extraire_texte_pdf('docs/snd.pdf')

Nombre de pages dans le PDF: 135
CPU times: user 2.27 s, sys: 38.6 ms, total: 2.31 s
Wall time: 2.34 s


# Sentence or Paragraph Segmentation

In [6]:
def segmenter_texte(text: str, mode: str = "phrases", chunk_size: int = 1, language: str = "french" ) -> List[str]:
    """
    Segments a text into chunks of sentences or paragraphs.
    Parameters:
        text: Text to segment
        mode: "phrases" or "paragraphes"
        chunk_size: Number of elements per chunk
        language: Language used for sentence tokenization
    Returns:
        List of text chunks
    Raises:
        ValueError: If parameters are invalid
    """
    # Validation of entries
    if not isinstance(text, str) or not text.strip():
        raise ValueError("Text must not be empty")
    
    if mode not in {"phrases", "paragraphes"}:
        raise ValueError("Mode must be 'sentences' or 'paragraphs'")
    
    if not isinstance(chunk_size, int) or chunk_size <= 0:
        raise ValueError("The chunk size must be a positive integer")

    # Text cleaning
    text = text.strip()
    
    if mode == "phrases":
        # Sentence Segmentation with NLTK
        elements = sent_tokenize(text, language=language)
    else:
        # Paragraph segmentation
        elements = [p for p in text.split('\n') if p.strip()]
    
    # Creating chunks
    chunks = []
    for i in range(0, len(elements), chunk_size):
        chunk = elements[i:i + chunk_size]
        
        # Join with space for sentences, line break for paragraphs
        separator = " " if mode == "phrases" else "\n"
        chunks.append(separator.join(chunk))
    
    return chunks

In [7]:
%%time
paragraph_segments = []
for one_page in texte_extrait:
    paragraph_segments+=segmenter_texte(one_page, "phrases", 2)

CPU times: user 25.8 ms, sys: 3.17 ms, total: 29 ms
Wall time: 29.5 ms


In [8]:
len(paragraph_segments)

778

In [9]:
paragraph_segments[100]

'Face à cette situation, le Sénégal demeure parmi le s 25 pays à plus faible \ndéveloppement humain, avec un IDH de 0,517 en 2022. Dans le même ordre \nd’idées, l’indice du capital humain du Sénégal est resté relativement faible \n(0,42), comparé à des pays comme la Malaisie (0,61)  ou le Ghana (0,504).'

# Word Embeddings (word level)

### Principle

Word embeddings at the word level consist of representing each word in a text as a numerical vector in a high-dimensional space, where the proximity between vectors reflects the semantic or contextual similarity between words. This technique relies on machine learning models such as **Word2Vec** (CBOW or Skip-gram), **GloVe**, or contextual embeddings from models like **BERT**, which are trained on large text corpora. During training, words that appear in similar contexts (for example, “dog” and “cat” in animal-related sentences) are projected close to each other in the vector space. For instance, in a Word2Vec embedding, the vector for “king” may be close to that of “queen” in terms of direction, reflecting a semantic relationship.

---

### Advantages

1. **Capturing semantic similarity**: Word embeddings model semantic relationships between words (synonymy, antonymy, analogy), such as “man” is to “king” what “woman” is to “queen,” which facilitates applications like translation or text generation.
2. **Dimensionality reduction**: By transforming words into vectors (typically 50–300 dimensions), this method compresses information compared to a one-hot representation (where each word is a binary vector equal to the vocabulary size), improving computational efficiency.
4. **Generalization across contexts**: Once trained on a diverse corpus, embeddings can be applied to new texts without retraining, offering flexibility across domains.
5. **Partial interpretability**: Arithmetic relationships between vectors (e.g., vector(“king”) – vector(“man”) + vector(“woman”) ≈ vector(“queen”)) allow qualitative analysis of semantic similarities.

---

### Limitations

1. **Lack of dynamic context**: Static embeddings (like Word2Vec or GloVe) assign a single vector per word, ignoring variations in meaning depending on context (e.g., “bank” can mean a financial institution or a riverbank), which limits precision in ambiguous texts.
2. **Dependence on training corpus quality**: Embeddings reflect the biases and gaps of the text they are trained on. For example, a corpus lacking linguistic diversity may produce less relevant vectors for certain words or domains.
3. **High initial training cost**: Training embeddings on large corpora requires significant resources (time, compute power), making it costly, although pre-trained models mitigate this issue.
4. **Out-of-vocabulary (OOV) problem**: Words missing from the training vocabulary (e.g., neologisms or domain-specific terms) are not represented, forcing the use of techniques like subword embeddings (e.g., FastText) or ignoring these words.
5. **Limitations with complex relations**: While capable of capturing similarities, word-level embeddings struggle to model syntactic relations or long-range dependencies, which require contextual approaches such as transformers.

---

### Improvements and Suggestions

* **Use contextual embeddings**: Move to models like **BERT** or **ELMo**, which generate dynamic embeddings based on context, overcoming the limitations of static embeddings.
* **Domain-specific training**: Adapt embeddings to a specialized domain (e.g., medical or legal) via fine-tuning to improve relevance.
* **Handling OOV words**: Incorporate subword techniques (such as **BPE** or **WordPiece**) to represent unknown words using subunits.
* **Resource optimization**: Leverage pre-trained embeddings (available via Hugging Face or TensorFlow Hub) to reduce training costs.
* **Evaluation**: Assess embeddings with metrics like **cosine similarity** or benchmarks like **WordSim-353** to validate their semantic quality.

In [None]:
def load_word_embeddings(embedding_type='word2vec', embedding_path=None):
    """
    Load pre-trained word embeddings into a Gensim KeyedVectors model.

    This function supports multiple embedding formats such as Word2Vec, 
    GloVe, and FastText. It uses Gensim's `KeyedVectors` to load the 
    embeddings from the specified file path.

    Parameters
    ----------
    embedding_type : str, optional (default='word2vec')
        The type of embeddings to load. Supported values are:
        - 'word2vec' : Loads Word2Vec binary format embeddings.
        - 'glove'    : Loads GloVe embeddings converted to Word2Vec format 
                       (without headers).
        - 'fasttext' : Loads FastText embeddings in text format.

    embedding_path : str
        Path to the embeddings file to be loaded.

    Returns
    -------
    model : gensim.models.KeyedVectors
        The loaded embedding model, which allows vector lookup, similarity 
        computations, and other operations.

    Raises
    ------
    ValueError
        If the provided embedding_type is not supported.

    Examples
    --------
    >>> model = load_word_embeddings('word2vec', 'GoogleNews-vectors-negative300.bin')
    >>> vector = model['king']  # Get embedding vector for the word 'king'
    >>> model.most_similar('paris')
    """

    if embedding_type.lower() == 'word2vec':
        # Load Word2Vec binary embeddings
        model = KeyedVectors.load_word2vec_format(embedding_path, binary=True)
    elif embedding_type.lower() == 'glove':
        # Load GloVe embeddings (converted to Word2Vec format without header)
        model = KeyedVectors.load_word2vec_format(embedding_path, binary=False, no_header=True)
    elif embedding_type.lower() == 'fasttext':
        # Load FastText embeddings (in text format)
        model = KeyedVectors.load_word2vec_format(embedding_path, binary=False)
    else:
        raise ValueError("Unsupported embedding type. Choose 'word2vec', 'glove', or 'fasttext'.")

    print(f"{embedding_type} model loaded successfully!")
    return model


In [None]:
def load_word_embeddings(embedding_type='word2vec', embedding_path=None):
    """
    Load pre-trained word embeddings into a Gensim KeyedVectors model.

    This function supports multiple embedding formats such as Word2Vec, 
    GloVe, and FastText. It uses Gensim's `KeyedVectors` to load the 
    embeddings from the specified file path.
    
    Parameters
        embedding_type : (default='word2vec') The type of embeddings to load. Supported values are:
            - 'word2vec' : Loads Word2Vec binary format embeddings.
            - 'glove'    : Loads GloVe embeddings converted to Word2Vec format 
                           (without headers).
            - 'fasttext' : Loads FastText embeddings in text format.
        embedding_path : Path to the embeddings file to be loaded.

    Returns
        model : gensim.models.KeyedVectors
            The loaded embedding model, which allows vector lookup, similarity 
            computations, and other operations.
        Raises ValueError If the provided embedding_type is not supported.

    Examples
        >>> model = load_word_embeddings('word2vec', 'GoogleNews-vectors-negative300.bin')
        >>> vector = model['king']  # Get embedding vector for the word 'king'
        >>> model.most_similar('paris')
    """    
    if embedding_type.lower() == 'word2vec':
        # Load Word2Vec binary embeddings
        model = KeyedVectors.load_word2vec_format(embedding_path, binary=True)
    elif embedding_type.lower() == 'glove':
        # Load GloVe embeddings (converted to Word2Vec format without header)
        model = KeyedVectors.load_word2vec_format(embedding_path, binary=False, no_header=True)
    elif embedding_type.lower() == 'fasttext':
        # Load FastText embeddings (in text format
        model = KeyedVectors.load_word2vec_format(embedding_path, binary=False)
    else:
        raise ValueError("Unsupported embedding type. Choose 'word2vec', 'glove', or 'fasttext'.")
    
    print(f"Modèle {embedding_type} chargé avec succès !")
    return model

In [None]:
def get_word_embedding(model, word):
    """
    Retrieve the embedding vector for a given word from a pre-trained model.

    This function takes a loaded embedding model (e.g., Word2Vec, GloVe, 
    or FastText via Gensim KeyedVectors) and returns the vector representation 
    of the specified word. If the word is not present in the model's vocabulary, 
    it will handle the error gracefully.

    Parameters
        model : gensim.models.KeyedVectors
            The pre-trained embedding model loaded with Gensim.
        
        word : str
            The word for which the embedding vector should be retrieved.

    Returns
        numpy.ndarray or None
            A vector (1D NumPy array) representing the word's embedding 
            if the word exists in the model. Returns `None` if the word 
            is not found in the vocabulary.

    Raises 
        KeyError
            If the word is not in the model's vocabulary (caught internally 
            and handled with a warning).

    Examples
        >>> model = load_word_embeddings('word2vec', 'GoogleNews-vectors-negative300.bin')
        >>> vector = get_word_embedding(model, 'king')
        >>> print(vector.shape)
        (300,)
        >>> get_word_embedding(model, 'unknownword')
        None
    """
    if word in model:
        return model[word]
    else:
        print(f"The word '{word}' is not in the vocabulary.")
        return None

In [None]:
# Replace with the path to your embeddings file
embedding_path = "GoogleNews-vectors-negative300.bin"  # Ex: Word2Vec
#embedding_path = "glove.6B.300d.txt"  # Ex: GloVe
#embedding_path = "wiki-news-300d-1M.vec"  # Ex: FastText

# Load the model
model = load_word_embeddings(embedding_type='word2vec', embedding_path=embedding_path)

if model:
    # Get the embedding of a word
    word = "king"
    embedding = get_word_embedding(model, word)
    
    if embedding is not None:
        print(f"Warning: The word '{word}':\n{embedding[:5]}... (shape: {embedding.shape})")

# Contextual Embeddings (sentence / chunk level)

### Principle

Contextual embeddings at the sentence or chunk level represent a word, sentence, or segment of text (chunk) as a numerical vector in a high-dimensional space, while taking into account the context in which they appear. Unlike static word embeddings (such as Word2Vec), these representations are generated dynamically by transformer-based models such as BERT (Bidirectional Encoder Representations from Transformers), RoBERTa (Robustly Optimized BERT Approach), or DistilBERT (a lightweight version of BERT). These models analyze text bidirectionally and use attention mechanisms to capture relationships between words in a sequence. For example, in the sentence *“The bank financed the project”*, the word *“bank”* will have a different embedding depending on whether it refers to a financial institution or the shore of a river, reflecting its contextual meaning. These embeddings are particularly suitable for chunks, sentences, or short documents, as they exploit local structure to produce rich representations, often extracted from the intermediate or final layers of the models after fine-tuning or direct inference.

---

### Advantages

1. **Capture polysemy and contextual semantics**: Thanks to their bidirectional and contextual nature, these embeddings distinguish between different meanings of a word depending on its environment. For example, in *“I’m going to the bank to fish”* vs *“I’m going to the bank to borrow”*, the vectors for *“bank”* will be distinct, improving accuracy in tasks such as word sense disambiguation or semantic search.
2. **Well-suited for chunks, sentences, and short documents**: Models like BERT or DistilBERT are optimized to process moderately long sequences (up to 512 tokens for standard BERT), making them ideal for segmenting and encoding textual units such as sentences or chunks in RAG pipelines, where local context is crucial.
3. **High performance in advanced NLP tasks**: These embeddings excel in applications such as text classification, question answering (QA), or text generation, as they incorporate complex syntactic and semantic relationships captured by attention mechanisms.
4. **Reuse of pre-trained models**: Thanks to libraries such as Hugging Face Transformers, contextual embeddings can be quickly obtained from pre-trained models, reducing the need for custom training and accelerating development.
5. **Flexibility with fine-tuning**: These models can be adapted to specific corpora (e.g., medical or legal) to improve their relevance in specialized domains, enabling customization without training from scratch.

---

### Limitations

1. **Often high-dimensional embeddings**: The generated vectors (typically 768 dimensions for BERT or 512 for DistilBERT) are memory-intensive, which can be problematic for indexing large corpora or running on low-power devices like mobile phones.
2. **High computational cost for large corpora**: Generating embeddings for each chunk or sentence requires intensive computation, especially on large texts or massive datasets, making this method costly in terms of time and resources (CPU/GPU), even with lighter models like DistilBERT.
3. **Sequence length limitations**: Models like BERT are limited to 512 tokens (or 1024 for some variants), which restricts their application to long documents. Chunks exceeding this limit must be truncated or split, potentially losing context.
4. **Dependence on training data quality**: The embeddings reflect the biases or gaps in the corpora on which the models were trained (e.g., lack of linguistic diversity), which may affect performance on out-of-domain texts.
5. **Implementation complexity**: While pre-trained models ease usage, integrating these embeddings into a RAG pipeline or application requires careful resource management (batch optimization, memory handling) and expertise in fine-tuning to maximize efficiency.

---

### Improvements and Suggestions

* **Use lighter models**: Choose DistilBERT or other optimized versions (like TinyBERT) to reduce embedding dimensionality and computational cost while maintaining good performance.
* **Handle long sequences**: Implement techniques such as sliding windows or smart truncation with overlap to process documents exceeding token limits while preserving context.
* **Optimize resources**: Use frameworks like ONNX or TensorRT to speed up inference on large corpora, or leverage parallel GPU computation.
* **Domain-specific fine-tuning**: Adapt the model on targeted corpora to reduce biases and improve embedding relevance in specific fields.
* **Evaluation**: Benchmark embeddings with tasks like GLUE or contextual similarity metrics to validate their effectiveness in specific tasks such as RAG.

---

This technique requires proper infrastructure and careful configuration for optimal performance.

In [50]:
def load_contextual_embeddings(model_name='bert-base-uncased'):
    """
    Load a pre-trained transformer model and its tokenizer to generate contextual embeddings.

    This function initializes a Hugging Face transformer model (e.g., BERT, RoBERTa, DistilBERT) 
    along with its tokenizer. The model can then be used to compute contextual embeddings for words, 
    sentences, or chunks of text. Unlike static embeddings (e.g., Word2Vec), contextual embeddings 
    are dynamically generated depending on the surrounding context.

    Parameters:
        model_name (str, optional): The name or path of the pre-trained model from the Hugging Face Hub. 
            Defaults to `'bert-base-uncased'`. 
            Examples include:
                - 'bert-base-uncased'
                - 'roberta-base'
                - 'distilbert-base-uncased'
                - or a custom fine-tuned model path.

    Returns:
        tuple: A tuple containing:
            - tokenizer (PreTrainedTokenizer): The tokenizer associated with the model, 
              used to preprocess input text.
            - model (PreTrainedModel): The transformer model loaded for generating embeddings.

    Example:
        >>> tokenizer, model = load_contextual_embeddings("bert-base-uncased")
        >>> inputs = tokenizer("The bank financed the project", return_tensors="pt")
        >>> outputs = model(**inputs)
        >>> embeddings = outputs.last_hidden_state
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    print(f"Model {model_name} loaded successfully!")
    return tokenizer, model

In [51]:
def get_contextual_embedding(tokenizer, model, text, pooling_strategy='mean'):
    """
    Generate a contextual embedding for a given text using a transformer model.
    Parameters
        tokenizer : PreTrainedTokenizer
            The tokenizer corresponding to the model (e.g., from Hugging Face Transformers).
        model : PreTrainedModel
            The transformer model (e.g., BERT, RoBERTa) used to generate contextual embeddings.
        text : str
            The input text (sentence, chunk, or document) for which the embedding will be computed.
        pooling_strategy : str, optional, default='mean'
            The strategy to pool token embeddings into a single vector representation:
                - 'mean' : average pooling of all token embeddings.
                - 'cls'  : use the [CLS] token embedding.
                - 'max'  : max pooling across token embeddings.
    
        Returns
        numpy.ndarray
            A 1D vector representing the contextual embedding of the input text.
    """
    
    # Tokenisation
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    
    # Forward pass
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Retrieving hidden states (last layer)
    last_hidden_states = outputs.last_hidden_state.squeeze(0)
    
    # Pooling
    if pooling_strategy == 'mean':
        embedding = torch.mean(last_hidden_states, dim=0)
    elif pooling_strategy == 'max':
        embedding, _ = torch.max(last_hidden_states, dim=0)
    elif pooling_strategy == 'cls':
        embedding = last_hidden_states[0]  # Token [CLS]
    else:
        raise ValueError("Stratégie de pooling non supportée. Choisissez 'mean', 'max' ou 'cls'.")
    
    return embedding.numpy()

In [53]:
# Model selection (eg: 'bert-base-uncased', 'roberta-base', 'distilbert-base-uncased')
model_name = 'bert-base-uncased'

# Loading the model
tokenizer, model = load_contextual_embeddings(model_name)

Modèle bert-base-uncased chargé avec succès !
Embedding contextuel pour le texte 'Face à cette situation, le Sénégal demeure parmi le s 25 pays à plus faible 
développement humain, avec un IDH de 0,517 en 2022. Dans le même ordre 
d’idées, l’indice du capital humain du Sénégal est resté relativement faible 
(0,42), comparé à des pays comme la Malaisie (0,61)  ou le Ghana (0,504).':
Shape: (768,)
Premiers éléments: [-0.45379516  0.24887244 -0.10547437 -0.28441253  0.2409869 ]...


In [58]:
%%time
# Text to encode
paragraph_segments_embbeding = []

print('Number of paragraphs : ', len(paragraph_segments))
# Embedding generation
for one_para in paragraph_segments:
    embedding = get_contextual_embedding(tokenizer, model, one_para, pooling_strategy='mean')
    paragraph_segments_embbeding.append(embedding)

if embedding is not None:
    print(f"Contextual embedding for texte '{text}':")
    print(f"Shape: {embedding.shape}")
    print(f"First elements: {embedding[:5]}...")

Nombre de paragraph :  778
Embedding contextuel pour le texte 'Face à cette situation, le Sénégal demeure parmi le s 25 pays à plus faible 
développement humain, avec un IDH de 0,517 en 2022. Dans le même ordre 
d’idées, l’indice du capital humain du Sénégal est resté relativement faible 
(0,42), comparé à des pays comme la Malaisie (0,61)  ou le Ghana (0,504).':
Shape: (768,)
Premiers éléments: [-0.49004218 -0.07687725 -0.17948622 -0.17795789  0.31216305]...
CPU times: user 2min 49s, sys: 1min 25s, total: 4min 14s
Wall time: 1min


# Sentence Embeddings (optimized for sentence similarity)

### Principle

Sentence embeddings, optimized for sentence similarity, consist of generating compact numerical vectors that represent sentences, paragraphs, or entire chunks while capturing their overall semantic meaning. Unlike word embeddings or contextual embeddings, which focus on individual words or their local context, these models—such as Sentence-BERT (SBERT), LaBSE (Language-agnostic BERT Sentence Embeddings), or the Universal Sentence Encoder—are specifically designed to encode larger textual units. These models rely on transformer architectures (often derived from BERT) fine-tuned with sentence pairs and tasks such as similarity classification (e.g., Natural Language Inference or STS – Semantic Textual Similarity). For example, SBERT applies a pooling technique (mean or max) over word embeddings to produce a single vector per sentence, so that sentences like “The cat is sleeping” and “The feline is resting” would be close in semantic similarity. These embeddings are especially useful in RAG pipelines, where they facilitate retrieving relevant chunks by measuring vector proximity.

---

### Advantages

1. **Optimized for semantic similarity and retrieval**: These embeddings are trained to maximize the correlation between vector distance and semantic similarity, making models like SBERT ideal for tasks such as similar text retrieval or paraphrase detection. For instance, “What is the weather today?” and “What’s the current forecast?” will produce close vectors.
2. **Directly suited for RAG**: By producing compact representations (often 384 or 768 dimensions) for entire chunks or sentences, these embeddings allow efficient indexing and fast retrieval in RAG systems, improving the relevance of generated answers.
3. **Multilingual performance**: Models like LaBSE are designed to work across multiple languages, offering a robust solution for multilingual corpora without the need for retraining per language.
4. **Relative computational efficiency**: Compared to contextual embeddings applied at each token, sentence embeddings reduce computational overhead by generating a single vector per text unit, which is advantageous for real-time applications.
5. **Easy reuse**: Thanks to frameworks like Hugging Face, pre-trained models (e.g., SBERT’s `all-mpnet-base-v2`) are directly accessible, allowing quick integration into NLP pipelines.

---

### Limitations

1. **Quality depends on training corpus**: The performance of embeddings is strongly influenced by the diversity and representativeness of the fine-tuning corpus. For example, a model trained on formal English text may underperform on informal dialogues or low-resource languages.
2. **Less effective on very long texts**: These models are optimized for short to moderately long sequences (generally up to 512 tokens for SBERT). Their ability to capture context decreases with long paragraphs or documents, requiring prior segmentation or truncation that may lead to information loss.
3. **Sensitivity to model biases**: Embeddings reflect biases present in the training data (e.g., gender stereotypes or cultural gaps), which can impact accuracy in sensitive or diverse contexts.
4. **Dependence on fine-tuning**: Although pre-trained, these models often require adjustment for domain-specific tasks (e.g., similarity in the medical domain) to achieve full potential, adding a configuration step.
5. **Limitations with complex relations**: While excellent for similarity, these embeddings may lack granularity to capture long-distance syntactic dependencies or subtle nuances, which can be a drawback for highly detailed analysis.

---

### Improvements and Suggestions

* **Domain adaptation**: Fine-tune on a specific corpus (e.g., medical or legal) to improve embedding quality in a given context.
* **Handling long texts**: Combine with hierarchical segmentation or sliding windows to process large documents, applying embeddings to smaller sub-units.
* **Bias reduction**: Use debiasing techniques (e.g., Bolukbasi et al. methods) to mitigate stereotypes in embeddings.
* **Resource optimization**: Leverage lighter versions (e.g., SBERT’s `miniLM`) or parallel computation to reduce computational cost on large corpora.
* **Evaluation**: Validate embeddings with benchmarks such as STS Benchmark or cosine similarity metrics in a RAG pipeline to assess effectiveness.

In [67]:
def load_sentence_embedding_model(model_name='all-MiniLM-L6-v2'):
    """
    Load a pre-trained sentence embedding model.

    This function loads a sentence embedding model from the 
    <<sentence_transformers>> library. Such models are optimized 
    for generating dense vector representations of entire 
    sentences, paragraphs, or chunks, making them particularly 
    useful for semantic similarity, clustering, and RAG pipelines.

    Parameters
        model_name : str, optional (default='all-MiniLM-L6-v2')
            The name or path of the pre-trained sentence embedding model
            to load. Examples include:
            - 'all-MiniLM-L6-v2' (lightweight and efficient)
            - 'all-mpnet-base-v2' (higher accuracy, larger model)
            - 'LaBSE' (multilingual performance)

    Returns
        model : SentenceTransformer
            A SentenceTransformer model instance that can be used to 
            encode sentences or chunks into embeddings.

    Example
        >>> model = load_sentence_embedding_model('all-MiniLM-L6-v2')
        >>> embeddings = model.encode(["The cat is sleeping", "The feline rests"])
        >>> print(embeddings.shape)  # (2, 384)
    """
    if model_name.lower() in ['universal-sentence-encoder', 'universal-sentence-encoder-multilingual']:
        # Loading via TensorFlow Hub
        model = hub.load(f"https://tfhub.dev/google/{model_name}/4")
        print(f"Model {model_name} loaded via TF Hub!")
    else:
        # Loading via SentenceTransformers (SBERT/LaBSE)
        model = SentenceTransformer(model_name)
        print(f"Model {model_name} loaded via SentenceTransformers!")
    
    return model

In [68]:
def encode_sentences(model, sentences, model_type='sbert'):
    """
    Encode a list of sentences into embeddings using a specified model.

    This function converts input sentences into dense vector 
    representations (embeddings) depending on the type of model provided. 
    These embeddings capture semantic meaning and can be used for 
    similarity search, clustering, or downstream NLP tasks.

    Parameters
        model : object
            The embedding model instance to use. This can be:
            - A SentenceTransformer model (if `model_type='sbert'`)
            - A TensorFlow Hub model such as USE (if `model_type='use'`)
            - Any compatible embedding model.
        
        sentences : list of str
            A list of sentences or short texts to encode into embeddings.
    
        model_type : str, optional (default='sbert')
            The type of embedding model. Supported values:
            - 'sbert' : for models loaded via Sentence-BERT (SentenceTransformer)
            - 'use'   : for Universal Sentence Encoder models
            - (other custom options can be added as needed)

    Returns
        embeddings : numpy.ndarray
            A 2D array of shape (n_sentences, embedding_dim) containing 
            the embeddings for each input sentence.

    Example
        >>> model = load_sentence_embedding_model('all-MiniLM-L6-v2')
        >>> sentences = ["The cat is sleeping", "The feline rests"]
        >>> embeddings = encode_sentences(model, sentences, model_type='sbert')
        >>> print(embeddings.shape)  # (2, 384)
    """
    if isinstance(sentences, str):
        sentences = [sentences]
    
    try:
        if model_type in ['sbert', 'labse']:
            embeddings = model.encode(sentences, convert_to_numpy=True)
        elif model_type == 'use':
            embeddings = model(sentences).numpy()
        else:
            raise ValueError("Unsupported model type")
        
        return embeddings
    
    except Exception as e:
        print(f"Erreur d'encodage: {e}")
        return None

In [78]:
%%time
print("\n=== Sentence-BERT ===")
sbert_model = load_sentence_embedding_model('all-MiniLM-L6-v2')
if sbert_model:
    embeddings = encode_sentences(sbert_model, paragraph_segments, 'sbert')
    print(f"Shape des embeddings: {embeddings.shape}")  # (3, 384)
    print(f"Similarité phrase 1-2: {np.dot(embeddings[0], embeddings[1]):.2f}")
    print(f"Similarité phrase 1-3: {np.dot(embeddings[0], embeddings[2]):.2f}")


=== Sentence-BERT ===
Modèle all-MiniLM-L6-v2 chargé via SentenceTransformers!
Shape des embeddings: (778, 384)
Similarité phrase 1-2: 1.00
Similarité phrase 1-3: 0.70
CPU times: user 963 ms, sys: 954 ms, total: 1.92 s
Wall time: 5.84 s


# Instruction-tuned & Domain-specific Embeddings

### Principle

Instruction-tuned & domain-specific embeddings are vector representations generated by models trained with **explicit instructions** to optimize performance on specific tasks such as semantic search, clustering, classification, or text retrieval in RAG systems. Unlike generic embeddings (like Word2Vec or BERT), these models—such as OpenAI text-embedding-3-small/large, E5 (Efficient Text Embedding), Instructor-XL, GTE (General Text Embedder), or Cohere Embed—are fine-tuned on datasets annotated with instructions or text–text pairs, often accompanied by contextual metadata. For example, a model like Instructor-XL may be trained with instructions such as “Find the documents most similar to this query” or “Classify this text by topic,” guiding the embeddings toward practical tasks. Moreover, these models can be specialized for specific domains (biomedical, legal, multilingual) using targeted corpora, enabling fine adaptation to particular application contexts such as medical knowledge bases or multilingual legal documents.

---

### Advantages

1. **More robust and adapted to practical RAG use cases**: These embeddings are designed to maximize relevance in real-world scenarios, such as retrieving relevant chunks in a RAG pipeline. For instance, OpenAI text-embedding-3 is optimized for search tasks, producing vectors that improve the accuracy of generated responses compared to generic embeddings.
2. **Support for specialized embeddings**: Models can be fine-tuned for specific domains, such as E5 for biomedical (handling terms like “oncology” or “DNA”) or LaBSE for multilingual use (supporting English, Spanish, Chinese, etc.), allowing customization according to application needs and improving performance on specialized corpora.
3. **Effectiveness in instruction-driven tasks**: Thanks to training with explicit instructions, these embeddings better capture the intent behind queries or classifications, making models like GTE particularly useful for intentional searches or thematic clustering.
4. **Reusability and scalability**: Pretrained models (available through APIs such as OpenAI or Cohere, or libraries like SentenceTransformers) enable rapid integration into pipelines while supporting large-scale corpora with consistent performance.
5. **Improved contextual robustness**: Instruction-based training helps reduce semantic ambiguities and better handle linguistic variations, offering greater stability across diverse environments.

---

### Limitations

1. **Strong dependence on training data quality**: Embedding performance depends on the representativeness and quality of the datasets used for fine-tuning. For instance, a model trained only on academic texts may underperform on informal dialogues or uncovered domains.
2. **Potentially high computational cost**: Instruction-based training, especially for large-scale models such as OpenAI text-embedding-3-large or Instructor-XL, requires significant resources (GPU, training time), and inference itself can be costly on large corpora—though lighter versions (such as text-embedding-3-small) help mitigate this.
3. **Risk of over-optimization**: Excessive fine-tuning on a specific task or domain may reduce generalization to other contexts, making the model less versatile if instructions do not cover all possible variations.
4. **Complexity of access and maintenance**: Some models (such as those from OpenAI or Cohere) require paid APIs or licenses, and their integration into a RAG pipeline may involve ongoing cost and update management.
5. **Limitations with very long or unstructured texts**: While effective on chunks or sentences, these embeddings may lose efficiency with very long documents or texts lacking clear structure, requiring pre-segmentation or truncation that can introduce errors.

---

### Improvements and Suggestions

* **Custom fine-tuning**: Adapt the model to a specific corpus (e.g., legal or multilingual) with tailored instructions to improve domain relevance.
* **Resource optimization**: Use lighter versions (like text-embedding-3-small or GTE-base) or apply quantization techniques (e.g., post-training quantization) to reduce computational costs.
* **Bias management**: Apply debiasing methods (such as Bolukbasi’s adjustments) to mitigate the effects of biased training data.
* **Pre-segmentation**: Combine with techniques such as sliding windows or hierarchical segmentation to handle long texts by applying embeddings to optimized sub-units.
* **Evaluation**: Benchmark embeddings using MTEB (Massive Text Embedding Benchmark) or precision metrics within a RAG pipeline to validate effectiveness.

---

This technique requires careful attention to resource use and training data quality.


In [84]:
class InstructionEmbeddings:
    def __init__(self, model_name: str = "e5-small-v2"):
        """
        Initialize the embedding model with instruction (without OpenAI)
        Available models:
            * **E5**: 'e5-small-v2', 'e5-base-v2', 'e5-large-v2'
            * **Instructor**: 'instructor-small', 'instructor-base', 'instructor-xl'
            * **GTE**: 'gte-small', 'gte-base'
            * **Cohere**: requires an API key but has an open-source alternative
        """
        self.model_name = model_name.lower()
        self.model = None
        self._load_model()

    def _load_model(self):
        """
        Loads and initializes the pre-trained model and its tokenizer.

        This internal method prepares the model for inference by:
        - Loading the model weights from the specified pre-trained checkpoint.
        - Initializing the corresponding tokenizer for text processing.
        - Configuring device placement (CPU or GPU) and model-specific settings.
    
        This ensures that the model is ready to generate embeddings or perform other 
        NLP tasks.
    
        Returns
            None
    
        Notes
            This is a private/internal method intended to be called during class 
            initialization or when the model needs to be reloaded.
        """
        
        if 'e5' in self.model_name:
            # E5 models (Embeddings from Bidirectional Encoder Representations)
            self.model = SentenceTransformer(f"intfloat/{self.model_name}")
            print(f"Loaded E5 model: {self.model_name}")
        
        elif 'instructor' in self.model_name:
            # Instructor Models (Instruction-tuned)
            self.model = INSTRUCTOR(f"hkunlp/{self.model_name}")
            print(f"Modèle Instructor chargé: {self.model_name}")
        
        elif 'gte' in self.model_name:
            # GTE (General Text Embeddings)
            self.model = SentenceTransformer(f"thenlper/{self.model_name}")
            print(f"Loaded GTE model: {self.model_name}")
        
        else:
            raise ValueError(f"Modèle non supporté: {self.model_name}. Choisissez parmi E5, Instructor ou GTE.")


    def encode(self, 
               texts: Union[str, List[str]], 
               instruction: str = None,
               task_type: str = None) -> np.ndarray:
        """
        Encodes one or multiple texts into numerical embeddings using a pre-trained model.

        This function generates vector representations of input text(s), which capture 
        semantic meaning and can be used for tasks like semantic search, clustering, 
        classification, or other NLP applications. Optional instructions or task types 
        can guide the model to produce embeddings tailored to specific contexts.
    
        Parameters
            texts : str or List[str]
                A single text string or a list of text strings to encode.
            instruction : str, optional
                An optional instruction to guide the embedding generation. For example, 
                specifying "summarize" or "analyze sentiment" may influence the resulting embedding.
            task_type : str, optional
                An optional task type indicator that can modify how embeddings are computed, 
                depending on the model's capabilities (e.g., "classification", "retrieval").
        
            Returns
            np.ndarray
                A NumPy array of embeddings:
                - If `texts` is a single string, returns a 1D array.
                - If `texts` is a list, returns a 2D array where each row corresponds to a text.
    
        Example
            >>> embedding = encoder.encode("This is a sample text.")
            >>> embedding.shape
            (768,)
            
            >>> embeddings = encoder.encode(["Text 1", "Text 2"])
            >>> embeddings.shape
            (2, 768)
        """
        if isinstance(texts, str):
            texts = [texts]

        # E5 Models - Adds task prefix automatically
        if 'e5' in self.model_name:
            if task_type == 'search':
                texts = [f"query: {text}" for text in texts]
            elif task_type == 'passage':
                texts = [f"passage: {text}" for text in texts]
            return self.model.encode(texts, convert_to_numpy=True)
        
        # Instructor Models - Requires explicit instruction
        elif 'instructor' in self.model_name:
            if instruction is None:
                instruction = "Represent the text for retrieval:"
            paired_inputs = [[instruction, text] for text in texts]
            return self.model.encode(paired_inputs, convert_to_numpy=True)
        
        # GTE Models - No instructions required
        elif 'gte' in self.model_name:
            return self.model.encode(texts, convert_to_numpy=True)
        
        else:
            raise ValueError("Unsupported model")

In [89]:
print("\n=== E5 Embedding ===")
e5_embedder = InstructionEmbeddings("e5-large-v2")
embeddings = e5_embedder.encode(
    paragraph_segments, 
    task_type="search"
)
print(f"E5 embedding shape: {embeddings.shape}")  # (1, 1024)
print(f"Sample values: {embeddings[0][:5]}...")

print("\n=== Instructor Embedding ===")
instructor_embedder = InstructionEmbeddings("instructor-xl")
embeddings = instructor_embedder.encode(
    ["Quantum computing principles"], 
    instruction="Represent the scientific text for retrieval:"
)
print(f"Instructor embedding shape: {embeddings.shape}")  # (1, 768)
print(f"Sample values: {embeddings[0][:5]}...")

print("\n=== GTE Embedding ===")
gte_embedder = InstructionEmbeddings("gte-base")
embeddings = gte_embedder.encode(
    ["This is a general text embedding example"]
)
print(f"GTE embedding shape: {embeddings.shape}")  # (1, 768)
print(f"Sample values: {embeddings[0][:5]}...")


=== E5 Embedding ===
Modèle E5 chargé: e5-large-v2
E5 embedding shape: (778, 1024)
Sample values: [ 0.00914955 -0.04589565  0.03256417  0.02174237 -0.01066862]...

=== Instructor Embedding ===




Modèle Instructor chargé: instructor-xl
Instructor embedding shape: (1, 1024)
Sample values: [ 0.00064827 -0.06732617  0.00796025 -0.0156702   0.16671199]...

=== GTE Embedding ===
Modèle GTE chargé: gte-base
GTE embedding shape: (1, 768)
Sample values: [ 0.01220494 -0.02685339  0.02171323  0.04435849  0.07564417]...


# Document-level Embeddings (long context)

### Principle

Document-level embeddings (long context) involve generating a single numerical vector representing an entire long document, which can contain several thousand tokens, rather than being limited to individual sentences or chunks. This approach relies on advanced transformer models designed to handle large contexts, such as Longformer, which uses sparse attention to efficiently process sequences up to 4,096 tokens or more, or variants of GPT embeddings with extended context windows (for example, architectures like xAI’s Grok or models like Qwen2.5 embeddings). The principle is based on capturing the overall semantics of the document by aggregating contextual information across its entire length, often through mechanisms such as pooling (mean, max, or attention-weighted) applied to intermediate layer outputs. For example, a 3,000-word report on climate change would be encoded into a single vector reflecting the main themes (like CO2 emissions or environmental policies), rather than being split into independent segments. These embeddings are particularly useful for tasks requiring a holistic understanding, such as document classification or semantic search over long texts in RAG systems.

---

### Advantages

1. **Capturing global semantics**: These embeddings synthesize the entire content of a document, allowing the representation of themes or cross-sectional ideas that span multiple sections, such as the development of an argument in an essay or a multi-chapter analysis.
2. **Efficiency for long documents**: Models like Longformer or Qwen2.5, with their extended context capabilities, eliminate the need to pre-segment documents, simplifying RAG pipelines and reducing context loss caused by cuts.
3. **Improved relevance in RAG**: By providing a unified representation, these embeddings enable more precise search over entire documents, avoiding biases introduced by arbitrary segmentation and improving the quality of generated responses.
4. **Adaptability to complex corpora**: They are particularly suited to narrative texts, technical reports, or books where relationships between distant parts are crucial, offering a robust solution for applications such as legal or scientific document analysis.
5. **Reduced redundancy**: Unlike overlapping-chunk approaches, a single vector per document decreases the amount of data to index, optimizing resources in large knowledge bases.

---

### Limitations

1. **High computational cost**: Generating an embedding for a document of several thousand tokens requires significant resources (GPU, memory), especially with models like GPT embeddings or Longformer, making inference slow on large corpora.
2. **Loss of granularity**: By producing a single vector, these embeddings may obscure local details or semantic variations within the document, which can be problematic for tasks requiring fine-grained analysis (e.g., extracting a specific quote).
3. **Model quality dependence**: Performance depends on the model’s ability to generalize over long contexts. A poorly trained or domain-mismatched model may produce unrepresentative embeddings, especially for heterogeneous texts.
4. **Hardware limitations**: Even with optimized architectures like Longformer, the maximum sequence length (often 4,096 or 8,192 tokens) may be insufficient for extremely long documents, requiring truncation or prior segmentation.
5. **Integration complexity**: Implementing these embeddings in a RAG pipeline requires specific resource management and compatibility with existing infrastructures, adding a layer of difficulty compared to sentence- or chunk-level embeddings.

---

### Improvements and Suggestions

* **Resource optimization**: Use lightweight versions or techniques like quantization to reduce computational load while preserving embedding quality.
* **Hybrid segmentation**: Combine with initial segmentation (e.g., by sections) to handle overly long documents, then apply document-level embeddings to the subunits.
* **Domain-specific fine-tuning**: Adapt the model on a corpus of long documents in a specific domain (e.g., financial reports or scientific articles) to improve representativeness.
* **Custom attention mechanisms**: Integrate attention mechanisms focused on key parts of the document (e.g., summaries or conclusions) to enhance embedding relevance.
* **Evaluation**: Test embeddings with metrics such as retrieval accuracy in a RAG context or thematic coherence on benchmarks like LongBench to validate their effectiveness.

This technique requires robust infrastructure and careful configuration.

In [94]:
class LongDocumentEmbedder:
    def __init__(self, model_name: str = "allenai/longformer-base-4096"):
        """
        Initialize a model for long-document embeddings
        Available models:
            * Longformer: `'allenai/longformer-base-4096'`
            * GPT-NeoX: `'EleutherAI/gpt-neox-20b'`
            * Qwen: `'Qwen/Qwen1.5-7B'`
            * BGE-M3: `'BAAI/bge-m3'` (supports 8,192 tokens)
        """
        self.model_name = model_name
        self.tokenizer = None
        self.model = None
        self._load_model()
    
    def _load_model(self):
        """
        Loads the pre-trained model and its corresponding tokenizer.
        This internal method initializes the language model and tokenizer for generating 
        embeddings or performing other NLP tasks. It ensures that the model is ready for 
        inference and handles any necessary configuration, such as device placement 
        (CPU/GPU) or model-specific settings.
    
        Returns
            None
    
        Notes
            This is an internal method and is typically called automatically during the 
            initialization of the embedding class.
        """
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        
        # Special configuration for some models
        if "longformer" in self.model_name.lower():
            self.model = AutoModel.from_pretrained(
                self.model_name,
                attention_window=512,
                gradient_checkpointing=True
            )
        else:
            self.model = AutoModel.from_pretrained(self.model_name)
            
        print(f"Model {self.model_name} successfully loaded!")
        print(f"Maximum context length: {self.tokenizer.model_max_length} tokens")
            

    def _chunk_text(self, text: str, chunk_size: int = 4000) -> List[str]:
        """
        Splits a long text into smaller, manageable chunks for processing.

        This function breaks the input text into segments of up to `chunk_size` tokens or characters, 
        depending on the implementation. It aims to create "smart" chunks, which means it tries 
        to avoid splitting sentences or paragraphs abruptly, preserving semantic coherence 
        within each chunk.
    
        Parameters
            text : str
                The input text string that needs to be divided into chunks.
            chunk_size : int, optional (default=4000)
                The maximum size of each chunk. The function will attempt to keep chunks below this 
                limit while maintaining logical boundaries in the text.
    
        Returns
            List[str]
                A list of text chunks, each as a string, ready for further processing 
                (e.g., embeddings generation or document analysis).
    
        Example
            >>> text = "This is a long document that needs to be split into smaller parts..."
            >>> chunks = self._chunk_text(text, chunk_size=1000)
            >>> len(chunks)
            3
        """
        # Simple breakdown by sentences (to be improved as needed)
        sentences = text.split('. ')
        chunks = []
        current_chunk = ""
        
        for sentence in sentences:
            if len(current_chunk) + len(sentence) < chunk_size:
                current_chunk += sentence + ". "
            else:
                chunks.append(current_chunk)
                current_chunk = sentence + ". "
        
        if current_chunk:
            chunks.append(current_chunk)
            
        return chunks

    def encode(self, text: str, pooling: str = "mean") -> np.ndarray:
        """
        Encodes a given text into a fixed-size numerical vector (embedding).
        This function generates a vector representation of the input text using a 
        pre-trained language model. The output embedding captures the semantic meaning 
        of the text and can be used for tasks such as semantic search, clustering, 
        or downstream NLP applications.
    
        Parameters
            text : str
                The input text string to be converted into an embedding.
            pooling : str, optional (default="mean")
                The pooling strategy to aggregate token-level embeddings into a single vector:
                - "mean": average of all token embeddings
                - "max": maximum value of each dimension across token embeddings
                - "attention": attention-weighted aggregation (if supported by the model)
    
        Returns
            np.ndarray
                A 1-dimensional NumPy array representing the text embedding.
        
        Example
            >>> embedding = encoder.encode("This is a sample document.", pooling="mean")
            >>> embedding.shape
            (768,)
        """
        # Trim text if necessary
        chunks = self._chunk_text(text)
        all_embeddings = []
        
        for chunk in chunks:
            inputs = self.tokenizer(
                chunk,
                return_tensors="pt",
                truncation=True,
                padding=True,
                max_length=512#self.tokenizer.model_max_length
            )
            
            with torch.no_grad():
                outputs = self.model(**inputs)
            
            # Retrieving hidden states
            last_hidden = outputs.last_hidden_state.squeeze(0)
            
            # Pooling
            if pooling == "mean":
                chunk_embedding = torch.mean(last_hidden, dim=0)
            elif pooling == "max":
                chunk_embedding, _ = torch.max(last_hidden, dim=0)
            elif pooling == "cls":
                chunk_embedding = last_hidden[0]
            else:
                raise ValueError("Méthode de pooling non supportée")
            
            all_embeddings.append(chunk_embedding.numpy())
        
        # Combining chunk embeddings
        doc_embedding = np.mean(all_embeddings, axis=0)
        
        # L2 Normalization
        doc_embedding = doc_embedding / np.linalg.norm(doc_embedding)
        
        return doc_embedding

In [96]:
print("\n=== Longformer (4096 tokens) ===")
long_embedder = LongDocumentEmbedder("allenai/longformer-base-4096")

long_text = paragraph_segments[100][0:50]   # Your long document
embedding = long_embedder.encode(long_text)
print(f"Dimension de l'embedding: {embedding.shape}")
print(f"Valeurs exemple: {embedding[:5]}...")

print("\n=== BGE-M3 (8192 tokens) ===")
bge_embedder = LongDocumentEmbedder("BAAI/bge-m3")
embedding = bge_embedder.encode(long_text)
print(f"Embedding dimension: {embedding.shape}")


=== Longformer (4096 tokens) ===
Modèle allenai/longformer-base-4096 chargé avec succès!
Longueur maximale de contexte: 1000000000000000019884624838656 tokens
Dimension de l'embedding: (768,)
Valeurs exemple: [-0.00249754 -0.00301268  0.01257276  0.00879303 -0.01029909]...

=== BGE-M3 (8192 tokens) ===
Modèle BAAI/bge-m3 chargé avec succès!
Longueur maximale de contexte: 8192 tokens
Dimension de l'embedding: (1024,)
