# <center>Panongbene Sawadogo</center>

📩 **Contact** : amet1900@gmail.com

🌐 **Linkedin** : https://www.linkedin.com/in/panongbene-jean-mohamed-sawadogo-33234a168/

🗓️ **Dernière modification** : 16 August 2025

# <center>**Document Segmentation Techniques for RAG Implementation**</center>

> In this document, I implement and compare different document segmentation techniques into *chunks*, in an optimal way, to build a Retrieval-Augmented Generation (RAG) system.
> The goal is to explore how to divide text into coherent and usable units in order to:
>
> * improve information retrieval accuracy,
> * optimize indexing,
> * and provide the model with relevant context for generating high-quality responses.

# Libraries

In [1]:
#!pip install numpy
#!pip install pandas
#!pip install PyPDF2
#!pip install matplotlib
#!pip install bitsandbytes
#!pip install --upgrade transformers

In [2]:
import os
import re
import sys
import json
import nltk
import torch
import PyPDF2
import threading
import numpy as np
import pandas as pd
from rich.panel import Panel
from rich.syntax import Syntax
import matplotlib.pyplot as plt
from rich.console import Console
from dataclasses import dataclass
from nltk.tokenize import sent_tokenize
from FlagEmbedding import BGEM3FlagModel
from IPython.display import Markdown, display
from typing import List, Dict, Tuple, Union, Callable, Optional
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, TextIteratorStreamer, BitsAndBytesConfig

In [3]:
# Download NLTK resources if not already done (only once)
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')

## Loading data

In [4]:
with open('docs/SuiteNumerique.md', 'r', encoding='utf-8') as f:
    sample_text = f.read()

# Fixed-length segmentation

#### *Principle*

This technique consists of dividing the text into blocks of fixed size, without taking into account specific technical or grammatical criteria. For example, a block length is predefined (in number of tokens or characters), and the text is split into equal segments corresponding to that length. This approach does not consider the natural boundaries of the text, such as sentences, paragraphs, or syntactic structures. For instance, if a block length of 100 characters is chosen, the text will be cut every 100 characters, regardless of whether this happens in the middle of a word or an idea.

---

#### *Advantages*

The main advantage of this method lies in its simplicity of implementation, fast execution, and high flexibility. It does not require complex linguistic analysis or advanced preprocessing, which makes it easy to integrate into various text processing pipelines. Moreover, it is robust and adaptable to different types of content, whether source code, English text, Arabic, or any other language or format. This universality makes it particularly useful in scenarios where data diversity is high and where standardization is preferable to customization.

---

#### *Disadvantages*

However, this method has several significant limitations. The main issue is the risk of cutting an idea or sentence in the middle, which can harm the semantic coherence of the generated blocks. For example, a sentence like “The project was a success thanks to the team that worked hard” could be split between “The project was a success thanks to the team” and “that worked hard,” making each part difficult to interpret on its own. Another drawback is the possible need to add padding (e.g., with spaces or neutral characters) for the last block, which might not reach the predefined length, thus introducing noise or inconsistency in the processed data. Finally, this approach may lead to a loss of contextual information, especially for tasks that require a global understanding of the text, such as in Retrieval-Augmented Generation (RAG) systems.

In [5]:
def split_text_into_fixed_chunks(text, chunk_size, padding_char=" "):
    """
    Splits a given text into fixed-size chunks.
    This function divides the input text into consecutive blocks of a specified length (`chunk_size`).
    If the last block is shorter than the desired length, it is padded with the specified character
    (`padding_char`) until it reaches the required size.
    Parameters:
        text: The input text to split. Must not be empty.
        chunk_size: The length of each chunk. Must be strictly positive.
        padding_char: The character used to pad the last chunk if it is shorter
                                      than `chunk_size`. Defaults to a space (" ").
    Returns:
        list[str]: A list of text chunks of equal length (`chunk_size`), except when no padding is required.
    Raises:
        ValueError: If the input text is empty or if `chunk_size` is not strictly positive.
    Example:
        >>> split_text_into_fixed_chunks("HelloWorld", 4)
        ['Hell', 'oWor', 'ld  ']

        >>> split_text_into_fixed_chunks("abcdef", 3, "_")
        ['abc', 'def']
    """

    if not text.strip():
        raise ValueError("The text must not be empty")
    if chunk_size <= 0:
        raise ValueError("The chunk size must be strictly positive")

    chunks = []
    text_length = len(text)

    for i in range(0, text_length, chunk_size):
        chunk = text[i:i + chunk_size]
        chunks.append(chunk)

    # The chunk size must be strictly positive
    if chunks and len(chunks[-1]) < chunk_size:
        padding_needed = chunk_size - len(chunks[-1])
        chunks[-1] += padding_char * padding_needed

    return chunks

In [None]:
# Exemple
split_text_into_fixed_chunks(sample_text, 100, padding_char=" ")

# Sentence or Paragraph Segmentation

#### *Principle*

Sentence or paragraph-based segmentation involves dividing a text into coherent units based on natural boundaries defined by linguistic or typographic structure. In sentence segmentation, the text is split at terminal points (e.g., periods, exclamation marks, or question marks), taking grammatical rules into account to identify complete units of meaning. For paragraph segmentation, the text is divided according to line breaks or separation markers (such as double line breaks), grouping multiple sentences into logical blocks that represent a single idea or subtopic. This method often relies on natural language processing (NLP) tools, such as NLTK, SpaCy, or regular expressions, to automatically detect these boundaries.

---

#### *Advantages*

1. **Preservation of semantic coherence**: By respecting natural boundaries, this method maintains the integrity of ideas and the relationships between elements within a sentence or paragraph. For example, a sentence like “Despite the challenges, the team achieved its goals through dedication” remains intact, allowing for clear understanding.
2. **Improved embedding quality**: Embedding models (such as BERT or SentenceTransformers) perform better with units that capture complete context. A sentence or paragraph provides a richer vector representation than isolated small fragments.
3. **Noise reduction**: Unlike arbitrary splits (e.g., fixed-size chunks), this approach avoids cutting through the middle of an idea, minimizing the risk of misinterpretation or loss of information.
4. **Adaptability to complex tasks**: In applications like RAG, where retrieving relevant text is crucial, well-defined segments enable more precise searches and coherent responses, especially for queries requiring contextual understanding.
5. **Ease of implementation with existing tools**: Libraries like SpaCy or NLTK offer ready-to-use functions (e.g., `sent_tokenize` or `para_tokenize`) that automate segmentation, reducing the need for custom development.

---

#### *Disadvantages*

1. **Dependence on detection errors**: Automatic segmentation can fail when faced with poorly structured texts, such as incomplete sentences, abbreviations (e.g., “Dr.”), or typographical errors (e.g., a missing period). For example, “Dr. Smith spoke.” could be misinterpreted as two sentences.
2. **Inconsistency in segment lengths**: Sentences or paragraphs naturally vary in length (e.g., one sentence may have 5 words, another 20), which can be problematic for models with token limits or requiring uniformity in processed data.
3. **Increased complexity**: Compared to fixed-size splitting, this method requires NLP tools or custom rules, increasing implementation complexity and software dependencies.
4. **Loss of context between segments**: While internal coherence is preserved, relationships between adjacent sentences or paragraphs can be broken if no overlap or linking mechanism is applied. For instance, an idea developed over two consecutive paragraphs might be fragmented.
5. **Lower performance on very large texts**: For massive documents, paragraph segmentation can produce units that are too long, exceeding the processing limits of some models (e.g., 512 tokens for BERT), requiring additional splitting or truncation.

In [7]:
def segmenter_texte(text: str, mode: str = "sentences", chunk_size: int = 1, language: str = "french" ) -> List[str]:
    """
    Segments a text into chunks of sentences or paragraphs.
    Parameters:
        text: Text to segment
        mode: "sentences" or "paragraphs"
        chunk_size: Number of elements per chunk
        language: Language used for sentence tokenization
    Returns:
        List of text chunks
    Raises:
        ValueError: If parameters are invalid
    """
    # Validation of entries
    if not isinstance(text, str) or not text.strip():
        raise ValueError("Text must not be empty")
    
    if mode not in {"phrases", "paragraphes"}:
        raise ValueError("Mode must be 'sentences' or 'paragraphs'")
    
    if not isinstance(chunk_size, int) or chunk_size <= 0:
        raise ValueError("The chunk size must be a positive integer")

    # Text cleaning
    text = text.strip()
    
    if mode == "phrases":
        # Sentence Segmentation with NLTK
        elements = sent_tokenize(text, language=language)
    else:
        # Paragraph segmentation
        elements = [p for p in text.split('\n') if p.strip()]
    
    # Creating chunks
    chunks = []
    for i in range(0, len(elements), chunk_size):
        chunk = elements[i:i + chunk_size]
        
        # Join with space for sentences, line break for paragraphs
        separator = " " if mode == "phrases" else "\n"
        chunks.append(separator.join(chunk))
    
    return chunks

In [8]:
paragraph_segments = segmenter_texte(sample_text, "phrases", 2)

In [None]:
paragraph_segments

# Hierarchical segmentation (by headings and subheadings)

### Principle

Hierarchical segmentation by headings and subheadings involves dividing a text according to its document structure, by identifying and leveraging heading levels (e.g., H1, H2, H3 in an HTML document, or conventions like "1. Introduction", "1.1 Objectives") and subheadings that organize content into sections and subsections. This method relies on recognizing typographical or syntactic markers (such as numbering, separator lines, or specific tags) to delimit text blocks associated with each hierarchical level. For example, in a technical manual, a heading like “2. Installation” might be followed by a subheading “2.1 Prerequisites,” each corresponding to a distinct semantic unit. This approach can be automated using text processing tools (e.g., HTML parsing with BeautifulSoup or pattern detection with regular expressions) or implemented manually based on the known structure of the document.

---

### Advantages

1. **Effective for structured documents**: This method excels at processing formal documents such as reports, technical manuals, academic articles, or books, where headings and subheadings serve as clear anchors for organizing information. For instance, a user manual can be segmented into sections like “Configuration” and “Troubleshooting” for quick navigation.
2. **Preserves contextual structure**: By respecting the hierarchy, each segment retains its context within the overall structure, facilitating comprehension and retrieval of specific information. This is particularly useful in search systems or retrieval-augmented generation (RAG) applications.
3. **Ease of indexing and navigation**: Hierarchical segments can be indexed with their corresponding headings, allowing targeted search (e.g., “see section 3.2”) and intuitive navigation, improving user experience in digital interfaces.
4. **Reduces unnecessary overlaps**: Unlike methods based on fixed sizes or arbitrary overlaps, this segmentation aligns with the author’s intent, minimizing cuts in the middle of ideas or themes.
5. **Adaptable to updates**: For evolving documents (such as wikis or knowledge bases), hierarchical segmentation allows adding or modifying sections without disrupting the overall structure, facilitating maintenance.

---

### Disadvantages

1. **Less suited for unstructured texts**: This method is ineffective for texts lacking headings or explicit structure, such as emails, conversation transcripts, or informal blog posts. For example, continuous text without tags or numbering cannot be reliably segmented.
2. **Dependent on structure quality**: If headings or subheadings are poorly defined, inconsistent (e.g., missing levels or numbering errors), or absent in parts of the document, segmentation may produce incomplete or misaligned segments.
3. **Implementation complexity**: Automation requires specific tools (e.g., HTML parsing, pattern detection with regex) or manual annotation, increasing workload compared to simpler methods like fixed-size splitting.
4. **Risk of unbalanced segments**: Sections defined by headings can vary greatly in length (e.g., a 50-word subsection versus a 500-word one), which can be problematic for models with processing limits (like token limits in AI models).
5. **Loss of fine-grained context**: Although the hierarchy is preserved, transitions between subsections or details within a segment may be overlooked, especially if segmentation does not consider overlaps or intra-section relationships.

---

### Improvements and Suggestions

* **Automated detection**: Use libraries like BeautifulSoup (for HTML) or regular expressions to robustly identify headings and subheadings, even across different formats (Markdown, Word, PDF).
* **Contextual overlap**: Include overlap between sections (e.g., include the end of one subsection in the next) to preserve logical transitions.
* **Length normalization**: Apply a maximum length limit or split overly long sections into subsegments while respecting hierarchy.
* **Manual validation**: For critical documents, combine automatic segmentation with human review to correct detection errors.
* **Evaluation**: Test this method using metrics such as semantic coverage or retrieval accuracy in a RAG context.

This technique is particularly suitable for domains where document structure is an advantage, such as technical documentation or knowledge management, but requires careful preparation for less organized texts.

In [10]:
@dataclass
class DocumentSection:
    title: str
    level: int
    content: str
    subsections: List['DocumentSection']

In [11]:
def segmentation_hiérarchique(
    texte: str,
    motifs_titres: List[Tuple[str, int]] = [
        (r'^\#\s(.+)$', 1),         # Markdown # Title
        (r'^\#\#\s(.+)$', 2),       # Markdown ## Sub-Title
        (r'^\#\#\#\s(.+)$', 3),     # Markdown ### Sub-sous-Title
        (r'^I\.\s(.+)$', 1),        # I. Title
        (r'^[A-Z]\.\s(.+)$', 2),    # A. Sub-Title
        (r'^\d+\.\s(.+)$', 3),      # 1. Sub-Sub-Title
        (r'^[A-Z][^\.]+:$', 1),     # TITLE:
        (r'^[A-Z][^\.]+:$', 2)      # Sub-TITLE:
    ]
) -> DocumentSection:
    """
    Segments a text hierarchically based on its structure.
    Parametrs:
        texte: Text to be segmented
        motifs_titres: List of tuples (regex, level) to identify headings    
    Returns:
        Hierarchical structure of the document with sections and subsections
    """
    if not texte.strip():
        raise ValueError("Text must not be empty")
    
    # Pretreatment
    lignes = [ligne.strip() for ligne in texte.split('\n') if ligne.strip()]
    racine = DocumentSection(title="Racine", level=0, content="", subsections=[])
    pile = [racine]
    
    for ligne in lignes:
        titre, niveau = detecter_titre(ligne, motifs_titres)
        if titre:
            # Create a new section
            nouvelle_section = DocumentSection(
                title=titre,
                level=niveau,
                content="",
                subsections=[]
            )            
            # Move up the stack to the appropriate parent
            while len(pile) > 1 and pile[-1].level >= niveau:
                pile.pop()
            
            # Add to parent section
            pile[-1].subsections.append(nouvelle_section)
            pile.append(nouvelle_section)
        else:
            # Add content to the current section
            if pile[-1].content:
                pile[-1].content += "\n" + ligne
            else:
                pile[-1].content = ligne
    
    return racine

In [12]:
def detecter_titre(ligne: str, motifs: List[Tuple[str, int]]) -> Tuple[str, int]:
    """
    Detects if a line is a heading and returns its text and level.
    """
    for motif, niveau in motifs:
        match = re.match(motif, ligne)
        if match:
            titre = match.group(1) if len(match.groups()) > 0 else ligne
            return (titre.strip(), niveau)
    return (None, 0)

In [13]:
def afficher_structure(section: DocumentSection, indent: int = 0) -> None:
    """
    Recursively displays the document structure.
    """
    if section.title != "Racine" or section.subsections:
        print(" " * indent + f"{section.title} (niveau {section.level})")
    for subsection in section.subsections:
        afficher_structure(subsection, indent + 2)

In [None]:
structure = segmentation_hiérarchique(sample_text)
print("=== Hierarchical structure ===")
afficher_structure(structure)

print("\n=== Contents of a section ===")
print(structure.subsections[0].subsections[0].content)

# Overlapping Segmentation (Overlapping Chunks)

### Principle

Overlapping segmentation (overlapping chunks) involves dividing a text into blocks (chunks) of a predefined size, where each chunk partially overlaps with the previous or next chunk to preserve context between segments. This method requires defining a chunk size (e.g., 200 characters or 50 tokens) and an overlap length (e.g., 50 characters or 10 tokens), which represents the portion shared between two consecutive chunks. For example, if a text is split into chunks of 100 characters with a 20-character overlap, the second chunk will start 80 characters after the beginning of the first, including the last 20 characters of the first chunk. This approach can be applied at the character, word, or token level and is often combined with other segmentation techniques (e.g., by sentence) to optimize contextual coherence. Implementation can be done using simple loops or text-processing tools like NLTK, adjusting slicing indices according to the overlap.

---

### Advantages

1. **Reduced context loss**: Overlapping maintains logical transitions and semantic relationships between chunks, preventing abrupt cuts in the middle of ideas. For example, if a sentence like “The project succeeded thanks to the team who worked hard” is split, the overlap can include “the team” in the next chunk, aiding comprehension.
2. **Improved embedding quality**: Embedding models (e.g., BERT or SentenceTransformers) benefit from this contextual continuity, producing richer and more coherent vector representations, which is crucial for tasks such as semantic search or text generation.
3. **Flexibility in RAG applications**: In retrieval-augmented systems, overlapping chunks allow for more relevant information retrieval, even for queries that depend on context spanning multiple segments.
4. **Adaptable to varied texts**: This method works well with texts of different lengths and structures (narrative, technical, conversational), as long as the overlap size is adjusted according to semantic density.
5. **Easy optimization**: The overlap percentage can be finely tuned (e.g., 10%, 20% of chunk size) to balance coherence and resource usage, making the method adaptable to specific constraints.

---

### Disadvantages

1. **Increased total corpus size**: Overlapping generates redundant chunks, increasing the total number of segments and, consequently, the amount of data to store and process. For example, a 1000-character text split into 200-character chunks with a 50-character overlap can nearly double the number of chunks compared to non-overlapping segmentation.
2. **Higher computational cost**: Indexing and searching a larger corpus require more resources (CPU time, memory), which can be an issue for large datasets or low-power systems.
3. **Risk of excessive redundancy**: If the overlap is too large (e.g., 50% of chunk size), much of the content becomes duplicated, diluting the diversity of retrieved information and unnecessarily increasing processing load.
4. **Complexity in setting boundaries**: Choosing the optimal chunk and overlap size requires prior analysis of the text, and poor calibration can either fail to preserve context or overload the system.
5. **Challenges with short texts**: In very short documents, a large overlap can make chunks nearly identical, reducing segmentation efficiency and increasing noise in the processed data.

---

### Improvements and Suggestions

* **Dynamic overlap adjustment**: Use semantic analysis (e.g., with NLP models) to adapt overlap length based on information density or contextual transitions.
* **Redundancy filtering**: Apply deduplication or similarity thresholding (e.g., using cosine similarity) to remove overly similar chunks, reducing corpus size.
* **Batch optimization**: Process chunks in batches with variable overlap sizes to balance performance and coherence, especially for large text volumes.
* **Integration with other methods**: Combine this segmentation with sentence or paragraph splitting to align overlaps with natural boundaries, improving semantic quality.
* **Evaluation**: Measure the impact of overlap with metrics such as retrieval precision or contextual coherence in a RAG context.

This technique is particularly useful in scenarios where preserving context is critical, such as long-document analysis or conversational AI systems, but it must be carefully calibrated to avoid unnecessary overload.

In [15]:
def overlapping_chunks(text: str,chunk_size: int = 5,overlap_size: int = 2,mode: str = "sentences",lang: str = "french") -> List[str]:
    """
    Segments a text with overlapping chunks.
    Parameters:
        text: Text to segment
        chunk_size: Number of elements per chunk
        overlap_size: Number of overlapping elements
        mode: "sentences", "tokens", or "paragraphs"
        lang: Language for tokenization
    Returns:
        List of overlapping chunks
    Raises:
        ValueError: If the parameters are invalid
    """
    # Validation of parameters
    if not text.strip():
        raise ValueError("Text must not be empty")
    if chunk_size <= 0:
        raise ValueError("chunk_size must be positive")
    if overlap_size >= chunk_size or overlap_size < 0:
        raise ValueError("overlap_size must be less than chunk_size and positive")
    
    # Tokenization by mode
    if mode == "sentences":
        elements = sent_tokenize(text, language=lang)
    elif mode == "paragraphs":
        elements = [p for p in text.split('\n') if p.strip()]
    elif mode == "tokens":
        elements = re.findall(r'\w+|\S', text)  # Tokenisation simple
    else:
        raise ValueError("Mode doit être 'sentences', 'paragraphs' ou 'tokens'")
    
    # Size Check
    if len(elements) <= chunk_size:
        return [" ".join(elements)]
    
    # Generating chunks with overlap
    chunks = []
    step = chunk_size - overlap_size
    for i in range(0, len(elements), step):
        chunk = elements[i:i + chunk_size]
        chunks.append(" ".join(chunk))
        
        # Stop if you reach the end
        if i + chunk_size >= len(elements):
            break
    
    return chunks

In [None]:
with open('docs/SuiteNumerique.md', 'r', encoding='utf-8') as f:
    sample_text = f.read()
    
print("=== Chevauchement de phrases ===")
chunks = overlapping_chunks(sample_text, chunk_size=3, overlap_size=1, mode="sentences")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk[:80]}...")

print("\n=== Chevauchement de paragraphes ===")
chunks = overlapping_chunks(sample_text, chunk_size=2, overlap_size=1, mode="paragraphs")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk[:80]}...")

print("\n=== Chevauchement de tokens ===")
chunks = overlapping_chunks("Ceci est un exemple de tokenisation.", chunk_size=4, overlap_size=2, mode="tokens")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk}")

# Semantic-based Segmentation (Semantic Chunking)

### Principle

Semantic-based segmentation (semantic chunking) involves dividing a text into segments based on changes in topic or theme, rather than on syntactic criteria or fixed sizes. This method leverages natural language processing (NLP) models, such as embedding models (e.g., BERT, SentenceTransformers) to convert text into vector representations, followed by clustering algorithms (like K-means or DBSCAN) or similarity threshold detection to identify semantic transitions. The process begins by splitting the text into preliminary units (e.g., sentences or small blocks) and then evaluating semantic coherence between these units. When a significant topic change is detected (e.g., a drop in cosine similarity between two blocks), a new segmentation is created. For example, in a text shifting from "Birds migrate in winter" to "Sustainable construction techniques," semantic chunking would create two distinct chunks reflecting the different themes, even if the sentences are contiguous.

---

### Advantages

1. **Alignment with complete ideas**: Each chunk represents a coherent semantic unit, capturing a specific topic or idea. This ensures that segments do not cut off a thought midway, improving readability and comprehension.
2. **Improved relevance in RAG**: In retrieval-augmented systems, this segmentation allows retrieving chunks that are more relevant to a given query, as they align with specific concepts rather than arbitrary boundaries. For example, a query on "bird migration" will retrieve only the relevant chunk without contextual noise.
3. **Adaptability to different writing styles**: This method works with texts of various natures (narrative, technical, conversational) because it relies on meaning rather than a predefined structure, making its application universal.
4. **Optimization of embeddings**: By grouping units that share semantic similarity, embedding models produce more consistent and accurate representations, enhancing search or classification performance.
5. **Support for thematic analysis**: This segmentation facilitates extraction of main themes in large corpora, useful for document summarization or theme-based sentiment analysis.

---

### Disadvantages

1. **Requires complex NLP processing**: This method relies on embedding models (often resource-intensive) and clustering algorithms, increasing implementation complexity compared to simple segmentation methods (by fixed size or sentences). For example, using BERT may require GPU access to process large volumes of text.
2. **High computational cost**: Calculating embeddings for each text unit, followed by similarity analysis or clustering, can be slow, especially for large documents, increasing processing time and hardware requirements.
3. **Dependence on model quality**: The segmentation accuracy depends on the performance of the NLP model used. A poorly trained or domain-inappropriate model (e.g., a general model on medical text) can produce incoherent segments or misdetect topic changes.
4. **Difficult parameter tuning**: Defining similarity thresholds or clustering parameters (e.g., number of clusters or maximum distance) requires manual calibration or experimental validation, which can be error-prone or subjective.
5. **Limitations with ambiguous texts**: In texts where topic transitions are gradual or implicit (e.g., smooth narratives), semantic chunking may produce overly fragmented chunks or merge distinct ideas, affecting coherence.

---

### Improvements and Suggestions

* **Resource optimization**: Use lightweight models (e.g., DistilBERT) or precomputed embeddings to reduce computational load while maintaining good semantic performance.
* **Dynamic threshold adjustment**: Implement adaptive detection of topic changes based on local semantic density rather than a fixed threshold to better capture transitions.
* **Text preprocessing**: Clean and normalize the text (remove noise, correct errors) before segmentation to improve embedding quality.
* **Hybrid validation**: Combine semantic segmentation with methods based on natural boundaries (e.g., sentences) to refine results and reduce errors.
* **Evaluation**: Test segmentation using metrics like semantic coherence (e.g., ROUGE scores) or retrieval accuracy in a RAG context to fine-tune parameters.

This technique is especially suitable for applications where semantic understanding is critical, such as complex document analysis or conversational AI systems, but it requires robust infrastructure and NLP expertise.

In [17]:
class SemanticChunker:
    """
    Semantics-based text segmenter that detects topic changes
    """
    
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2', threshold: float = 0.85, window_size: int = 3):
        """
        Initialize the semantic segmenter.
        Parameters:
          model_name: Name of the SentenceTransformer model to use
          threshold: Similarity threshold for detecting topic changes
          window_size: Size of the sliding window for calculating similarity
        """
        self.model = SentenceTransformer(model_name)
        self.threshold = threshold
        self.window_size = window_size
    
    def embed_sentences(self, sentences: List[str]) -> np.ndarray:
        """Converts sentences to embeddings."""
        return self.model.encode(sentences, convert_to_tensor=False)
    
    def calculate_similarities(self, embeddings: np.ndarray) -> List[float]:
        """Calculates similarities between consecutive sentences."""
        similarities = []
        for i in range(len(embeddings) - 1):
            sim = cosine_similarity(
                embeddings[i].reshape(1, -1),
                embeddings[i + 1].reshape(1, -1)
            )[0][0]
            similarities.append(sim)
        return similarities
    
    def find_chunk_boundaries(self, similarities: List[float]) -> List[int]:
        """Find chunk boundaries based on similarities"""
        boundaries = []
        window = []
        
        for i, sim in enumerate(similarities):
            window.append(sim)
            if len(window) > self.window_size:
                window.pop(0)
            
            # Si la moyenne de la fenêtre est en dessous du seuil, c'est une frontière
            if len(window) == self.window_size and np.mean(window) < self.threshold:
                boundaries.append(i + 1)  # +1 car on veut l'indice après le saut
        
        return boundaries
    
    def chunk_text(self, text: str) -> List[Tuple[int, int, str]]:
        """
        Segments the text into semantic chunks.
        Returns:
            List of tuples (start_idx, end_idx, chunk_text)
        """
        # Separation into sentences (simplistic - to be improved as needed)
        sentences = [s.strip() for s in text.split('.') if s.strip()]
        if not sentences:
            return []
        
        # Sentence embeddings
        embeddings = self.embed_sentences(sentences)
        
        # Calculating similarities between consecutive sentences
        similarities = self.calculate_similarities(embeddings)
        if not similarities:
            return [(0, len(text), text)]
        
        # Finding the limits of chunks
        boundaries = self.find_chunk_boundaries(similarities)
        
        # Rebuild chunks from boundaries
        chunks = []
        start_idx = 0
        
        # Find the start/end positions in the original text
        sentence_offsets = [0]
        current_pos = 0
        for s in sentences:
            current_pos += len(s) + 1  # +1 pour le point
            sentence_offsets.append(current_pos)
        
        for boundary in boundaries:
            end_idx = boundary
            chunk_sentences = sentences[start_idx:end_idx]
            chunk_start = sentence_offsets[start_idx]
            chunk_end = sentence_offsets[end_idx] if end_idx < len(sentence_offsets) else len(text)
            chunk_text = text[chunk_start:chunk_end].strip()
            
            if chunk_text:
                chunks.append((chunk_start, chunk_end, chunk_text))
            start_idx = end_idx
        
        # Add the last chunk
        if start_idx < len(sentences):
            chunk_start = sentence_offsets[start_idx]
            chunk_text = text[chunk_start:].strip()
            if chunk_text:
                chunks.append((chunk_start, len(text), chunk_text))
        
        return chunks

In [None]:
# Initialisation du chunker sémantique
chunker = SemanticChunker(threshold=0.8)

# Segmentation du texte
chunks = chunker.chunk_text(sample_text)

# Affichage des résultats
print("Texte segmenté en chunks sémantiques:")
for i, (start, end, chunk) in enumerate(chunks, 1):
    print(f"\nChunk {i} (positions {start}-{end}):")
    print(chunk)

# Multi-level segmentation (hierarchical semantic chunking)

### Principle

Multi-level segmentation, also called hierarchical semantic chunking, combines document-structure-based segmentation (based on titles, subtitles, and sections) with internal semantic segmentation within these units. This approach relies on two complementary steps: first, an initial split is performed according to the document's explicit hierarchy (for example, by identifying titles like "1. Introduction" or "2.1 Methodology" using tools such as BeautifulSoup or regular expressions), creating high-level segments. Then, each segment is semantically subdivided based on topic or thematic changes within it, using natural language processing (NLP) models such as embedding models (e.g., BERT or SentenceTransformers) and clustering or similarity-detection algorithms (like K-means or cosine-thresholds). For instance, in a section titled "2. Results," a semantic sub-segmentation could separate "Data Analysis" from "Trend Interpretation," even if these topics fall under the same title, ensuring fine granularity while preserving the overall structure.

---

### Advantages

1. **Balance between document structure and semantic relevance**: This method leverages the explicit hierarchy to organize segments at a macro level while applying semantic analysis to refine sub-segments, providing an optimal combination of structural clarity and thematic coherence.
2. **Improved relevance in RAG**: Multi-level segments enable more precise retrieval in RAG systems, as they align units to specific topics while maintaining their context within the overall structure. For example, a query on "trend interpretation" will directly target the relevant sub-segment within the "Results" section.
3. **Adaptability to various formats**: This approach works well with structured documents (reports, manuals) while accommodating more fluid internal sections, making it versatile across heterogeneous corpora.
4. **Preservation of hierarchical context**: By keeping titles as metadata, this segmentation facilitates navigation and understanding, allowing users or models to follow the document's organizational logic.
5. **Optimization for thematic analysis**: Combining the two methods allows extraction of themes at different levels (global via structure, local via semantics), which is valuable for tasks such as summarization or content classification.

---

### Disadvantages

1. **More complex pipeline to implement**: Combining structure-based segmentation (requiring parsing tools) and semantic segmentation (requiring NLP models and clustering algorithms) increases pipeline complexity, demanding technical expertise and careful integration.
2. **High computational cost**: Internal semantic analysis, involving computing embeddings for each sub-segment and applying clustering, adds significant load, especially for large documents or massive corpora, increasing resource requirements (CPU/GPU, memory).
3. **Dependence on the quality of both steps**: Accuracy depends on both correct title detection (sensitive to formatting errors) and semantic model performance (sensitive to training data or domain). Weakness in either step can compromise overall results.
4. **Extended processing time**: The dual analysis (structural then semantic) lengthens processing time compared to simpler methods, which can be a limitation for real-time applications.
5. **Risk of over-segmentation**: If semantic thresholds are too strict or the structure is misinterpreted, segmentation may produce excessive sub-segments, reducing indexing or retrieval efficiency and increasing noise in the data.

---

### Improvements and Suggestions

* **Resource optimization**: Use lightweight models (like DistilBERT) for semantic segmentation and efficient tools (like simple regex) for title detection, reducing computational load.
* **Hybrid calibration**: Dynamically adjust semantic similarity thresholds based on section length or density to avoid over-segmentation.
* **Structured preprocessing**: Normalize title formats (e.g., converting "1.1" and "Section 1.1" into a single standard) to improve initial detection.
* **Integration of overlaps**: Add overlap between semantic sub-segments to preserve contextual transitions while respecting the hierarchy.
* **Evaluation**: Test segmentation using metrics such as retrieval precision (in RAG) or thematic coherence (via scores like ROUGE), comparing results with univariate methods.

This technique is particularly suited for complex documents requiring both clear organization and fine-grained content analysis, such as technical reports or knowledge bases, but it requires robust infrastructure and careful configuration.

In [19]:
@dataclass
class ChunkNode:
    """Represents a node in the chunk hierarchy"""
    text: str
    start: int
    end: int
    level: int
    children: List['ChunkNode']
    semantic_group: Optional[int] = None

In [20]:
class HierarchicalSemanticChunker:
    """
    Hierarchical segmenter that combines structure and semantics
    """
    
    def __init__(self, 
                 model_name: str = 'all-MiniLM-L6-v2',
                 semantic_threshold: float = 0.82,
                 window_size: int = 3,
                 heading_pattern: str = r'^(#+)\s*(.*)$'):
        """
        Initialize the hierarchical segmenter.
        Parameters:
            model_name: Embedding model to use
            semantic_threshold: Threshold for semantic segmentation
            window_size: Window size for similarity calculation
            heading_pattern: Regex pattern to detect headings
        """
        self.model = SentenceTransformer(model_name)
        self.semantic_threshold = semantic_threshold
        self.window_size = window_size
        self.heading_pattern = re.compile(heading_pattern, re.MULTILINE)
    
    def _detect_structure(self, text: str) -> List[Tuple[int, int, int, str]]:
        """
        Detects the document structure (headings and sections).
        Returns:
            List of tuples `(start, end, level, text)`
        """
        structure = []
        last_pos = 0
        
        for match in self.heading_pattern.finditer(text):
            # Text before current title
            if match.start() > last_pos:
                structure.append((last_pos, match.start(), 0, text[last_pos:match.start()].strip()))
            
            # The title itself
            level = len(match.group(1))  # Number of # determines level
            structure.append((match.start(), match.end(), level, match.group(2).strip()))
            last_pos = match.end()
        
        # Add text after the last title
        if last_pos < len(text):
            structure.append((last_pos, len(text), 0, text[last_pos:].strip()))
        
        return structure
    
    def _semantic_chunking(self, text: str) -> List[Tuple[int, int, str]]:
        """Semantic segmentation of a text (similar to the previous implementation)"""
        sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]
        if len(sentences) < 2:
            return [(0, len(text), text)]
        
        embeddings = self.model.encode(sentences, convert_to_tensor=False)
        similarities = []
        
        for i in range(len(embeddings) - 1):
            sim = cosine_similarity(embeddings[i].reshape(1, -1), 
                                   embeddings[i + 1].reshape(1, -1))[0][0]
            similarities.append(sim)
        
        boundaries = []
        window = []
        
        for i, sim in enumerate(similarities):
            window.append(sim)
            if len(window) > self.window_size:
                window.pop(0)
            
            if len(window) == self.window_size and np.mean(window) < self.semantic_threshold:
                boundaries.append(i + 1)
        
        chunks = []
        start_idx = 0
        sentence_offsets = [0]
        current_pos = 0
        
        for s in sentences:
            current_pos += len(s) + 1  # +1 pour l'espace après la phrase
            sentence_offsets.append(current_pos)
        
        for boundary in boundaries:
            end_idx = boundary
            chunk_start = sentence_offsets[start_idx]
            chunk_end = sentence_offsets[end_idx] if end_idx < len(sentence_offsets) else len(text)
            chunk_text = text[chunk_start:chunk_end].strip()
            
            if chunk_text:
                chunks.append((chunk_start, chunk_end, chunk_text))
            start_idx = end_idx
        
        if start_idx < len(sentences):
            chunk_start = sentence_offsets[start_idx]
            chunk_text = text[chunk_start:].strip()
            if chunk_text:
                chunks.append((chunk_start, len(text), chunk_text))
        
        return chunks
    
    def _build_hierarchy(self, structure: List[Tuple[int, int, int, str]]) -> ChunkNode:
        """Builds the chunk hierarchy from the detected structure"""
        root = ChunkNode(text="", start=0, end=0, level=-1, children=[])
        stack = [root]
        
        for start, end, level, text in structure:
            node = ChunkNode(text=text, start=start, end=end, level=level, children=[])
            
            # Trouver le parent approprié dans la stack
            while len(stack) > 1 and stack[-1].level >= level:
                stack.pop()
            
            stack[-1].children.append(node)
            stack.append(node)
        
        return root
    
    def _add_semantic_chunks(self, node: ChunkNode, text: str):
        """Adds semantic segmentation to the leaves of the hierarchy"""
        if not node.children and len(node.text.split()) > 20:  # Seulement pour les nœuds textuels assez longs
            semantic_chunks = self._semantic_chunking(node.text)
            
            for chunk_start, chunk_end, chunk_text in semantic_chunks:
                # Ajuster les positions par rapport au texte complet
                abs_start = node.start + chunk_start
                abs_end = node.start + chunk_end
                child = ChunkNode(text=chunk_text, start=abs_start, end=abs_end, 
                                level=node.level + 1, children=[])
                node.children.append(child)
        
        for child in node.children:
            self._add_semantic_chunks(child, text)
    
    def _assign_semantic_groups(self, node: ChunkNode, text: str, group_counter: int = 0):
        """Assigns semantic groups to chunks to identify common themes"""
        if not node.children:
            # Embedding seulement pour les feuilles
            embedding = self.model.encode([node.text], convert_to_tensor=False)[0]
            node.embedding = embedding
            
            # Trouver un groupe similaire existant ou en créer un nouveau
            best_group = None
            best_sim = -1
            
            # Browse existing groups (simplified implementation)
            # In a real implementation, one would maintain a list of group centroids
            if hasattr(self, '_group_centroids'):
                for group_id, centroid in self._group_centroids.items():
                    sim = cosine_similarity([embedding], [centroid])[0][0]
                    if sim > self.semantic_threshold and sim > best_sim:
                        best_sim = sim
                        best_group = group_id
            
            if best_group is not None:
                node.semantic_group = best_group
                # Mettre à jour le centroïde
                self._group_centroids[best_group] = (self._group_centroids[best_group] + embedding) / 2
            else:
                node.semantic_group = group_counter
                if not hasattr(self, '_group_centroids'):
                    self._group_centroids = {}
                self._group_centroids[group_counter] = embedding
                group_counter += 1
        
        for child in node.children:
            group_counter = self._assign_semantic_groups(child, text, group_counter)
        
        return group_counter
    
    def chunk_document(self, text: str) -> ChunkNode:
        """Segments a document hierarchically and semantically"""
        # Step 1: Structure detection (headings/sections)
        structure = self._detect_structure(text)
        
        # Step 2: Building the Hierarchy
        root = self._build_hierarchy(structure)
        
        # Step 3: Semantic segmentation of textual content
        self._add_semantic_chunks(root, text)
        
        # Step 4: Semantic grouping of chunks
        self._assign_semantic_groups(root, text)
        
        return root
    
    def print_chunks(self, node: ChunkNode, indent: int = 0):
        """Displays the chunk hierarchy (for visualization)"""
        prefix = "  " * indent
        if node.level >= 0:
            group_info = f" [Groupe {node.semantic_group}]" if node.semantic_group is not None else ""
            print(f"{prefix}Niveau {node.level}: {node.text[:60]}...{group_info} (pos: {node.start}-{node.end})")
        
        for child in node.children:
            self.print_chunks(child, indent + 1)

In [21]:
sample_text_ = sample_text[0:1625]

In [None]:
# Initialisation du chunker hiérarchique
chunker = HierarchicalSemanticChunker(
    model_name='all-MiniLM-L6-v2',
    semantic_threshold=0.8,
    window_size=2
)

# Segmentation du document
document_tree = chunker.chunk_document(sample_text_)

# Affichage des résultats
print("Structure hiérarchique du document avec segmentation sémantique:")
chunker.print_chunks(document_tree)

# Exemple d'accès aux chunks
def get_all_chunks(node):
    chunks = []
    if node.text and not node.children:  # Feuille avec contenu
        chunks.append((node.start, node.end, node.text, node.level, node.semantic_group))
    for child in node.children:
        chunks.extend(get_all_chunks(child))
    return chunks

all_chunks = get_all_chunks(document_tree)
print("\nListe plate de tous les chunks:")
for chunk in all_chunks:
    print(f"Lvl {chunk[3]} - Groupe {chunk[4]}: {chunk[2][:50]}...")

# Segmentation based on sliding windows

### Principle

Sliding window–based segmentation involves processing a text continuously using a fixed-size window that moves incrementally by X tokens (or characters) at each step to generate chunks. The window size defines the length of each segment (e.g., 100 tokens), while the increment (X) determines the offset between the starts of consecutive windows (e.g., 50 tokens). This method treats the text as a linear sequence, naturally producing overlaps between chunks based on the difference between the window size and the increment. For example, with a 200-character window and a 100-character increment, the first chunk covers characters 1–200, the second 101–300, and so on. This approach can be implemented with simple loops or tokenization tools (like NLTK or SpaCy) and is particularly suited for texts where local context is more important than semantic or structural boundaries.

---

### Advantages

1. **Precise size control**: Segmentation allows exact definition of chunk length (e.g., 50, 100, or 200 tokens), providing uniformity that facilitates adaptation to AI model processing limits (such as token limits in BERT or GPT).
2. **Precise context control**: By adjusting the increment, the degree of overlap between chunks can be regulated, ensuring context is preserved in a controlled manner. For example, a smaller increment (e.g., 25 tokens on a 100-token window) maximizes overlap and continuity.
3. **Flexible application**: This method works effectively on various types of text (narrative, technical, conversational) without requiring prior semantic or structural analysis, making it quick to implement in processing pipelines.
4. **Optimization for large corpora**: The systematic nature of sliding allows automated and scalable processing, ideal for indexing large volumes of data in RAG systems or search engines.
5. **Ease of tuning**: Parameters (window size and increment) can be adjusted according to specific needs, such as text density or performance constraints, without changing the core logic.

---

### Disadvantages

1. **Unnecessary duplication if misconfigured**: If the increment is too small relative to the window size (e.g., 10 tokens on a 100-token window), a large portion of content is repeated across chunks, increasing redundancy and corpus size without adding semantic value.
2. **Loss of semantic coherence**: Since segmentation ignores natural boundaries (sentences, topics), it may split ongoing ideas, making chunks less interpretable on their own. For example, a sentence like “The project succeeded thanks to the team” could be divided between two chunks, diluting its meaning.
3. **Higher computational cost with small increments**: A small increment generates more overlapping chunks, increasing processing load for indexing and search, especially on long texts.
4. **Sensitivity to parameter tuning**: Performance depends heavily on the choice of window size and increment. Poor configuration (e.g., a window too large for dense text) can either lose context or overload the system with redundant data.
5. **Less suited for structured texts**: This method ignores hierarchies (titles, sections) or semantic transitions, making it less effective for documents where structure is an asset, like manuals or reports.

---

### Improvements and Suggestions

* **Dynamic increment adjustment**: Adapt the increment based on semantic density or sentence length to minimize duplication while preserving context.
* **Integration of natural boundaries**: Combine the sliding window with sentence or paragraph detection to align cuts with coherent units, reducing abrupt splits.
* **Redundancy filtering**: Apply similarity-based deduplication (e.g., with a cosine threshold) to remove overly similar chunks, optimizing corpus size.
* **Batch optimization**: Process text in larger blocks with internal sliding windows, reducing computational overhead for very long documents.
* **Evaluation**: Test segmentation using metrics like contextual coverage or retrieval accuracy in a RAG context to fine-tune parameters.

This technique is especially suited for scenarios requiring granular control over context and size, such as indexing massive corpora or training AI models, but careful calibration is needed to avoid drawbacks.

In [23]:
@dataclass
class ChunkNode:
    """Represents a node in the chunk hierarchy"""
    text: str
    start: int
    end: int
    level: int
    children: List['ChunkNode']
    semantic_group: Optional[int] = None

class HierarchicalSemanticChunker:
    """
    Hierarchical segmenter that combines structure and semantics
    """
    
    def __init__(self, 
                 model_name: str = 'all-MiniLM-L6-v2',
                 semantic_threshold: float = 0.82,
                 window_size: int = 3,
                 heading_pattern: str = r'^(#+)\s*(.*)$'):
        """
        Initialize the hierarchical segmenter.
        Parameters:
             model_name: Embedding model to use
             semantic_threshold: Threshold for semantic segmentation
             window_size: Window size for similarity
             heading_pattern: Regex pattern to detect headings

        """
        self.model = SentenceTransformer(model_name)
        self.semantic_threshold = semantic_threshold
        self.window_size = window_size
        self.heading_pattern = re.compile(heading_pattern, re.MULTILINE)
    
    def _detect_structure(self, text: str) -> List[Tuple[int, int, int, str]]:
        """
        Detects the document structure (headings and sections).
        Returns:
             List of tuples (start, end, level, text)
        """
        structure = []
        last_pos = 0
        
        for match in self.heading_pattern.finditer(text):
            # Text before current title
            if match.start() > last_pos:
                structure.append((last_pos, match.start(), 0, text[last_pos:match.start()].strip()))
            
            # The title itself
            level = len(match.group(1))  # Number of # determines level
            structure.append((match.start(), match.end(), level, match.group(2).strip()))
            last_pos = match.end()
        
        # Add text after the last title
        if last_pos < len(text):
            structure.append((last_pos, len(text), 0, text[last_pos:].strip()))
        
        return structure
    
    def _semantic_chunking(self, text: str) -> List[Tuple[int, int, str]]:
        """Semantic segmentation of a text (similar to the previous implementation)"""
        sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]
        if len(sentences) < 2:
            return [(0, len(text), text)]
        
        embeddings = self.model.encode(sentences, convert_to_tensor=False)
        similarities = []
        
        for i in range(len(embeddings) - 1):
            sim = cosine_similarity(embeddings[i].reshape(1, -1), 
                                   embeddings[i + 1].reshape(1, -1))[0][0]
            similarities.append(sim)
        
        boundaries = []
        window = []
        
        for i, sim in enumerate(similarities):
            window.append(sim)
            if len(window) > self.window_size:
                window.pop(0)
            
            if len(window) == self.window_size and np.mean(window) < self.semantic_threshold:
                boundaries.append(i + 1)
        
        chunks = []
        start_idx = 0
        sentence_offsets = [0]
        current_pos = 0
        
        for s in sentences:
            current_pos += len(s) + 1  # +1 pour l'espace après la phrase
            sentence_offsets.append(current_pos)
        
        for boundary in boundaries:
            end_idx = boundary
            chunk_start = sentence_offsets[start_idx]
            chunk_end = sentence_offsets[end_idx] if end_idx < len(sentence_offsets) else len(text)
            chunk_text = text[chunk_start:chunk_end].strip()
            
            if chunk_text:
                chunks.append((chunk_start, chunk_end, chunk_text))
            start_idx = end_idx
        
        if start_idx < len(sentences):
            chunk_start = sentence_offsets[start_idx]
            chunk_text = text[chunk_start:].strip()
            if chunk_text:
                chunks.append((chunk_start, len(text), chunk_text))
        
        return chunks
    
    def _build_hierarchy(self, structure: List[Tuple[int, int, int, str]]) -> ChunkNode:
        """Builds the chunk hierarchy from the detected structure"""
        root = ChunkNode(text="", start=0, end=0, level=-1, children=[])
        stack = [root]
        
        for start, end, level, text in structure:
            node = ChunkNode(text=text, start=start, end=end, level=level, children=[])
            
            # Find the appropriate parent in the stack
            while len(stack) > 1 and stack[-1].level >= level:
                stack.pop()
            
            stack[-1].children.append(node)
            stack.append(node)
        
        return root
    
    def _add_semantic_chunks(self, node: ChunkNode, text: str):
        """Adds semantic segmentation to the leaves of the hierarchy"""
        if not node.children and len(node.text.split()) > 20:  # Only for fairly long text nodes
            semantic_chunks = self._semantic_chunking(node.text)
            
            for chunk_start, chunk_end, chunk_text in semantic_chunks:
                # Adjust positions relative to full text
                abs_start = node.start + chunk_start
                abs_end = node.start + chunk_end
                child = ChunkNode(text=chunk_text, start=abs_start, end=abs_end, 
                                level=node.level + 1, children=[])
                node.children.append(child)
        
        for child in node.children:
            self._add_semantic_chunks(child, text)
    
    def _assign_semantic_groups(self, node: ChunkNode, text: str, group_counter: int = 0):
        """Assigns semantic groups to chunks to identify common themes."""
        if not node.children:
            # Embedding only for sheets
            embedding = self.model.encode([node.text], convert_to_tensor=False)[0]
            node.embedding = embedding
            
            # Find an existing similar group or create a new one
            best_group = None
            best_sim = -1
            
            # Iterate through existing groups (simplified implementation)
            # In a real implementation, we would maintain a list of group centroids
            if hasattr(self, '_group_centroids'):
                for group_id, centroid in self._group_centroids.items():
                    sim = cosine_similarity([embedding], [centroid])[0][0]
                    if sim > self.semantic_threshold and sim > best_sim:
                        best_sim = sim
                        best_group = group_id
            
            if best_group is not None:
                node.semantic_group = best_group
                # Update the centroid
                self._group_centroids[best_group] = (self._group_centroids[best_group] + embedding) / 2
            else:
                node.semantic_group = group_counter
                if not hasattr(self, '_group_centroids'):
                    self._group_centroids = {}
                self._group_centroids[group_counter] = embedding
                group_counter += 1
        
        for child in node.children:
            group_counter = self._assign_semantic_groups(child, text, group_counter)
        
        return group_counter
    
    def chunk_document(self, text: str) -> ChunkNode:
        """Segments a document hierarchically and semantically"""
        # Step 1: Structure detection (headings/sections)
        structure = self._detect_structure(text)
        
        # Step 2: Building the Hierarchy
        root = self._build_hierarchy(structure)
        
        # Step 3: Semantic segmentation of textual content
        self._add_semantic_chunks(root, text)
        
        # Step 4: Semantic grouping of chunks
        self._assign_semantic_groups(root, text)
        
        return root
    
    def print_chunks(self, node: ChunkNode, indent: int = 0):
        """Displays the chunk hierarchy (for visualization)."""
        prefix = "  " * indent
        if node.level >= 0:
            group_info = f" [Groupe {node.semantic_group}]" if node.semantic_group is not None else ""
            print(f"{prefix}Niveau {node.level}: {node.text[:60]}...{group_info} (pos: {node.start}-{node.end})")
        
        for child in node.children:
            self.print_chunks(child, indent + 1)

In [24]:
sample_text_ = sample_text[0:1625]

In [None]:
# Initializing the hierarchical chunker
chunker = HierarchicalSemanticChunker(
    model_name='all-MiniLM-L6-v2',
    semantic_threshold=0.8,
    window_size=2
)

# Document segmentation
document_tree = chunker.chunk_document(sample_text_)

# Displaying resultss
print("Hierarchical document structure with semantic segmentation:")
chunker.print_chunks(document_tree)

# Example of accessing chunks
def get_all_chunks(node):
    chunks = []
    if node.text and not node.children:  # Feuille avec contenu
        chunks.append((node.start, node.end, node.text, node.level, node.semantic_group))
    for child in node.children:
        chunks.extend(get_all_chunks(child))
    return chunks

all_chunks = get_all_chunks(document_tree)
print("\nFlat list of all chunks:")
for chunk in all_chunks:
    print(f"Lvl {chunk[3]} - Groupe {chunk[4]}: {chunk[2][:50]}...")