---
## LLM Semantic Chunker

The LLM Semantic Chunker takes a direct approach to document chunking by literally asking a Language Model to identify semantic boundaries. The process begins by dividing the input text into small, fixed-size pieces of around 50 tokens using a standard recursive splitter, creating manageable units for the LLM to analyze. These pieces are then wrapped with special tags like `<start_chunk_1>` and `<end_chunk_1>` to maintain their identity throughout the process.

The core of the chunking process involves presenting text to the LLM in windows of approximately 800 tokens (containing multiple small pieces) at a time. For each window, the LLM is instructed to identify natural semantic breaks, responding in a specific format like `split_after: X, Y, Z` where X, Y, Z are chunk numbers. These splits must be in ascending order and must start from the current position, with at least one split being required to ensure the process continues moving forward.

The chunker maintains a sliding window approach, progressively moving through the document based on the LLM's last suggested split point. This continues until either the end of the document is reached or the remaining text becomes too short to require further splitting (less than ~4 chunks). The suggested split points are then used to reassemble the small pieces into final chunks, with each chunk combining all pieces between two split points.


Internally, the system prompt follows:
```python
"You are an assistant specialized in splitting text into thematically consistent sections. "
"The text has been divided into chunks, each marked with <|start_chunk_X|> and <|end_chunk_X|> tags, where X is the chunk number. "
"Your task is to identify the points where splits should occur, such that consecutive chunks of similar themes stay together. "
"Respond with a list of chunk IDs where you believe a split should be made. For example, if chunks 1 and 2 belong together but chunk 3 starts a new topic, you would suggest a split after chunk 2. THE CHUNKS MUST BE IN ASCENDING ORDER."
"Your response should be in the form: 'split_after: 3, 5'."

```

In [1]:
document  = open("/home/codepips/Home/Portfolio/Projects/ŸÖÿ≥ÿßÿ±/data/processed/MD/TBS_Handbook-2022.md").read()

In [5]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()


True

In [14]:
def analyze_chunks(chunks, use_tokens=False):
    """Analyze chunk statistics with flexible handling of chunk count"""
    
    if not chunks:
        print("No chunks to analyze")
        return
    
    print(f"\n{'='*70}")
    print(f"CHUNK ANALYSIS")
    print(f"{'='*70}")
    print(f"\nNumber of Chunks: {len(chunks)}")
    
    # Calculate statistics
    if use_tokens:
        import tiktoken
        encoding = tiktoken.get_encoding("cl100k_base")
        sizes = [len(encoding.encode(chunk)) for chunk in chunks]
        unit = "tokens"
    else:
        sizes = [len(chunk) for chunk in chunks]
        unit = "characters"
    
    print(f"\nChunk Size Statistics:")
    print(f"  Average: {sum(sizes)/len(sizes):.1f} {unit}")
    print(f"  Min: {min(sizes)} {unit}")
    print(f"  Max: {max(sizes)} {unit}")
    print(f"  Total: {sum(sizes)} {unit}")
    
    # Show sample chunks
    print(f"\n{'='*70}")
    print(f"SAMPLE CHUNKS")
    print(f"{'='*70}")
    
    # First chunk
    print(f"\nüìÑ Chunk 1 of {len(chunks)} ({sizes[0]} {unit}):")
    print(chunks[0][:300] + "..." if len(chunks[0]) > 300 else chunks[0])
    
    # Middle chunk (if exists)
    if len(chunks) >= 3:
        mid_idx = len(chunks) // 2
        print(f"\nüìÑ Chunk {mid_idx + 1} of {len(chunks)} ({sizes[mid_idx]} {unit}):")
        print(chunks[mid_idx][:300] + "..." if len(chunks[mid_idx]) > 300 else chunks[mid_idx])
    
    # Last chunk
    if len(chunks) > 1:
        print(f"\nüìÑ Chunk {len(chunks)} of {len(chunks)} ({sizes[-1]} {unit}):")
        print(chunks[-1][:300] + "..." if len(chunks[-1]) > 300 else chunks[-1])
    
    # Check for overlap between consecutive chunks (if we have at least 2 chunks)
    if len(chunks) >= 2:
        # Use the last two chunks for overlap analysis
        chunk1, chunk2 = chunks[-2], chunks[-1]
        
        print(f"\n{'='*70}")
        print(f"OVERLAP ANALYSIS (between last two chunks)")
        print(f"{'='*70}")
        
        if use_tokens:
            tokens1 = encoding.encode(chunk1)
            tokens2 = encoding.encode(chunk2)
            
            # Find overlapping tokens
            overlap_found = False
            for i in range(min(len(tokens1), len(tokens2)), 0, -1):
                if tokens1[-i:] == tokens2[:i]:
                    overlap = encoding.decode(tokens1[-i:])
                    print(f"\n‚úì Overlapping text ({i} tokens):")
                    print(overlap[:200] + "..." if len(overlap) > 200 else overlap)
                    overlap_found = True
                    break
            
            if not overlap_found:
                print("\n‚úó No token overlap found between consecutive chunks")
        else:
            # Find overlapping characters
            overlap_found = False
            for i in range(min(len(chunk1), len(chunk2)), 0, -1):
                if chunk1[-i:] == chunk2[:i]:
                    print(f"\n‚úì Overlapping text ({i} chars):")
                    print(chunk1[-i:][:200] + "..." if len(chunk1[-i:]) > 200 else chunk1[-i:])
                    overlap_found = True
                    break
            
            if not overlap_found:
                print("\n‚úó No character overlap found between consecutive chunks")

In [10]:
!pip install git+https://github.com/brandonstarxel/chunking_evaluation.git

Collecting git+https://github.com/brandonstarxel/chunking_evaluation.git
  Cloning https://github.com/brandonstarxel/chunking_evaluation.git to /tmp/pip-req-build-_hia14lm
  Running command git clone --filter=blob:none --quiet https://github.com/brandonstarxel/chunking_evaluation.git /tmp/pip-req-build-_hia14lm
  Resolved https://github.com/brandonstarxel/chunking_evaluation.git to commit d451fc4cf56e417b755994b4ca5212fd5057c0d2
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting tiktoken (from chunking_evaluation==0.1.0)
  Downloading tiktoken-0.12.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.7 kB)
Collecting fuzzywuzzy (from chunking_evaluation==0.1.0)
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting chromadb (from chunking_evaluation==0.1.0)
  Downloading chromadb-1.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_

In [15]:
import tiktoken
import re
from ollama import chat

class LocalLLMSemanticChunker:
    """LLM-based semantic chunker using Ollama (local, free)"""
    
    def __init__(self, model_name="llama3.2"):
        self.model_name = model_name
        self.encoding = tiktoken.encoding_for_model("gpt-4")
        
    def _create_initial_chunks(self, text, chunk_size=50):
        """Split text into small initial chunks of ~50 tokens"""
        tokens = self.encoding.encode(text)
        chunks = []
        
        for i in range(0, len(tokens), chunk_size):
            chunk_tokens = tokens[i:i + chunk_size]
            chunk_text = self.encoding.decode(chunk_tokens)
            chunks.append(chunk_text)
            
        return chunks
    
    def _tag_chunks(self, chunks):
        """Add tags to chunks for LLM processing"""
        tagged_text = ""
        for i, chunk in enumerate(chunks):
            tagged_text += f"<|start_chunk_{i}|>\n{chunk}\n<|end_chunk_{i}|>\n"
        return tagged_text
    
    def _get_split_points(self, tagged_text, current_chunk=0):
        """Ask local LLM to identify semantic split points"""
        
        # Limit the text to avoid token limits
        tokens = self.encoding.encode(tagged_text)
        if len(tokens) > 4000:  # Smaller window for local models
            tagged_text = self.encoding.decode(tokens[:4000])
        
        prompt = f"""You are an assistant specialized in splitting text into thematically consistent sections.

The text has been divided into chunks, each marked with <|start_chunk_X|> and <|end_chunk_X|> tags.

Rules:
- Identify natural semantic boundaries where topics change
- Each final section should be 200-1000 words
- Splits must be in ascending order
- Splits must be equal or larger than {current_chunk}

Text:
{tagged_text}

Respond ONLY with: 'split_after: X, Y, Z' where X, Y, Z are chunk numbers.
YOU MUST RESPOND WITH AT LEAST ONE SPLIT."""

        try:
            response = chat(
                model=self.model_name,
                messages=[{"role": "user", "content": prompt}],
                options={
                    "temperature": 0.2,
                    "num_ctx": 8192
                }
            )
            
            result = response.message.content.strip()
            print(f"  LLM response: {result[:100]}...")
            
            # Extract numbers from response
            split_points = [int(x) for x in re.findall(r'\d+', result)]
            
            # Filter splits to be >= current_chunk
            split_points = [s for s in split_points if s >= current_chunk]
            
            if not split_points:
                # Fallback: create a split halfway through
                num_chunks = tagged_text.count("<|start_chunk_")
                split_points = [current_chunk + num_chunks // 2]
            
            return sorted(set(split_points))
            
        except Exception as e:
            print(f"  Error calling LLM: {e}")
            # Fallback: split every 10 chunks
            return [current_chunk + 10]
    
    def split_text(self, text, max_iterations=50):
        """Main method to split text into semantic chunks with sliding window"""
        
        print("Creating initial chunks...")
        initial_chunks = self._create_initial_chunks(text)
        print(f"Created {len(initial_chunks)} initial chunks")
        
        all_splits = [0]  # Start with first chunk
        current_position = 0
        iteration = 0
        
        while current_position < len(initial_chunks) - 10 and iteration < max_iterations:
            iteration += 1
            
            # Get next window of chunks (max ~800 tokens = ~16 chunks of 50 tokens each)
            window_end = min(current_position + 20, len(initial_chunks))
            window_chunks = initial_chunks[current_position:window_end]
            
            print(f"\nIteration {iteration}: Processing chunks {current_position} to {window_end}")
            
            # Tag and get split points
            tagged_text = self._tag_chunks(window_chunks)
            relative_splits = self._get_split_points(tagged_text, 0)
            
            # Convert relative splits to absolute positions
            absolute_splits = [current_position + s for s in relative_splits]
            
            print(f"  Found splits at: {absolute_splits}")
            
            # Add new splits
            for split in absolute_splits:
                if split > current_position and split not in all_splits:
                    all_splits.append(split)
            
            # Move position to last split
            if absolute_splits:
                current_position = max(absolute_splits)
            else:
                current_position += 10  # Move forward if no splits found
        
        # Add final chunk
        all_splits.append(len(initial_chunks))
        all_splits = sorted(set(all_splits))
        
        print(f"\n‚úÖ Total split points: {len(all_splits)}")
        
        # Create final chunks
        final_chunks = []
        for i in range(len(all_splits) - 1):
            start_idx = all_splits[i]
            end_idx = all_splits[i + 1]
            
            chunk_text = " ".join(initial_chunks[start_idx:end_idx])
            if chunk_text.strip():
                final_chunks.append(chunk_text.strip())
        
        print(f"‚úÖ Created {len(final_chunks)} final semantic chunks")
        return final_chunks


# Use local LLM instead of OpenAI
llm_chunker = LocalLLMSemanticChunker(model_name="llama3.2")

llm_chunker_chunks = llm_chunker.split_text(document)

analyze_chunks(llm_chunker_chunks, use_tokens=True)

Creating initial chunks...
Created 454 initial chunks

Iteration 1: Processing chunks 0 to 20
  LLM response: After analyzing the text, I suggest splitting it into sections after the following chunk numbers:

s...
  Found splits at: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]

Iteration 2: Processing chunks 18 to 38
  LLM response: After analyzing the text, I suggest splitting it into sections after the following chunk numbers:

s...
  Found splits at: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]

Iteration 2: Processing chunks 18 to 38
  LLM response: split_after: 0, 1

The first split is after the introduction of Tunis Business School and its main l...
  Found splits at: [18, 19, 20]

Iteration 3: Processing chunks 20 to 40
  LLM response: split_after: 0, 1

The first split is after the introduction of Tunis Business School and its main l...
  Found splits at: [18, 19, 20]

Iteration 3: Processing chunks 20 to 40
  LLM response: split_afte

In [17]:
print(llm_chunker_chunks)

['Educating to Lead\n\nMinistry of Higher Education and Scientific Research University of Tunis\n\nTunis Business School\n\n‚ÄúEducating Future Leaders and Managers for a Global Economy‚Äù\n\nSCHOOL HANDBOOK\n\nVersion: September, 2022', 'Last update: February 5, 2023\n\n# DISCLAIMER\n\nThis Handbook provides information about the school, its programs, guidelines, and regulations. It has been approved by the Scientific Council. It is the only body in the school that can formally', 'modify this handbook.\n\nTunis Business School reserves the right to amend any policy at any time. The most updated version is the online version (updated on 5 February 2023). It is the responsibility of the students to be familiar with the', 'content of this handbook.\n\n# TABLE OF CONTENTS\n\nDISCLAIMER.\n\nTABLE OF CONTENTS . 3\n\n1. ABOUT TUNIS BUSINESS SCHOOL .\n\n1.1. MANDATE... ............... ................. ..... 5   \n1.', '2. INNOVATIVE INSTITUTION ...... ..... 5   \n1.3. INTERNATIONAL STANDARDS

---
## Embedding Generation

Now we'll generate embeddings for each chunk using a sentence transformer model. These embeddings will be used to build a vector index for semantic search and retrieval.

In [22]:
!pip install faiss-cpu sentence-transformers



In [26]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import os

# Load embedding model
print("Loading embedding model...")
# Get HF token from environment or use the one from cache
hf_token = os.environ.get("HF_TOKEN") or os.environ.get("HUGGING_FACE_HUB_TOKEN")

try:
    embedding_model = SentenceTransformer(
        'paraphrase-multilingual-mpnet-base-v2',
        token=hf_token,
        trust_remote_code=True
    )
    print("Model loaded successfully!")
except Exception as e:
    print(f"Error loading with token: {e}")
    print("Trying without authentication (using cache if available)...")
    embedding_model = SentenceTransformer(
        'paraphrase-multilingual-mpnet-base-v2',
        local_files_only=True  # Use cached model if available
    )
    print("Model loaded from cache!")

# Generate embeddings for all chunks
print(f"\nGenerating embeddings for {len(llm_chunker_chunks)} chunks...")
chunk_embeddings = embedding_model.encode(
    llm_chunker_chunks,
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"‚úÖ Generated embeddings with shape: {chunk_embeddings.shape}")
print(f"   Embedding dimension: {chunk_embeddings.shape[1]}")

# Create FAISS index for efficient similarity search
print("\nBuilding FAISS index...")
dimension = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # L2 distance (Euclidean)
index.add(chunk_embeddings.astype('float32'))

print(f"‚úÖ FAISS index built with {index.ntotal} vectors")

# Store metadata for retrieval
import tiktoken
token_encoding = tiktoken.get_encoding("cl100k_base")

chunk_metadata = [
    {
        "chunk_id": i,
        "text": chunk,
        "token_count": len(token_encoding.encode(chunk))
    }
    for i, chunk in enumerate(llm_chunker_chunks)
]

print(f"‚úÖ Metadata stored for {len(chunk_metadata)} chunks")

Loading embedding model...
Model loaded successfully!

Generating embeddings for 104 chunks...
Model loaded successfully!

Generating embeddings for 104 chunks...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:02<00:00,  1.86it/s]

‚úÖ Generated embeddings with shape: (104, 768)
   Embedding dimension: 768

Building FAISS index...
‚úÖ FAISS index built with 104 vectors
‚úÖ Metadata stored for 104 chunks





---
## Retrieval Test

Let's test the retrieval system with some sample queries to see how well it finds relevant chunks.

In [29]:
def retrieve_chunks(query, top_k=3):
    """
    Retrieve the most relevant chunks for a given query
    
    Args:
        query: The search query (string)
        top_k: Number of top results to return
    
    Returns:
        List of tuples (chunk_text, similarity_score, metadata)
    """
    # Generate query embedding
    query_embedding = embedding_model.encode([query], convert_to_numpy=True)
    
    # Search in FAISS index
    distances, indices = index.search(query_embedding.astype('float32'), top_k)
    
    # Prepare results
    results = []
    for i, (idx, distance) in enumerate(zip(indices[0], distances[0])):
        # Convert L2 distance to similarity score (lower is better, so invert)
        similarity_score = 1 / (1 + distance)
        
        results.append({
            "rank": i + 1,
            "chunk_id": int(idx),
            "similarity_score": float(similarity_score),
            "distance": float(distance),
            "text": llm_chunker_chunks[idx],
            "token_count": chunk_metadata[idx]["token_count"]
        })
    
    return results


def display_retrieval_results(query, results):
    """Display retrieval results in a formatted way"""
    print(f"\n{'='*80}")
    print(f"QUERY: {query}")
    print(f"{'='*80}")
    
    for result in results:
        print(f"\nüìå Rank {result['rank']} | Chunk #{result['chunk_id']} | Score: {result['similarity_score']:.4f} | Distance: {result['distance']:.4f}")
        print(f"   Tokens: {result['token_count']}")
        print(f"\n   {result['text'][:400]}...")
        print(f"\n{'-'*80}")


# Test queries
test_queries = [
    "What are the admission requirements?",
    "Tell me about the computer science program",
    "What courses are available in the first year?",
    "Quels sont les frais de scolarit√©?"  
]

print(f"\n{'#'*80}")
print(f"RETRIEVAL TESTING")
print(f"{'#'*80}")

for query in test_queries:
    results = retrieve_chunks(query, top_k=3)
    display_retrieval_results(query, results)

# Interactive query (optional)
print(f"\n{'#'*80}")
print(f"INTERACTIVE RETRIEVAL")
print(f"{'#'*80}")
print("\nTry your own query:")
print("(Leave empty to skip)")

user_query = input("\nEnter your query: ").strip()
if user_query:
    results = retrieve_chunks(user_query, top_k=5)
    display_retrieval_results(user_query, results)
else:
    print("Skipped interactive query.")


################################################################################
RETRIEVAL TESTING
################################################################################

QUERY: What are the admission requirements?

üìå Rank 1 | Chunk #47 | Score: 0.0952 | Distance: 9.4988
   Tokens: 51

   Humanities,   
and Social Science areas)   
5. Computer Science Courses (12 semester credits)   
6. Senior Project (Option I or Option II) (12 semester credits)

The list of Business Core Requirement Courses at TBS is the following (...

--------------------------------------------------------------------------------

üìå Rank 2 | Chunk #44 | Score: 0.0920 | Distance: 9.8723
   Tokens: 50

   Core Course with a unique identifier 40 offered at the freshman level.

# 2.3. Undergraduate Curriculum

Graduating students at TBS will be earning a Bachelor of Science in Business Administration (BSBA). In order to receive this degree...

----------------------------------------------------------

---
## Save to Persistent Storage

Save the FAISS index, embeddings, and chunk metadata to disk so they can be loaded by the dashboard app without reprocessing.

In [30]:
import pickle
import os

# Create directory for processed data if it doesn't exist
output_dir = "/home/codepips/Home/Portfolio/Projects/ŸÖÿ≥ÿßÿ±/data/processed"
os.makedirs(output_dir, exist_ok=True)

# Define file paths
index_path = os.path.join(output_dir, "faiss_index.bin")
metadata_path = os.path.join(output_dir, "chunks_metadata.pkl")
chunks_path = os.path.join(output_dir, "chunks.pkl")
embeddings_path = os.path.join(output_dir, "embeddings.npy")

print("Saving to persistent storage...")
print(f"Output directory: {output_dir}")

# 1. Save FAISS index
print(f"\n1. Saving FAISS index...")
faiss.write_index(index, index_path)
print(f"   ‚úÖ Saved to: {index_path}")
print(f"   Index contains {index.ntotal} vectors")

# 2. Save chunks and metadata
print(f"\n2. Saving chunks and metadata...")
with open(metadata_path, 'wb') as f:
    pickle.dump({
        'chunks': llm_chunker_chunks,
        'metadata': chunk_metadata
    }, f)
print(f"   ‚úÖ Saved to: {metadata_path}")
print(f"   Contains {len(llm_chunker_chunks)} chunks")

# 3. Save embeddings (optional, for backup)
print(f"\n3. Saving embeddings...")
np.save(embeddings_path, chunk_embeddings)
print(f"   ‚úÖ Saved to: {embeddings_path}")
print(f"   Shape: {chunk_embeddings.shape}")

# 4. Save just the chunks as a separate file (for easy access)
print(f"\n4. Saving chunks separately...")
with open(chunks_path, 'wb') as f:
    pickle.dump(llm_chunker_chunks, f)
print(f"   ‚úÖ Saved to: {chunks_path}")

print(f"\n{'='*80}")
print("‚úÖ ALL DATA SAVED SUCCESSFULLY!")
print(f"{'='*80}")
print("\nFiles created:")
print(f"  ‚Ä¢ {index_path} ({os.path.getsize(index_path) / (1024*1024):.2f} MB)")
print(f"  ‚Ä¢ {metadata_path} ({os.path.getsize(metadata_path) / (1024*1024):.2f} MB)")
print(f"  ‚Ä¢ {embeddings_path} ({os.path.getsize(embeddings_path) / (1024*1024):.2f} MB)")
print(f"  ‚Ä¢ {chunks_path} ({os.path.getsize(chunks_path) / (1024*1024):.2f} MB)")
print("\nThese files can now be loaded by the dashboard app!")
print("Run: streamlit run dashboard/app.py")

Saving to persistent storage...
Output directory: /home/codepips/Home/Portfolio/Projects/ŸÖÿ≥ÿßÿ±/data/processed

1. Saving FAISS index...
   ‚úÖ Saved to: /home/codepips/Home/Portfolio/Projects/ŸÖÿ≥ÿßÿ±/data/processed/faiss_index.bin
   Index contains 104 vectors

2. Saving chunks and metadata...
   ‚úÖ Saved to: /home/codepips/Home/Portfolio/Projects/ŸÖÿ≥ÿßÿ±/data/processed/chunks_metadata.pkl
   Contains 104 chunks

3. Saving embeddings...
   ‚úÖ Saved to: /home/codepips/Home/Portfolio/Projects/ŸÖÿ≥ÿßÿ±/data/processed/embeddings.npy
   Shape: (104, 768)

4. Saving chunks separately...
   ‚úÖ Saved to: /home/codepips/Home/Portfolio/Projects/ŸÖÿ≥ÿßÿ±/data/processed/chunks.pkl

‚úÖ ALL DATA SAVED SUCCESSFULLY!

Files created:
  ‚Ä¢ /home/codepips/Home/Portfolio/Projects/ŸÖÿ≥ÿßÿ±/data/processed/faiss_index.bin (0.30 MB)
  ‚Ä¢ /home/codepips/Home/Portfolio/Projects/ŸÖÿ≥ÿßÿ±/data/processed/chunks_metadata.pkl (0.08 MB)
  ‚Ä¢ /home/codepips/Home/Portfolio/Projects/ŸÖÿ≥ÿßÿ±/data/processed/e