A production-quality RAG (Retrieval-Augmented Generation) system that goes beyond basic document Q&A by implementing semantic chunking, hierarchical summarization, temporal version tracking, and context-aware chunk boundaries powered by sentence transformers.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Streamlit UI / CLI β
β Upload β Query β Chunk Analysis β Versions β Manage β
ββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββββββ βββββββββββββββββββββ
β Document β β RAG β β Version β
β Ingestor β β Pipeline β β Manager β
β β β β β β
β PDF/TXT β β QueryβEmbed β β Hash-based dedup β
β Markdown β β RetrieveβLLM β β Temporal tracking β
ββββββ¬ββββββ ββββ¬ββββββββββββ β Diff generation β
β β βββββββββββββββββββββ
βΌ βΌ
βββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β Semantic Chunker β β Hierarchical β
β β β Summarizer β
β Sentence Embedding β β β
β Cosine Similarity β β Level 0: Chunk summaries β
β Breakpoint Detectionβ β Level 1: Section summary β
β Size Constraints β β Level 2: Doc overview β
βββββββββββ¬ββββββββββββ ββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β ChromaDB Vector Store β
β β
β Cosine similarity searchβ
β Metadata filtering β
β Version-aware retrieval β
ββββββββββββββββββββββββββββ
Traditional RAG systems split documents into fixed-size chunks (e.g., 500 tokens). This project uses sentence-transformer embeddings to detect natural topic boundaries:
Traditional: |--500 tokens--|--500 tokens--|--500 tokens--|
β may split mid-sentence or mid-topic
Smart: |--Topic A (variable)--|--Topic B--|--Topic C (variable)--|
β boundaries align with meaning shifts
Algorithm:
- Split text into sentences
- Embed each sentence with
all-MiniLM-L6-v2 - Compute cosine similarity between consecutive sentence pairs
- When similarity drops below threshold β new chunk boundary
- Enforce min/max size constraints (merge tiny, split huge)
Long documents get a 3-level summary tree:
| Level | Granularity | Purpose |
|---|---|---|
| 0 | Per-chunk summaries | Fine-grained retrieval context |
| 1 | Section summaries (groups of 4 chunks) | Mid-level understanding |
| 2 | Document overview | Quick document-level answers |
- SHA-256 hashing detects actual content changes (skips no-op re-uploads)
- Every version is stored with timestamp, change description, and metadata
- RAG pipeline adds temporal notes when older versions are retrieved
- Version diff tool compares any two versions
Each chunk carries a semantic density score β the mean pairwise cosine similarity of its internal sentences. High-density chunks are coherent; low-density chunks may span topics. This metadata helps the retrieval stage prefer tightly focused chunks.
Knowledge-Assistant/
βββ config.py # Centralized configuration (env vars)
βββ chunking_engine.py # π Semantic chunker (core innovation)
βββ summarizer.py # Hierarchical summarization engine
βββ versioning.py # Document version tracking
βββ vector_store.py # ChromaDB wrapper
βββ ingestor.py # Document loading & ingestion pipeline
βββ rag_pipeline.py # Query β Retrieve β Generate
βββ app.py # Streamlit web UI
βββ main.py # CLI entry point
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variable template
βββ data/
β βββ documents/ # Uploaded raw files
β βββ versions/ # Version history (JSON + text)
β βββ chroma_db/ # ChromaDB persistent storage
βββ sample_docs/
βββ sample.txt # Sample document for testing
cd Knowledge-Assistant
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On macOS/Linux
# Install dependencies
pip install -r requirements.txtcp .env.example .env
# Default uses Ollama locally β no API key needed!
# Make sure Ollama is running: ollama serve
# Pull a model: ollama pull llama3.2python main.py ui
# or directly:
streamlit run app.py# Ingest a single file
python main.py ingest sample_docs/sample.txt
# Ingest all files in a directory
python main.py ingest ./my_documents/
# Ask a question
python main.py query "What are the main topics covered?"
# View statistics
python main.py statsfrom chunking_engine import SemanticChunker
chunker = SemanticChunker(
similarity_threshold=0.65, # Break when similarity < this
min_chunk_size=100, # Merge chunks smaller than this
max_chunk_size=2000, # Split chunks larger than this
)
chunks = chunker.chunk(long_document_text)
for chunk in chunks:
print(f"Chunk {chunk.chunk_index}: {chunk.word_count} words, "
f"density={chunk.semantic_density:.3f}")
print(chunk.text[:200])
print("---")Sentence: S1 S2 S3 S4 S5 S6 S7 S8
Similarity: 0.85 0.78 0.82 β0.31β 0.76 0.88 β0.42β
β β
BREAK HERE BREAK HERE
(topic shift) (topic shift)
Result: [Chunk 1: S1-S4] [Chunk 2: S5-S7] [Chunk 3: S8]
All settings are configurable via environment variables (.env file):
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
OLLAMA_MODEL |
llama3.2 |
LLM model for generation |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Sentence transformer model |
SIMILARITY_THRESHOLD |
0.65 |
Semantic breakpoint threshold |
MIN_CHUNK_SIZE |
100 |
Minimum chunk size (chars) |
MAX_CHUNK_SIZE |
2000 |
Maximum chunk size (chars) |
HIERARCHICAL_LEVELS |
3 |
Levels in summary hierarchy |
| Tab | Feature |
|---|---|
| π Upload | Upload PDF/TXT/MD or paste text directly |
| π Query | Ask questions with source attribution |
| π Chunk Analysis | Visualize how text gets split semantically |
| π Versions | Browse version history, compare changes |
| βοΈ Manage | View stats, delete documents, reset KB |
- No heavy NLP dependency for sentence splitting β uses efficient regex heuristics
- ChromaDB with HNSW cosine index for fast approximate nearest-neighbour search
- Streaming-friendly β chunks are processed one at a time (low memory)
- Deduplication β SHA-256 hashing skips re-processing unchanged documents
- Rich CLI β pretty terminal output with tables and panels via
rich
This project demonstrates:
- Semantic chunking β why meaning-based splits beat fixed-size splits
- Sentence transformers β using pre-trained models for local embeddings
- Hierarchical summarization β multi-level document understanding
- Vector databases β ChromaDB for embedding storage and retrieval
- RAG architecture β retrieval-augmented generation end-to-end
- Version control for data β tracking document evolution over time
- Streamlit β building interactive ML/AI web applications