🧠 Personal Knowledge Assistant with Smart Chunking

A production-quality RAG (Retrieval-Augmented Generation) system that goes beyond basic document Q&A by implementing semantic chunking, hierarchical summarization, temporal version tracking, and context-aware chunk boundaries powered by sentence transformers.

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Streamlit UI / CLI                        │
│         Upload │ Query │ Chunk Analysis │ Versions │ Manage      │
└──────┬──────────────┬──────────────────────┬────────────────────┘
       │              │                      │
       ▼              ▼                      ▼
┌──────────┐  ┌──────────────┐    ┌───────────────────┐
│ Document │  │     RAG      │    │     Version       │
│ Ingestor │  │   Pipeline   │    │     Manager       │
│          │  │              │    │                   │
│ PDF/TXT  │  │ Query→Embed  │    │ Hash-based dedup  │
│ Markdown │  │ Retrieve→LLM │    │ Temporal tracking │
└────┬─────┘  └──┬───────────┘    │ Diff generation   │
     │           │                └───────────────────┘
     ▼           ▼
┌─────────────────────┐    ┌──────────────────────────┐
│   Semantic Chunker  │    │   Hierarchical           │
│                     │    │   Summarizer              │
│ Sentence Embedding  │    │                          │
│ Cosine Similarity   │    │ Level 0: Chunk summaries │
│ Breakpoint Detection│    │ Level 1: Section summary │
│ Size Constraints    │    │ Level 2: Doc overview    │
└─────────┬───────────┘    └──────────────────────────┘
          │
          ▼
┌──────────────────────────┐
│    ChromaDB Vector Store │
│                          │
│  Cosine similarity search│
│  Metadata filtering      │
│  Version-aware retrieval │
└──────────────────────────┘

🌟 Key Features & Innovations

1. Semantic Chunking (vs Fixed-Size)

Traditional RAG systems split documents into fixed-size chunks (e.g., 500 tokens). This project uses sentence-transformer embeddings to detect natural topic boundaries:

Traditional:  |--500 tokens--|--500 tokens--|--500 tokens--|
                    ↑ may split mid-sentence or mid-topic

Smart:        |--Topic A (variable)--|--Topic B--|--Topic C (variable)--|
                    ↑ boundaries align with meaning shifts

Algorithm:

Split text into sentences
Embed each sentence with all-MiniLM-L6-v2
Compute cosine similarity between consecutive sentence pairs
When similarity drops below threshold → new chunk boundary
Enforce min/max size constraints (merge tiny, split huge)

2. Hierarchical Summarization

Long documents get a 3-level summary tree:

Level	Granularity	Purpose
0	Per-chunk summaries	Fine-grained retrieval context
1	Section summaries (groups of 4 chunks)	Mid-level understanding
2	Document overview	Quick document-level answers

3. Temporal Awareness & Versioning

SHA-256 hashing detects actual content changes (skips no-op re-uploads)
Every version is stored with timestamp, change description, and metadata
RAG pipeline adds temporal notes when older versions are retrieved
Version diff tool compares any two versions

4. Context-Aware Chunk Boundaries

Each chunk carries a semantic density score — the mean pairwise cosine similarity of its internal sentences. High-density chunks are coherent; low-density chunks may span topics. This metadata helps the retrieval stage prefer tightly focused chunks.

📁 Project Structure

Knowledge-Assistant/
├── config.py              # Centralized configuration (env vars)
├── chunking_engine.py     # 🔑 Semantic chunker (core innovation)
├── summarizer.py          # Hierarchical summarization engine
├── versioning.py          # Document version tracking
├── vector_store.py        # ChromaDB wrapper
├── ingestor.py            # Document loading & ingestion pipeline
├── rag_pipeline.py        # Query → Retrieve → Generate
├── app.py                 # Streamlit web UI
├── main.py                # CLI entry point
├── requirements.txt       # Python dependencies
├── .env.example           # Environment variable template
├── data/
│   ├── documents/         # Uploaded raw files
│   ├── versions/          # Version history (JSON + text)
│   └── chroma_db/         # ChromaDB persistent storage
└── sample_docs/
    └── sample.txt         # Sample document for testing

🚀 Quick Start

1. Clone & Install

cd Knowledge-Assistant

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On macOS/Linux

# Install dependencies
pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Default uses Ollama locally — no API key needed!
# Make sure Ollama is running: ollama serve
# Pull a model: ollama pull llama3.2

3. Launch the Web UI

python main.py ui
# or directly:
streamlit run app.py

4. CLI Usage

# Ingest a single file
python main.py ingest sample_docs/sample.txt

# Ingest all files in a directory
python main.py ingest ./my_documents/

# Ask a question
python main.py query "What are the main topics covered?"

# View statistics
python main.py stats

🧪 How Smart Chunking Works (Deep Dive)

from chunking_engine import SemanticChunker

chunker = SemanticChunker(
    similarity_threshold=0.65,  # Break when similarity < this
    min_chunk_size=100,         # Merge chunks smaller than this
    max_chunk_size=2000,        # Split chunks larger than this
)

chunks = chunker.chunk(long_document_text)

for chunk in chunks:
    print(f"Chunk {chunk.chunk_index}: {chunk.word_count} words, "
          f"density={chunk.semantic_density:.3f}")
    print(chunk.text[:200])
    print("---")

Cosine Similarity Breakpoint Detection

Sentence:    S1    S2    S3    S4    S5    S6    S7    S8
Similarity:     0.85  0.78  0.82  │0.31│  0.76  0.88  │0.42│
                                  ↑                     ↑
                              BREAK HERE           BREAK HERE
                            (topic shift)         (topic shift)

Result:  [Chunk 1: S1-S4]  [Chunk 2: S5-S7]  [Chunk 3: S8]

🔧 Configuration

All settings are configurable via environment variables (.env file):

Variable	Default	Description
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama server URL
`OLLAMA_MODEL`	`llama3.2`	LLM model for generation
`EMBEDDING_MODEL`	`all-MiniLM-L6-v2`	Sentence transformer model
`SIMILARITY_THRESHOLD`	`0.65`	Semantic breakpoint threshold
`MIN_CHUNK_SIZE`	`100`	Minimum chunk size (chars)
`MAX_CHUNK_SIZE`	`2000`	Maximum chunk size (chars)
`HIERARCHICAL_LEVELS`	`3`	Levels in summary hierarchy

📊 Streamlit UI Features

Tab	Feature
📚 Upload	Upload PDF/TXT/MD or paste text directly
🔍 Query	Ask questions with source attribution
📊 Chunk Analysis	Visualize how text gets split semantically
🕐 Versions	Browse version history, compare changes
⚙️ Manage	View stats, delete documents, reset KB

🔬 Technical Highlights

No heavy NLP dependency for sentence splitting — uses efficient regex heuristics
ChromaDB with HNSW cosine index for fast approximate nearest-neighbour search
Streaming-friendly — chunks are processed one at a time (low memory)
Deduplication — SHA-256 hashing skips re-processing unchanged documents
Rich CLI — pretty terminal output with tables and panels via rich

📝 Learning Objectives

This project demonstrates:

Semantic chunking — why meaning-based splits beat fixed-size splits
Sentence transformers — using pre-trained models for local embeddings
Hierarchical summarization — multi-level document understanding
Vector databases — ChromaDB for embedding storage and retrieval
RAG architecture — retrieval-augmented generation end-to-end
Version control for data — tracking document evolution over time
Streamlit — building interactive ML/AI web applications

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Personal Knowledge Assistant with Smart Chunking

🏗️ Architecture Overview

🌟 Key Features & Innovations

1. Semantic Chunking (vs Fixed-Size)

2. Hierarchical Summarization

3. Temporal Awareness & Versioning

4. Context-Aware Chunk Boundaries

📁 Project Structure

🚀 Quick Start

1. Clone & Install

2. Configure Environment

3. Launch the Web UI

4. CLI Usage

🧪 How Smart Chunking Works (Deep Dive)

Cosine Similarity Breakpoint Detection

🔧 Configuration

📊 Streamlit UI Features

🔬 Technical Highlights

📝 Learning Objectives

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
sample_docs		sample_docs
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
chunking_engine.py		chunking_engine.py
config.py		config.py
ingestor.py		ingestor.py
main.py		main.py
rag_pipeline.py		rag_pipeline.py
requirements.txt		requirements.txt
summarizer.py		summarizer.py
vector_store.py		vector_store.py
versioning.py		versioning.py

Folders and files

Latest commit

History

Repository files navigation

🧠 Personal Knowledge Assistant with Smart Chunking

🏗️ Architecture Overview

🌟 Key Features & Innovations

1. Semantic Chunking (vs Fixed-Size)

2. Hierarchical Summarization

3. Temporal Awareness & Versioning

4. Context-Aware Chunk Boundaries

📁 Project Structure

🚀 Quick Start

1. Clone & Install

2. Configure Environment

3. Launch the Web UI

4. CLI Usage

🧪 How Smart Chunking Works (Deep Dive)

Cosine Similarity Breakpoint Detection

🔧 Configuration

📊 Streamlit UI Features

🔬 Technical Highlights

📝 Learning Objectives

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages