Skip to content

jitendrar292/knowledge-Assistant

Repository files navigation

🧠 Personal Knowledge Assistant with Smart Chunking

A production-quality RAG (Retrieval-Augmented Generation) system that goes beyond basic document Q&A by implementing semantic chunking, hierarchical summarization, temporal version tracking, and context-aware chunk boundaries powered by sentence transformers.


πŸ—οΈ Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Streamlit UI / CLI                        β”‚
β”‚         Upload β”‚ Query β”‚ Chunk Analysis β”‚ Versions β”‚ Manage      β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚              β”‚                      β”‚
       β–Ό              β–Ό                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Document β”‚  β”‚     RAG      β”‚    β”‚     Version       β”‚
β”‚ Ingestor β”‚  β”‚   Pipeline   β”‚    β”‚     Manager       β”‚
β”‚          β”‚  β”‚              β”‚    β”‚                   β”‚
│ PDF/TXT  │  │ Query→Embed  │    │ Hash-based dedup  │
│ Markdown │  │ Retrieve→LLM │    │ Temporal tracking │
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ Diff generation   β”‚
     β”‚           β”‚                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β–Ό           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Semantic Chunker  β”‚    β”‚   Hierarchical           β”‚
β”‚                     β”‚    β”‚   Summarizer              β”‚
β”‚ Sentence Embedding  β”‚    β”‚                          β”‚
β”‚ Cosine Similarity   β”‚    β”‚ Level 0: Chunk summaries β”‚
β”‚ Breakpoint Detectionβ”‚    β”‚ Level 1: Section summary β”‚
β”‚ Size Constraints    β”‚    β”‚ Level 2: Doc overview    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    ChromaDB Vector Store β”‚
β”‚                          β”‚
β”‚  Cosine similarity searchβ”‚
β”‚  Metadata filtering      β”‚
β”‚  Version-aware retrieval β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🌟 Key Features & Innovations

1. Semantic Chunking (vs Fixed-Size)

Traditional RAG systems split documents into fixed-size chunks (e.g., 500 tokens). This project uses sentence-transformer embeddings to detect natural topic boundaries:

Traditional:  |--500 tokens--|--500 tokens--|--500 tokens--|
                    ↑ may split mid-sentence or mid-topic

Smart:        |--Topic A (variable)--|--Topic B--|--Topic C (variable)--|
                    ↑ boundaries align with meaning shifts

Algorithm:

  1. Split text into sentences
  2. Embed each sentence with all-MiniLM-L6-v2
  3. Compute cosine similarity between consecutive sentence pairs
  4. When similarity drops below threshold β†’ new chunk boundary
  5. Enforce min/max size constraints (merge tiny, split huge)

2. Hierarchical Summarization

Long documents get a 3-level summary tree:

Level Granularity Purpose
0 Per-chunk summaries Fine-grained retrieval context
1 Section summaries (groups of 4 chunks) Mid-level understanding
2 Document overview Quick document-level answers

3. Temporal Awareness & Versioning

  • SHA-256 hashing detects actual content changes (skips no-op re-uploads)
  • Every version is stored with timestamp, change description, and metadata
  • RAG pipeline adds temporal notes when older versions are retrieved
  • Version diff tool compares any two versions

4. Context-Aware Chunk Boundaries

Each chunk carries a semantic density score β€” the mean pairwise cosine similarity of its internal sentences. High-density chunks are coherent; low-density chunks may span topics. This metadata helps the retrieval stage prefer tightly focused chunks.


πŸ“ Project Structure

Knowledge-Assistant/
β”œβ”€β”€ config.py              # Centralized configuration (env vars)
β”œβ”€β”€ chunking_engine.py     # πŸ”‘ Semantic chunker (core innovation)
β”œβ”€β”€ summarizer.py          # Hierarchical summarization engine
β”œβ”€β”€ versioning.py          # Document version tracking
β”œβ”€β”€ vector_store.py        # ChromaDB wrapper
β”œβ”€β”€ ingestor.py            # Document loading & ingestion pipeline
β”œβ”€β”€ rag_pipeline.py        # Query β†’ Retrieve β†’ Generate
β”œβ”€β”€ app.py                 # Streamlit web UI
β”œβ”€β”€ main.py                # CLI entry point
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ .env.example           # Environment variable template
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ documents/         # Uploaded raw files
β”‚   β”œβ”€β”€ versions/          # Version history (JSON + text)
β”‚   └── chroma_db/         # ChromaDB persistent storage
└── sample_docs/
    └── sample.txt         # Sample document for testing

πŸš€ Quick Start

1. Clone & Install

cd Knowledge-Assistant

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On macOS/Linux

# Install dependencies
pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Default uses Ollama locally β€” no API key needed!
# Make sure Ollama is running: ollama serve
# Pull a model: ollama pull llama3.2

3. Launch the Web UI

python main.py ui
# or directly:
streamlit run app.py

4. CLI Usage

# Ingest a single file
python main.py ingest sample_docs/sample.txt

# Ingest all files in a directory
python main.py ingest ./my_documents/

# Ask a question
python main.py query "What are the main topics covered?"

# View statistics
python main.py stats

πŸ§ͺ How Smart Chunking Works (Deep Dive)

from chunking_engine import SemanticChunker

chunker = SemanticChunker(
    similarity_threshold=0.65,  # Break when similarity < this
    min_chunk_size=100,         # Merge chunks smaller than this
    max_chunk_size=2000,        # Split chunks larger than this
)

chunks = chunker.chunk(long_document_text)

for chunk in chunks:
    print(f"Chunk {chunk.chunk_index}: {chunk.word_count} words, "
          f"density={chunk.semantic_density:.3f}")
    print(chunk.text[:200])
    print("---")

Cosine Similarity Breakpoint Detection

Sentence:    S1    S2    S3    S4    S5    S6    S7    S8
Similarity:     0.85  0.78  0.82  β”‚0.31β”‚  0.76  0.88  β”‚0.42β”‚
                                  ↑                     ↑
                              BREAK HERE           BREAK HERE
                            (topic shift)         (topic shift)

Result:  [Chunk 1: S1-S4]  [Chunk 2: S5-S7]  [Chunk 3: S8]

πŸ”§ Configuration

All settings are configurable via environment variables (.env file):

Variable Default Description
OLLAMA_BASE_URL http://localhost:11434 Ollama server URL
OLLAMA_MODEL llama3.2 LLM model for generation
EMBEDDING_MODEL all-MiniLM-L6-v2 Sentence transformer model
SIMILARITY_THRESHOLD 0.65 Semantic breakpoint threshold
MIN_CHUNK_SIZE 100 Minimum chunk size (chars)
MAX_CHUNK_SIZE 2000 Maximum chunk size (chars)
HIERARCHICAL_LEVELS 3 Levels in summary hierarchy

πŸ“Š Streamlit UI Features

Tab Feature
πŸ“š Upload Upload PDF/TXT/MD or paste text directly
πŸ” Query Ask questions with source attribution
πŸ“Š Chunk Analysis Visualize how text gets split semantically
πŸ• Versions Browse version history, compare changes
βš™οΈ Manage View stats, delete documents, reset KB

πŸ”¬ Technical Highlights

  • No heavy NLP dependency for sentence splitting β€” uses efficient regex heuristics
  • ChromaDB with HNSW cosine index for fast approximate nearest-neighbour search
  • Streaming-friendly β€” chunks are processed one at a time (low memory)
  • Deduplication β€” SHA-256 hashing skips re-processing unchanged documents
  • Rich CLI β€” pretty terminal output with tables and panels via rich

πŸ“ Learning Objectives

This project demonstrates:

  1. Semantic chunking β€” why meaning-based splits beat fixed-size splits
  2. Sentence transformers β€” using pre-trained models for local embeddings
  3. Hierarchical summarization β€” multi-level document understanding
  4. Vector databases β€” ChromaDB for embedding storage and retrieval
  5. RAG architecture β€” retrieval-augmented generation end-to-end
  6. Version control for data β€” tracking document evolution over time
  7. Streamlit β€” building interactive ML/AI web applications

About

A production-quality RAG (Retrieval-Augmented Generation) system

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages