Skip to content

olsongl/Doc-Brain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Private Doc Brain

A local, privacy-first question-answering system for your personal documents. Ask natural language questions about your files — PDFs, Word docs, emails, text files — without any data ever leaving your machine.

Why This Exists

Most document Q&A tools require uploading your files to a cloud service. Private Doc Brain runs entirely on your hardware using Ollama for local LLM inference and ChromaDB for local vector storage. No API keys, no subscriptions, no data exposure.


Features

  • Multi-format document support — PDF, DOCX, TXT, MD, EML
  • Hybrid search — combines BM25 keyword search with semantic vector search via Reciprocal Rank Fusion for better retrieval than either alone
  • HyDE (Hypothetical Document Embeddings) — optional mode where the LLM generates a hypothetical answer first, then uses that to find more relevant chunks
  • Incremental indexing — SHA256-based change detection means only modified or new files get re-embedded
  • Source citations — every answer shows which document and chunk it came from
  • Conversational memory — maintains the last 6 turns so you can ask follow-up questions
  • Fully local — Ollama runs the embedding model and LLM on your machine, ChromaDB stores vectors on disk

Architecture

private-doc-brain/
├── main.py          # CLI entry point (ingest / chat / list / remove)
├── config.py        # Centralized settings (paths, models, chunking, retrieval)
├── ingest.py        # Document parsing, chunking, and indexing pipeline
├── ollama.py        # HTTP client for Ollama (embeddings + chat)
├── search.py        # Hybrid retrieval: BM25 + vector search + RRF + HyDE
├── brain.py         # Interactive REPL, prompt engineering, citation display
├── requirements.txt
├── docs/            # Put your documents here
├── .chroma_db/      # ChromaDB vector store (auto-generated)
└── .index_state.json  # Index metadata (auto-generated)

How It Works

  1. Ingest — Documents in docs/ are parsed, split into ~500-character overlapping chunks, embedded using nomic-embed-text, and stored in ChromaDB.
  2. Search — At query time, a BM25 index is built in-memory from all stored chunks. Your question is embedded and used for vector search. Both result sets are merged using Reciprocal Rank Fusion and the top 5 chunks are selected.
  3. HyDE (optional) — Before embedding the question, the LLM generates a hypothetical document snippet that would answer it. That synthetic snippet is embedded instead of the raw question, which tends to match real document language more closely.
  4. Chat — The top chunks are injected into a prompt with the conversation history. llama3.2 streams the response in real-time. Citations are printed afterward.

Tech Stack

Component Technology
Language Python 3.10+
LLM backend Ollama (llama3.2)
Embeddings Ollama (nomic-embed-text)
Vector store ChromaDB (cosine similarity)
Keyword search rank-bm25
PDF parsing pdfplumber
DOCX parsing python-docx
Terminal output colorama

Setup

1. Install Ollama

Download from https://ollama.com, then pull the required models:

ollama pull nomic-embed-text
ollama pull llama3.2

Start the Ollama server (it may already be running as a background service):

ollama serve

2. Install Python dependencies

pip install -r requirements.txt

3. Add your documents

Copy any PDFs, Word docs, emails, or text files into the docs/ directory:

docs/
├── contract.pdf
├── notes.md
├── report.docx
└── archive.eml

Usage

Index your documents

python main.py ingest

Only new or modified files are re-indexed on subsequent runs.

Ask questions

python main.py chat

With HyDE enabled (recommended for better semantic recall):

python main.py chat --hyde

You'll enter an interactive session. Type your question and press Enter. Type exit or press Ctrl+C to quit.

You: What were the key terms of the vendor contract?
[streams answer with citations...]

You: What about the payment schedule?
[follow-up using conversation context...]

Manage the index

# List all indexed documents
python main.py list

# Remove a specific file from the index
python main.py remove contract.pdf

Configuration

All settings are in config.py:

Setting Default Description
DOCS_DIR docs/ Input directory
CHROMA_DIR .chroma_db/ Vector store path
OLLAMA_BASE_URL http://localhost:11434 Ollama endpoint
EMBED_MODEL nomic-embed-text Embedding model
CHAT_MODEL llama3.2 Chat/generation model
CHUNK_SIZE 500 Target chunk size (chars)
CHUNK_OVERLAP 50 Overlap between chunks (chars)
TOP_K_VECTOR 20 Vector search candidates
TOP_K_BM25 20 BM25 search candidates
TOP_K_FINAL 5 Chunks passed to LLM

Implementation Notes

Chunking without a tokenizer — Chunks are split on sentence and paragraph boundaries using character count (4 chars ≈ 1 token) as a lightweight approximation. This avoids adding a tokenizer dependency while still respecting semantic boundaries.

Reciprocal Rank Fusion — Rather than manually weighting BM25 vs. vector scores (which are on incompatible scales), RRF uses the rank position of each chunk in each result list. This is robust and requires no tuning.

HyDE — The prompt instructs the model to write as if extracting text from a real document, avoiding meta-language like "according to...". The resulting snippet lives in the same embedding space as actual document text, improving recall when question phrasing diverges from document phrasing.

Embedding batching — Texts are embedded 32 at a time to avoid memory spikes when indexing large document sets.

Session-scoped BM25 — The BM25 index is built once per chat session from all ChromaDB chunks loaded into memory. ChromaDB remains the source of truth; BM25 is a fast in-memory layer.


Requirements

  • Python 3.10+
  • Ollama running locally with nomic-embed-text and llama3.2 pulled
  • ~4GB RAM for llama3.2 (or more for larger models)

License

MIT

About

A local, privacy-first question-answering system for your personal documents. Ask natural language questions about your files — PDFs, Word docs, emails, text files — without any data ever leaving your machine.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages