A local, privacy-first question-answering system for your personal documents. Ask natural language questions about your files — PDFs, Word docs, emails, text files — without any data ever leaving your machine.
Most document Q&A tools require uploading your files to a cloud service. Private Doc Brain runs entirely on your hardware using Ollama for local LLM inference and ChromaDB for local vector storage. No API keys, no subscriptions, no data exposure.
- Multi-format document support — PDF, DOCX, TXT, MD, EML
- Hybrid search — combines BM25 keyword search with semantic vector search via Reciprocal Rank Fusion for better retrieval than either alone
- HyDE (Hypothetical Document Embeddings) — optional mode where the LLM generates a hypothetical answer first, then uses that to find more relevant chunks
- Incremental indexing — SHA256-based change detection means only modified or new files get re-embedded
- Source citations — every answer shows which document and chunk it came from
- Conversational memory — maintains the last 6 turns so you can ask follow-up questions
- Fully local — Ollama runs the embedding model and LLM on your machine, ChromaDB stores vectors on disk
private-doc-brain/
├── main.py # CLI entry point (ingest / chat / list / remove)
├── config.py # Centralized settings (paths, models, chunking, retrieval)
├── ingest.py # Document parsing, chunking, and indexing pipeline
├── ollama.py # HTTP client for Ollama (embeddings + chat)
├── search.py # Hybrid retrieval: BM25 + vector search + RRF + HyDE
├── brain.py # Interactive REPL, prompt engineering, citation display
├── requirements.txt
├── docs/ # Put your documents here
├── .chroma_db/ # ChromaDB vector store (auto-generated)
└── .index_state.json # Index metadata (auto-generated)
- Ingest — Documents in
docs/are parsed, split into ~500-character overlapping chunks, embedded usingnomic-embed-text, and stored in ChromaDB. - Search — At query time, a BM25 index is built in-memory from all stored chunks. Your question is embedded and used for vector search. Both result sets are merged using Reciprocal Rank Fusion and the top 5 chunks are selected.
- HyDE (optional) — Before embedding the question, the LLM generates a hypothetical document snippet that would answer it. That synthetic snippet is embedded instead of the raw question, which tends to match real document language more closely.
- Chat — The top chunks are injected into a prompt with the conversation history.
llama3.2streams the response in real-time. Citations are printed afterward.
| Component | Technology |
|---|---|
| Language | Python 3.10+ |
| LLM backend | Ollama (llama3.2) |
| Embeddings | Ollama (nomic-embed-text) |
| Vector store | ChromaDB (cosine similarity) |
| Keyword search | rank-bm25 |
| PDF parsing | pdfplumber |
| DOCX parsing | python-docx |
| Terminal output | colorama |
Download from https://ollama.com, then pull the required models:
ollama pull nomic-embed-text
ollama pull llama3.2Start the Ollama server (it may already be running as a background service):
ollama servepip install -r requirements.txtCopy any PDFs, Word docs, emails, or text files into the docs/ directory:
docs/
├── contract.pdf
├── notes.md
├── report.docx
└── archive.eml
python main.py ingestOnly new or modified files are re-indexed on subsequent runs.
python main.py chatWith HyDE enabled (recommended for better semantic recall):
python main.py chat --hydeYou'll enter an interactive session. Type your question and press Enter. Type exit or press Ctrl+C to quit.
You: What were the key terms of the vendor contract?
[streams answer with citations...]
You: What about the payment schedule?
[follow-up using conversation context...]
# List all indexed documents
python main.py list
# Remove a specific file from the index
python main.py remove contract.pdfAll settings are in config.py:
| Setting | Default | Description |
|---|---|---|
DOCS_DIR |
docs/ |
Input directory |
CHROMA_DIR |
.chroma_db/ |
Vector store path |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama endpoint |
EMBED_MODEL |
nomic-embed-text |
Embedding model |
CHAT_MODEL |
llama3.2 |
Chat/generation model |
CHUNK_SIZE |
500 | Target chunk size (chars) |
CHUNK_OVERLAP |
50 | Overlap between chunks (chars) |
TOP_K_VECTOR |
20 | Vector search candidates |
TOP_K_BM25 |
20 | BM25 search candidates |
TOP_K_FINAL |
5 | Chunks passed to LLM |
Chunking without a tokenizer — Chunks are split on sentence and paragraph boundaries using character count (4 chars ≈ 1 token) as a lightweight approximation. This avoids adding a tokenizer dependency while still respecting semantic boundaries.
Reciprocal Rank Fusion — Rather than manually weighting BM25 vs. vector scores (which are on incompatible scales), RRF uses the rank position of each chunk in each result list. This is robust and requires no tuning.
HyDE — The prompt instructs the model to write as if extracting text from a real document, avoiding meta-language like "according to...". The resulting snippet lives in the same embedding space as actual document text, improving recall when question phrasing diverges from document phrasing.
Embedding batching — Texts are embedded 32 at a time to avoid memory spikes when indexing large document sets.
Session-scoped BM25 — The BM25 index is built once per chat session from all ChromaDB chunks loaded into memory. ChromaDB remains the source of truth; BM25 is a fast in-memory layer.
- Python 3.10+
- Ollama running locally with
nomic-embed-textandllama3.2pulled - ~4GB RAM for
llama3.2(or more for larger models)
MIT