Fully local RAG agent built with FastMCP + ChromaDB + Ollama models (phi4-mini:3.8b-q4_K_M and embeddinggemma:300m-qat-q8_0), managed with UV and linted/formatted with Ruff.
All components run on your machine:
- LLM inference via local Ollama
- Embedding generation via local Ollama
- Vector storage/query via local ChromaDB
No cloud APIs are required.
C4Container
title DocuMind - C4 Container Diagram
Person(user, "User", "Ingests documents and chats with the assistant")
System_Ext(ollama, "Ollama", "Local model runtime for chat and embeddings")
System_Boundary(documind, "DocuMind (Local)") {
Container(client, "Interactive Client (client.py)", "Python CLI", "Runs chat loop, calls MCP tools, stores conversation memory")
Container(server, "FastMCP Server (server.py)", "Python / FastMCP", "Exposes add_document, semantic_search, collection_stats")
ContainerDb(chroma, "ChromaDB", "Local vector database", "Stores ingested documents and conversation memory")
}
Rel(user, client, "Uses", "CLI")
Rel(client, server, "Invokes tools", "MCP over stdio or SSE")
Rel(server, ollama, "Generates embeddings", "HTTP")
Rel(client, ollama, "Runs chat + embeddings", "HTTP")
Rel(server, chroma, "Reads/writes document vectors", "Local DB API")
Rel(client, chroma, "Reads/writes conversation memory", "Local DB API")
Install/pull models:
ollama pull phi4-mini:3.8b-q4_K_M
ollama pull embeddinggemma:300m-qat-q8_0
ollama list<project-root>/
├── pyproject.toml
├── uv.lock
├── .python-version
├── config.py
├── server.py
├── client.py
├── ingest.py
├── scripts/
├── data/
└── chroma_db/ # runtime-created, ignored by git
cd <project-root>
python3 -m uv syncv1 ingestion supports text/markdown files only (.txt, .md, .markdown).
uv run python ingest.py data/my_notes.txt
uv run python ingest.py data/notes.txt data/report.mdStart Ollama if needed:
ollama serveRun FastMCP server over stdio (default):
cd <project-root>
uv run python server.py --transport stdioEnable verbose MCP context logs on the server:
cd <project-root>
uv run python server.py --transport stdio --log-level DEBUG --to-client-debugRun FastMCP server over SSE:
cd <project-root>
uv run python server.py --transport sse --host 127.0.0.1 --port 8000Client persists conversation history in Chroma (conversation_memory collection) and supports stdio and SSE transports.
Launch interactive client with default session id (stdio):
cd <project-root>
uv run python client.pyLaunch with a custom persisted session:
cd <project-root>
uv run python client.py --session-id my-sessionOverride the server launch command used by the client (stdio mode):
cd <project-root>
uv run python client.py --transport stdio --server-command "uv run python server.py --transport stdio"Connect client to an already running SSE server:
cd <project-root>
uv run python client.py --transport sse --sse-url "http://127.0.0.1:8000/sse"Client log forwarding is always enabled; --log-level only changes verbosity:
cd <project-root>
uv run python client.py --log-level INFO
uv run python client.py --log-level DEBUGadd_document(text, doc_id=None, source="")semantic_search(query, n_results=5, source_filter="")collection_stats()
Run script-based checks:
cd <project-root>
./scripts/ruff_check.sh
./scripts/smoke_ingest.shDirect Ruff commands:
cd <project-root>
python3 -m uv run ruff format .
python3 -m uv run ruff format --check . && python3 -m uv run ruff check .Verify collection count:
cd <project-root>
uv run python -c "import chromadb; c=chromadb.PersistentClient('./chroma_db'); print(c.get_or_create_collection('documents').count())"Connection refusedonlocalhost:11434- Ensure
ollama serveis running.
- Ensure
- Missing model errors
- Re-run
ollama pull phi4-mini:3.8b-q4_K_Mandollama pull embeddinggemma:300m-qat-q8_0.
- Re-run
- Empty search results
- Check ingestion completed and collection count is non-zero.
- ChromaDB embedding dimension mismatch
- Keep one embedding model per collection; clear
chroma_db/and re-ingest if model changes.
- Keep one embedding model per collection; clear