iData is a self-hosted enterprise document intelligence platform. Drop files into a watched folder or sync them from cloud storage, and the system automatically ingests, parses, chunks, and indexes them. A multi-agent RAG backend then answers natural-language questions about your documents with inline citations, SQL data lookups, and interactive graph generation — all through a React web interface.
Drag & Drop Remote Storage (Google Drive, S3, NAS…)
│ │
│ User rclone Backend
│ (host, port 13000)
│ │
└────────────────────────┘
│
local-documents/
│
▼
Document Collector ──(file events)──► MongoDB
(Docker, watchdog) (document metadata)
│
▼
Document Parser Backend
(Docling / Mistral OCR)
│ │
▼ ▼
Milvus MongoDB
(vector chunks) (chunks + images)
│
▼
Agent Backend (Flask)
┌──────────────────────────┐
│ Orchestrator Agent │
│ ┌──────────────────────┐ │
│ │ Document Agent │ │
│ │ (RAG + skills) │ │
│ └──────────────────────┘ │
│ ┌──────────────────────┐ │
│ │ SQL Agent │ │
│ │ (PostgreSQL) │ │
│ └──────────────────────┘ │
└──────────────────────────┘
│
▼
Frontend (React + Vite)
Pipeline:
- Files enter
local-documents/by drag and drop, or by syncing from remote storage via the User rclone Backend running on the host. - Document Collector (Docker) detects new files via a filesystem watchdog and records their metadata in MongoDB.
- Document Parser Backend picks up unprocessed files, parses them with Docling or Mistral OCR, chunks the content with LangChain HybridChunker, generates OpenAI embeddings, and writes chunks to Milvus and MongoDB.
- Agent Backend receives chat messages and routes them through the Orchestrator, which delegates to a Document Agent and/or SQL Agent, then synthesizes a cited answer.
- Frontend renders the response with markdown, images, inline citations, and interactive charts.
| Service | Port | Runs | Description |
|---|---|---|---|
agent_backend |
5001 | Docker | Flask API + multi-agent orchestrator |
document_parser_backend |
4999 | Docker | Document parsing worker |
document_collector |
8001 | Docker | Folder watchdog |
frontend |
3000 | Docker | React + TypeScript web interface |
mongodb |
27017 | Docker | Document metadata + chunks + skills |
standalone (Milvus) |
19530 | Docker | Vector database |
minio |
9000 / 9001 | Docker | Object storage (Milvus dependency) |
postgres |
5432 | Docker | User accounts + structured data |
pgadmin |
5433 | Docker | PostgreSQL admin UI |
user_rclone_backend |
13000 | Local | rclone sync API (must run on host) |
user_rclone_backendmust run directly on the host. rclone requires access to the host filesystem and browser-based OAuth flows that cannot work inside a container.
| Layer | Technologies |
|---|---|
| Agent Framework | LangGraph, DeepAgents, LangChain |
| LLM / Embeddings | OpenAI (GPT-4.1-mini, text-embedding), Mistral (OCR) |
| Document Parsing | Docling 2.61.1, Mistral OCR, LangChain HybridChunker |
| Vector DB | Milvus 2.6.5 + MinIO + etcd |
| Databases | MongoDB (documents, skills, chunks), PostgreSQL (users, structured data) |
| Backend | Flask (agent + parser), FastAPI (collector + rclone) |
| Frontend | React 18, TypeScript, Material-UI, Vite |
| Cloud Sync | rclone |
| Containerization | Docker Compose |
- Docker + Docker Compose
- Python 3.10+ (for
user_rclone_backend) - rclone installed on the host (for remote sync)
- An OpenAI API key (embeddings + LLM)
- A Mistral API key (required only for OCR-based PDF parsing)
Copy .env.example to .env and fill in all values:
cp .env.example .envKey variables to set:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key |
MISTRAL_API_KEY |
Mistral API key (OCR only) |
SECRET_KEY |
Random string for JWT signing — generate with openssl rand -hex 32 |
MONGODB_USERNAME / MONGODB_PASSWORD |
MongoDB credentials |
POSTGRES_USER / POSTGRES_PASSWORD |
PostgreSQL credentials |
SQL_AGENTDB_URI |
Full PostgreSQL URI for the SQL agent |
Edit configs/config.toml:
[folder_paths]
nas_documents_path = "/app/local-documents"
local_documents_path = "/app/local-documents"docker compose up --buildMilvus takes ~30 seconds to initialize. The agent backend is ready when it logs MongoDB connector initialized.
Required for remote storage sync. Run directly on the host:
cd user_rclone_backend
pip install -r requirements.txt # first time only
python -m api.mainStarts on port 13000.
Drag and drop — copy files into local-documents/ on the host. The watchdog detects them immediately and queues them for parsing.
Remote sync — use the frontend's Remote Files page to configure a rclone remote (Google Drive, S3, NAS, etc.) and trigger a sync. Files land in local-documents/ and are picked up automatically.
Supported formats: PDF, DOCX, PPTX, XLSX, and common image types.
The backend runs a multi-agent orchestrator built with LangGraph and DeepAgents. Every factual claim in a response is backed by retrieved evidence and carries an inline citation.
The top-level agent that receives user queries. It follows a strict loop:
Plan → Delegate → Reflect → Delegate more (if needed) → Synthesize → Answer
It decides which subagent(s) to call, reconciles their outputs, and produces a final cited response. It never fabricates answers — if evidence is missing it says so and proposes next retrieval steps.
Orchestrator tools:
| Tool | Purpose |
|---|---|
document-agent (subagent) |
Delegate unstructured document queries |
sql-agent (subagent) |
Delegate structured data / SQL queries |
think_tool |
Internal reflection between retrieval steps |
add/get/list/update/delete_skill_tool |
Manage the persistent skill library |
save_user_info |
Persist user preferences and goals across conversations |
Routing logic:
| Query type | Subagent |
|---|---|
| "What does the contract say about X?" | Document Agent |
| "How many orders were placed in Q3?" | SQL Agent |
| "Compare the policy doc with the sales numbers" | Both |
Output capabilities:
The orchestrator synthesizes subagent evidence into a structured response that can include:
-
Inline citations — every factual claim links to its source chunk or document:
The notice period is 30 days [1](/documents/<doc_id>/<chunk_id>). Sources: - [1](/documents/...) — Section 4.2, termination clause -
Interactive plots — when retrieved data contains numbers, trends, or comparisons the orchestrator emits a
<graph>block with a Plotly.js spec that the frontend renders as an interactive chart:<graph> { "data": [{"type": "bar", "x": ["Q1","Q2","Q3"], "y": [120, 145, 98]}], "layout": {"title": "Quarterly Sales"} } </graph>Supported chart types:
bar,line,scatter,pie,histogram. Graphs are only generated from retrieved data — never fabricated. -
Image embedding — images extracted during document parsing can be included inline using their stored asset URL:

Specialist subagent for all unstructured knowledge retrieval. It returns structured evidence packets to the orchestrator — it does not produce final user-facing answers directly.
The system maintains two separate Milvus collections:
| Collection | Granularity | What is indexed |
|---|---|---|
document_catalogue |
One entry per document | LLM-generated summary of the whole document |
| Chunk store | One entry per chunk | Individual text/table segments from parsed documents |
The document catalogue is the agent's first point of contact with the corpus. When a document is parsed, an LLM generates a one-sentence context summary of its entire content. That summary — along with the document's path and metadata — is stored in the catalogue.
Why this matters for retrieval quality:
Searching directly against chunks is prone to false negatives. A chunk is a small window of text; a query about a document's overall topic or a concept that spans multiple sections may not match any single chunk well. The catalogue summary captures what the document is about at a high level, making it a much stronger first-pass signal for identifying which documents are relevant before any chunk retrieval happens.
The catalogue search uses hybrid BM25 + dense vector search with weighted fusion (70% dense, 30% BM25), followed by FlashRank cross-encoder reranking — fetching 2× the requested k as candidates, then reranking down to the final k. This combination handles both keyword-specific queries (BM25) and semantic/conceptual queries (dense) reliably.
Access control via the document catalogue:
Because the catalogue is the gateway to all document retrieval, it is also the natural place to enforce access control. Each request carries a Context object injected server-side:
@dataclass
class Context:
user_id: str
path_filters: Optional[List[str]] # e.g. ["finance/", "reports/2025/"]path_filters is a list of path prefixes the current user is allowed to access. These filters are enforced automatically and silently at two levels:
| Layer | How filters are applied |
|---|---|
| Catalogue search | Only documents whose local_path matches an allowed prefix are discoverable |
| Chunk retrieval | chunk_retriever_tool automatically builds local_path like "<prefix>%" expressions and ANDs them into every Milvus query |
This means the agent cannot retrieve — or even discover — a document outside the user's allowed paths, regardless of what it queries. The access boundary is enforced in the retrieval layer, not in the prompt, so it cannot be bypassed by rephrasing a question.
In practice this allows you to model resource access with folder structure:
local-documents/
├── finance/ # path_filters=["finance/"] → finance team only
├── legal/ # path_filters=["legal/"] → legal team only
├── engineering/ # path_filters=["engineering/"]
└── shared/ # included in all users' path_filters
A user with path_filters=["finance/", "shared/"] will only ever see chunks and catalogue entries from those two folders. A user with no path_filters (e.g. an admin) has unrestricted access to the full corpus.
The document agent accesses content through three progressively deeper levels. Each level is only used when the previous one is insufficient:
Level 1 — Document Catalogue Search
└─ Identifies relevant document IDs from LLM summaries
└─ Cheap: one-per-document, broad signal, fast
Level 2 — Chunk Retrieval (scoped)
└─ Pulls top-k chunks from Milvus, filtered to relevant doc IDs
└─ Precise: targeted evidence extraction with citation granularity
Level 3 — Full Document Retrieval
└─ Loads the complete parsed markdown from MongoDB
└─ Last resort: used when chunks lack enough context or detail
This tiered design means the agent is not firing expensive full-document reads on every query, but it also never silently gives up when chunks are insufficient.
The document agent follows this sequence on every request:
- Discover — call
document_catalogue_search_toolwith a query that captures the user's intent, key entities, and any known document type. Extract the returned document IDs. - Retrieve chunks — call
chunk_retriever_toolwithfilter_expr='document_id == "<id>"'to scope the Milvus search to the relevant documents identified in step 1. - Escalate — if chunks are too short or missing context, call
retrieve_full_content_tool(document_id)on the most promising document(s). - Reflect — call
think_toolafter each step to assess what was found, what is missing, and whether to continue or return.
| Tool | Description |
|---|---|
document_catalogue_search_tool(query, k) |
Hybrid BM25 + dense search with FlashRank reranking over per-document LLM summaries — returns document IDs |
chunk_retriever_tool(query, k, filter_expr) |
Milvus chunk search; path filters from user session applied automatically |
retrieve_full_content_tool(document_id) |
Load complete parsed markdown + context summary from MongoDB |
think_tool(reflection) |
Deliberate reflection step — assess coverage and decide next action |
add_skill_tool |
Add a reusable technique or workflow to the skill library |
get_skill_tool |
Retrieve a skill by name |
list_skills_tool |
List all skills (LRU order) |
update_skill_tool |
Update a skill's description or content |
delete_skill_tool |
Remove a skill |
save_user_info |
Persist user name, preferences, and goals across conversations |
Chunk filter expressions — chunk_retriever_tool accepts Milvus filter expressions to scope retrieval:
# Scope to a specific document (most common — used after catalogue search)
filter_expr='document_id == "507f1f77bcf86cd799439011"'
# Filter by chunk type
filter_expr='chunk_type == "table"'
# Filter by page range
filter_expr='pages[0] >= 10 and pages[0] <= 20'
# Filter by file path prefix (folder scoping)
filter_expr='local_path like "/app/local-documents/reports/%"'
# Combine
filter_expr='chunk_type == "table" and pages[0] >= 5'Path filters from the user's active session context (folders selected in the UI) are applied automatically on top of any custom filter expression.
Specialist subagent for structured data queries against PostgreSQL. It introspects schemas dynamically, writes safe SQL, validates queries before execution, and returns results as markdown tables.
SQL Agent tools:
| Tool | Description |
|---|---|
list_table_schemas_tool() |
List all documented tables (LRU order) |
get_table_schema_tool(table_name) |
Get detailed schema docs for a table |
add_table_schema_tool(...) |
Document a table's columns, relationships, and query patterns |
update_table_schema_tool(...) |
Update existing schema documentation |
delete_table_schema_tool(...) |
Remove schema documentation |
think_tool |
Reflection between query steps |
| LangChain SQLDatabaseToolkit | Live schema introspection, query execution, query validation |
Schema documentation is stored in MongoDB (same LRU mechanism as skills) and injected into the SQL agent's context automatically.
The agent can build and maintain a persistent library of reusable techniques, workflows, and domain knowledge. Skills are stored in MongoDB with LRU eviction (cap: 10 skills total, 20 injected into context per request).
Skills can be created, updated, and deleted by the agent mid-conversation. They are surfaced automatically in the system prompt for future sessions.
Example use cases:
- Storing a multi-step analysis process the user defined once
- Remembering a specific SQL query pattern for a recurring report
- Saving domain knowledge extracted from documents for faster future retrieval
| Method | Path | Description |
|---|---|---|
POST |
/auth/login |
Log in, returns JWT access + refresh tokens |
POST |
/auth/signup |
Create a new user account |
POST |
/auth/refresh |
Refresh an expired access token |
| Method | Path | Description |
|---|---|---|
GET |
/chats |
List all chat sessions for the current user |
POST |
/chats |
Create a new chat session |
GET |
/chats/<id>/messages |
Get message history for a chat |
POST |
/chats/<id>/messages |
Send a message and stream the agent response |
| Method | Path | Description |
|---|---|---|
GET |
/files |
List all indexed documents |
GET |
/api/images/<id> |
Retrieve a stored image asset |
| Method | Path | Description |
|---|---|---|
GET |
/documents |
List all tracked documents and parse status |
GET |
/documents/<id> |
Status and metadata for a single document |
GET |
/documents/<id>/<chunk_id> |
Retrieve a specific Milvus chunk |
GET |
/api/images/<id> |
Stream a stored image |
GET |
/documents/search/<query> |
Milvus similarity search |
POST |
/documents/reset |
Mark all documents for re-parsing (dev) |
| Method | Path | Description |
|---|---|---|
GET |
/rclone/remotes |
List configured rclone remotes |
POST |
/rclone/remotes |
Add a new remote (e.g. Google Drive) |
GET |
/rclone/remotes/<name>/files |
Browse files on a remote |
POST |
/rclone/sync |
Trigger a sync from a remote path to local-documents/ |
- Chat interface — multi-turn conversations with the orchestrator, markdown + image + graph rendering
- Tool picker — select which agent capabilities are active per session
- Document browser — view indexed documents and parse status
- Remote file manager — configure rclone remotes and browse/sync cloud storage
- Spotlight search — semantic search across all indexed content
- Document viewer — open source documents directly from citation links
- Theme system — Pastel, GitHub, GitHub Dark, Yale Blue, Dark Blue
To populate the SQL agent's database with sample data for testing:
docker exec -i postgres psql -U $POSTGRES_USER -d $POSTGRES_DB \
-f /docker-entrypoint-initdb.d/schema.sql
docker exec -i postgres psql -U $POSTGRES_USER -d $POSTGRES_DB \
-f /repo/add-data-copy-csv.sqlAll services write structured JSON logs to logs/project.log.jsonl (1 GB max, 3 rotating backups).
docker compose logs -f <service_name>