iData

iData is a self-hosted enterprise document intelligence platform. Drop files into a watched folder or sync them from cloud storage, and the system automatically ingests, parses, chunks, and indexes them. A multi-agent RAG backend then answers natural-language questions about your documents with inline citations, SQL data lookups, and interactive graph generation — all through a React web interface.

Architecture

Drag & Drop          Remote Storage (Google Drive, S3, NAS…)
     │                        │
     │               User rclone Backend
     │                (host, port 13000)
     │                        │
     └────────────────────────┘
                  │
           local-documents/
                  │
                  ▼
        Document Collector ──(file events)──► MongoDB
        (Docker, watchdog)                   (document metadata)
                                                    │
                                                    ▼
                                       Document Parser Backend
                                       (Docling / Mistral OCR)
                                        │              │
                                        ▼              ▼
                                     Milvus         MongoDB
                                  (vector chunks)  (chunks + images)
                                             │
                                             ▼
                                   Agent Backend (Flask)
                              ┌──────────────────────────┐
                              │      Orchestrator Agent   │
                              │  ┌──────────────────────┐ │
                              │  │    Document Agent    │ │
                              │  │  (RAG + skills)      │ │
                              │  └──────────────────────┘ │
                              │  ┌──────────────────────┐ │
                              │  │      SQL Agent       │ │
                              │  │  (PostgreSQL)        │ │
                              │  └──────────────────────┘ │
                              └──────────────────────────┘
                                             │
                                             ▼
                                   Frontend (React + Vite)

Pipeline:

Files enter local-documents/ by drag and drop, or by syncing from remote storage via the User rclone Backend running on the host.
Document Collector (Docker) detects new files via a filesystem watchdog and records their metadata in MongoDB.
Document Parser Backend picks up unprocessed files, parses them with Docling or Mistral OCR, chunks the content with LangChain HybridChunker, generates OpenAI embeddings, and writes chunks to Milvus and MongoDB.
Agent Backend receives chat messages and routes them through the Orchestrator, which delegates to a Document Agent and/or SQL Agent, then synthesizes a cited answer.
Frontend renders the response with markdown, images, inline citations, and interactive charts.

Services

Service	Port	Runs	Description
`agent_backend`	5001	Docker	Flask API + multi-agent orchestrator
`document_parser_backend`	4999	Docker	Document parsing worker
`document_collector`	8001	Docker	Folder watchdog
`frontend`	3000	Docker	React + TypeScript web interface
`mongodb`	27017	Docker	Document metadata + chunks + skills
`standalone` (Milvus)	19530	Docker	Vector database
`minio`	9000 / 9001	Docker	Object storage (Milvus dependency)
`postgres`	5432	Docker	User accounts + structured data
`pgadmin`	5433	Docker	PostgreSQL admin UI
`user_rclone_backend`	13000	Local	rclone sync API (must run on host)

user_rclone_backend must run directly on the host. rclone requires access to the host filesystem and browser-based OAuth flows that cannot work inside a container.

Tech Stack

Layer	Technologies
Agent Framework	LangGraph, DeepAgents, LangChain
LLM / Embeddings	OpenAI (GPT-4.1-mini, text-embedding), Mistral (OCR)
Document Parsing	Docling 2.61.1, Mistral OCR, LangChain HybridChunker
Vector DB	Milvus 2.6.5 + MinIO + etcd
Databases	MongoDB (documents, skills, chunks), PostgreSQL (users, structured data)
Backend	Flask (agent + parser), FastAPI (collector + rclone)
Frontend	React 18, TypeScript, Material-UI, Vite
Cloud Sync	rclone
Containerization	Docker Compose

Prerequisites

Docker + Docker Compose
Python 3.10+ (for user_rclone_backend)
rclone installed on the host (for remote sync)
An OpenAI API key (embeddings + LLM)
A Mistral API key (required only for OCR-based PDF parsing)

Setup

1. Configure environment

Copy .env.example to .env and fill in all values:

cp .env.example .env

Key variables to set:

Variable	Description
`OPENAI_API_KEY`	OpenAI API key
`MISTRAL_API_KEY`	Mistral API key (OCR only)
`SECRET_KEY`	Random string for JWT signing — generate with `openssl rand -hex 32`
`MONGODB_USERNAME` / `MONGODB_PASSWORD`	MongoDB credentials
`POSTGRES_USER` / `POSTGRES_PASSWORD`	PostgreSQL credentials
`SQL_AGENTDB_URI`	Full PostgreSQL URI for the SQL agent

2. Configure watched folders

Edit configs/config.toml:

[folder_paths]
nas_documents_path = "/app/local-documents"
local_documents_path = "/app/local-documents"

3. Start the Docker stack

docker compose up --build

Milvus takes ~30 seconds to initialize. The agent backend is ready when it logs MongoDB connector initialized.

4. Start the user rclone backend (local)

Required for remote storage sync. Run directly on the host:

cd user_rclone_backend
pip install -r requirements.txt   # first time only
python -m api.main

Starts on port 13000.

5. Ingest documents

Drag and drop — copy files into local-documents/ on the host. The watchdog detects them immediately and queues them for parsing.

Remote sync — use the frontend's Remote Files page to configure a rclone remote (Google Drive, S3, NAS, etc.) and trigger a sync. Files land in local-documents/ and are picked up automatically.

Supported formats: PDF, DOCX, PPTX, XLSX, and common image types.

Agent System

The backend runs a multi-agent orchestrator built with LangGraph and DeepAgents. Every factual claim in a response is backed by retrieved evidence and carries an inline citation.

Orchestrator

The top-level agent that receives user queries. It follows a strict loop:

Plan → Delegate → Reflect → Delegate more (if needed) → Synthesize → Answer

It decides which subagent(s) to call, reconciles their outputs, and produces a final cited response. It never fabricates answers — if evidence is missing it says so and proposes next retrieval steps.

Orchestrator tools:

Tool	Purpose
`document-agent` (subagent)	Delegate unstructured document queries
`sql-agent` (subagent)	Delegate structured data / SQL queries
`think_tool`	Internal reflection between retrieval steps
`add/get/list/update/delete_skill_tool`	Manage the persistent skill library
`save_user_info`	Persist user preferences and goals across conversations

Routing logic:

Query type	Subagent
"What does the contract say about X?"	Document Agent
"How many orders were placed in Q3?"	SQL Agent
"Compare the policy doc with the sales numbers"	Both

Output capabilities:

The orchestrator synthesizes subagent evidence into a structured response that can include:

Inline citations — every factual claim links to its source chunk or document:

The notice period is 30 days [1](/documents/<doc_id>/<chunk_id>).

Sources:
- [1](/documents/...) — Section 4.2, termination clause

Interactive plots — when retrieved data contains numbers, trends, or comparisons the orchestrator emits a <graph> block with a Plotly.js spec that the frontend renders as an interactive chart:
```
<graph>
{
  "data": [{"type": "bar", "x": ["Q1","Q2","Q3"], "y": [120, 145, 98]}],
  "layout": {"title": "Quarterly Sales"}
}
</graph>
```
Supported chart types: bar, line, scatter, pie, histogram. Graphs are only generated from retrieved data — never fabricated.
Image embedding — images extracted during document parsing can be included inline using their stored asset URL:
```
![diagram](/api/images/<image_id>)
```

Document Agent

Specialist subagent for all unstructured knowledge retrieval. It returns structured evidence packets to the orchestrator — it does not produce final user-facing answers directly.

Document Catalogue

The system maintains two separate Milvus collections:

Collection	Granularity	What is indexed
`document_catalogue`	One entry per document	LLM-generated summary of the whole document
Chunk store	One entry per chunk	Individual text/table segments from parsed documents

The document catalogue is the agent's first point of contact with the corpus. When a document is parsed, an LLM generates a one-sentence context summary of its entire content. That summary — along with the document's path and metadata — is stored in the catalogue.

Why this matters for retrieval quality:

Searching directly against chunks is prone to false negatives. A chunk is a small window of text; a query about a document's overall topic or a concept that spans multiple sections may not match any single chunk well. The catalogue summary captures what the document is about at a high level, making it a much stronger first-pass signal for identifying which documents are relevant before any chunk retrieval happens.

The catalogue search uses hybrid BM25 + dense vector search with weighted fusion (70% dense, 30% BM25), followed by FlashRank cross-encoder reranking — fetching 2× the requested k as candidates, then reranking down to the final k. This combination handles both keyword-specific queries (BM25) and semantic/conceptual queries (dense) reliably.

Access control via the document catalogue:

Because the catalogue is the gateway to all document retrieval, it is also the natural place to enforce access control. Each request carries a Context object injected server-side:

@dataclass
class Context:
    user_id: str
    path_filters: Optional[List[str]]  # e.g. ["finance/", "reports/2025/"]

path_filters is a list of path prefixes the current user is allowed to access. These filters are enforced automatically and silently at two levels:

Layer	How filters are applied
Catalogue search	Only documents whose `local_path` matches an allowed prefix are discoverable
Chunk retrieval	`chunk_retriever_tool` automatically builds `local_path like "<prefix>%"` expressions and ANDs them into every Milvus query

This means the agent cannot retrieve — or even discover — a document outside the user's allowed paths, regardless of what it queries. The access boundary is enforced in the retrieval layer, not in the prompt, so it cannot be bypassed by rephrasing a question.

In practice this allows you to model resource access with folder structure:

local-documents/
├── finance/          # path_filters=["finance/"]  → finance team only
├── legal/            # path_filters=["legal/"]    → legal team only
├── engineering/      # path_filters=["engineering/"]
└── shared/           # included in all users' path_filters

A user with path_filters=["finance/", "shared/"] will only ever see chunks and catalogue entries from those two folders. A user with no path_filters (e.g. an admin) has unrestricted access to the full corpus.

Tiered Access Levels

The document agent accesses content through three progressively deeper levels. Each level is only used when the previous one is insufficient:

Level 1 — Document Catalogue Search
  └─ Identifies relevant document IDs from LLM summaries
  └─ Cheap: one-per-document, broad signal, fast

Level 2 — Chunk Retrieval (scoped)
  └─ Pulls top-k chunks from Milvus, filtered to relevant doc IDs
  └─ Precise: targeted evidence extraction with citation granularity

Level 3 — Full Document Retrieval
  └─ Loads the complete parsed markdown from MongoDB
  └─ Last resort: used when chunks lack enough context or detail

This tiered design means the agent is not firing expensive full-document reads on every query, but it also never silently gives up when chunks are insufficient.

Retrieval Workflow

The document agent follows this sequence on every request:

Discover — call document_catalogue_search_tool with a query that captures the user's intent, key entities, and any known document type. Extract the returned document IDs.
Retrieve chunks — call chunk_retriever_tool with filter_expr='document_id == "<id>"' to scope the Milvus search to the relevant documents identified in step 1.
Escalate — if chunks are too short or missing context, call retrieve_full_content_tool(document_id) on the most promising document(s).
Reflect — call think_tool after each step to assess what was found, what is missing, and whether to continue or return.

Document Agent Tools

Tool	Description
`document_catalogue_search_tool(query, k)`	Hybrid BM25 + dense search with FlashRank reranking over per-document LLM summaries — returns document IDs
`chunk_retriever_tool(query, k, filter_expr)`	Milvus chunk search; path filters from user session applied automatically
`retrieve_full_content_tool(document_id)`	Load complete parsed markdown + context summary from MongoDB
`think_tool(reflection)`	Deliberate reflection step — assess coverage and decide next action
`add_skill_tool`	Add a reusable technique or workflow to the skill library
`get_skill_tool`	Retrieve a skill by name
`list_skills_tool`	List all skills (LRU order)
`update_skill_tool`	Update a skill's description or content
`delete_skill_tool`	Remove a skill
`save_user_info`	Persist user name, preferences, and goals across conversations

Chunk filter expressions — chunk_retriever_tool accepts Milvus filter expressions to scope retrieval:

# Scope to a specific document (most common — used after catalogue search)
filter_expr='document_id == "507f1f77bcf86cd799439011"'

# Filter by chunk type
filter_expr='chunk_type == "table"'

# Filter by page range
filter_expr='pages[0] >= 10 and pages[0] <= 20'

# Filter by file path prefix (folder scoping)
filter_expr='local_path like "/app/local-documents/reports/%"'

# Combine
filter_expr='chunk_type == "table" and pages[0] >= 5'

Path filters from the user's active session context (folders selected in the UI) are applied automatically on top of any custom filter expression.

SQL Agent

Specialist subagent for structured data queries against PostgreSQL. It introspects schemas dynamically, writes safe SQL, validates queries before execution, and returns results as markdown tables.

SQL Agent tools:

Tool	Description
`list_table_schemas_tool()`	List all documented tables (LRU order)
`get_table_schema_tool(table_name)`	Get detailed schema docs for a table
`add_table_schema_tool(...)`	Document a table's columns, relationships, and query patterns
`update_table_schema_tool(...)`	Update existing schema documentation
`delete_table_schema_tool(...)`	Remove schema documentation
`think_tool`	Reflection between query steps
LangChain SQLDatabaseToolkit	Live schema introspection, query execution, query validation

Schema documentation is stored in MongoDB (same LRU mechanism as skills) and injected into the SQL agent's context automatically.

Skill System

The agent can build and maintain a persistent library of reusable techniques, workflows, and domain knowledge. Skills are stored in MongoDB with LRU eviction (cap: 10 skills total, 20 injected into context per request).

Skills can be created, updated, and deleted by the agent mid-conversation. They are surfaced automatically in the system prompt for future sessions.

Example use cases:

Storing a multi-step analysis process the user defined once
Remembering a specific SQL query pattern for a recurring report
Saving domain knowledge extracted from documents for faster future retrieval

API Reference

Agent Backend — `localhost:5001`

Auth

Method	Path	Description
`POST`	`/auth/login`	Log in, returns JWT access + refresh tokens
`POST`	`/auth/signup`	Create a new user account
`POST`	`/auth/refresh`	Refresh an expired access token

Chat

Method	Path	Description
`GET`	`/chats`	List all chat sessions for the current user
`POST`	`/chats`	Create a new chat session
`GET`	`/chats/<id>/messages`	Get message history for a chat
`POST`	`/chats/<id>/messages`	Send a message and stream the agent response

Files

Method	Path	Description
`GET`	`/files`	List all indexed documents
`GET`	`/api/images/<id>`	Retrieve a stored image asset

Document Parser Backend — `localhost:4999`

Method	Path	Description
`GET`	`/documents`	List all tracked documents and parse status
`GET`	`/documents/<id>`	Status and metadata for a single document
`GET`	`/documents/<id>/<chunk_id>`	Retrieve a specific Milvus chunk
`GET`	`/api/images/<id>`	Stream a stored image
`GET`	`/documents/search/<query>`	Milvus similarity search
`POST`	`/documents/reset`	Mark all documents for re-parsing (dev)

User rclone Backend — `localhost:13000` (host process)

Method	Path	Description
`GET`	`/rclone/remotes`	List configured rclone remotes
`POST`	`/rclone/remotes`	Add a new remote (e.g. Google Drive)
`GET`	`/rclone/remotes/<name>/files`	Browse files on a remote
`POST`	`/rclone/sync`	Trigger a sync from a remote path to `local-documents/`

Frontend Features

Chat interface — multi-turn conversations with the orchestrator, markdown + image + graph rendering
Tool picker — select which agent capabilities are active per session
Document browser — view indexed documents and parse status
Remote file manager — configure rclone remotes and browse/sync cloud storage
Spotlight search — semantic search across all indexed content
Document viewer — open source documents directly from citation links
Theme system — Pastel, GitHub, GitHub Dark, Yale Blue, Dark Blue

PostgreSQL Mock Data

To populate the SQL agent's database with sample data for testing:

docker exec -i postgres psql -U $POSTGRES_USER -d $POSTGRES_DB \
  -f /docker-entrypoint-initdb.d/schema.sql

docker exec -i postgres psql -U $POSTGRES_USER -d $POSTGRES_DB \
  -f /repo/add-data-copy-csv.sql

Logs

All services write structured JSON logs to logs/project.log.jsonl (1 GB max, 3 rotating backups).

docker compose logs -f <service_name>

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Frontend		Frontend
agent_backend		agent_backend
configs		configs
document_collector		document_collector
document_parser_backend		document_parser_backend
files		files
jupyterhub		jupyterhub
local-documents		local-documents
logs		logs
postgreSQL		postgreSQL
user_rclone_backend		user_rclone_backend
.DS_Store		.DS_Store
CONTRIBUTORS		CONTRIBUTORS
README.md		README.md
amber-overhaul.css		amber-overhaul.css
docker-compose.yml		docker-compose.yml
prompts.py		prompts.py

Folders and files

Latest commit

History

Repository files navigation

iData

Architecture

Services

Tech Stack

Prerequisites

Setup

1. Configure environment

2. Configure watched folders

3. Start the Docker stack

4. Start the user rclone backend (local)

5. Ingest documents

Agent System

Orchestrator

Document Agent

Document Catalogue

Tiered Access Levels

Retrieval Workflow

Document Agent Tools

SQL Agent

Skill System

API Reference

Agent Backend — localhost:5001

Auth

Chat

Files

Document Parser Backend — localhost:4999

User rclone Backend — localhost:13000 (host process)

Frontend Features

PostgreSQL Mock Data

Logs

Contributors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Agent Backend — `localhost:5001`

Document Parser Backend — `localhost:4999`

User rclone Backend — `localhost:13000` (host process)

Packages