⚠️ Pre-release — In Active DevelopmentThis project is functional and has been tested end-to-end in a local environment, but it has not undergone formal QA, security audit, or enterprise support validation for public deployment. The Admin UI in particular is in a primitive early state. Use in production environments at your own discretion and with appropriate review.
A production-grade MCP (Model Context Protocol) gateway that exposes data.world's catalog, governance, and knowledge graph capabilities as AI-native tools — enabling AI agents to discover, understand, and reason about enterprise data before querying it.
Most MCP servers for data platforms are thin API wrappers. This gateway is different: every tool is designed around what an agent wants to accomplish, not what API endpoint it maps to. describe_dataset doesn't expose the data.world schema endpoint — it returns everything an agent needs to understand a dataset in a single call: schema, governance, certification status, responsible parties, and compliance tags.
When paired with a SQL or file-system MCP (the "data access layer"), this gateway acts as the knowledge and intelligence layer in a multi-MCP agent architecture:
┌─────────────────────────────────────────────────────┐
│ AI Agent (Claude, GPT, etc.) │
└────────────────┬────────────────────┬────────────────┘
│ │
┌────────────▼──────────┐ ┌──────▼──────────────────┐
│ data.world MCP │ │ Data MCP │
│ Knowledge + Wisdom │ │ Access Layer │
│ Layer │ │ │
│ • What data exists? │ │ • SQL (Postgres, │
│ • Who governs it? │ │ Snowflake, BigQuery) │
│ • Is it certified? │ │ • Files (S3, ADLS) │
│ • What does it mean? │ │ • REST APIs │
│ • Who can access it? │ │ • Streaming sources │
└───────────────────────┘ └─────────────────────────┘
The agent asks the gateway which dataset to query, what columns exist and what they mean, whether the data is certified, and who to contact for governance questions — then the data MCP executes the actual retrieval. The user receives verified, cited, auditable answers.
| Sub-project | Description | Status |
|---|---|---|
| SP1 | Core MCP Gateway (7 tools) | ✅ Complete |
| SP2 | Enterprise Auth (Okta JWT) | ✅ Complete |
| SP3 | Admin API (control plane) | ✅ Complete |
| SP4 | AI-Powered Instance Discovery | ✅ Complete |
| SP5 | Admin UI (React frontend) | ✅ Complete — see Admin UI notes |
| SP6 | Okta SSO for Admin UI | 🔲 Planned |
| — | Production hardening pass | 🔲 Planned |
| Tool | What it does |
|---|---|
search_catalog |
Full-text search across the data.world catalog. Supports filtering by owner, tags, domain, and 8 responsible-party roles. Returns source_url for inline citation. |
describe_dataset |
Full schema (all tables, columns, types), governance metadata, certification status, quality score, compliance tags (GDPR, SOX, CCPA, HIPAA), and responsible-party contacts for a specific dataset. |
list_collections |
Enumerates curated domain collections with member counts and IRIs for further navigation. Enterprise tier. |
| Tool | What it does |
|---|---|
get_access_policy |
Access level (open/restricted/private), policy description, approved groups, and compliance classification for a dataset. Enterprise tier. |
| Tool | What it does |
|---|---|
get_glossary_terms |
Business vocabulary definitions with synonyms, owning team, and linked datasets/columns. Resolves terms before data interpretation. Enterprise tier. |
get_lineage |
Upstream/downstream data dependency graph with configurable depth and direction. Enterprise tier. |
get_related_resources |
Graph traversal from any catalog IRI — finds linked datasets, glossary terms, and related resources. IRIs are provided by other tool responses. Enterprise tier. |
Every tool response includes a source_url field pointing directly to the originating page in data.world, and a next_step hint guiding agents to the logical follow-on tool call. Agents can use source_url to produce inline markdown citations:
The Hospital Outcome of Care Surgical Measures dataset covers quality metrics for surgical procedures across US hospitals...
See docs/citation-system-prompt.md for a system prompt snippet that instructs agents to use inline citations.
┌──────────────────────────────────────────────────────────────────────┐
│ Control Plane │
│ │
│ Admin API (FastAPI, port 8000) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Tool config management · Discovery scan orchestration │ │
│ │ Telemetry · Audit log · Recommendation review │ │
│ └──────────────────────────┬──────────────────────────────────────┘ │
│ │ pg_notify('config_changed') │
│ ▼ │
│ PostgreSQL (shared state) │
│ │ asyncpg LISTEN │
│ ▼ │
│ MCP Gateway (FastMCP, port 8001) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 7 MCP tools · Okta JWT auth · Telemetry middleware │ │
│ │ Live tool toggle (no restart) · Source link citations │ │
│ └──────────────────────────┬──────────────────────────────────────┘ │
│ │ │
└─────────────────────────────┼────────────────────────────────────────┘
│ HTTPS
┌─────────▼─────────┐
│ data.world API │
│ (public or │
│ enterprise) │
└───────────────────┘
The Admin API and MCP Gateway communicate only through PostgreSQL LISTEN/NOTIFY — no direct service-to-service calls. A crash or restart of either service has zero impact on the other.
- Python 3.10+
- Node.js 18+ (for the Admin UI)
- Docker Desktop (for PostgreSQL)
- A data.world account with an API token (get one here)
git clone https://github.com/illegal-request/data.world-FastMCP-Platform.git
cd data.world-FastMCP-Platform
# Install the MCP gateway (editable)
pip install -e .
# Install the Admin API
pip install -e "admin_api/[dev]"cp .env.example .env
# Edit .env and set DATAWORLD_API_TOKEN to your data.world tokenFor the Admin API, create a separate .env:
cp admin_api/.env.example admin_api/.env # if it doesn't exist, copy from root .env
# Set ADMIN_BOOTSTRAP_KEY to a random string (≥32 chars)
# Example: python -c "import secrets; print(secrets.token_urlsafe(48))"docker compose up postgrescd admin_api
python -m alembic upgrade head
cd ..Terminal 1 — Admin API:
cd admin_api
python -m uvicorn dataworld_admin.main:app --host 0.0.0.0 --port 8000Terminal 2 — Admin UI:
cd admin_ui
npm install
npm run dev
# Serves at http://localhost:5173Terminal 3 — MCP Gateway (for Claude Code / Claude Desktop):
python -m dataworld_mcp
# Default transport: stdio — launched by the MCP client, not manuallyThe gateway runs in stdio mode by default, meaning your MCP client (Claude Desktop, Claude Code, Cursor) starts it as a subprocess. You only start it manually if you want to run it as an HTTP server.
# MCP gateway tests
pytest
# Admin API tests
cd admin_api && pytestAdd to your claude_desktop_config.json:
{
"mcpServers": {
"dataworld": {
"command": "python",
"args": ["-m", "dataworld_mcp"],
"cwd": "/path/to/data.world-FastMCP-Platform",
"env": {
"DATAWORLD_API_TOKEN": "your_token_here"
}
}
}
}Start the gateway as an HTTP server:
MCP_TRANSPORT=streamable-http MCP_PORT=8001 python -m dataworld_mcpThe project includes a .mcp.json pre-configured for localhost:8001. If using Claude Code, this is picked up automatically when you open the project directory.
Alternatively, use SSE transport:
MCP_TRANSPORT=sse MCP_PORT=8001 python -m dataworld_mcpAnd update .mcp.json:
{
"mcpServers": {
"dataworld": {
"type": "sse",
"url": "http://localhost:8001/sse"
}
}
}For inline citations, add to your agent's system prompt (or paste from docs/citation-system-prompt.md):
When tool results include
source_urlfields, cite them as inline markdown hyperlinks —[descriptive text](url)— woven naturally into your response prose. Do not collect them into a reference list at the end.
All configuration is via environment variables. Copy .env.example to .env to get started.
| Variable | Required | Default | Description |
|---|---|---|---|
DATAWORLD_API_TOKEN |
✅ Yes | — | Your data.world API token |
DATAWORLD_BASE_URL |
No | https://api.data.world/v0 |
API base URL. Override for enterprise single-tenant: https://api.{company}.app.data.world/v0 |
DATAWORLD_UI_BASE_URL |
No | https://data.world |
UI base URL for source_url construction. Override for enterprise: https://{company}.app.data.world |
MCP_TRANSPORT |
No | stdio |
Transport protocol: stdio, streamable-http, or sse |
MCP_PORT |
No | 8001 |
Port for HTTP/SSE transports |
AUTH_MODE |
No | env_token |
Authentication mode: env_token (single token from env) or okta (per-user JWT) |
OKTA_ISSUER |
If AUTH_MODE=okta |
— | Okta authorization server URL |
OKTA_AUDIENCE |
If AUTH_MODE=okta |
— | Expected JWT audience |
OKTA_CLIENT_ID |
If AUTH_MODE=okta |
— | Okta application client ID |
OKTA_CLIENT_SECRET |
If AUTH_MODE=okta |
— | Okta application client secret |
DATABASE_URL |
No | — | PostgreSQL connection string. Required for live tool configuration and telemetry. |
| Variable | Required | Default | Description |
|---|---|---|---|
ADMIN_BOOTSTRAP_KEY |
✅ Yes | — | Static token for Admin UI login (≥32 chars). Generate with python -c "import secrets; print(secrets.token_urlsafe(48))" |
DATABASE_URL |
✅ Yes | — | PostgreSQL connection string |
DATAWORLD_API_TOKEN |
✅ Yes | — | data.world API token (used by discovery scanner) |
DATAWORLD_BASE_URL |
No | https://api.data.world/v0 |
data.world API base URL |
DISCOVERY_LLM_MODEL |
No | — | LiteLLM model string for AI-powered scan analysis (e.g. claude-sonnet-4-5). Omit to use the template analyser fallback. |
ANTHROPIC_API_KEY |
If using Anthropic | — | Anthropic API key (required if DISCOVERY_LLM_MODEL uses an Anthropic model) |
| Concern | Local | Enterprise |
|---|---|---|
| API host | https://api.data.world/v0 |
https://api.{company}.app.data.world/v0 |
| UI base URL | https://data.world |
https://{company}.app.data.world |
| Authentication | Single API token (env_token) |
Per-user Okta JWT (okta mode) |
| Transport | stdio (client-managed) |
streamable-http or sse (hosted server) |
| Credentials | .env file |
Secrets manager (AWS, Azure, Vault, K8s) — planned, see Known Issues |
Set AUTH_MODE=okta and provide all four OKTA_* variables. In this mode:
- Agents authenticate with their Okta JWT
- The gateway validates the JWT with your Okta authorization server
- The validated user's data.world token is fetched via RFC 8693 token exchange
- Tool calls execute with the individual user's data.world permissions
Note: The RFC 8693 token exchange endpoint for enterprise data.world instances has not been confirmed with data.world enterprise support. OktaTokenProvider._exchange() is intentionally isolated — only this method changes when the endpoint is confirmed. See Known Issues.
The project includes Dockerfiles for both the gateway and the Admin API:
# Build and run all services
docker compose upNote: the Docker Compose file uses development defaults (plain dataworld/dataworld Postgres credentials). Update for any non-local deployment.
The discovery engine scans your data.world instance and uses an LLM to generate instance-specific tool descriptions — so agents arrive pre-tuned to your actual domain taxonomy, collection structure, and responsible-party roles.
Configure DISCOVERY_LLM_MODEL with any LiteLLM-compatible model string:
| Provider | Model string example |
|---|---|
| Anthropic | claude-sonnet-4-5 |
| OpenAI | gpt-4o-mini |
| Azure OpenAI | azure/gpt-4o-mini |
| Google Gemini | gemini/gemini-1.5-flash |
| Amazon Bedrock | bedrock/anthropic.claude-sonnet-4-5 |
| None | Omit — template analyser is used |
Run a discovery scan from the Admin UI at http://localhost:5173.
⚠️ The Admin UI is in a primitive early development state. The core functionality works — login, tool configuration, discovery scan management, telemetry dashboard — but the interface has not undergone design review, usability testing, or full feature development. Expect rough edges, missing polish, and incomplete workflows.
Access the Admin UI at http://localhost:5173 after starting the dev server (npm run dev in admin_ui/).
Login with your ADMIN_BOOTSTRAP_KEY value from your .env file.
What works:
- Viewing and toggling MCP tools on/off (changes propagate to the live gateway without restart)
- Viewing and editing tool descriptions
- Running basic and advanced discovery scans
- Reviewing and approving discovery recommendations
- Telemetry dashboard (tool call counts, latency)
SP6 — Okta SSO for Admin UI: The "Login with Okta SSO" button is a stub. Full Okta OIDC login for admin sessions is planned as SP6 and has not been implemented.
data.world-FastMCP-Platform/
├── src/dataworld_mcp/ # MCP Gateway (the core product)
│ ├── tools/ # The 7 MCP tools
│ │ ├── catalog.py # search_catalog, describe_dataset, list_collections
│ │ ├── governance.py # get_access_policy
│ │ ├── knowledge.py # get_glossary_terms, get_lineage, get_related_resources
│ │ └── url_builder.py # Source URL construction utility
│ ├── auth/ # Okta JWT validation
│ ├── client/ # data.world API client (httpx + tenacity)
│ ├── telemetry/ # Tool call event buffering and middleware
│ ├── config.py # Environment configuration (single source of truth)
│ ├── config_listener.py # PostgreSQL LISTEN for live config updates
│ └── server.py # FastMCP server instance
├── admin_api/ # Control plane API (FastAPI)
│ ├── src/dataworld_admin/
│ │ ├── discovery/ # LLM-powered catalog scan engine
│ │ ├── tools/ # Tool config management
│ │ ├── telemetry/ # Telemetry persistence
│ │ └── auth/ # Admin API authentication
│ └── alembic/ # Database migrations
├── admin_ui/ # Admin frontend (React + TypeScript)
│ └── src/
├── tests/ # MCP gateway test suite
├── docs/ # Architecture docs, briefs, system prompt guidance
├── docker-compose.yml
├── Dockerfile.gateway
└── pyproject.toml
- Add the tool function to the appropriate file in
src/dataworld_mcp/tools/using the@mcp.tool()decorator - Follow the XML docstring format (
<usecase>/<instructions>) for consistent tool selection behaviour - Include
source_urlin the response (extract from API response or construct withdataset_url()) - Append the citation hint to
next_step:"source_url fields in results can be used as inline markdown citations [title](url) in agent responses." - Add tests in
tests/ - Import the module in
src/dataworld_mcp/__main__.pyto register the tool
@mcp.tool()
async def my_tool(param: str) -> dict:
"""
<usecase>
Use when the agent wants to [accomplish X]. Call after [Y] to [Z].
</usecase>
<instructions>
Provide [param description]. Returns [response structure].
</instructions>
"""The <usecase> block drives tool selection. The <instructions> block drives tool use. Keep them separate — mixing them degrades tool selection accuracy.
See KNOWN_ISSUES.md for the full list. Key items:
- Okta token exchange endpoint unconfirmed —
AUTH_MODE=oktarequires validation with data.world enterprise support before production use - Secrets manager not implemented — credentials are read from
.envfiles; AWS/Azure/Vault/K8s secrets integration is planned - Admin UI is primitive — SP6 (Okta SSO), full UI polish, and complete workflow coverage are planned
- Enterprise source URLs (Group 2 tools) — constructed URLs may not resolve on some enterprise instance configurations
- Lineage node-level citations —
get_lineagereturns a top-levelsource_urlonly; per-node source links are deferred to a future release
| Item | Description |
|---|---|
| SP6: Okta SSO for Admin UI | Replace bootstrap key login with full Okta OIDC for admin sessions |
| Production hardening | Secrets manager integration, confirmed token exchange endpoint, scanner service account provisioning |
| Dedicated citation agent | Opinionated agent for deployment to enterprise agent marketplaces (Gemini Enterprise, A2A), built on this gateway |
MCP resource_link content type |
Migrate to the MCP 2025-06-18 spec's typed resource_link content blocks when FastMCP adds first-class support |
Lineage node-level source_url |
Per-node source links in get_lineage responses (V2) |
This project is in active development. Issues and pull requests are welcome.
Before contributing:
- Run the test suite:
pytest(gateway) andcd admin_api && pytest(Admin API) - Follow the existing tool description format (XML docstrings with
<usecase>/<instructions>) - New tools must include
source_urlin responses and append the citation hint tonext_step - Keep
url_builder.pyas the single source of truth for URL construction — do not readDATAWORLD_UI_BASE_URLdirectly in tool files
[License TBD — not yet specified for this pre-release]