Skip to content

illegal-request/data.world-FastMCP-Platform

Repository files navigation

data.world FastMCP Platform

⚠️ Pre-release — In Active Development

This project is functional and has been tested end-to-end in a local environment, but it has not undergone formal QA, security audit, or enterprise support validation for public deployment. The Admin UI in particular is in a primitive early state. Use in production environments at your own discretion and with appropriate review.

A production-grade MCP (Model Context Protocol) gateway that exposes data.world's catalog, governance, and knowledge graph capabilities as AI-native tools — enabling AI agents to discover, understand, and reason about enterprise data before querying it.


What This Is

Most MCP servers for data platforms are thin API wrappers. This gateway is different: every tool is designed around what an agent wants to accomplish, not what API endpoint it maps to. describe_dataset doesn't expose the data.world schema endpoint — it returns everything an agent needs to understand a dataset in a single call: schema, governance, certification status, responsible parties, and compliance tags.

When paired with a SQL or file-system MCP (the "data access layer"), this gateway acts as the knowledge and intelligence layer in a multi-MCP agent architecture:

┌─────────────────────────────────────────────────────┐
│                   AI Agent (Claude, GPT, etc.)       │
└────────────────┬────────────────────┬────────────────┘
                 │                    │
    ┌────────────▼──────────┐  ┌──────▼──────────────────┐
    │  data.world MCP       │  │  Data MCP               │
    │  Knowledge + Wisdom   │  │  Access Layer           │
    │  Layer                │  │                         │
    │  • What data exists?  │  │  • SQL (Postgres,       │
    │  • Who governs it?    │  │    Snowflake, BigQuery)  │
    │  • Is it certified?   │  │  • Files (S3, ADLS)     │
    │  • What does it mean? │  │  • REST APIs            │
    │  • Who can access it? │  │  • Streaming sources    │
    └───────────────────────┘  └─────────────────────────┘

The agent asks the gateway which dataset to query, what columns exist and what they mean, whether the data is certified, and who to contact for governance questions — then the data MCP executes the actual retrieval. The user receives verified, cited, auditable answers.


Sub-project Status

Sub-project Description Status
SP1 Core MCP Gateway (7 tools) ✅ Complete
SP2 Enterprise Auth (Okta JWT) ✅ Complete
SP3 Admin API (control plane) ✅ Complete
SP4 AI-Powered Instance Discovery ✅ Complete
SP5 Admin UI (React frontend) ✅ Complete — see Admin UI notes
SP6 Okta SSO for Admin UI 🔲 Planned
Production hardening pass 🔲 Planned

The 7 MCP Tools

Catalog Layer

Tool What it does
search_catalog Full-text search across the data.world catalog. Supports filtering by owner, tags, domain, and 8 responsible-party roles. Returns source_url for inline citation.
describe_dataset Full schema (all tables, columns, types), governance metadata, certification status, quality score, compliance tags (GDPR, SOX, CCPA, HIPAA), and responsible-party contacts for a specific dataset.
list_collections Enumerates curated domain collections with member counts and IRIs for further navigation. Enterprise tier.

Governance Layer

Tool What it does
get_access_policy Access level (open/restricted/private), policy description, approved groups, and compliance classification for a dataset. Enterprise tier.

Knowledge Graph Layer

Tool What it does
get_glossary_terms Business vocabulary definitions with synonyms, owning team, and linked datasets/columns. Resolves terms before data interpretation. Enterprise tier.
get_lineage Upstream/downstream data dependency graph with configurable depth and direction. Enterprise tier.
get_related_resources Graph traversal from any catalog IRI — finds linked datasets, glossary terms, and related resources. IRIs are provided by other tool responses. Enterprise tier.

Source Link Citations

Every tool response includes a source_url field pointing directly to the originating page in data.world, and a next_step hint guiding agents to the logical follow-on tool call. Agents can use source_url to produce inline markdown citations:

The Hospital Outcome of Care Surgical Measures dataset covers quality metrics for surgical procedures across US hospitals...

See docs/citation-system-prompt.md for a system prompt snippet that instructs agents to use inline citations.


Architecture

┌──────────────────────────────────────────────────────────────────────┐
│  Control Plane                                                        │
│                                                                       │
│  Admin API (FastAPI, port 8000)                                       │
│  ┌─────────────────────────────────────────────────────────────────┐ │
│  │  Tool config management · Discovery scan orchestration          │ │
│  │  Telemetry · Audit log · Recommendation review                  │ │
│  └──────────────────────────┬──────────────────────────────────────┘ │
│                             │ pg_notify('config_changed')             │
│                             ▼                                         │
│  PostgreSQL (shared state)                                            │
│                             │ asyncpg LISTEN                          │
│                             ▼                                         │
│  MCP Gateway (FastMCP, port 8001)                                     │
│  ┌─────────────────────────────────────────────────────────────────┐ │
│  │  7 MCP tools · Okta JWT auth · Telemetry middleware             │ │
│  │  Live tool toggle (no restart) · Source link citations          │ │
│  └──────────────────────────┬──────────────────────────────────────┘ │
│                             │                                         │
└─────────────────────────────┼────────────────────────────────────────┘
                              │ HTTPS
                    ┌─────────▼─────────┐
                    │  data.world API   │
                    │  (public or       │
                    │   enterprise)     │
                    └───────────────────┘

The Admin API and MCP Gateway communicate only through PostgreSQL LISTEN/NOTIFY — no direct service-to-service calls. A crash or restart of either service has zero impact on the other.


Prerequisites

  • Python 3.10+
  • Node.js 18+ (for the Admin UI)
  • Docker Desktop (for PostgreSQL)
  • A data.world account with an API token (get one here)

Local Development Setup

1. Clone and install

git clone https://github.com/illegal-request/data.world-FastMCP-Platform.git
cd data.world-FastMCP-Platform

# Install the MCP gateway (editable)
pip install -e .

# Install the Admin API
pip install -e "admin_api/[dev]"

2. Configure environment

cp .env.example .env
# Edit .env and set DATAWORLD_API_TOKEN to your data.world token

For the Admin API, create a separate .env:

cp admin_api/.env.example admin_api/.env   # if it doesn't exist, copy from root .env
# Set ADMIN_BOOTSTRAP_KEY to a random string (≥32 chars)
# Example: python -c "import secrets; print(secrets.token_urlsafe(48))"

3. Start PostgreSQL

docker compose up postgres

4. Run database migrations

cd admin_api
python -m alembic upgrade head
cd ..

5. Start the services

Terminal 1 — Admin API:

cd admin_api
python -m uvicorn dataworld_admin.main:app --host 0.0.0.0 --port 8000

Terminal 2 — Admin UI:

cd admin_ui
npm install
npm run dev
# Serves at http://localhost:5173

Terminal 3 — MCP Gateway (for Claude Code / Claude Desktop):

python -m dataworld_mcp
# Default transport: stdio — launched by the MCP client, not manually

The gateway runs in stdio mode by default, meaning your MCP client (Claude Desktop, Claude Code, Cursor) starts it as a subprocess. You only start it manually if you want to run it as an HTTP server.

6. Running tests

# MCP gateway tests
pytest

# Admin API tests
cd admin_api && pytest

Connecting to AI Clients

Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "dataworld": {
      "command": "python",
      "args": ["-m", "dataworld_mcp"],
      "cwd": "/path/to/data.world-FastMCP-Platform",
      "env": {
        "DATAWORLD_API_TOKEN": "your_token_here"
      }
    }
  }
}

Claude Code / Cursor (HTTP mode)

Start the gateway as an HTTP server:

MCP_TRANSPORT=streamable-http MCP_PORT=8001 python -m dataworld_mcp

The project includes a .mcp.json pre-configured for localhost:8001. If using Claude Code, this is picked up automatically when you open the project directory.

Alternatively, use SSE transport:

MCP_TRANSPORT=sse MCP_PORT=8001 python -m dataworld_mcp

And update .mcp.json:

{
  "mcpServers": {
    "dataworld": {
      "type": "sse",
      "url": "http://localhost:8001/sse"
    }
  }
}

Agent system prompt

For inline citations, add to your agent's system prompt (or paste from docs/citation-system-prompt.md):

When tool results include source_url fields, cite them as inline markdown hyperlinks — [descriptive text](url) — woven naturally into your response prose. Do not collect them into a reference list at the end.


Configuration Reference

All configuration is via environment variables. Copy .env.example to .env to get started.

MCP Gateway (.env)

Variable Required Default Description
DATAWORLD_API_TOKEN ✅ Yes Your data.world API token
DATAWORLD_BASE_URL No https://api.data.world/v0 API base URL. Override for enterprise single-tenant: https://api.{company}.app.data.world/v0
DATAWORLD_UI_BASE_URL No https://data.world UI base URL for source_url construction. Override for enterprise: https://{company}.app.data.world
MCP_TRANSPORT No stdio Transport protocol: stdio, streamable-http, or sse
MCP_PORT No 8001 Port for HTTP/SSE transports
AUTH_MODE No env_token Authentication mode: env_token (single token from env) or okta (per-user JWT)
OKTA_ISSUER If AUTH_MODE=okta Okta authorization server URL
OKTA_AUDIENCE If AUTH_MODE=okta Expected JWT audience
OKTA_CLIENT_ID If AUTH_MODE=okta Okta application client ID
OKTA_CLIENT_SECRET If AUTH_MODE=okta Okta application client secret
DATABASE_URL No PostgreSQL connection string. Required for live tool configuration and telemetry.

Admin API (admin_api/.env)

Variable Required Default Description
ADMIN_BOOTSTRAP_KEY ✅ Yes Static token for Admin UI login (≥32 chars). Generate with python -c "import secrets; print(secrets.token_urlsafe(48))"
DATABASE_URL ✅ Yes PostgreSQL connection string
DATAWORLD_API_TOKEN ✅ Yes data.world API token (used by discovery scanner)
DATAWORLD_BASE_URL No https://api.data.world/v0 data.world API base URL
DISCOVERY_LLM_MODEL No LiteLLM model string for AI-powered scan analysis (e.g. claude-sonnet-4-5). Omit to use the template analyser fallback.
ANTHROPIC_API_KEY If using Anthropic Anthropic API key (required if DISCOVERY_LLM_MODEL uses an Anthropic model)

Enterprise Deployment

What changes from local setup

Concern Local Enterprise
API host https://api.data.world/v0 https://api.{company}.app.data.world/v0
UI base URL https://data.world https://{company}.app.data.world
Authentication Single API token (env_token) Per-user Okta JWT (okta mode)
Transport stdio (client-managed) streamable-http or sse (hosted server)
Credentials .env file Secrets manager (AWS, Azure, Vault, K8s) — planned, see Known Issues

Okta authentication mode

Set AUTH_MODE=okta and provide all four OKTA_* variables. In this mode:

  1. Agents authenticate with their Okta JWT
  2. The gateway validates the JWT with your Okta authorization server
  3. The validated user's data.world token is fetched via RFC 8693 token exchange
  4. Tool calls execute with the individual user's data.world permissions

Note: The RFC 8693 token exchange endpoint for enterprise data.world instances has not been confirmed with data.world enterprise support. OktaTokenProvider._exchange() is intentionally isolated — only this method changes when the endpoint is confirmed. See Known Issues.

Docker deployment

The project includes Dockerfiles for both the gateway and the Admin API:

# Build and run all services
docker compose up

Note: the Docker Compose file uses development defaults (plain dataworld/dataworld Postgres credentials). Update for any non-local deployment.

AI-Powered Discovery

The discovery engine scans your data.world instance and uses an LLM to generate instance-specific tool descriptions — so agents arrive pre-tuned to your actual domain taxonomy, collection structure, and responsible-party roles.

Configure DISCOVERY_LLM_MODEL with any LiteLLM-compatible model string:

Provider Model string example
Anthropic claude-sonnet-4-5
OpenAI gpt-4o-mini
Azure OpenAI azure/gpt-4o-mini
Google Gemini gemini/gemini-1.5-flash
Amazon Bedrock bedrock/anthropic.claude-sonnet-4-5
None Omit — template analyser is used

Run a discovery scan from the Admin UI at http://localhost:5173.


Admin UI

⚠️ The Admin UI is in a primitive early development state. The core functionality works — login, tool configuration, discovery scan management, telemetry dashboard — but the interface has not undergone design review, usability testing, or full feature development. Expect rough edges, missing polish, and incomplete workflows.

Access the Admin UI at http://localhost:5173 after starting the dev server (npm run dev in admin_ui/).

Login with your ADMIN_BOOTSTRAP_KEY value from your .env file.

What works:

  • Viewing and toggling MCP tools on/off (changes propagate to the live gateway without restart)
  • Viewing and editing tool descriptions
  • Running basic and advanced discovery scans
  • Reviewing and approving discovery recommendations
  • Telemetry dashboard (tool call counts, latency)

SP6 — Okta SSO for Admin UI: The "Login with Okta SSO" button is a stub. Full Okta OIDC login for admin sessions is planned as SP6 and has not been implemented.


Developer Guide

Project structure

data.world-FastMCP-Platform/
├── src/dataworld_mcp/          # MCP Gateway (the core product)
│   ├── tools/                  # The 7 MCP tools
│   │   ├── catalog.py          # search_catalog, describe_dataset, list_collections
│   │   ├── governance.py       # get_access_policy
│   │   ├── knowledge.py        # get_glossary_terms, get_lineage, get_related_resources
│   │   └── url_builder.py      # Source URL construction utility
│   ├── auth/                   # Okta JWT validation
│   ├── client/                 # data.world API client (httpx + tenacity)
│   ├── telemetry/              # Tool call event buffering and middleware
│   ├── config.py               # Environment configuration (single source of truth)
│   ├── config_listener.py      # PostgreSQL LISTEN for live config updates
│   └── server.py               # FastMCP server instance
├── admin_api/                  # Control plane API (FastAPI)
│   ├── src/dataworld_admin/
│   │   ├── discovery/          # LLM-powered catalog scan engine
│   │   ├── tools/              # Tool config management
│   │   ├── telemetry/          # Telemetry persistence
│   │   └── auth/               # Admin API authentication
│   └── alembic/                # Database migrations
├── admin_ui/                   # Admin frontend (React + TypeScript)
│   └── src/
├── tests/                      # MCP gateway test suite
├── docs/                       # Architecture docs, briefs, system prompt guidance
├── docker-compose.yml
├── Dockerfile.gateway
└── pyproject.toml

Adding a new tool

  1. Add the tool function to the appropriate file in src/dataworld_mcp/tools/ using the @mcp.tool() decorator
  2. Follow the XML docstring format (<usecase> / <instructions>) for consistent tool selection behaviour
  3. Include source_url in the response (extract from API response or construct with dataset_url())
  4. Append the citation hint to next_step: "source_url fields in results can be used as inline markdown citations [title](url) in agent responses."
  5. Add tests in tests/
  6. Import the module in src/dataworld_mcp/__main__.py to register the tool

Tool description format

@mcp.tool()
async def my_tool(param: str) -> dict:
    """
    <usecase>
    Use when the agent wants to [accomplish X]. Call after [Y] to [Z].
    </usecase>
    <instructions>
    Provide [param description]. Returns [response structure].
    </instructions>
    """

The <usecase> block drives tool selection. The <instructions> block drives tool use. Keep them separate — mixing them degrades tool selection accuracy.


Known Issues & Limitations

See KNOWN_ISSUES.md for the full list. Key items:

  • Okta token exchange endpoint unconfirmedAUTH_MODE=okta requires validation with data.world enterprise support before production use
  • Secrets manager not implemented — credentials are read from .env files; AWS/Azure/Vault/K8s secrets integration is planned
  • Admin UI is primitive — SP6 (Okta SSO), full UI polish, and complete workflow coverage are planned
  • Enterprise source URLs (Group 2 tools) — constructed URLs may not resolve on some enterprise instance configurations
  • Lineage node-level citationsget_lineage returns a top-level source_url only; per-node source links are deferred to a future release

Roadmap

Item Description
SP6: Okta SSO for Admin UI Replace bootstrap key login with full Okta OIDC for admin sessions
Production hardening Secrets manager integration, confirmed token exchange endpoint, scanner service account provisioning
Dedicated citation agent Opinionated agent for deployment to enterprise agent marketplaces (Gemini Enterprise, A2A), built on this gateway
MCP resource_link content type Migrate to the MCP 2025-06-18 spec's typed resource_link content blocks when FastMCP adds first-class support
Lineage node-level source_url Per-node source links in get_lineage responses (V2)

Contributing

This project is in active development. Issues and pull requests are welcome.

Before contributing:

  1. Run the test suite: pytest (gateway) and cd admin_api && pytest (Admin API)
  2. Follow the existing tool description format (XML docstrings with <usecase> / <instructions>)
  3. New tools must include source_url in responses and append the citation hint to next_step
  4. Keep url_builder.py as the single source of truth for URL construction — do not read DATAWORLD_UI_BASE_URL directly in tool files

License

[License TBD — not yet specified for this pre-release]

About

This project focuses on the development of a Data.World MCP server designed to support customer's ability to deploy the framework and make their instance available for agentic workloads.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors