data.world FastMCP Platform

⚠️ Pre-release — In Active Development

This project is functional and has been tested end-to-end in a local environment, but it has not undergone formal QA, security audit, or enterprise support validation for public deployment. The Admin UI in particular is in a primitive early state. Use in production environments at your own discretion and with appropriate review.

A production-grade MCP (Model Context Protocol) gateway that exposes data.world's catalog, governance, and knowledge graph capabilities as AI-native tools — enabling AI agents to discover, understand, and reason about enterprise data before querying it.

What This Is

Most MCP servers for data platforms are thin API wrappers. This gateway is different: every tool is designed around what an agent wants to accomplish, not what API endpoint it maps to. describe_dataset doesn't expose the data.world schema endpoint — it returns everything an agent needs to understand a dataset in a single call: schema, governance, certification status, responsible parties, and compliance tags.

When paired with a SQL or file-system MCP (the "data access layer"), this gateway acts as the knowledge and intelligence layer in a multi-MCP agent architecture:

┌─────────────────────────────────────────────────────┐
│                   AI Agent (Claude, GPT, etc.)       │
└────────────────┬────────────────────┬────────────────┘
                 │                    │
    ┌────────────▼──────────┐  ┌──────▼──────────────────┐
    │  data.world MCP       │  │  Data MCP               │
    │  Knowledge + Wisdom   │  │  Access Layer           │
    │  Layer                │  │                         │
    │  • What data exists?  │  │  • SQL (Postgres,       │
    │  • Who governs it?    │  │    Snowflake, BigQuery)  │
    │  • Is it certified?   │  │  • Files (S3, ADLS)     │
    │  • What does it mean? │  │  • REST APIs            │
    │  • Who can access it? │  │  • Streaming sources    │
    └───────────────────────┘  └─────────────────────────┘

The agent asks the gateway which dataset to query, what columns exist and what they mean, whether the data is certified, and who to contact for governance questions — then the data MCP executes the actual retrieval. The user receives verified, cited, auditable answers.

Sub-project Status

Sub-project	Description	Status
SP1	Core MCP Gateway (7 tools)	✅ Complete
SP2	Enterprise Auth (Okta JWT)	✅ Complete
SP3	Admin API (control plane)	✅ Complete
SP4	AI-Powered Instance Discovery	✅ Complete
SP5	Admin UI (React frontend)	✅ Complete — see Admin UI notes
SP6	Okta SSO for Admin UI	🔲 Planned
—	Production hardening pass	🔲 Planned

The 7 MCP Tools

Catalog Layer

Tool	What it does
`search_catalog`	Full-text search across the data.world catalog. Supports filtering by owner, tags, domain, and 8 responsible-party roles. Returns `source_url` for inline citation.
`describe_dataset`	Full schema (all tables, columns, types), governance metadata, certification status, quality score, compliance tags (GDPR, SOX, CCPA, HIPAA), and responsible-party contacts for a specific dataset.
`list_collections`	Enumerates curated domain collections with member counts and IRIs for further navigation. Enterprise tier.

Governance Layer

Tool	What it does
`get_access_policy`	Access level (open/restricted/private), policy description, approved groups, and compliance classification for a dataset. Enterprise tier.

Knowledge Graph Layer

Tool	What it does
`get_glossary_terms`	Business vocabulary definitions with synonyms, owning team, and linked datasets/columns. Resolves terms before data interpretation. Enterprise tier.
`get_lineage`	Upstream/downstream data dependency graph with configurable depth and direction. Enterprise tier.
`get_related_resources`	Graph traversal from any catalog IRI — finds linked datasets, glossary terms, and related resources. IRIs are provided by other tool responses. Enterprise tier.

Source Link Citations

Every tool response includes a source_url field pointing directly to the originating page in data.world, and a next_step hint guiding agents to the logical follow-on tool call. Agents can use source_url to produce inline markdown citations:

The Hospital Outcome of Care Surgical Measures dataset covers quality metrics for surgical procedures across US hospitals...

See docs/citation-system-prompt.md for a system prompt snippet that instructs agents to use inline citations.

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│  Control Plane                                                        │
│                                                                       │
│  Admin API (FastAPI, port 8000)                                       │
│  ┌─────────────────────────────────────────────────────────────────┐ │
│  │  Tool config management · Discovery scan orchestration          │ │
│  │  Telemetry · Audit log · Recommendation review                  │ │
│  └──────────────────────────┬──────────────────────────────────────┘ │
│                             │ pg_notify('config_changed')             │
│                             ▼                                         │
│  PostgreSQL (shared state)                                            │
│                             │ asyncpg LISTEN                          │
│                             ▼                                         │
│  MCP Gateway (FastMCP, port 8001)                                     │
│  ┌─────────────────────────────────────────────────────────────────┐ │
│  │  7 MCP tools · Okta JWT auth · Telemetry middleware             │ │
│  │  Live tool toggle (no restart) · Source link citations          │ │
│  └──────────────────────────┬──────────────────────────────────────┘ │
│                             │                                         │
└─────────────────────────────┼────────────────────────────────────────┘
                              │ HTTPS
                    ┌─────────▼─────────┐
                    │  data.world API   │
                    │  (public or       │
                    │   enterprise)     │
                    └───────────────────┘

The Admin API and MCP Gateway communicate only through PostgreSQL LISTEN/NOTIFY — no direct service-to-service calls. A crash or restart of either service has zero impact on the other.

Prerequisites

Python 3.10+
Node.js 18+ (for the Admin UI)
Docker Desktop (for PostgreSQL)
A data.world account with an API token (get one here)

Local Development Setup

1. Clone and install

git clone https://github.com/illegal-request/data.world-FastMCP-Platform.git
cd data.world-FastMCP-Platform

# Install the MCP gateway (editable)
pip install -e .

# Install the Admin API
pip install -e "admin_api/[dev]"

2. Configure environment

cp .env.example .env
# Edit .env and set DATAWORLD_API_TOKEN to your data.world token

For the Admin API, create a separate .env:

cp admin_api/.env.example admin_api/.env   # if it doesn't exist, copy from root .env
# Set ADMIN_BOOTSTRAP_KEY to a random string (≥32 chars)
# Example: python -c "import secrets; print(secrets.token_urlsafe(48))"

3. Start PostgreSQL

docker compose up postgres

4. Run database migrations

cd admin_api
python -m alembic upgrade head
cd ..

5. Start the services

Terminal 1 — Admin API:

cd admin_api
python -m uvicorn dataworld_admin.main:app --host 0.0.0.0 --port 8000

Terminal 2 — Admin UI:

cd admin_ui
npm install
npm run dev
# Serves at http://localhost:5173

Terminal 3 — MCP Gateway (for Claude Code / Claude Desktop):

python -m dataworld_mcp
# Default transport: stdio — launched by the MCP client, not manually

The gateway runs in stdio mode by default, meaning your MCP client (Claude Desktop, Claude Code, Cursor) starts it as a subprocess. You only start it manually if you want to run it as an HTTP server.

6. Running tests

# MCP gateway tests
pytest

# Admin API tests
cd admin_api && pytest

Connecting to AI Clients

Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "dataworld": {
      "command": "python",
      "args": ["-m", "dataworld_mcp"],
      "cwd": "/path/to/data.world-FastMCP-Platform",
      "env": {
        "DATAWORLD_API_TOKEN": "your_token_here"
      }
    }
  }
}

Claude Code / Cursor (HTTP mode)

Start the gateway as an HTTP server:

MCP_TRANSPORT=streamable-http MCP_PORT=8001 python -m dataworld_mcp

The project includes a .mcp.json pre-configured for localhost:8001. If using Claude Code, this is picked up automatically when you open the project directory.

Alternatively, use SSE transport:

MCP_TRANSPORT=sse MCP_PORT=8001 python -m dataworld_mcp

And update .mcp.json:

{
  "mcpServers": {
    "dataworld": {
      "type": "sse",
      "url": "http://localhost:8001/sse"
    }
  }
}

Agent system prompt

For inline citations, add to your agent's system prompt (or paste from docs/citation-system-prompt.md):

When tool results include source_url fields, cite them as inline markdown hyperlinks — [descriptive text](url) — woven naturally into your response prose. Do not collect them into a reference list at the end.

Configuration Reference

All configuration is via environment variables. Copy .env.example to .env to get started.

MCP Gateway (`.env`)

Variable	Required	Default	Description
`DATAWORLD_API_TOKEN`	✅ Yes	—	Your data.world API token
`DATAWORLD_BASE_URL`	No	`https://api.data.world/v0`	API base URL. Override for enterprise single-tenant: `https://api.{company}.app.data.world/v0`
`DATAWORLD_UI_BASE_URL`	No	`https://data.world`	UI base URL for `source_url` construction. Override for enterprise: `https://{company}.app.data.world`
`MCP_TRANSPORT`	No	`stdio`	Transport protocol: `stdio`, `streamable-http`, or `sse`
`MCP_PORT`	No	`8001`	Port for HTTP/SSE transports
`AUTH_MODE`	No	`env_token`	Authentication mode: `env_token` (single token from env) or `okta` (per-user JWT)
`OKTA_ISSUER`	If `AUTH_MODE=okta`	—	Okta authorization server URL
`OKTA_AUDIENCE`	If `AUTH_MODE=okta`	—	Expected JWT audience
`OKTA_CLIENT_ID`	If `AUTH_MODE=okta`	—	Okta application client ID
`OKTA_CLIENT_SECRET`	If `AUTH_MODE=okta`	—	Okta application client secret
`DATABASE_URL`	No	—	PostgreSQL connection string. Required for live tool configuration and telemetry.

Admin API (`admin_api/.env`)

Variable	Required	Default	Description
`ADMIN_BOOTSTRAP_KEY`	✅ Yes	—	Static token for Admin UI login (≥32 chars). Generate with `python -c "import secrets; print(secrets.token_urlsafe(48))"`
`DATABASE_URL`	✅ Yes	—	PostgreSQL connection string
`DATAWORLD_API_TOKEN`	✅ Yes	—	data.world API token (used by discovery scanner)
`DATAWORLD_BASE_URL`	No	`https://api.data.world/v0`	data.world API base URL
`DISCOVERY_LLM_MODEL`	No	—	LiteLLM model string for AI-powered scan analysis (e.g. `claude-sonnet-4-5`). Omit to use the template analyser fallback.
`ANTHROPIC_API_KEY`	If using Anthropic	—	Anthropic API key (required if `DISCOVERY_LLM_MODEL` uses an Anthropic model)

Enterprise Deployment

What changes from local setup

Concern	Local	Enterprise
API host	`https://api.data.world/v0`	`https://api.{company}.app.data.world/v0`
UI base URL	`https://data.world`	`https://{company}.app.data.world`
Authentication	Single API token (`env_token`)	Per-user Okta JWT (`okta` mode)
Transport	`stdio` (client-managed)	`streamable-http` or `sse` (hosted server)
Credentials	`.env` file	Secrets manager (AWS, Azure, Vault, K8s) — planned, see Known Issues

Okta authentication mode

Set AUTH_MODE=okta and provide all four OKTA_* variables. In this mode:

Agents authenticate with their Okta JWT
The gateway validates the JWT with your Okta authorization server
The validated user's data.world token is fetched via RFC 8693 token exchange
Tool calls execute with the individual user's data.world permissions

Note: The RFC 8693 token exchange endpoint for enterprise data.world instances has not been confirmed with data.world enterprise support. OktaTokenProvider._exchange() is intentionally isolated — only this method changes when the endpoint is confirmed. See Known Issues.

Docker deployment

The project includes Dockerfiles for both the gateway and the Admin API:

# Build and run all services
docker compose up

Note: the Docker Compose file uses development defaults (plain dataworld/dataworld Postgres credentials). Update for any non-local deployment.

AI-Powered Discovery

The discovery engine scans your data.world instance and uses an LLM to generate instance-specific tool descriptions — so agents arrive pre-tuned to your actual domain taxonomy, collection structure, and responsible-party roles.

Configure DISCOVERY_LLM_MODEL with any LiteLLM-compatible model string:

Provider	Model string example
Anthropic	`claude-sonnet-4-5`
OpenAI	`gpt-4o-mini`
Azure OpenAI	`azure/gpt-4o-mini`
Google Gemini	`gemini/gemini-1.5-flash`
Amazon Bedrock	`bedrock/anthropic.claude-sonnet-4-5`
None	Omit — template analyser is used

Run a discovery scan from the Admin UI at http://localhost:5173.

Admin UI

⚠️ The Admin UI is in a primitive early development state. The core functionality works — login, tool configuration, discovery scan management, telemetry dashboard — but the interface has not undergone design review, usability testing, or full feature development. Expect rough edges, missing polish, and incomplete workflows.

Access the Admin UI at http://localhost:5173 after starting the dev server (npm run dev in admin_ui/).

Login with your ADMIN_BOOTSTRAP_KEY value from your .env file.

What works:

Viewing and toggling MCP tools on/off (changes propagate to the live gateway without restart)
Viewing and editing tool descriptions
Running basic and advanced discovery scans
Reviewing and approving discovery recommendations
Telemetry dashboard (tool call counts, latency)

SP6 — Okta SSO for Admin UI: The "Login with Okta SSO" button is a stub. Full Okta OIDC login for admin sessions is planned as SP6 and has not been implemented.

Developer Guide

Project structure

data.world-FastMCP-Platform/
├── src/dataworld_mcp/          # MCP Gateway (the core product)
│   ├── tools/                  # The 7 MCP tools
│   │   ├── catalog.py          # search_catalog, describe_dataset, list_collections
│   │   ├── governance.py       # get_access_policy
│   │   ├── knowledge.py        # get_glossary_terms, get_lineage, get_related_resources
│   │   └── url_builder.py      # Source URL construction utility
│   ├── auth/                   # Okta JWT validation
│   ├── client/                 # data.world API client (httpx + tenacity)
│   ├── telemetry/              # Tool call event buffering and middleware
│   ├── config.py               # Environment configuration (single source of truth)
│   ├── config_listener.py      # PostgreSQL LISTEN for live config updates
│   └── server.py               # FastMCP server instance
├── admin_api/                  # Control plane API (FastAPI)
│   ├── src/dataworld_admin/
│   │   ├── discovery/          # LLM-powered catalog scan engine
│   │   ├── tools/              # Tool config management
│   │   ├── telemetry/          # Telemetry persistence
│   │   └── auth/               # Admin API authentication
│   └── alembic/                # Database migrations
├── admin_ui/                   # Admin frontend (React + TypeScript)
│   └── src/
├── tests/                      # MCP gateway test suite
├── docs/                       # Architecture docs, briefs, system prompt guidance
├── docker-compose.yml
├── Dockerfile.gateway
└── pyproject.toml

Adding a new tool

Add the tool function to the appropriate file in src/dataworld_mcp/tools/ using the @mcp.tool() decorator
Follow the XML docstring format (<usecase> / <instructions>) for consistent tool selection behaviour
Include source_url in the response (extract from API response or construct with dataset_url())
Append the citation hint to next_step: "source_url fields in results can be used as inline markdown citations [title](url) in agent responses."
Add tests in tests/
Import the module in src/dataworld_mcp/__main__.py to register the tool

Tool description format

@mcp.tool()
async def my_tool(param: str) -> dict:
    """
    <usecase>
    Use when the agent wants to [accomplish X]. Call after [Y] to [Z].
    </usecase>
    <instructions>
    Provide [param description]. Returns [response structure].
    </instructions>
    """

The <usecase> block drives tool selection. The <instructions> block drives tool use. Keep them separate — mixing them degrades tool selection accuracy.

Known Issues & Limitations

See KNOWN_ISSUES.md for the full list. Key items:

Okta token exchange endpoint unconfirmed — AUTH_MODE=okta requires validation with data.world enterprise support before production use
Secrets manager not implemented — credentials are read from .env files; AWS/Azure/Vault/K8s secrets integration is planned
Admin UI is primitive — SP6 (Okta SSO), full UI polish, and complete workflow coverage are planned
Enterprise source URLs (Group 2 tools) — constructed URLs may not resolve on some enterprise instance configurations
Lineage node-level citations — get_lineage returns a top-level source_url only; per-node source links are deferred to a future release

Roadmap

Item	Description
SP6: Okta SSO for Admin UI	Replace bootstrap key login with full Okta OIDC for admin sessions
Production hardening	Secrets manager integration, confirmed token exchange endpoint, scanner service account provisioning
Dedicated citation agent	Opinionated agent for deployment to enterprise agent marketplaces (Gemini Enterprise, A2A), built on this gateway
MCP `resource_link` content type	Migrate to the MCP 2025-06-18 spec's typed `resource_link` content blocks when FastMCP adds first-class support
Lineage node-level `source_url`	Per-node source links in `get_lineage` responses (V2)

Contributing

This project is in active development. Issues and pull requests are welcome.

Before contributing:

Run the test suite: pytest (gateway) and cd admin_api && pytest (Admin API)
Follow the existing tool description format (XML docstrings with <usecase> / <instructions>)
New tools must include source_url in responses and append the citation hint to next_step
Keep url_builder.py as the single source of truth for URL construction — do not read DATAWORLD_UI_BASE_URL directly in tool files

License

[License TBD — not yet specified for this pre-release]

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.claude		.claude
admin_api		admin_api
admin_ui		admin_ui
docs		docs
mockups		mockups
src/dataworld_mcp		src/dataworld_mcp
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.mcp.json		.mcp.json
Dockerfile.gateway		Dockerfile.gateway
KNOWN_ISSUES.md		KNOWN_ISSUES.md
README.md		README.md
docker-compose.yml		docker-compose.yml
generate_arch_review_pdf.py		generate_arch_review_pdf.py
generate_brief_pdf.py		generate_brief_pdf.py
generate_deck.js		generate_deck.js
generate_innovation_brief_pdf.py		generate_innovation_brief_pdf.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

data.world FastMCP Platform

What This Is

Sub-project Status

The 7 MCP Tools

Catalog Layer

Governance Layer

Knowledge Graph Layer

Source Link Citations

Architecture

Prerequisites

Local Development Setup

1. Clone and install

2. Configure environment

3. Start PostgreSQL

4. Run database migrations

5. Start the services

6. Running tests

Connecting to AI Clients

Claude Desktop

Claude Code / Cursor (HTTP mode)

Agent system prompt

Configuration Reference

MCP Gateway (.env)

Admin API (admin_api/.env)

Enterprise Deployment

What changes from local setup

Okta authentication mode

Docker deployment

AI-Powered Discovery

Admin UI

Developer Guide

Project structure

Adding a new tool

Tool description format

Known Issues & Limitations

Roadmap

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

MCP Gateway (`.env`)

Admin API (`admin_api/.env`)

Packages