api-catalog

A local, searchable catalog of ~5,500 APIs exposed over a REST endpoint and an MCP server. Used personally by the author to let Claude Code and other LLM agents find APIs by name, tag, or natural-language description.

What it is

api-catalog ingests API descriptions from a handful of public sources (the MCP server registry, the public-apis community list, APIs.guru, plus a few hand-written YAML cards) and loads them into PostgreSQL. Each API record has a name, URL, description, auth scheme, a base_url, optional endpoints (method, path, example request/response body), and one or more hierarchical tags stored as PostgreSQL ltree paths (e.g. AI.Language_Models.Chat). Three retrieval paths are exposed: (1) full-text search via a generated tsvector column, (2) tag browsing via ltree prefix/descendant queries, and (3) semantic search via pgvector. Embeddings are computed by calling a local Ollama instance running nomic-embed-text (768-dim output), one embedding per API card, stored in an embeddings table; semantic queries embed the query string the same way and order by cosine distance. A FastAPI process (api_server.py) exposes HTTP endpoints, and a stdio MCP server (mcp_server.py) exposes six tools that wrap the same queries so an MCP client (Claude Code, etc.) can call them directly. A scheduled_refresh.py script re-scrapes sources into a staging table, validates a minimum row count, and swaps tables in a transaction so the live catalog is never half-updated.

Status

Working. Used by one person (the author) in a home setup. Not multi-tenant, no auth on the REST API, no rate limiting. The code runs; the data pipeline runs. It has not been tested by anyone else.

Data provenance — read this before redistributing

The ~5,500 API records are aggregated from third-party sources. The code in this repo is Apache-2.0, but the scraped catalog data is not and is not checked into the repo for that reason. Sources and their licenses at time of writing:

Source	Approx. count	License / terms
MCP server registry	~4,100	Per-entry; check the upstream registry before republishing
`public-apis` (github.com/public-apis/public-apis)	~1,000	MIT on the list itself; individual API terms vary
APIs.guru	~170	CC0 on the directory metadata
Hand-written cards	4	Written by the author; Apache-2.0

If you plan to publish a built database dump alongside this code, verify each source's current terms first. The safe path is to ship only the code and the scraper, and let each user build their own catalog locally.

How to run

Prerequisites

Python 3.10+
PostgreSQL 14+ with the vector (pgvector) and ltree extensions
Ollama running locally with nomic-embed-text pulled (only required for semantic search; full-text and tag browse work without it)

Setup

git clone <this-repo>
cd api-catalog

pip install psycopg2-binary fastapi uvicorn pyyaml requests tabulate

cp .env.example .env
# Edit .env: DB_HOST, DB_PORT, DB_USER, DB_PASSWORD, DB_NAME, OLLAMA_HOST

psql -d "$DB_NAME" -c "CREATE EXTENSION IF NOT EXISTS vector;"
psql -d "$DB_NAME" -c "CREATE EXTENSION IF NOT EXISTS ltree;"

# Build the catalog from the three upstream sources
python scheduled_refresh.py

# Or, if you already have card YAML files locally:
python migrate_to_postgres.py

# Compute embeddings (needs Ollama + nomic-embed-text)
python compute_embeddings.py

Run the servers

python api_server.py    # REST on 127.0.0.1:8002
python mcp_server.py    # MCP over stdio

MCP client config:

{
  "api-catalog": {
    "type": "stdio",
    "command": "python",
    "args": ["/path/to/api-catalog/mcp_server.py"]
  }
}

CLI

python query_pg.py stats
python query_pg.py search "email sending"
python query_pg.py browse "AI.Language_Models"
python query_pg.py card "openai-chat"
python query_pg.py tags "Communication"

MCP tools

Tool	Description
`search_apis`	Full-text search by name/description
`semantic_search`	pgvector cosine similarity over embeddings
`browse_by_tag`	List APIs under an `ltree` tag path
`get_api_card`	Full API details including endpoints
`list_tags`	Walk the tag hierarchy
`catalog_stats`	Row counts and source breakdown

REST endpoints

Method	Path	Description
GET	`/stats`	Catalog statistics
GET	`/tags?prefix=AI`	List tags with optional prefix
GET	`/search?q=email&limit=50`	Full-text search
GET	`/browse/{tag_path}`	APIs under a tag
GET	`/card/{api_name}`	API details + endpoints
GET	`/embeddings/search?q=email`	Semantic search
GET	`/health`	Health check

API card format

name: openai-chat
slug: openai-chat
description: OpenAI Chat Completions API
url: https://platform.openai.com/docs/api-reference/chat
auth:
  type: bearer
  header: Authorization
tags:
  - AI.Language_Models.Chat
base_url: https://api.openai.com/v1
endpoints:
  - method: POST
    path: /chat/completions
    description: Create a chat completion

See schema/api_card.yaml and schema/tags.yaml.

Database schema

apis          — Core records (name, url, description, auth, source)
tags          — ltree paths
api_tags      — Many-to-many junction
endpoints     — HTTP call examples
embeddings    — pgvector, 768-dim

Configuration

Variable	Default	Description
`DB_HOST`	`127.0.0.1`	PostgreSQL host
`DB_PORT`	`5432`	PostgreSQL port
`DB_USER`	—	Database user
`DB_PASSWORD`	—	Database password
`DB_NAME`	—	Database name
`OLLAMA_HOST`	`http://127.0.0.1:11434`	Ollama endpoint
`OLLAMA_MODEL`	`nomic-embed-text`	Embedding model
`API_HOST`	`127.0.0.1`	REST bind address
`API_PORT`	`8002`	REST port
`LOG_LEVEL`	`INFO`	Log verbosity
`LOG_DIR`	`./logs`	Log directory
`REFRESH_MIN_API_COUNT`	`1000`	Minimum rows for a refresh to be considered valid

Known limitations

No auth on the REST API. Bind to localhost or put it behind a reverse proxy if you expose it.
No rate limiting on embedding calls; a full re-embed of ~5,500 records hits Ollama hard.
Card quality varies. The MCP registry entries are machine-generated and often lack endpoint examples; the hand-written cards are richer.
Tag ontology (schema/tags.yaml) was authored by hand and is opinionated. Re-tagging a scraped record is best-effort keyword matching.
Semantic search quality is bounded by nomic-embed-text. Good enough for "find me an email API", not for fine-grained disambiguation.
Scraper assumes source formats stay stable. When public-apis or the MCP registry reshape their data, the scraper needs manual fixes.
No tests included.
Catalog data is not checked in (see Data provenance). You must run the refresh yourself.

License

Apache-2.0 on the code. Data license depends on upstream source — see Data provenance above.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs/superpowers/plans		docs/superpowers/plans
schema		schema
scrapers		scrapers
tests		tests
.env.example		.env.example
.gitignore		.gitignore
BACKUP_AND_RESTORE.md		BACKUP_AND_RESTORE.md
CLAUDE.md		CLAUDE.md
CLAUDE_CODE_INTEGRATION.md		CLAUDE_CODE_INTEGRATION.md
DECISIONS.md		DECISIONS.md
LICENSE		LICENSE
MCP_SETUP.md		MCP_SETUP.md
PRODUCTION_CHECKLIST.md		PRODUCTION_CHECKLIST.md
README.md		README.md
SYSTEM_COMPLETE.md		SYSTEM_COMPLETE.md
TESTING.md		TESTING.md
api_server.py		api_server.py
auto-batch.sh		auto-batch.sh
backup.sh		backup.sh
compute_embeddings.py		compute_embeddings.py
ingest-batch.sh		ingest-batch.sh
ingest_conversations.py		ingest_conversations.py
ingest_direct.py		ingest_direct.py
ingest_to_graphiti.py		ingest_to_graphiti.py
logger_config.py		logger_config.py
mcp_config.json		mcp_config.json
mcp_server.py		mcp_server.py
migrate_to_postgres.py		migrate_to_postgres.py
query.py		query.py
query_pg.py		query_pg.py
refresh_catalog.sh		refresh_catalog.sh
run_ingestion.py		run_ingestion.py
run_parallel_ingestion.py		run_parallel_ingestion.py
scheduled_refresh.py		scheduled_refresh.py
test_mcp.py		test_mcp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

api-catalog

What it is

Status

Data provenance — read this before redistributing

How to run

Prerequisites

Setup

Run the servers

CLI

MCP tools

REST endpoints

API card format

Database schema

Configuration

Known limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

api-catalog

What it is

Status

Data provenance — read this before redistributing

How to run

Prerequisites

Setup

Run the servers

CLI

MCP tools

REST endpoints

API card format

Database schema

Configuration

Known limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages