Skip to content

loke-jad/api-catalog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

api-catalog

A local, searchable catalog of ~5,500 APIs exposed over a REST endpoint and an MCP server. Used personally by the author to let Claude Code and other LLM agents find APIs by name, tag, or natural-language description.

What it is

api-catalog ingests API descriptions from a handful of public sources (the MCP server registry, the public-apis community list, APIs.guru, plus a few hand-written YAML cards) and loads them into PostgreSQL. Each API record has a name, URL, description, auth scheme, a base_url, optional endpoints (method, path, example request/response body), and one or more hierarchical tags stored as PostgreSQL ltree paths (e.g. AI.Language_Models.Chat). Three retrieval paths are exposed: (1) full-text search via a generated tsvector column, (2) tag browsing via ltree prefix/descendant queries, and (3) semantic search via pgvector. Embeddings are computed by calling a local Ollama instance running nomic-embed-text (768-dim output), one embedding per API card, stored in an embeddings table; semantic queries embed the query string the same way and order by cosine distance. A FastAPI process (api_server.py) exposes HTTP endpoints, and a stdio MCP server (mcp_server.py) exposes six tools that wrap the same queries so an MCP client (Claude Code, etc.) can call them directly. A scheduled_refresh.py script re-scrapes sources into a staging table, validates a minimum row count, and swaps tables in a transaction so the live catalog is never half-updated.

Status

Working. Used by one person (the author) in a home setup. Not multi-tenant, no auth on the REST API, no rate limiting. The code runs; the data pipeline runs. It has not been tested by anyone else.

Data provenance — read this before redistributing

The ~5,500 API records are aggregated from third-party sources. The code in this repo is Apache-2.0, but the scraped catalog data is not and is not checked into the repo for that reason. Sources and their licenses at time of writing:

Source Approx. count License / terms
MCP server registry ~4,100 Per-entry; check the upstream registry before republishing
public-apis (github.com/public-apis/public-apis) ~1,000 MIT on the list itself; individual API terms vary
APIs.guru ~170 CC0 on the directory metadata
Hand-written cards 4 Written by the author; Apache-2.0

If you plan to publish a built database dump alongside this code, verify each source's current terms first. The safe path is to ship only the code and the scraper, and let each user build their own catalog locally.

How to run

Prerequisites

  • Python 3.10+
  • PostgreSQL 14+ with the vector (pgvector) and ltree extensions
  • Ollama running locally with nomic-embed-text pulled (only required for semantic search; full-text and tag browse work without it)

Setup

git clone <this-repo>
cd api-catalog

pip install psycopg2-binary fastapi uvicorn pyyaml requests tabulate

cp .env.example .env
# Edit .env: DB_HOST, DB_PORT, DB_USER, DB_PASSWORD, DB_NAME, OLLAMA_HOST

psql -d "$DB_NAME" -c "CREATE EXTENSION IF NOT EXISTS vector;"
psql -d "$DB_NAME" -c "CREATE EXTENSION IF NOT EXISTS ltree;"

# Build the catalog from the three upstream sources
python scheduled_refresh.py

# Or, if you already have card YAML files locally:
python migrate_to_postgres.py

# Compute embeddings (needs Ollama + nomic-embed-text)
python compute_embeddings.py

Run the servers

python api_server.py    # REST on 127.0.0.1:8002
python mcp_server.py    # MCP over stdio

MCP client config:

{
  "api-catalog": {
    "type": "stdio",
    "command": "python",
    "args": ["/path/to/api-catalog/mcp_server.py"]
  }
}

CLI

python query_pg.py stats
python query_pg.py search "email sending"
python query_pg.py browse "AI.Language_Models"
python query_pg.py card "openai-chat"
python query_pg.py tags "Communication"

MCP tools

Tool Description
search_apis Full-text search by name/description
semantic_search pgvector cosine similarity over embeddings
browse_by_tag List APIs under an ltree tag path
get_api_card Full API details including endpoints
list_tags Walk the tag hierarchy
catalog_stats Row counts and source breakdown

REST endpoints

Method Path Description
GET /stats Catalog statistics
GET /tags?prefix=AI List tags with optional prefix
GET /search?q=email&limit=50 Full-text search
GET /browse/{tag_path} APIs under a tag
GET /card/{api_name} API details + endpoints
GET /embeddings/search?q=email Semantic search
GET /health Health check

API card format

name: openai-chat
slug: openai-chat
description: OpenAI Chat Completions API
url: https://platform.openai.com/docs/api-reference/chat
auth:
  type: bearer
  header: Authorization
tags:
  - AI.Language_Models.Chat
base_url: https://api.openai.com/v1
endpoints:
  - method: POST
    path: /chat/completions
    description: Create a chat completion

See schema/api_card.yaml and schema/tags.yaml.

Database schema

apis          — Core records (name, url, description, auth, source)
tags          — ltree paths
api_tags      — Many-to-many junction
endpoints     — HTTP call examples
embeddings    — pgvector, 768-dim

Configuration

Variable Default Description
DB_HOST 127.0.0.1 PostgreSQL host
DB_PORT 5432 PostgreSQL port
DB_USER Database user
DB_PASSWORD Database password
DB_NAME Database name
OLLAMA_HOST http://127.0.0.1:11434 Ollama endpoint
OLLAMA_MODEL nomic-embed-text Embedding model
API_HOST 127.0.0.1 REST bind address
API_PORT 8002 REST port
LOG_LEVEL INFO Log verbosity
LOG_DIR ./logs Log directory
REFRESH_MIN_API_COUNT 1000 Minimum rows for a refresh to be considered valid

Known limitations

  • No auth on the REST API. Bind to localhost or put it behind a reverse proxy if you expose it.
  • No rate limiting on embedding calls; a full re-embed of ~5,500 records hits Ollama hard.
  • Card quality varies. The MCP registry entries are machine-generated and often lack endpoint examples; the hand-written cards are richer.
  • Tag ontology (schema/tags.yaml) was authored by hand and is opinionated. Re-tagging a scraped record is best-effort keyword matching.
  • Semantic search quality is bounded by nomic-embed-text. Good enough for "find me an email API", not for fine-grained disambiguation.
  • Scraper assumes source formats stay stable. When public-apis or the MCP registry reshape their data, the scraper needs manual fixes.
  • No tests included.
  • Catalog data is not checked in (see Data provenance). You must run the refresh yourself.

License

Apache-2.0 on the code. Data license depends on upstream source — see Data provenance above.

About

Searchable catalog of public APIs and MCP servers, with semantic search and an MCP server frontend (pgvector + Postgres).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors