Skip to content

phase1-B: build-catalog + validate-catalog (TDD)#12

Merged
rafael5 merged 2 commits into
mainfrom
phase1-B
May 11, 2026
Merged

phase1-B: build-catalog + validate-catalog (TDD)#12
rafael5 merged 2 commits into
mainfrom
phase1-B

Conversation

@rafael5
Copy link
Copy Markdown
Contributor

@rafael5 rafael5 commented May 11, 2026

Summary

Phase-1 Track B of the AI-discoverability plan: ship the two scripts that turn profile/tools.json from hand-curated to generated, plus the catalog-level validator that enforces the post-P1-A contract.

  • profile/build/validate-catalog.py — strict Draft2020-12 validation of tools.json against tools.schema.json + task_index.json against task_index.schema.json, plus a key-collision guard between the two top-level shapes (post-P1-A invariant).
  • profile/build/build-catalog.py — fetches each of the six onboarded repos' dist/repo.meta.json, validates it, translates it to a summary tools.<key> entry (one *_url per exposes.<kind>), and emits a deterministic tools.json carrying the hand-curated org / workflow / discovery_protocol narrative from the prior file.
  • TDD throughout: RED tests written first, then GREEN implementations. 22 pytest cases (10 + 12), all green.

TDD coverage

test_validate_catalog.py (10 tests)

  • Baseline pair validates clean (smoke).
  • Unknown top-level key in tools.json fails under additionalProperties: false (covers re-inlined task_index, generic unknown key, inlined-facts block in a tool entry).
  • Unknown top-level key in task_index.json fails.
  • Malformed typed ID in primary fails the typedID regex.
  • Malformed typed ID in see_also fails.
  • Missing required field in a tool entry fails.
  • Missing file reports a clean error.
  • main(argv) exits 0 against the committed baseline.

test_build_catalog.py (12 tests)

  • Three synthetic repo.meta.json fixtures — minimal, rich, extra-exposes-kind — exercise the translation surface.
  • Required top-level keys emitted; task_index is NOT emitted; hand-curated org/workflow/discovery_protocol/description carried verbatim.
  • Single-element language arrays collapse to strings; multi-element stay as arrays.
  • Each exposes.<kind><kind>_url resolving against the manifest's repo-root raw URL.
  • agent_instructions resolves to a github.com/.../blob/<branch>/... URL.
  • Tools key strips the tool: prefix; entry id keeps it.
  • Extra-exposes-kind passes through (proves the script is data-driven from manifests, not from a hardcoded allow-list).
  • Generated output validates clean against tools.schema.json.
  • dumps() is deterministic across input-order shuffles; sorted keys; trailing newline.
  • B5 two-run determinism: building twice → byte-identical output.
  • Invalid manifest raises a clear error rather than emitting a malformed entry.

Verification (all green)

pytest profile/build/                                                  # 22 passed
python3 profile/build/validate-catalog.py                              # exits 0 on committed pair
python3 profile/build/build-catalog.py | python3 -m json.tool >/dev/null  # parses
python3 profile/build/build-catalog.py > out1.json
python3 profile/build/build-catalog.py > out2.json && diff out1.json out2.json  # byte-identical
python3 profile/build/validate-catalog.py --tools out1.json --task-index profile/task_index.json  # exits 0
make phase0-smoke                                                      # PASS — manifests unchanged

Drift vs. committed profile/tools.json (what P1-D's drift gate will need to address)

Running build-catalog.py against the live TIER_1 + TIER_2 manifests produces a tools.json that diverges from the committed hand-curated baseline in four categorical ways. This is expected — the build script's job is to produce a regenerated version, and the drift is exactly what the P1-D drift gate (make catalog && git diff --exit-code) will need to reconcile. Documenting here so P1-D's reviewer can plan it.

1. Four hand-curated entries dropped (no dist/repo.meta.json on any of them):

  • m-cli-extras — tier-3, no manifest.
  • m-stdlib-vscode — tier-3, no manifest.
  • tree-sitter-m-vscode — tier-3, no manifest.
  • m-tools — archived seed repo.

These need a decision in P1-D: either (a) keep them as hand-merged additions overlaid on the generator output, (b) onboard them to the Phase-0 contract first (T3-* tasks per current-state-inventory-priority.md §3.2), or (c) drop them and add a top-level unonboarded list elsewhere.

2. consumed_by lost on all 6 onboarded tools. repo.meta.schema.json has consumes but not consumed_by — the inverse edge is currently hand-maintained in tools.json only. P1-D options: (a) compute consumed_by in build-catalog.py from the inverse consumes graph, or (b) extend repo.meta.schema.json to allow consumed_by (less clean — consumers shouldn't know who consumes them).

3. role text drifts on all 6 tools. The manifests (post-Phase-0) have shorter role strings; the committed tools.json carries the original longer descriptions:

  • m-cli: "Canonical CLI toolchain — m fmt / lint / test / coverage / watch / lsp / doc / new / ..." → "Canonical M CLI — fmt / lint / test / coverage / watch / lsp / doc / new"
  • m-stdlib: "Pure-M (and selectively $ZF-bound) runtime standard library — STD* modules" → "Pure-M runtime standard library — STD* modules"
  • m-standard: "Citable, machine-readable M language reference reconciling AnnoStd / YottaDB / IRIS / VA SAC" → "Machine-readable M language reference"
  • m-modern-corpus, m-test-engine, tree-sitter-m — similar drift.

The manifest is canonical going forward; P1-D should let the generator overwrite.

4. exposes key naming drift (one instance and one addition):

  • m-stdlib: manifest exposes modules (→ modules_url); committed baseline calls the pointer manifest_url. Manifest is canonical.
  • m-modern-corpus: manifest exposes licenses (→ licenses_url); committed baseline doesn't carry it. Generator additively picks it up.

Deferred to P1-D (per the plan and the task brief)

  • Makefile targets catalog and a strict validate-catalog replacing the current python -m json.tool parse-only target.
  • CI workflow make catalog && git diff --exit-code profile/tools.json profile/task_index.json drift gate, plus the make validate-catalog step.
  • Resolving the drift catalogued above (one of: regenerate tools.json from the generator output and reconcile, or extend the generator's surface, or both).

Files added

  • profile/build/build-catalog.py (executable, 274 lines)
  • profile/build/validate-catalog.py (executable, 128 lines)
  • profile/build/test_build_catalog.py (236 lines, 12 tests)
  • profile/build/test_validate_catalog.py (135 lines, 10 tests)

No existing files modified — Track-B work is purely additive in profile/build/. profile/tools.json and profile/task_index.json are untouched (the drift question is P1-D's call, not this PR's).

Test plan

  • Both pytest profile/build/test_validate_catalog.py and pytest profile/build/test_build_catalog.py green.
  • python3 profile/build/validate-catalog.py exits 0 against the committed tools.json + task_index.json.
  • python3 profile/build/build-catalog.py | python3 -m json.tool >/dev/null parses.
  • Running build-catalog.py twice produces byte-identical output (B5 determinism).
  • make phase0-smoke still green (manifests unchanged).
  • CI green on this branch.

Do not auto-merge — drift catalogued above needs P1-D-level reconciliation before this generator should overwrite profile/tools.json.

🤖 Generated with Claude Code

Phase-1 Track B of the AI-discoverability plan: implement the two
scripts that turn `profile/tools.json` from a hand-curated file into a
generated artifact, plus the catalog-level validator that enforces the
post-P1-A contract.

profile/build/validate-catalog.py
  * Validates `profile/tools.json` against `tools.schema.json` and
    `profile/task_index.json` against `task_index.schema.json`, both
    via Draft202012Validator from jsonschema.
  * Asserts no data-key collision between the two top-level shapes —
    a future-proofing guard: after P1-A the two documents share no
    top-level data key. The five meta-keys ($schema / schema_compat /
    schema_version / kind / _comment) are expected on both and skipped.
  * argparse: --tools (default profile/tools.json),
    --task-index (default profile/task_index.json).
  * Returns 0 on success; non-zero with structured stderr on failure.

profile/build/build-catalog.py
  * TIER_1 + TIER_2 constants: the six onboarded repos' raw-GitHub
    repo.meta.json URLs (m-cli, m-stdlib, m-standard, tree-sitter-m,
    m-test-engine, m-modern-corpus).
  * For each manifest: fetch → validate against repo.meta.schema.json
    (reusing validate-repo-meta.py's logic so the schema-check path
    stays in one place) → translate to a tools.<key> summary entry.
  * Translation: id / repo / role / language / license /
    agent_instructions / verified_on / status (default "active") /
    repo_meta_url straight from the manifest; each exposes.<kind>
    becomes <kind>_url with the URL resolved against the repo's
    main-branch raw prefix; consumes / consumed_by passed through.
  * Top-level narrative ($schema / schema_compat / schema_version /
    kind / description / org / workflow / discovery_protocol) is
    copied verbatim from the prior tools.json so we don't lose
    hand-curated content. task_index is NOT emitted — it stays in
    its own file post-P1-A.
  * --write PATH (default stdout), --prior PATH (default the
    committed tools.json), --no-network (dry-run framing only),
    --urls (override TIER_1+TIER_2).
  * Deterministic: sorted keys, 2-space indent, trailing newline,
    ensure_ascii=False (em dashes pass through). Running twice
    against the same input produces byte-identical output.

profile/build/test_validate_catalog.py (10 tests)
  * Baseline pair validates clean (smoke).
  * Unknown top-level keys fail under additionalProperties: false
    (covers task_index re-inlined, generic unknown key, inlined-facts
    block in a tool entry, surprise field in task_index).
  * Malformed typed IDs fail under the typedID regex (in primary
    and in see_also).
  * Missing required field in a tool entry fails.
  * Missing file path reports clean error.
  * Main(argv) exits 0 on the committed baseline.

profile/build/test_build_catalog.py (12 tests)
  * Three synthetic repo.meta.json payloads — minimal, rich,
    extra-exposes-kind — exercise the manifest → tools entry
    translation surface end-to-end.
  * Build emits all required top-level keys.
  * Build never emits task_index (post-P1-A contract).
  * Build preserves hand-curated top-level from prior_tools.
  * Minimal meta → summary entry; single-element language collapses
    to string; agent_instructions resolves to a github.com/.../blob/
    URL; manifest_url derives from exposes.manifest.
  * Rich meta → multi-language array stays as array; multiple
    exposes become multiple *_url pointers.
  * Extra-exposes-kind passes through (no hardcoded allow-list).
  * Tools key strips the `tool:` prefix from id.
  * Generated output validates against tools.schema.json.
  * dumps() is deterministic across input-order shuffles; emits
    trailing newline and sorted keys.
  * Two-run determinism (B5).
  * Invalid manifest raises a clear error rather than emitting a
    malformed entry.

Verification
  * pytest profile/build/ → 22 passed (10 + 12).
  * python3 profile/build/validate-catalog.py → exits 0 against the
    committed baseline.
  * python3 profile/build/build-catalog.py > generated.json runs
    twice → byte-identical output (B5 determinism).
  * Generated output validates clean against tools.schema.json.
  * make phase0-smoke → PASS (manifests unchanged).

Deferred to P1-D
  * Makefile `catalog` + `validate-catalog` targets — P1-D's job.
  * CI workflow `make catalog && git diff --exit-code` drift gate
    and `make validate-catalog` step — P1-D's job.
  * The drift between this branch's generator output and the
    committed `profile/tools.json` is intentional and surfaces
    exactly what the P1-D drift gate will need to address — see
    PR body for the catalogued differences.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rafael5 added a commit that referenced this pull request May 11, 2026
Coordinated companion to the three tier-3 onboarding PRs that
landed in parallel today:

  - tree-sitter-m-vscode #3  (squash-merge 1251518)
  - m-stdlib-vscode      #2  (squash-merge e92f660)
  - m-cli-extras         #2  (squash-merge 6e6fccf)

Each entry gains Phase-0 pointer URLs:
  - repo_meta_url   → dist/repo.meta.json
  - tree-sitter-m-vscode: + extension_info_url, package_json_url, language_configuration_url
  - m-stdlib-vscode:      + extension_info_url, package_json_url
  - m-cli-extras:         + plugins_url

Also drops the placeholder "not yet onboarded" notes lines and
fixes two stale fields surfaced by the onboarding PRs:
  - tree-sitter-m-vscode license: AGPL-3.0 → MIT (matches package.json)
  - tree-sitter-m-vscode agent_instructions: CLAUDE.md → AGENTS.md
  - m-stdlib-vscode agent_instructions: README.md placeholder → AGENTS.md

Phase 2 (tier-2) and tier-3 onboardings now both COMPLETE — every
non-archived repo in the org-catalog carries the Phase-0 contract.
Mechanical pickup will happen via P1-B's build-catalog.py (PR #12,
open for review).

make phase0-smoke still PASS; tools.json validates against
tools.schema.json (P1-A's strict shape).
…s PR

The three tier-3 repos (tree-sitter-m-vscode, m-stdlib-vscode,
m-cli-extras) all shipped dist/repo.meta.json today (PRs #3, #2, #2
in their respective repos; org-side companion .github PR #13 merged).

Adds TIER_3 = [...] alongside TIER_1 + TIER_2; defaults the URL list
to TIER_1 + TIER_2 + TIER_3 so build-catalog covers all nine
manifest-bearing org repos.

Without this commit the regenerated catalog would silently drop the
three tier-3 entries, looking like a drift-vs-committed bug.

Tests unchanged (still 22 green); local diff against committed
tools.json now shows the four real semantic gaps that P1-D will need
to address (m-tools archived-entry handling, consumed_by inverse-edge
computation, m-stdlib manifest_url/modules_url naming, additive
licenses_url/pyproject_toml_url payload pointers).
@rafael5 rafael5 merged commit c07dfa3 into main May 11, 2026
1 check passed
@rafael5 rafael5 deleted the phase1-B branch May 11, 2026 03:22
rafael5 added a commit that referenced this pull request May 11, 2026
Phase 3 launch state captured. Both upstream blockers from §0 closed:

- Phase 1 (org routing layer) CLOSED 2026-05-10 — A/B/C/D all merged
  (PRs #10/#11/#12/#16); make catalog + make validate-catalog green
  in CI; make catalog byte-idempotent against origin/main.
- Phase 2 (tier-2 + tier-3 manifests) CLOSED 2026-05-10 — all 3 tier-2
  + all 3 tier-3 repos onboarded same day; tools.json carries 9
  manifest-bearing entries; m-tools archived holdout rehosted under
  docs/history/ via PR #17.

§0 status column refreshed; verification commands inlined so any
future session can re-confirm the launch state without spelunking
git history. Recipe 7's MCP-server soft-dep noted as Phase-4
follow-up but not gating.

Pure documentation change; no plan-structure edits beyond §0. Tracks
A → B+C+D → E and stage matrices unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant