Skip to content

phase5-B: link-check gate + fix validate-repo-meta binary-URL decode bug (TDD)#33

Merged
rafael5 merged 1 commit into
mainfrom
phase5-B
May 11, 2026
Merged

phase5-B: link-check gate + fix validate-repo-meta binary-URL decode bug (TDD)#33
rafael5 merged 1 commit into
mainfrom
phase5-B

Conversation

@rafael5
Copy link
Copy Markdown
Contributor

@rafael5 rafael5 commented May 11, 2026

Summary

Phase 5 Track B per phase5-plan.md §3. Gate 2 of 4. Bundles two pieces because B0 was a documented prerequisite for B1+B2.

B0 — validate-repo-meta.py binary-URL fix (Phase-4 carry-over)

Pre-Track-B, resolve_one() blindly UTF-8-decoded every fetched body. Any exposes.* URL pointing at a binary asset (e.g. m-dev-tools-mcp's v0.1.0 wheel) crashed with UnicodeDecodeError; the documented workaround was ARGS=--no-resolve on every call.

Fix: detect non-text extensions (.whl / .tar.gz / .tgz / .zip / .gz) and skip the body-read entirely — HEAD-equivalent semantics. Text URLs unchanged.

  • New _BINARY_EXTENSIONS constant + _is_binary_url() helper.
  • resolve_one() branches on binary-ness; binary URLs only check "did the server answer?".
  • tests/test_validate_repo_meta.py — 5 cases pinning the behaviour (text decode + JSON-validate, invalid-JSON text, binary .whl no-decode, .tar.gz no-decode, binary-URL transport failure still surfaces).

Verified: make check-repo-meta META=…/m-dev-tools-mcp/dist/repo.meta.json in full-resolve mode now returns OK against the v0.1.0 wheel URL.

B1 + B2 — check-links.py

Walks every URL surfaced by the catalog:

Source Surfaces
profile/llms.txt Markdown link targets (line number captured as label)
profile/tools.json top-level $schema + every tools.<key>.<field>_url value
profile/task_index.json top-level $schema + every doc field on every category/row

Per-URL: HEAD with 15s timeout; 405→GET fallback; 200/301/302 = OK (urllib follows redirects when --allow-redirects is on, default). Anything else = FAIL with HTTP status + reason captured.

Modes:

  • --offline — inventory only. No network. Status column reads INVENTORIED. Used by per-PR CI.
  • default — full HEAD walk. Used by the weekly cron firing.

Exit codes: 0 all OK / INVENTORIED; 1 any FAIL; 2 fixture error.

13 new TDD cases (tests/test_check_links.py):

Surface Cases
Inventory 3 (tools.json *_url, task_index doc, llms.txt Markdown links)
check_url() 5 (200 OK, 404 FAIL, 405→GET fallback, DNS failure, redirect-follow OK)
Driver 2 (worst-first aggregator, all-OK case)
CLI 3 (--offline must NOT touch network — patched urlopen to raise; 404 → rc=1; smoke against committed catalog)

B3 — Make + CI wiring

  • make check-links — invokes the script with --offline (PR-mode).
  • .github/workflows/ci.yml: PR-mode adds a Link inventory step to the check job. The weekly cron firing on the handshake job adds a parallel step running the full live HEAD walk so a broken upstream URL surfaces within 7 days.

Verified locally

  • pytest profile/build/84 / 84 (66 prior + 5 validate-repo-meta + 13 check-links)
  • make check-links — clean (58 URLs catalogued offline)
  • make check-freshness — still clean (Track A unchanged)
  • make check-repo-meta full mode (no --no-resolve) — OK against the v0.1.0 wheel URL

What's NOT in this PR

  • JSON-body validation in the cron firing — check-links only HEAD-checks; deeper "is the body well-formed" lives in build-catalog + validate-catalog. Duplicating that path here would be churn.
  • Retry-with-backoff for transient 5xx on the cron firing — phase5-plan.md §9 explicitly defers this until CI demonstrates flake.

Test plan

  • All 18 new pytest cases pass (5 + 13)
  • Full suite stays green (84 / 84 total)
  • Both make targets exit 0 on main state
  • Pre-existing make check-repo-meta calls (which still use --no-resolve in places) continue to work — the fix is purely additive
  • CI green

…bug (TDD)

Phase 5 Track B per phase5-plan.md §3. Gate 2 of 4. Two pieces:

### B0 — validate-repo-meta.py binary-URL fix (Phase-4 carry-over)

Pre-Track-B, ``resolve_one()`` blindly UTF-8-decoded every fetched
body. Any ``exposes.*`` URL pointing at a binary asset (e.g.
m-dev-tools-mcp's v0.1.0 wheel) crashed with UnicodeDecodeError; the
documented workaround was ``ARGS=--no-resolve`` on every call.

Fix: detect non-text extensions (``.whl`` / ``.tar.gz`` / ``.tgz`` /
``.zip`` / ``.gz``) and skip the body-read entirely — HEAD-equivalent
semantics. Text URLs unchanged.

* New _BINARY_EXTENSIONS module constant + _is_binary_url() helper.
* resolve_one() branches on binary-ness; binary URLs only check
  "did the server answer?".
* tests/test_validate_repo_meta.py — 5 cases pinning the behaviour:
  text-URL decodes + JSON-validates; invalid-JSON text URL reports;
  binary .whl URL does NOT decode (the bug-fix assertion); .tar.gz
  same defense; transport failure on a binary URL still surfaces.

Verified: ``make check-repo-meta META=…/m-dev-tools-mcp/dist/repo.meta.json``
in full-resolve mode now returns OK against the v0.1.0 wheel URL.
The ``ARGS=--no-resolve`` workaround can be dropped from any
call site that wants the stricter check (out of scope here — just
fix the bug, leave the call sites alone).

### B1 + B2 — check-links.py

Walks every URL surfaced by the catalog:

* profile/llms.txt — Markdown link targets (one row per link, line
  number captured as the label)
* profile/tools.json — top-level $schema + every tools.<key>.<field>_url
  value across every entry
* profile/task_index.json — top-level $schema + every doc field on
  every category/row

Per-URL behaviour: HEAD with a 15s timeout. On 405 Method Not Allowed,
retry with GET. 200/301/302 = OK (urllib follows redirects natively
when --allow-redirects is on, which is the default). Anything else
= FAIL with the HTTP status + reason captured.

Modes:

* --offline — inventory only. No network round-trip. Status column
  reads INVENTORIED. Used by per-PR CI.
* default — full HEAD walk. Used by the weekly cron firing.

Exit codes: 0 all OK (or all INVENTORIED), 1 any FAIL, 2 fixture
error.

13 new TDD cases (tests/test_check_links.py):

* Inventory (3): tools.json $schema + *_url surfacing; task_index doc
  fields; llms.txt Markdown link extraction.
* check_url (5): 200 OK, 404 FAIL, 405→GET fallback (HEAD then GET
  retry assertion), DNS failure FAIL, redirect-follow OK.
* Driver (2): worst-first aggregation; all-OK aggregator.
* CLI (3): --offline must not touch the network (urlopen patched to
  raise); a 404 surfaces as rc=1; smoke against committed catalog.

### B3 — Make + CI wiring

* make check-links — invokes the script with --offline (PR-mode).
* .github/workflows/ci.yml: PR-mode adds a "Link inventory" step to
  the `check` job. The weekly cron firing on the `handshake` job
  adds a parallel step running the full live HEAD walk so a broken
  upstream URL surfaces within 7 days.

### Verification locally

* pytest profile/build/ — 84 / 84 (66 prior + 5 validate-repo-meta
  + 13 check-links)
* make check-links — clean (58 URLs catalogued offline)
* make check-freshness — still clean (Phase-5 Track A unchanged)
* make check-repo-meta full mode (no --no-resolve) green against
  v0.1.0 wheel URL

### What's NOT in this PR

* JSON-body validation in the cron firing — check-links only HEAD-
  checks; deeper "is the body well-formed" lives in build-catalog +
  validate-catalog. Adding a JSON-parse pass in check-links would
  duplicate that path.
* Retry-with-backoff for transient 5xx on the cron firing —
  phase5-plan.md §9 explicitly defers this until CI demonstrates
  flake.
@rafael5 rafael5 merged commit 16bbd08 into main May 11, 2026
2 checks passed
@rafael5 rafael5 deleted the phase5-B branch May 11, 2026 14:38
rafael5 added a commit that referenced this pull request May 11, 2026
Captures Phase 5 exit per phase5-plan.md §6 + §10. Mirrors
phase4-evidence.md shape: one section per gate, "what this proves"
roll-up, then each §10 done-criterion cited green.

Verified locally (gate outputs in the evidence doc):

* pytest profile/build/ — 116/116 (51 prior + 65 across Phase 5)
* make check-freshness — clean (worst=OK)
* make check-links — clean (offline; 59 URLs catalogued)
* make check-licenses — clean (worst=SKIP; 9 INVENTORIED + 1 SKIP
  for m-modern-corpus's mixed-per-subdir)
* make check-schema-compat — clean (no bumps in this PR; no non-
  additive changes)
* make handshake — 8/8 steps
* make recipes-check — 4/4 clean
* make validate-catalog — OK
* make check-docs-prose — clean

All seven §10 done-criteria cited green in the evidence doc:

1. check-freshness.py + 15 TDD + make target (PR #32 / e9e00cb)
2. check-links.py + 13 TDD + B0 binary-URL fix in
   validate-repo-meta.py + 5 TDD (PR #33 / 16bbd08)
3. check-licenses.py + 18 TDD + per-license signature dict (PR #34
   / 5d9a995)
4. check-schema-compat.py + 14 TDD + fetch-depth:0 wiring (PR #35
   / 8792787)
5. CI per-PR runs all four --offline variants
6. Weekly cron firing runs --strict freshness + live link-check +
   full LICENSE-fetch
7. This evidence file

Also:

* docs/ai-discoverability/README.md phase table — Phase 5 row
  flipped from "in flight" → "Closed 2026-05-11" with evidence link.
* AI-discoverability-architecture.md "Phase 5 — in flight" section
  rewritten to "Phase 5 — closed 2026-05-11"; notes the operational
  loop is complete and future phases would address growth, not
  enforcement coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant