Skip to content

Export skills function added by Mirror Site maintainer#2138

Merged
Patrick-Erichsen merged 3 commits into
openclaw:mainfrom
wuchengan734-a11y:mirror_sync_v1
May 22, 2026
Merged

Export skills function added by Mirror Site maintainer#2138
Patrick-Erichsen merged 3 commits into
openclaw:mainfrom
wuchengan734-a11y:mirror_sync_v1

Conversation

@wuchengan734-a11y
Copy link
Copy Markdown
Contributor

@wuchengan734-a11y wuchengan734-a11y commented May 11, 2026

Summary

We are the ClawHub China mirror site from bytedance. To keep our mirror in sync with the upstream skill registry, we need an efficient way to bulk-fetch skill data. Currently there is no batch export API — the only option is to download skills one by one, which triggers rate limits and takes hours for 62k+ skills.

This PR adds a new GET /api/v1/skills/export endpoint that returns a merged ZIP archive of up to 1000 skills per request, with cursor-based pagination for full-catalog traversal.

What's changed

  • New endpoint: GET /api/v1/skills/export?startDate=<ms>&endDate=<ms>&limit=<n>&cursor=<token>

    • Returns a ZIP containing skill files organized in <slug>/ subdirectories
    • Includes _manifest.json (skill metadata array) and _meta.json per skill
    • Supports cursor-based pagination via X-Next-Cursor / X-Has-More response headers
  • Authentication & Authorization (convex/lib/apiTokenAuth.ts)

    • New requireExportAuth() — enforces Bearer token + admin-only role check
    • No token → 401, valid token but non-admin → 403, admin → allowed
    • Role whitelist configurable via EXPORT_ALLOWED_ROLES array
  • Security hardening (convex/lib/skillZip.ts)

    • validateSlug() — prevents Zip Slip via malicious slug paths (../../etc)
    • validateFilePath() — blocks absolute paths, .. traversal, backslashes, empty segments
    • buildMergedExportZip() — detects duplicate ZIP paths to prevent silent overwrites
  • Error recording (convex/httpApiV1/skillsV1.ts)

    • Missing versions or blobs are logged to _errors.json inside the ZIP (not silently skipped)
    • X-Export-Errors response header reports error count
  • Rate limiting (convex/lib/httpRateLimit.ts)

    • New export tier: { ip: 10, key: 60, adminKey: 600 } requests/min
  • Query layer (convex/skills.ts)

    • listByDateRange internalQuery — range scan on skillSearchDigest.by_active_created index
    • Chunked parallel blob reads (50 concurrent) to stay within Convex transaction limits

Files changed (16)

File Change
packages/schema/src/routes.ts Add skillsExport route constant
convex/lib/apiTokenAuth.ts Add requireExportAuth (admin-only, enabled)
convex/lib/httpRateLimit.ts Add export rate limit tier
convex/lib/skillZip.ts Add validateSlug, validateFilePath, buildMergedExportZip
convex/skills.ts Add listByDateRange internalQuery
convex/httpApiV1/skillsV1.ts Add exportSkillsV1Handler with security checks
convex/httpApiV1.ts Wire up httpAction export
convex/http.ts Register GET route
convex/devSeedExport.ts Dev seed: 200 test skills for export testing
test_export_api.sh Integration test script (31 tests)
convex/lib/access.ts Minor type alignment
convex/_generated/api.d.ts Auto-generated
packages/schema/dist/* Built output
readme_mirror_sync.md Mirror sync documentation

Testing

All checks below were actually executed against the local Convex dev server and verified.

  • TypeScript: npx tsc --noEmit — 0 errors
  • Unit tests: bun run test (vitest) — 180 test files, 1701 tests passed, 0 failed
  • Integration tests: bash test_export_api.sh31/31 passed, covering:
    • Auth (6 tests): no token→401, invalid→401, user→403, admin→200, case-insensitive bearer, empty bearer
    • Param validation (4 tests): missing startDate/endDate→400, startDate>endDate→400
    • Empty results (3 tests): future date range→empty ZIP, correct headers
    • Export output (5 tests): valid ZIP, correct Content-Type/Disposition, error/date headers
    • ZIP content (5 tests): _manifest.json valid, skill subdirs with _meta.json, no path traversal
    • Pagination (2 tests): cursor + hasMore flow, second page fetch
    • Rate limit (2 tests): X-RateLimit headers present
    • Edge cases (4 tests): limit clamping, single-point date range

Notes

No breaking changes

  • All existing endpoints are unaffected
  • The new export rate limit tier is additive (existing read/write/download unchanged)
  • skillZip.ts additions are new exports; existing buildDeterministicZip unchanged

Reviewer attention points

  1. Memory usage: The handler loads up to 1000 skills × N files into memory before zipping. For the current average file size (~10KB/file, ~5 files/skill) this is ~50MB peak, within Convex Action limits. Consider adding a total-size guard if skill file sizes grow significantly.
  2. adminKey: 600: The admin rate limit for export is generous (600 req/min). This was intentional to allow fast full-catalog sync, but could be reduced to 60-120 if the team prefers a more conservative setting.
  3. Uint8Array.from(zipSync(...)) in buildMergedExportZip creates a copy of the ZIP buffer. Could be optimized to return zipSync() directly if memory pressure is a concern.

Context

We are the ClawHub China mirror site operators. China mainland users experience high latency and intermittent connectivity when accessing clawhub.ai directly. This export API enables us to efficiently bulk-sync the full skill catalog to our mirror, keeping it up-to-date with hourly incremental exports using date-range queries + cursor pagination. Without this endpoint, syncing 62k+ skills requires individual downloads that take 10+ hours and frequently hit rate limits.

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 11, 2026

@wuchengan734-a11y is attempting to deploy a commit to the Amantus Machina Team on Vercel.

A member of the Team first needs to authorize it.

@openclaw-barnacle openclaw-barnacle Bot added the triage: dirty-candidate Candidate: broad unrelated surfaces; may need splitting or cleanup. label May 11, 2026
@wuchengan734-a11y
Copy link
Copy Markdown
Contributor Author

/review

Comment thread convex/skills.ts Outdated
Comment on lines +9393 to +9398
const startIndexKey: IndexKey = decodedCursor ?? [undefined, endDate];
const endIndexKey: IndexKey = [undefined, startDate];

const result = await getPage(ctx, {
table: "skillSearchDigest",
index: "by_active_created",
Copy link
Copy Markdown
Contributor

@vercel vercel Bot May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The listByDateRange query used for skill exports does not filter out suspicious skills, potentially exposing flagged malicious content through the export API.

Fix on Vercel

@wuchengan734-a11y wuchengan734-a11y changed the title Mirror sync v1 Export skills function added by Mirror Site maintainer May 11, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 11, 2026

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-22 19:20 UTC / May 22, 2026, 3:20 PM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds an authenticated GET /api/v1/skills/export ZIP export endpoint with updatedAt pagination, manifest metadata, a new export rate-limit tier, route constants, and handler/query tests.

Reproducibility: yes. for the review blockers: source inspection of the PR head shows the auth-before-rate-limit ordering, owner visibility mismatch, storage-read-before-cap path, and metadata collision path. I did not execute the branch, and no inspectable runtime proof was provided.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🧂 unranked krab
Summary: The PR is not quality-ready because real behavior proof is missing and source inspection still finds blocking export boundary and availability defects.

Rank-up moves:

  • Add redacted terminal/log proof of the updated authenticated mirror flow, including paginated requests, response headers, and ZIP manifest/file inspection.
  • Fix the active-owner visibility, failed-auth throttling, pre-read size cap, and generated metadata path collision defects.
  • Add focused regression tests that fail on those edge cases.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body lists commands, but there is no inspectable redacted terminal output, logs, screenshot, recording, or linked artifact showing an authenticated paginated export, response headers, and ZIP inspection; contributors should redact private data and update the PR body so a fresh review can run.

Risk before merge

  • Merging creates a durable mirror export contract for route shape, cursor semantics, headers, auth behavior, rate limits, and ZIP layout before there is inspectable real behavior proof.
  • The export list can diverge from existing public visibility by exporting active skill rows whose owner is no longer public/active.
  • Invalid or missing API-token export attempts currently bypass the export rate limiter because auth returns before throttling runs.
  • The requested page can perform many storage reads and materialize blobs before enforcing the total byte cap, which can pressure Convex action availability.
  • A valid skill file named _export_skill_meta.json can collide with generated metadata and fail the whole export page.

Maintainer options:

  1. Repair Export Boundaries First (recommended)
    Fix the active-owner visibility check, auth-before-rate-limit ordering, pre-read size cap, and metadata filename collision, then add focused regression tests before merge.
  2. Confirm the Mirror Contract
    Maintainers should explicitly decide whether this is a general authenticated public feed or a privileged operator API and align docs, headers, rate limits, and audit behavior with that decision.
  3. Pause Bulk Export
    If ClawHub is not ready to operate a bulk catalog export API, pause or close this branch until a narrower sync contract is agreed.

Next step before merge
This needs contributor-supplied real behavior proof and maintainer API-contract review before any automated repair or merge path is safe.

Security
Needs attention: The diff introduces a bulk file export boundary and still needs fixes for public-owner visibility and failed-auth throttling.

Review findings

  • [P2] Match the public owner gate before exporting — convex/skills.ts:10481
  • [P2] Throttle failed export authentication — convex/httpApiV1/skillsV1.ts:1580-1584
  • [P2] Preflight byte caps before fetching blobs — convex/httpApiV1/skillsV1.ts:1711
Review details

Best possible solution:

Land a documented mirror export endpoint only after it reuses the established public visibility boundary, throttles all attempts, preflights file/byte caps, reserves generated ZIP paths, and has redacted proof from the mirror flow.

Do we have a high-confidence way to reproduce the issue?

Yes for the review blockers: source inspection of the PR head shows the auth-before-rate-limit ordering, owner visibility mismatch, storage-read-before-cap path, and metadata collision path. I did not execute the branch, and no inspectable runtime proof was provided.

Is this the best way to solve the issue?

No. The bulk export direction may be useful, but the current patch is not the narrowest maintainable solution until it matches existing public visibility, throttles failed auth, enforces caps before blob reads, reserves generated ZIP paths, and has mirror-flow proof.

Label justifications:

  • P2: This is a normal-priority feature PR with limited blast radius but real API-boundary and availability defects before merge.
  • merge-risk: 🚨 compatibility: Merging creates a new external export API contract that mirror operators may depend on for route, pagination, headers, and ZIP layout.
  • merge-risk: 🚨 security-boundary: The diff changes bulk access to skill files and currently does not exactly match existing public visibility and auth-throttling boundaries.
  • merge-risk: 🚨 availability: The handler can perform large storage/blob work before caps are enforced and can fail pages on generated metadata path collisions.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🧂 unranked krab, and The PR is not quality-ready because real behavior proof is missing and source inspection still finds blocking export boundary and availability defects.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body lists commands, but there is no inspectable redacted terminal output, logs, screenshot, recording, or linked artifact showing an authenticated paginated export, response headers, and ZIP inspection; contributors should redact private data and update the PR body so a fresh review can run.

Full review comments:

  • [P2] Match the public owner gate before exporting — convex/skills.ts:10481
    isExportableSkillDigest only checks the skill visibility fields, but the existing public list drops digest rows without a public owner. A digest for an inactive/deleted publisher can still satisfy isPublicSkillDoc here and be handed to the export ZIP even though normal public listing hides it; require digestToOwnerInfo(digest)?.owner or an equivalent active-owner gate before exporting.
    Confidence: 0.86
  • [P2] Throttle failed export authentication — convex/httpApiV1/skillsV1.ts:1580-1584
    The handler returns 401 before calling applyRateLimit, so missing or invalid tokens never consume the export IP bucket. Move the export rate-limit check before required auth, then pass the rate headers through the auth error response so failed-auth traffic is still throttled.
    Confidence: 0.91
  • [P2] Preflight byte caps before fetching blobs — convex/httpApiV1/skillsV1.ts:1711
    This fetches every queued storage blob before enforcing the total export byte cap, and later calls arrayBuffer() before checking whether the file would exceed the cap. Use the version.files[].size metadata to stop or skip before ctx.storage.get so a large valid request cannot burn storage/action resources beyond the advertised cap.
    Confidence: 0.88
  • [P2] Reserve generated export metadata paths — convex/httpApiV1/skillsV1.ts:1794-1795
    A skill can publish a normal text file named _export_skill_meta.json; the handler then appends generated metadata at the same path and buildMergedExportZip throws on the duplicate, failing the entire page. Reject or namespace reserved generated filenames before adding user files to the ZIP.
    Confidence: 0.84

Overall correctness: patch is incorrect
Overall confidence: 0.88

Security concerns:

  • [medium] Export omits the active-owner visibility check — convex/skills.ts:10481
    The export query only requires isPublicSkillDoc, while current public list paths also require a public owner, so inactive-owner content can cross the new bulk export boundary.
    Confidence: 0.86
  • [medium] Failed export auth bypasses the export throttle — convex/httpApiV1/skillsV1.ts:1580
    Missing or invalid API tokens return before applyRateLimit, leaving unauthorized export attempts outside the intended IP quota.
    Confidence: 0.91

What I checked:

  • Current main has no export route: ApiRoutes on current main lists the v1 skill/search/download routes but no skillsExport, so this PR is not redundant on main. (packages/schema/src/routes.ts:14, 232017ba6f0e)
  • PR registers a new bulk export endpoint: The PR branch adds ApiRoutes.skillsExport and wires it to exportSkillsV1Http, creating a new external API contract. (convex/http.ts:75, 6c19b992b451)
  • Auth runs before export throttling: exportSkillsV1Handler returns 401 before applyRateLimit, so missing or invalid tokens bypass the new export IP quota and rate headers. (convex/httpApiV1/skillsV1.ts:1580, 6c19b992b451)
  • Export filter omits the public owner gate: The export query only requires latestVersionId and isPublicSkillDoc, while existing public list construction also requires digestToOwnerInfo(digest)?.owner so inactive-owner rows stay hidden. (convex/skills.ts:10481, 6c19b992b451)
  • Byte cap is enforced after storage reads: The handler fetches all queued storage blobs before checking the total export byte cap, even though version file metadata already carries size. (convex/httpApiV1/skillsV1.ts:1711, 6c19b992b451)
  • Metadata path can still collide: The handler appends generated _export_skill_meta.json under each skill root, while normal sanitized publish paths can include that JSON filename; the ZIP builder then throws on duplicate paths. (convex/httpApiV1/skillsV1.ts:1794, 6c19b992b451)

Likely related people:

  • Patrick-Erichsen: Blame and log show Patrick owning the current v1 HTTP handlers, rate-limit helper, and ZIP helper on main, and he also pushed the PR follow-up that changed the export auth/rate contract. (role: recent area contributor; confidence: high; commits: b753b1f7ab0e, 9be35a3d4329, 5ebd7c3d8624; files: convex/httpApiV1/skillsV1.ts, convex/lib/httpRateLimit.ts, convex/lib/skillZip.ts)
  • Vyctor H. Brzezowski: Recent main history includes the public skill-list pagination cursor fix that this export endpoint builds on for cursor semantics. (role: recent adjacent contributor; confidence: medium; commits: 07bf41e1091e; files: convex/skills.ts, convex/skills.publicListCursor.test.ts)
  • Jason (Json): Recent main history includes category-aware related-skill work on the skillSearchDigest listing surface used by the export query. (role: recent adjacent contributor; confidence: medium; commits: 5be50db7ec4a; files: convex/skills.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 232017ba6f0e.

@wuchengan734-a11y
Copy link
Copy Markdown
Contributor Author

/review

@wuchengan734-a11y
Copy link
Copy Markdown
Contributor Author

/review

@clawsweeper clawsweeper Bot added P2 Normal backlog priority with limited blast radius. impact:security Security boundary, credential, authz, sandbox, or sensitive-data risk. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed impact:security Security boundary, credential, authz, sandbox, or sensitive-data risk. labels May 17, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 20, 2026

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 22, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@wuchengan734-a11y
Copy link
Copy Markdown
Contributor Author

/review

Add a new REST API endpoint that allows authenticated admin users to
export skills in bulk as a merged ZIP archive, designed for the ClawHub
China mirror site to efficiently sync skill data.

- New endpoint: GET /api/v1/skills/export?startDate=&endDate=&limit=&cursor=
- Admin-only auth via requireExportAuth (Bearer token + role check)
- Zip Slip protection: validateSlug + validateFilePath
- Duplicate ZIP path detection in buildMergedExportZip
- Per-skill metadata written to _export_skill_meta.json (avoids collision with skill files)
- Error recording: missing version/blob logged to _errors.json
- Dedicated rate limit tier: export { ip: 10, key: 60, adminKey: 600 }
- Cursor-based pagination on skillSearchDigest.by_active_created index
- Chunked parallel blob reads (50 concurrent)

Co-Authored-By: Claude <noreply@anthropic.com>
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 22, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 22, 2026
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels May 22, 2026
@Patrick-Erichsen
Copy link
Copy Markdown
Contributor

Patrick-Erichsen commented May 22, 2026

@wuchengan734-a11y I pushed a follow-up commit to this branch with some updates:

  • export now requires a normal API token but is no longer site-admin-only
  • authenticated users use the export key tier at 60 requests/min; no separate mirror role
  • exportable content is limited to public/installable/non-hidden/non-deleted/non-malicious skills
  • incremental sync now uses updatedAt
  • ZIP entries are namespaced as //... so duplicate slugs across publishers do not collide
  • added request file/byte caps and regression coverage

Verified with focused handler/query tests, format, lint, TypeScript, and autoreview. Could you please verify the updated branch against your mirror flow?


The above was generated by my AI, but @wuchengan734-a11y , I don't think we need to require this to be admin only. I don't think we would grant you guys admin access just to do this export.

Will the 60 req/min work for your needs? We could definitely bump this but I just used the key: 60 that you had scaffolded for authenticated users.

Note that I also updated the exported folder structure to use <publisher>/<slug> since slugs are no longer globally unique.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 22, 2026

🦞🧹
ClawSweeper could not start a re-review for this item.

Reason: re-review requires an open issue or PR.

Re-review progress:

@Patrick-Erichsen Patrick-Erichsen merged commit e22bb7d into openclaw:main May 22, 2026
4 of 5 checks passed
@Patrick-Erichsen
Copy link
Copy Markdown
Contributor

@wuchengan734-a11y also note that in #2376 I capped the pagination at 250 skills. Beyond that Convex was crashing consistently.

@wuchengan734-a11y
Copy link
Copy Markdown
Contributor Author

Thanks for the update. I agree that requiring site-admin access is unnecessary for our mirror flow, so using a normal API token sounds good.

60 req/min should likely be fine for incremental sync. We’ll verify whether it is sufficient for initial backfill/full mirror runs and let you know if we need a higher burst or rate limit.

The / folder structure also makes sense given that slugs are no longer globally unique. We’ll update/verify our mirror-side parsing accordingly.

We’ll test the updated branch against our mirror flow, thanks for your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. P2 Normal backlog priority with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: dirty-candidate Candidate: broad unrelated surfaces; may need splitting or cleanup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants