Skip to content

scrapers: generalize recording manifest + automate record→ship pipeline (keep hand-written production scrapers) #51

@hanwencheng

Description

@hanwencheng

Context

The agentkeys-workflow-collection skill's recorder + src/scrapers/openrouter-cdp.ts + src/scrapers/openai-cdp.ts (shipped in #66ac92d) proved the architecture: hand-written production scrapers + recorder as iteration scaffolding. The current pipeline:

  1. /agentkeys-workflow-collection drives the recorder → iterates until signup mints a real key → flows.ts accumulates proven fixes.
  2. Human ports the working flow from flows.ts into a new src/scrapers/<service>-cdp.ts by hand.

Step 2 is the gap. We do NOT want to fix it by making the emitter magically produce production-ready scrapers (the string-template approach is fragile — every flows.ts change needs the template re-synced, and service-specific knowledge will always leak through). Instead, keep the hand-written scrapers as the source of truth and make the recorder produce artifacts + tooling that shrink the porting step to a near-mechanical transcription.

Scope

1. Generalize the manifest interface + helper functions

src/workflow-recorder/artifacts.ts::Manifest is currently scrapped-together around the OpenRouter/OpenAI happy path. Make it general:

  • Flow-shape agnostic: signup AND login both serializable to the same manifest (today the manifest has flow: \"signup\" | \"login\" but many selectors/outcomes are signup-specific).
  • Generic step-outcome vocabulary ("fill-email", "click-continue", "wait-verification", "extract-key") instead of the current mix of flow-specific labels.
  • Typed "detected" fields: regexes (from/subject/URL), selectors (email/password/TOS/Continue/Create), timings (per-step ms), captcha-kind encountered (turnstile / hcaptcha / none / PoW-custom).
  • manifest.json becomes the contract, not the debug dump. Any consumer (recorder, ship-scraper skill, drift-detector) reads the same shape.

Files: src/workflow-recorder/artifacts.ts, src/workflow-recorder/flows.ts, src/workflow-recorder/email-analyzer.ts (now in src/lib/).

2. Improve agentkeys-workflow-collection skill — emitter uses the manifest interface

The current emitDraftScraper string-templates a scraper inline. Replace with: read the manifest, compose a scraper from:

  • A stable shell (argv parsing, env-var read, CDP connect, JSON-event emit, exit-cleanly) — identical across services, copied verbatim.
  • A service-specific body generated by walking the manifest's step sequence and emitting the corresponding lib/playwright-patterns calls.
  • Inherited behavior from the lib (no inlining of humanType / clickOuterCreate / etc.) — matches the hand-written scraper structure.

Minimal changes per service: the emitter should produce a file that is >80% identical to a hand-written scraper. Service-specific helpers (OpenRouter's dismissOpenRouterOnboardingModals, OpenAI's completeOpenAIPostVerifyProfile) still require human input (recorded as "notes" fields in the manifest).

Update ~/.claude/skills/agentkeys-workflow-collection/SKILL.md Phase 4 to document the new emitter output + how to review it.

3. New skill: /agentkeys-ship-scraper

Takes the last-successful recording for a service and ships a production scraper. Flow:

  1. `--service ` argument.
  2. Find most recent manifest with state: completed under provisioner-scripts/recordings/<slug>-*-reference/ (reference) or <slug>-<ts>/ (latest).
  3. Emit via the new manifest-driven emitter → src/scrapers/<slug>-cdp.ts.
  4. Run tsc --noEmit on the emitted file; surface errors as human decisions.
  5. Run the scraper once (live) to prove minting; write outcome into manifest.
  6. Stage the new file for PR.

Works for both login and signup flows:

  • Signup flow: emits full create-account → email-verify → API-key-mint path.
  • Login flow: emits login-with-credentials → API-key-mint (shorter; no email verify).

Skill lives at ~/.claude/skills/agentkeys-ship-scraper/SKILL.md.

Acceptance criteria

  • Manifest interface + helper functions reviewed for flow-agnosticism; login-path recording produces a manifest shaped identically to signup.
  • emitDraftScraper rewritten to compose from lib calls (not string-template). Output for OpenRouter recording matches hand-written openrouter-cdp.ts to within ~20 lines (docstring / ordering allowed to differ; behavior identical).
  • /agentkeys-ship-scraper end-to-end: invoke on OpenRouter's reference recording → emitted scraper mints a real key live → no regression vs current src/scrapers/openrouter-cdp.ts.
  • Same test for OpenAI.
  • A third service (user's choice — Brave-login if we add a BYOK flow, or whatever unblocks next) goes record → emit → ship without hand-porting.

Out of scope

Why this architecture

Keeping production scrapers hand-written means every service gets a clear, auditable file with service-specific logic visible. The emitter's job is to produce a starting point, not the final artifact. This matches the way we actually debug: when OpenRouter's modal chain changes, you edit openrouter-cdp.ts (not the emitter template), re-record, regenerate, diff, ship.

Related commit: #66ac92d `feat(scrapers): deterministic OpenRouter + OpenAI production scrapers`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions