feat(record): add live recording command for API capture (#300)

yee94 · yee.wang · jackwener · web-flow · commit dff0fe510c96 · 2026-03-23T19:19:54.000+08:00
* feat(record): add live recording command for API capture

- Add `opencli record &lt;url&gt;` command that injects fetch/XHR interceptors
  into all tabs in the automation window, polls captured requests, and
  auto-generates YAML candidate adapters
- Support multi-tab recording: new tabs discovered during polling are
  automatically injected
- Add --timeout (default 60s) for agent-friendly non-blocking operation;
  stops on Enter, timeout, or SIGINT — whichever comes first
- Fix idempotent re-injection: restores original fetch/XHR before
  re-patching so guard flag no longer blocks subsequent record runs
- Add --poll interval option (default 2000ms)
- Expand SKILL.md with full Record Workflow section: interceptor
  internals, page-type capture expectations, YAML→TS conversion guide,
  and troubleshooting table

* fix(record): fix XHR listener leak, pathChain syntax error, readline hang &amp; args interpolation

- XHR send(): add __rec_listener_added guard to prevent duplicate event
  listeners when XHR is reused (abort → open → send)
- pathChain: when findArrayPath returns '' (root-level array), data access
  is just 'data' not 'data?.' which was invalid JS syntax
- waitForEnter(): return cleanup fn so timeout path can close readline.Interface
  preventing the process from hanging on stdin after auto-timeout
- buildRecordedYaml: replace search/page query param values with template
  vars ({{args.keyword}}, {{args.page}}) so generated YAML actually uses
  the declared args instead of hardcoding the recorded URL

---------

Co-authored-by: yee.wang &lt;yee.wang@lazada.com&gt;
Co-authored-by: jackwener &lt;jakevingoo@gmail.com&gt;
diff --git a/SKILL.md b/SKILL.md
@@ -259,6 +259,18 @@ opencli synthesize <site>
 # Generate: one-shot explore → synthesize → register
 opencli generate <url> --goal "hot"
 
+# Record: YOU operate the page, opencli captures every API call → YAML candidates
+# Opens the URL in automation window, injects fetch/XHR interceptor into ALL tabs,
+# polls every 2s, auto-stops after 60s (or press Enter to stop early).
+opencli record <url>                            # 录制，site name 从域名推断
+opencli record <url> --site mysite             # 指定 site name
+opencli record <url> --timeout 120000          # 自定义超时（毫秒，默认 60000）
+opencli record <url> --poll 1000               # 缩短轮询间隔（毫秒，默认 2000）
+opencli record <url> --out .opencli/record/x   # 自定义输出目录
+# Output:
+#   .opencli/record/<site>/captured.json        ← 原始捕获数据（带 url/method/body）
+#   .opencli/record/<site>/candidates/*.yaml    ← 高置信度候选适配器（score ≥ 8，有 array 结果）
+
 # Strategy Cascade: auto-probe PUBLIC → COOKIE → HEADER
 opencli cascade <api-url>
 
@@ -289,6 +301,129 @@ opencli bilibili hot -f csv     # CSV
 opencli bilibili hot -v         # Show each pipeline step and data flow
 ```
 
+## Record Workflow
+
+`record` 是为「无法用 `explore` 自动发现」的页面（需要登录操作、复杂交互、SPA 内路由）准备的手动录制方案。
+
+### 工作原理
+
+```
+opencli record <url>
+  → 打开 automation window 并导航到目标 URL
+  → 向所有 tab 注入 fetch/XHR 拦截器（幂等，可重复注入）
+  → 每 2s 轮询一次：发现新 tab 自动注入，drain 所有 tab 的捕获缓冲区
+  → 超时（默认 60s）或按 Enter 停止
+  → 分析捕获到的 JSON 请求：去重 → 评分 → 生成候选 YAML
+```
+
+**拦截器特性**：
+- 同时 patch `window.fetch` 和 `XMLHttpRequest`
+- 只捕获 `Content-Type: application/json` 的响应
+- 过滤纯对象少于 2 个 key 的响应（避免 tracking/ping）
+- 跨 tab 隔离：每个 tab 独立缓冲区，轮询时分别 drain
+- 幂等注入：同一 tab 二次注入时先 restore 原始函数再重新 patch，不丢失已捕获数据
+
+### 使用步骤
+
+```bash
+# 1. 启动录制（建议 --timeout 给足操作时间）
+opencli record "https://example.com/page" --timeout 120000
+
+# 2. 在弹出的 automation window 里正常操作页面：
+#    - 打开列表、搜索、点击条目、切换 Tab
+#    - 凡是触发网络请求的操作都会被捕获
+
+# 3. 完成操作后按 Enter 停止（或等超时自动停止）
+
+# 4. 查看结果
+cat .opencli/record/<site>/captured.json        # 原始捕获
+ls  .opencli/record/<site>/candidates/          # 候选 YAML
+```
+
+### 页面类型与捕获预期
+
+| 页面类型 | 预期捕获量 | 说明 |
+|---------|-----------|------|
+| 列表/搜索页 | 多（5~20+） | 每次搜索/翻页都会触发新请求 |
+| 详情页（只读） | 少（1~5） | 首屏数据一次性返回，后续操作走 form/redirect |
+| SPA 内路由跳转 | 中等 | 路由切换会触发新接口，但首屏请求在注入前已发出 |
+| 需要登录的页面 | 视操作而定 | 确保 Chrome 已登录目标网站 |
+
+> **注意**：如果页面在导航完成前就发出了大部分请求（服务端渲染 / SSR 注水），拦截器会错过这些请求。
+> 解决方案：在页面加载完成后，手动触发能产生新请求的操作（搜索、翻页、切 Tab、展开折叠项等）。
+
+### 候选 YAML → TS CLI 转换
+
+生成的候选 YAML 是起点，通常需要转换为 TypeScript（尤其是 tae 等内部系统）：
+
+**候选 YAML 结构**（自动生成）：
+```yaml
+site: tae
+name: getList          # 从 URL path 推断的名称
+strategy: cookie
+browser: true
+pipeline:
+  - navigate: https://...
+  - evaluate: |
+      (async () => {
+        const res = await fetch('/approval/getList.json?procInsId=...', { credentials: 'include' });
+        const data = await res.json();
+        return (data?.content?.operatorRecords || []).map(item => ({ ... }));
+      })()
+```
+
+**转换为 TS CLI**（参考 `src/clis/tae/add-expense.ts` 风格）：
+```typescript
+import { cli, Strategy } from '../../registry.js';
+
+cli({
+  site: 'tae',
+  name: 'get-approval',
+  description: '查看报销单审批流程和操作记录',
+  domain: 'tae.alibaba-inc.com',
+  strategy: Strategy.COOKIE,
+  browser: true,
+  args: [
+    { name: 'proc_ins_id', type: 'string', required: true, positional: true, help: '流程实例 ID（procInsId）' },
+  ],
+  columns: ['step', 'operator', 'action', 'time'],
+  func: async (page, kwargs) => {
+    await page.goto('https://tae.alibaba-inc.com/expense/pc.html?_authType=SAML');
+    await page.wait(2);
+    const result = await page.evaluate(`(async () => {
+      const res = await fetch('/approval/getList.json?taskId=&procInsId=${kwargs.proc_ins_id}', {
+        credentials: 'include'
+      });
+      const data = await res.json();
+      return data?.content?.operatorRecords || [];
+    })()`);
+    return (result as any[]).map((r, i) => ({
+      step: i + 1,
+      operator: r.operatorName || r.userId,
+      action: r.operationType,
+      time: r.operateTime,
+    }));
+  },
+});
+```
+
+**转换要点**：
+1. URL 中的动态 ID（`procInsId`、`taskId` 等）提取为 `args`
+2. `captured.json` 里的真实 body 结构用于确定正确的数据路径（如 `content.operatorRecords`）
+3. tae 系统统一用 `{ success, content, errorCode, errorMsg }` 外层包裹，取数据要走 `content.*`
+4. 认证方式：cookie（`credentials: 'include'`），不需要额外 header
+5. 文件放入 `src/clis/<site>/`，无需手动注册，`npm run build` 后自动发现
+
+### 故障排查
+
+| 现象 | 原因 | 解法 |
+|------|------|------|
+| 捕获 0 条请求 | 拦截器注入失败，或页面无 JSON API | 检查 daemon 是否运行：`curl localhost:19825/status` |
+| 捕获量少（1~3 条） | 页面是只读详情页，首屏数据已在注入前发出 | 手动操作触发更多请求（搜索/翻页），或换用列表页 |
+| 候选 YAML 为 0 | 捕获到的 JSON 都没有 array 结构 | 直接看 `captured.json` 手写 TS CLI |
+| 新开的 tab 没有被拦截 | 轮询间隔内 tab 已关闭 | 缩短 `--poll 500` |
+| 二次运行 record 时数据不连续 | 正常，每次 `record` 启动都是新的 automation window | 无需处理 |
+
 ## Creating Adapters
 
 > [!TIP]
diff --git a/src/cli.ts b/src/cli.ts
@@ -183,6 +183,30 @@ export function runCli(BUILTIN_CLIS: string, USER_CLIS: string): void {
       process.exitCode = r.ok ? 0 : 1;
     });
 
+  // ── Built-in: record ─────────────────────────────────────────────────────
+
+  program
+    .command('record')
+    .description('Record API calls from a live browser session → generate YAML candidates')
+    .argument('<url>', 'URL to open and record')
+    .option('--site <name>', 'Site name (inferred from URL if omitted)')
+    .option('--out <dir>', 'Output directory for candidates')
+    .option('--poll <ms>', 'Poll interval in milliseconds', '2000')
+    .option('--timeout <ms>', 'Auto-stop after N milliseconds (default: 60000)', '60000')
+    .action(async (url, opts) => {
+      const { recordSession, renderRecordSummary } = await import('./record.js');
+      const result = await recordSession({
+        BrowserFactory: getBrowserFactory() as any,
+        url,
+        site: opts.site,
+        outDir: opts.out,
+        pollMs: parseInt(opts.poll, 10),
+        timeoutMs: parseInt(opts.timeout, 10),
+      });
+      console.log(renderRecordSummary(result));
+      process.exitCode = result.candidateCount > 0 ? 0 : 1;
+    });
+
   program
     .command('cascade')
     .description('Strategy cascade: find simplest working strategy')
diff --git a/src/record.ts b/src/record.ts