Skip to content

eval(pkm): use framework transcript recording#216

Merged
sparkleMing merged 1 commit into
mainfrom
codex/pkm-eval-transcript-refactor
May 28, 2026
Merged

eval(pkm): use framework transcript recording#216
sparkleMing merged 1 commit into
mainfrom
codex/pkm-eval-transcript-refactor

Conversation

@sparkleMing
Copy link
Copy Markdown
Collaborator

Summary

  • upgrade to so evals use the framework transcript recorder and EventBus close handling
  • pass the eval into while preserving the default production controller path
  • remove PKM harness-owned trajectory reconstruction from outcomes; keep outcomes focused on final state and
  • update PKM graders to read tool order and tool payloads from
  • add focused grader regression tests for workspace diff routing and read-before-write transcript ordering

Validation

  • Analyzing 2 items...
    No issues found! (ran in 5.6s)
  • 00:00 +0: loading /Users/ming/Downloads/project/opensource/memex/test/agent/eval/pkm_agent/graders_test.dart
    00:00 +0: PKM eval graders routes from workspace diff, not outcome trajectory fields
    00:00 +1: PKM eval graders read-before-write uses transcript tool-call order
    00:00 +2: All tests passed!
  • Analyzing 4 items...
    No issues found! (ran in 3.9s)

Did not run the full live PKM suite because runs the whole suite from a single test entrypoint.

@github-actions
Copy link
Copy Markdown

PR AI Review / PR AI 语义预检

中文

  • 风险等级:低风险
  • 需要人工审核:
  • 黄金链路影响:
  • 置信度:medium
  • Workflow run:26578118635

此 PR 重构 PKM eval 测试框架:升级 dart_agent_core 至 1.0.13,使 eval 使用框架内置的 transcript 录制;PkmAgent.runWithContent 新增可选 controller 参数供 eval 注入;移除 harness 中手动重建工具调用历史的代码,改为从 outcome workspace diff 和 transcript 读取数据;新增两个 grader 回归测试。生产代码变更极小(仅新增可选参数),风险低。

影响范围

  • agent
  • tests

黄金链路

  • 未识别到黄金链路影响。
  • 说明:变更仅涉及 eval 测试基础设施和 dart_agent_core 小版本升级。生产代码中新增的 controller 参数为可选且向后兼容,不影响任何核心用户链路。

风险项

  • info 移除 harness 中手动重建工具历史的代码,改用框架 transcript。证据:test/agent/eval/pkm_agent/harness.dart: removed ~80 lines of _replayToolHistoryFromState and replaced with _diffSnapshots, test/agent/eval/pkm_agent/graders.dart: graders now read from transcript.toolCalls and outcome.workspaceDiff。
    建议:变更合理,减少了 harness 与 agent 内部状态的耦合。无需额外操作。
  • info PkmAgent 新增可选 controller 参数,向后兼容。证据:lib/agent/pkm_agent/pkm_agent.dart: AgentController? controller added to createAgent() and runWithContent(), Default path preserved: final agentController = controller ?? AgentController()。
    建议:参数为可选且默认值不变,现有调用方无需修改。无架构违规。

测试缺口

  • agent 新增的可选 controller 参数在生产路径中没有直接的单元测试覆盖(仅通过 eval 集成测试间接验证)。 建议检查:可接受,因为 eval 测试覆盖了该路径;但如果未来有更多调用方使用此参数,建议补充单测。

English

  • Risk level: LOW
  • Human review required: NO
  • Golden path impact: NONE
  • Confidence: medium
  • Workflow run: 26578118635

This PR refactors the PKM eval harness: bumps dart_agent_core to 1.0.13 so evals use the framework transcript recorder; adds an optional controller parameter to PkmAgent.runWithContent/createAgent for eval injection; removes harness-owned trajectory reconstruction, switching graders to read from outcome workspace diff and transcript; adds two grader regression tests. Production code change is minimal (one optional parameter added), risk is low.

Affected Areas

  • agent
  • tests

Golden Path

  • No golden path impact was identified.
  • Rationale: Changes are limited to eval test infrastructure and a minor dart_agent_core version bump. The new optional controller parameter in production code is backward-compatible and does not affect any core user flow.

Findings

  • info Removed harness-owned trajectory reconstruction in favor of framework transcript. Evidence: test/agent/eval/pkm_agent/harness.dart: removed ~80 lines of _replayToolHistoryFromState and replaced with _diffSnapshots, test/agent/eval/pkm_agent/graders.dart: graders now read from transcript.toolCalls and outcome.workspaceDiff.
    Recommendation: Change is sound — reduces coupling between harness and agent internal state. No action needed.
  • info PkmAgent gains optional controller parameter, backward-compatible. Evidence: lib/agent/pkm_agent/pkm_agent.dart: AgentController? controller added to createAgent() and runWithContent(), Default path preserved: final agentController = controller ?? AgentController().
    Recommendation: Parameter is optional with unchanged default; existing callers are unaffected. No architecture violation.

Test Gaps

  • agent The new optional controller parameter has no direct unit test for the production code path (only indirectly validated via eval integration tests). Suggested check: 可接受,因为 eval 测试覆盖了该路径;但如果未来有更多调用方使用此参数,建议补充单测。.

AI review is advisory. Maintainers should verify the result before merging.

@github-actions github-actions Bot added the ai: low risk AI review classified the PR as low risk label May 28, 2026
@sparkleMing sparkleMing merged commit 1a37500 into main May 28, 2026
3 checks passed
@github-actions
Copy link
Copy Markdown

PR Preflight Summary / PR 预检汇总

中文

  • 统一结论:低风险:两个预检均已完成,质量预检通过,可走普通手动合并流程。
  • Policy preflight:低风险。未命中打回、高风险或警告规则。
  • Flutter quality:通过。Analyzer 和 test baseline 均未发现新增问题。
  • PR head:4a8ef445f13ece62207c3079485a4006f3b96aec
  • Policy run:26578118633
  • Flutter run:26578119137

English

  • Combined result: Low risk: both preflights completed and quality passed; use the normal manual merge flow.
  • Policy preflight: LOW RISK. No blocking, high-risk, or warning policy signal was found.
  • Flutter quality: PASS. Analyzer and test baselines found no newly introduced issue.
  • PR head: 4a8ef445f13ece62207c3079485a4006f3b96aec
  • Policy run: 26578118633
  • Flutter run: 26578119137
PR Policy Preflight / PR 规则预检

PR Policy Preflight / PR 规则预检

中文

  • 判定:低风险
  • 变更文件数:7
  • 变更行数:499
  • Diff 是否截断:false

未发现确定性规则问题。

English

  • Decision: LOW RISK
  • Changed files: 7
  • Changed lines: 499
  • Diff truncated: false

No deterministic policy findings.

PR Flutter Quality / Flutter 质量预检

PR Flutter Quality / Flutter 质量预检

中文

  • 总体:通过
  • Analyzer baseline:通过
  • Test baseline:通过

English

  • Overall: PASS
  • Analyzer baseline: PASS
  • Test baseline: PASS

Flutter Analyzer Baseline

  • Base issues: 307
  • PR issues: 307
  • New issues: 0

No new analyzer issues introduced by this PR.

Flutter Test Baseline

  • Base failures: 0
  • PR failures: 0
  • New failures: 0

No new Flutter test failures introduced by this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai: low risk AI review classified the PR as low risk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant