eval(pkm): use framework transcript recording#216
Conversation
PR AI Review / PR AI 语义预检中文
此 PR 重构 PKM eval 测试框架:升级 dart_agent_core 至 1.0.13,使 eval 使用框架内置的 transcript 录制;PkmAgent.runWithContent 新增可选 controller 参数供 eval 注入;移除 harness 中手动重建工具调用历史的代码,改为从 outcome workspace diff 和 transcript 读取数据;新增两个 grader 回归测试。生产代码变更极小(仅新增可选参数),风险低。 影响范围
黄金链路
风险项
测试缺口
English
This PR refactors the PKM eval harness: bumps dart_agent_core to 1.0.13 so evals use the framework transcript recorder; adds an optional controller parameter to PkmAgent.runWithContent/createAgent for eval injection; removes harness-owned trajectory reconstruction, switching graders to read from outcome workspace diff and transcript; adds two grader regression tests. Production code change is minimal (one optional parameter added), risk is low. Affected Areas
Golden Path
Findings
Test Gaps
|
PR Preflight Summary / PR 预检汇总中文
English
PR Policy Preflight / PR 规则预检PR Policy Preflight / PR 规则预检中文
未发现确定性规则问题。 English
No deterministic policy findings. PR Flutter Quality / Flutter 质量预检PR Flutter Quality / Flutter 质量预检中文
English
Flutter Analyzer Baseline
No new analyzer issues introduced by this PR. Flutter Test Baseline
No new Flutter test failures introduced by this PR. |
Summary
Validation
No issues found! (ran in 5.6s)
00:00 +0: PKM eval graders routes from workspace diff, not outcome trajectory fields
00:00 +1: PKM eval graders read-before-write uses transcript tool-call order
00:00 +2: All tests passed!
No issues found! (ran in 3.9s)
Did not run the full live PKM suite because runs the whole suite from a single test entrypoint.