eval(pkm): use framework transcript recording by sparkleMing · Pull Request #216 · memex-lab/memex

sparkleMing · 2026-05-28T13:36:55Z

Summary

upgrade to so evals use the framework transcript recorder and EventBus close handling
pass the eval into while preserving the default production controller path
remove PKM harness-owned trajectory reconstruction from outcomes; keep outcomes focused on final state and
update PKM graders to read tool order and tool payloads from
add focused grader regression tests for workspace diff routing and read-before-write transcript ordering

Validation

Analyzing 2 items...
No issues found! (ran in 5.6s)
00:00 +0: loading /Users/ming/Downloads/project/opensource/memex/test/agent/eval/pkm_agent/graders_test.dart
00:00 +0: PKM eval graders routes from workspace diff, not outcome trajectory fields
00:00 +1: PKM eval graders read-before-write uses transcript tool-call order
00:00 +2: All tests passed!
Analyzing 4 items...
No issues found! (ran in 3.9s)

Did not run the full live PKM suite because runs the whole suite from a single test entrypoint.

github-actions · 2026-05-28T13:38:29Z

PR AI Review / PR AI 语义预检

中文

风险等级：低风险
需要人工审核：否
黄金链路影响：无
置信度：medium
Workflow run：26578118635

此 PR 重构 PKM eval 测试框架：升级 dart_agent_core 至 1.0.13，使 eval 使用框架内置的 transcript 录制；PkmAgent.runWithContent 新增可选 controller 参数供 eval 注入；移除 harness 中手动重建工具调用历史的代码，改为从 outcome workspace diff 和 transcript 读取数据；新增两个 grader 回归测试。生产代码变更极小（仅新增可选参数），风险低。

影响范围

agent
tests

黄金链路

未识别到黄金链路影响。
说明：变更仅涉及 eval 测试基础设施和 dart_agent_core 小版本升级。生产代码中新增的 controller 参数为可选且向后兼容，不影响任何核心用户链路。

风险项

info 移除 harness 中手动重建工具历史的代码，改用框架 transcript。证据：test/agent/eval/pkm_agent/harness.dart: removed ~80 lines of _replayToolHistoryFromState and replaced with _diffSnapshots, test/agent/eval/pkm_agent/graders.dart: graders now read from transcript.toolCalls and outcome.workspaceDiff。
建议：变更合理，减少了 harness 与 agent 内部状态的耦合。无需额外操作。
info PkmAgent 新增可选 controller 参数，向后兼容。证据：lib/agent/pkm_agent/pkm_agent.dart: AgentController? controller added to createAgent() and runWithContent(), Default path preserved: final agentController = controller ?? AgentController()。
建议：参数为可选且默认值不变，现有调用方无需修改。无架构违规。

测试缺口

agent 新增的可选 controller 参数在生产路径中没有直接的单元测试覆盖（仅通过 eval 集成测试间接验证）。建议检查：可接受，因为 eval 测试覆盖了该路径；但如果未来有更多调用方使用此参数，建议补充单测。。

English

Risk level: LOW
Human review required: NO
Golden path impact: NONE
Confidence: medium
Workflow run: 26578118635

This PR refactors the PKM eval harness: bumps dart_agent_core to 1.0.13 so evals use the framework transcript recorder; adds an optional controller parameter to PkmAgent.runWithContent/createAgent for eval injection; removes harness-owned trajectory reconstruction, switching graders to read from outcome workspace diff and transcript; adds two grader regression tests. Production code change is minimal (one optional parameter added), risk is low.

Affected Areas

agent
tests

Golden Path

No golden path impact was identified.
Rationale: Changes are limited to eval test infrastructure and a minor dart_agent_core version bump. The new optional controller parameter in production code is backward-compatible and does not affect any core user flow.

Findings

info Removed harness-owned trajectory reconstruction in favor of framework transcript. Evidence: test/agent/eval/pkm_agent/harness.dart: removed ~80 lines of _replayToolHistoryFromState and replaced with _diffSnapshots, test/agent/eval/pkm_agent/graders.dart: graders now read from transcript.toolCalls and outcome.workspaceDiff.
Recommendation: Change is sound — reduces coupling between harness and agent internal state. No action needed.
info PkmAgent gains optional controller parameter, backward-compatible. Evidence: lib/agent/pkm_agent/pkm_agent.dart: AgentController? controller added to createAgent() and runWithContent(), Default path preserved: final agentController = controller ?? AgentController().
Recommendation: Parameter is optional with unchanged default; existing callers are unaffected. No architecture violation.

Test Gaps

agent The new optional controller parameter has no direct unit test for the production code path (only indirectly validated via eval integration tests). Suggested check: 可接受，因为 eval 测试覆盖了该路径；但如果未来有更多调用方使用此参数，建议补充单测。.

AI review is advisory. Maintainers should verify the result before merging.

github-actions · 2026-05-28T13:42:11Z

PR Preflight Summary / PR 预检汇总

中文

统一结论：低风险：两个预检均已完成，质量预检通过，可走普通手动合并流程。
Policy preflight：低风险。未命中打回、高风险或警告规则。
Flutter quality：通过。Analyzer 和 test baseline 均未发现新增问题。
PR head：4a8ef445f13ece62207c3079485a4006f3b96aec
Policy run：26578118633
Flutter run：26578119137

English

Combined result: Low risk: both preflights completed and quality passed; use the normal manual merge flow.
Policy preflight: LOW RISK. No blocking, high-risk, or warning policy signal was found.
Flutter quality: PASS. Analyzer and test baselines found no newly introduced issue.
PR head: 4a8ef445f13ece62207c3079485a4006f3b96aec
Policy run: 26578118633
Flutter run: 26578119137

PR Policy Preflight / PR 规则预检

中文

判定：低风险
变更文件数：7
变更行数：499
Diff 是否截断：false

未发现确定性规则问题。

English

Decision: LOW RISK
Changed files: 7
Changed lines: 499
Diff truncated: false

No deterministic policy findings.

PR Flutter Quality / Flutter 质量预检

中文

总体：通过
Analyzer baseline：通过
Test baseline：通过

English

Overall: PASS
Analyzer baseline: PASS
Test baseline: PASS

Flutter Analyzer Baseline

Base issues: 307
PR issues: 307
New issues: 0

No new analyzer issues introduced by this PR.

Flutter Test Baseline

Base failures: 0
PR failures: 0
New failures: 0

No new Flutter test failures introduced by this PR.

eval(pkm): use framework transcript recording

4a8ef44

github-actions Bot added the ai: low risk AI review classified the PR as low risk label May 28, 2026

sparkleMing merged commit 1a37500 into main May 28, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval(pkm): use framework transcript recording#216

eval(pkm): use framework transcript recording#216
sparkleMing merged 1 commit into
mainfrom
codex/pkm-eval-transcript-refactor

sparkleMing commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 28, 2026

PR Policy Preflight / PR 规则预检

中文

English

PR Flutter Quality / Flutter 质量预检

中文

English

Flutter Analyzer Baseline

Flutter Test Baseline

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sparkleMing commented May 28, 2026

Summary

Validation

Uh oh!

github-actions Bot commented May 28, 2026

PR AI Review / PR AI 语义预检

中文

影响范围

黄金链路

风险项

测试缺口

English

Affected Areas

Golden Path

Findings

Test Gaps

Uh oh!

Uh oh!

github-actions Bot commented May 28, 2026

PR Preflight Summary / PR 预检汇总

中文

English

PR Policy Preflight / PR 规则预检

中文

English

PR Flutter Quality / Flutter 质量预检

中文

English

Flutter Analyzer Baseline

Flutter Test Baseline

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant