Should agent completion be evaluated from final response or external evidence? #1673

productmakerjason · 2026-06-02T05:47:09Z

productmakerjason
Jun 2, 2026

Hi OpenAI Evals community — I’m testing an agent-eval boundary:

a model/agent saying “completed” is not the same as verified completion.

For agent workflows, what should an eval treat as the source of truth?

Final response, tool trace, target-system state, external scorer, or receipt-like evidence?

I’m trying to understand whether “completion proof” is better modeled as an eval template or app-level harness logic.