fix(sanity): parse OpenClaw trajectory format correctly by ScuttleBot · Pull Request #361 · pinchbench/skill

ScuttleBot · 2026-04-27T12:29:16Z

Problem

The sanity check grader was looking for type='message' with message.role='assistant', but OpenClaw trajectory files actually use type='model.completed' with data.assistantTexts[].

This caused ALL models to fail the sanity check with 0%, triggering fail-fast and aborting benchmark runs. We had 42 Vultr instances sitting idle because they all hit this.

Root Cause

Schema mismatch between what the grader expected and what OpenClaw actually produces:

Grader expected:

{"type": "message", "message": {"role": "assistant", "content": [...]}}

OpenClaw actually produces:

{"type": "model.completed", "data": {"assistantTexts": ["Hello, I'm ready!..."]}}

Fix

Updated the grader to check:

type='model.completed' with data.assistantTexts
type='trace.artifacts' with data.assistantTexts (final summary)

Testing

Verified against actual transcript from a GPT-5.4 run that was incorrectly failing — the model responded correctly but the grader couldn't find the response due to the schema mismatch.

The sanity check grader was looking for type='message' with message.role='assistant', but OpenClaw trajectory files use type='model.completed' with data.assistantTexts[]. This caused ALL models to fail the sanity check with 0%, triggering fail-fast and aborting benchmark runs. Updated the grader to check both model.completed and trace.artifacts entries for assistantTexts, matching the actual trajectory schema.

kilo-code-bot · 2026-04-27T12:30:10Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

The fix is clean and correctly addresses the schema mismatch. The new checks properly handle both model.completed and trace.artifacts entry types from the OpenClaw trajectory format, and the any(t.strip() for t in texts) guard is a nice touch to avoid false positives from whitespace-only responses.

The new commit only bumps BENCHMARK_VERSION to rc4 — no logic changes.

Files Reviewed (2 files)

tasks/task_sanity.md
BENCHMARK_VERSION

_{Reviewed by claude-sonnet-4.6 · 110,925 tokens}

…hbench#361)" This reverts commit 5e9bc20.

bump BENCHMARK_VERSION to rc4

e37e8c6

ScuttleBot merged commit 5e9bc20 into main Apr 27, 2026
1 check passed

ScuttleBot deleted the fix/sanity-check-transcript-schema branch April 27, 2026 12:40

pull Bot pushed a commit to Stars1233/skill that referenced this pull request Apr 27, 2026

Revert "fix(sanity): parse OpenClaw trajectory format correctly (pinc…

0347a7f

…hbench#361)" This reverts commit 5e9bc20.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sanity): parse OpenClaw trajectory format correctly#361

fix(sanity): parse OpenClaw trajectory format correctly#361
ScuttleBot merged 2 commits intomainfrom
fix/sanity-check-transcript-schema

ScuttleBot commented Apr 27, 2026

Uh oh!

kilo-code-bot Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ScuttleBot commented Apr 27, 2026

Problem

Root Cause

Fix

Testing

Uh oh!

kilo-code-bot Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kilo-code-bot Bot commented Apr 27, 2026 •

edited

Loading