Skip to content

fix(sanity): parse OpenClaw trajectory format correctly#361

Merged
ScuttleBot merged 2 commits intomainfrom
fix/sanity-check-transcript-schema
Apr 27, 2026
Merged

fix(sanity): parse OpenClaw trajectory format correctly#361
ScuttleBot merged 2 commits intomainfrom
fix/sanity-check-transcript-schema

Conversation

@ScuttleBot
Copy link
Copy Markdown
Contributor

Problem

The sanity check grader was looking for type='message' with message.role='assistant', but OpenClaw trajectory files actually use type='model.completed' with data.assistantTexts[].

This caused ALL models to fail the sanity check with 0%, triggering fail-fast and aborting benchmark runs. We had 42 Vultr instances sitting idle because they all hit this.

Root Cause

Schema mismatch between what the grader expected and what OpenClaw actually produces:

Grader expected:

{"type": "message", "message": {"role": "assistant", "content": [...]}}

OpenClaw actually produces:

{"type": "model.completed", "data": {"assistantTexts": ["Hello, I'm ready!..."]}}

Fix

Updated the grader to check:

  1. type='model.completed' with data.assistantTexts
  2. type='trace.artifacts' with data.assistantTexts (final summary)

Testing

Verified against actual transcript from a GPT-5.4 run that was incorrectly failing — the model responded correctly but the grader couldn't find the response due to the schema mismatch.

The sanity check grader was looking for type='message' with
message.role='assistant', but OpenClaw trajectory files use
type='model.completed' with data.assistantTexts[].

This caused ALL models to fail the sanity check with 0%, triggering
fail-fast and aborting benchmark runs.

Updated the grader to check both model.completed and trace.artifacts
entries for assistantTexts, matching the actual trajectory schema.
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Apr 27, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

The fix is clean and correctly addresses the schema mismatch. The new checks properly handle both model.completed and trace.artifacts entry types from the OpenClaw trajectory format, and the any(t.strip() for t in texts) guard is a nice touch to avoid false positives from whitespace-only responses.

The new commit only bumps BENCHMARK_VERSION to rc4 — no logic changes.

Files Reviewed (2 files)
  • tasks/task_sanity.md
  • BENCHMARK_VERSION

Reviewed by claude-sonnet-4.6 · 110,925 tokens

@ScuttleBot ScuttleBot merged commit 5e9bc20 into main Apr 27, 2026
1 check passed
@ScuttleBot ScuttleBot deleted the fix/sanity-check-transcript-schema branch April 27, 2026 12:40
pull Bot pushed a commit to Stars1233/skill that referenced this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants