Windows robustness for claude/codex backends (+ hardened JSON fallback)#79
Merged
Conversation
…ubprocess encoding, tolerant JSON, test-eval dirs
Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend
(local Claude CLI) on Windows. None changes the OpenAI/GPT happy path.
1. skillopt/engine/trainer.py — the final test-eval directory
(test_eval_final/) is written to before being created; add
os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs.
Without it, summary.json raises FileNotFoundError when a rollout yields
zero predictions.
2. skillopt/model/claude_backend.py
a. Pass the prompt via stdin (not argv): on Windows the whole command line
is capped at ~32 KB and a large optimizer prompt (the success-analyst
minibatch carrying several report trajectories) overflows it with
[WinError 206], killing the run after retries.
b. Pass the system prompt via --append-system-prompt-file (a temp file),
not argv. The system prompt here is the skill being optimized, which
SkillOpt grows over training; since the ~32 KB cap applies to the SUM of
all argv, a grown skill would re-hit [WinError 206] even with the prompt
on stdin.
c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True
and no encoding=, stdin is encoded with the system codepage; on a zh-CN
box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters
raises UnicodeEncodeError before the CLI even starts, failing every retry.
3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its
subprocess.run(input=...) call (identical unpinned-encoding pattern).
4. skillopt/utils/json_utils.py — extract_json() returned None for valid-
looking JSON that strict json.loads rejects (unescaped ASCII quotes inside
CJK string values, trailing commas), silently dropping the analyst's edits
on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied.
Add a json_repair fallback, but only on a single unambiguous object — a
balanced-brace extractor plus a refuse-on-multiple-objects guard — so a
chain-of-thought "scratch + final" response can't make repair silently
return the wrong (discarded) object, which would be worse than None (None is
detectable and retryable; a wrong-but-valid edit is applied blind). Declare
json_repair in requirements.txt and the claude/qwen optional extras so the
fallback is actually present (it otherwise no-ops, dropping edits silently).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit dca74a6)
Follow-up fixes on top of the cherry-picked Windows-robustness change:
1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not
just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"')
no longer starts a candidate object, so extract_json() returns None for
prose pseudo-JSON instead of repairing it into a bogus dict — which would
be strictly worse than dropping the edit, since extract_json feeds the
optimizer's skill edits.
2. Pick the repair candidate BEFORE importing json_repair, so the missing-
dependency RuntimeWarning only fires when there is genuinely a single
malformed object that could have been repaired. Ordinary no-JSON / prose
replies (the common case) now return None silently instead of warning on
every call.
3. Resolve dependency-metadata inconsistency: json_repair is optional, so add
it to the `all` extra (it was already in `claude`/`qwen`) and demote it
from a hard requirement to an optional/commented entry in requirements.txt,
matching the project's convention for backend-specific deps.
Adds regression tests for prose-with-braces (-> None), no-warning-on-plain-
text, single-object repair, and multi-object ambiguity. Existing 22 json
tests still pass with and without json_repair installed.
Co-Authored-By: Claude <noreply@anthropic.com>
Yif-Yang
added a commit
that referenced
this pull request
Jun 23, 2026
Add an Acknowledgements section crediting @samuelgoofus-boop for the Windows-robustness work on the Claude/Codex backends (originally #77, merged via #79). Co-authored-by: Claude <noreply@anthropic.com>
This was referenced Jun 23, 2026
Yif-Yang
added a commit
that referenced
this pull request
Jun 23, 2026
The contributor is already credited via the Co-authored-by trailer carried into main by #79; a dedicated README section is unnecessary. Co-authored-by: Claude <noreply@anthropic.com>
3 tasks
Yif-Yang
added a commit
that referenced
this pull request
Jun 23, 2026
…82) Follow-up to the string-aware brace scan: that change only skipped double-quoted prose, so brace-shaped text in single quotes, backticks, or bare prose (e.g. `{op: delete}`, '{x: 1}') still reached json_repair and was fabricated into a bogus dict — strictly worse than None, since extract_json feeds the optimizer's skill edits. Add a _looks_json_like() guard before repair: a genuine JSON object's first non-space char after `{` is `"` (a key) or `}` (empty). Prose pseudo-objects start with a bare word and are rejected, while legitimate repair targets (trailing commas, unescaped quotes inside string values) all begin with `"` and pass — including objects whose string VALUES contain single quotes or backticks, which must not be rejected. Found by an independent GPT-5.5 re-review of the merged #79 code. Adds regression tests for single-quoted / backticked / bare prose (-> None) and for legitimate objects with quote/backtick string values (still repaired). Tests: 30 pass (+3 skip) without json_repair, 33 pass with it, both clean under -W error::RuntimeWarning. Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings in the Windows-robustness work from #77 (by @samuelgoofus-boop, cherry-picked with authorship preserved) plus follow-up fixes that harden the tolerant-JSON fallback it introduced. Supersedes #77.
From #77 (cherry-picked, credit to @samuelgoofus-boop)
--append-system-prompt-file) and feed the user prompt via stdin instead of argv, so a grown skill (>~30 KB) no longer overflows the Windows ~32 KB command-line cap (WinError 206). Pin UTF-8 so a zh-CN codepage (cp936) can't raiseUnicodeEncodeError.encoding="utf-8", errors="replace"on the subprocess.os.makedirs(..., exist_ok=True)before the three test-eval rollout dirs.extract_json()fallback via optionaljson_repair, gated to a single unambiguous object.Follow-up fixes (this PR)
_top_level_brace_objects()now ignores{inside quoted prose in its outer scan, not just inside an object. Prose like'"set it to {x}"'no longer gets "repaired" into a bogus dict;extract_json()returnsNone. This matters becauseextract_jsonfeeds optimizer skill edits — a plausible-but-wrong object is strictly worse than dropping the edit. (Flagged by the automated reviewer on Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs #77.)json_repair, so the missing-dependencyRuntimeWarningonly fires when there is genuinely a repairable single object. Ordinary no-JSON replies returnNonesilently.json_repairis optional: added to theallextra (already inclaude/qwen) and demoted from a hardrequirements.txtpin to the commented optional convention.Test plan
tests/test_json_utils.py— 22 existing pass with and withoutjson_repairinstalled (previously emitted 4RuntimeWarnings without it; now 0).None, no-warning-on-plain-text, single-object repair, multi-object ambiguity →None.claude --append-system-prompt-file+ stdin prompt verified live against Claude Code2.1.183; behavior identical to the old inline flag.text=True+encoding/errorscombo verified valid on Python 3.10.import skilloptclean.🤖 Generated with Claude Code