Skip to content

Windows robustness for claude/codex backends (+ hardened JSON fallback)#79

Merged
Yif-Yang merged 2 commits into
mainfrom
fix/windows-robustness-pr77
Jun 23, 2026
Merged

Windows robustness for claude/codex backends (+ hardened JSON fallback)#79
Yif-Yang merged 2 commits into
mainfrom
fix/windows-robustness-pr77

Conversation

@Yif-Yang

Copy link
Copy Markdown
Contributor

Summary

Brings in the Windows-robustness work from #77 (by @samuelgoofus-boop, cherry-picked with authorship preserved) plus follow-up fixes that harden the tolerant-JSON fallback it introduced. Supersedes #77.

From #77 (cherry-picked, credit to @samuelgoofus-boop)

  • Claude backend: write the system prompt to a temp file (--append-system-prompt-file) and feed the user prompt via stdin instead of argv, so a grown skill (>~30 KB) no longer overflows the Windows ~32 KB command-line cap (WinError 206). Pin UTF-8 so a zh-CN codepage (cp936) can't raise UnicodeEncodeError.
  • Codex backend: pin encoding="utf-8", errors="replace" on the subprocess.
  • Trainer: os.makedirs(..., exist_ok=True) before the three test-eval rollout dirs.
  • JSON: tolerant extract_json() fallback via optional json_repair, gated to a single unambiguous object.

Follow-up fixes (this PR)

  1. String-aware brace scan_top_level_brace_objects() now ignores { inside quoted prose in its outer scan, not just inside an object. Prose like '"set it to {x}"' no longer gets "repaired" into a bogus dict; extract_json() returns None. This matters because extract_json feeds optimizer skill edits — a plausible-but-wrong object is strictly worse than dropping the edit. (Flagged by the automated reviewer on Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs #77.)
  2. No warning spam — pick the repair candidate before importing json_repair, so the missing-dependency RuntimeWarning only fires when there is genuinely a repairable single object. Ordinary no-JSON replies return None silently.
  3. Dependency metadatajson_repair is optional: added to the all extra (already in claude/qwen) and demoted from a hard requirements.txt pin to the commented optional convention.

Test plan

  • tests/test_json_utils.py — 22 existing pass with and without json_repair installed (previously emitted 4 RuntimeWarnings without it; now 0).
  • New regression tests: prose-with-braces → None, no-warning-on-plain-text, single-object repair, multi-object ambiguity → None.
  • claude --append-system-prompt-file + stdin prompt verified live against Claude Code 2.1.183; behavior identical to the old inline flag.
  • Codex text=True + encoding/errors combo verified valid on Python 3.10.
  • All four changed modules compile; import skillopt clean.

Note: a pre-existing --schema--json-schema issue in claude_backend.py (rejected by current Claude Code) is not addressed here and deserves its own PR.

🤖 Generated with Claude Code

samuelgoofus-boop and others added 2 commits June 23, 2026 10:28
…ubprocess encoding, tolerant JSON, test-eval dirs

Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend
(local Claude CLI) on Windows. None changes the OpenAI/GPT happy path.

1. skillopt/engine/trainer.py — the final test-eval directory
   (test_eval_final/) is written to before being created; add
   os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs.
   Without it, summary.json raises FileNotFoundError when a rollout yields
   zero predictions.

2. skillopt/model/claude_backend.py
   a. Pass the prompt via stdin (not argv): on Windows the whole command line
      is capped at ~32 KB and a large optimizer prompt (the success-analyst
      minibatch carrying several report trajectories) overflows it with
      [WinError 206], killing the run after retries.
   b. Pass the system prompt via --append-system-prompt-file (a temp file),
      not argv. The system prompt here is the skill being optimized, which
      SkillOpt grows over training; since the ~32 KB cap applies to the SUM of
      all argv, a grown skill would re-hit [WinError 206] even with the prompt
      on stdin.
   c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True
      and no encoding=, stdin is encoded with the system codepage; on a zh-CN
      box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters
      raises UnicodeEncodeError before the CLI even starts, failing every retry.

3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its
   subprocess.run(input=...) call (identical unpinned-encoding pattern).

4. skillopt/utils/json_utils.py — extract_json() returned None for valid-
   looking JSON that strict json.loads rejects (unescaped ASCII quotes inside
   CJK string values, trailing commas), silently dropping the analyst's edits
   on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied.
   Add a json_repair fallback, but only on a single unambiguous object — a
   balanced-brace extractor plus a refuse-on-multiple-objects guard — so a
   chain-of-thought "scratch + final" response can't make repair silently
   return the wrong (discarded) object, which would be worse than None (None is
   detectable and retryable; a wrong-but-valid edit is applied blind). Declare
   json_repair in requirements.txt and the claude/qwen optional extras so the
   fallback is actually present (it otherwise no-ops, dropping edits silently).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit dca74a6)
Follow-up fixes on top of the cherry-picked Windows-robustness change:

1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not
   just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"')
   no longer starts a candidate object, so extract_json() returns None for
   prose pseudo-JSON instead of repairing it into a bogus dict — which would
   be strictly worse than dropping the edit, since extract_json feeds the
   optimizer's skill edits.

2. Pick the repair candidate BEFORE importing json_repair, so the missing-
   dependency RuntimeWarning only fires when there is genuinely a single
   malformed object that could have been repaired. Ordinary no-JSON / prose
   replies (the common case) now return None silently instead of warning on
   every call.

3. Resolve dependency-metadata inconsistency: json_repair is optional, so add
   it to the `all` extra (it was already in `claude`/`qwen`) and demote it
   from a hard requirement to an optional/commented entry in requirements.txt,
   matching the project's convention for backend-specific deps.

Adds regression tests for prose-with-braces (-> None), no-warning-on-plain-
text, single-object repair, and multi-object ambiguity. Existing 22 json
tests still pass with and without json_repair installed.

Co-Authored-By: Claude <noreply@anthropic.com>
@Yif-Yang Yif-Yang merged commit 14c045f into main Jun 23, 2026
3 of 4 checks passed
@Yif-Yang Yif-Yang deleted the fix/windows-robustness-pr77 branch June 23, 2026 11:00
Yif-Yang added a commit that referenced this pull request Jun 23, 2026
Add an Acknowledgements section crediting @samuelgoofus-boop for the
Windows-robustness work on the Claude/Codex backends (originally #77,
merged via #79).

Co-authored-by: Claude <noreply@anthropic.com>
Yif-Yang added a commit that referenced this pull request Jun 23, 2026
The contributor is already credited via the Co-authored-by trailer carried
into main by #79; a dedicated README section is unnecessary.

Co-authored-by: Claude <noreply@anthropic.com>
Yif-Yang added a commit that referenced this pull request Jun 23, 2026
…82)

Follow-up to the string-aware brace scan: that change only skipped
double-quoted prose, so brace-shaped text in single quotes, backticks, or
bare prose (e.g. `{op: delete}`, '{x: 1}') still reached json_repair and was
fabricated into a bogus dict — strictly worse than None, since extract_json
feeds the optimizer's skill edits.

Add a _looks_json_like() guard before repair: a genuine JSON object's first
non-space char after `{` is `"` (a key) or `}` (empty). Prose pseudo-objects
start with a bare word and are rejected, while legitimate repair targets
(trailing commas, unescaped quotes inside string values) all begin with `"`
and pass — including objects whose string VALUES contain single quotes or
backticks, which must not be rejected.

Found by an independent GPT-5.5 re-review of the merged #79 code. Adds
regression tests for single-quoted / backticked / bare prose (-> None) and
for legitimate objects with quote/backtick string values (still repaired).
Tests: 30 pass (+3 skip) without json_repair, 33 pass with it, both clean
under -W error::RuntimeWarning.

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants