Windows robustness for claude/codex backends (+ hardened JSON fallback) by Yif-Yang · Pull Request #79 · microsoft/SkillOpt

Yif-Yang · 2026-06-23T10:34:21Z

Summary

Brings in the Windows-robustness work from #77 (by @samuelgoofus-boop, cherry-picked with authorship preserved) plus follow-up fixes that harden the tolerant-JSON fallback it introduced. Supersedes #77.

From #77 (cherry-picked, credit to @samuelgoofus-boop)

Claude backend: write the system prompt to a temp file (--append-system-prompt-file) and feed the user prompt via stdin instead of argv, so a grown skill (>~30 KB) no longer overflows the Windows ~32 KB command-line cap (WinError 206). Pin UTF-8 so a zh-CN codepage (cp936) can't raise UnicodeEncodeError.
Codex backend: pin encoding="utf-8", errors="replace" on the subprocess.
Trainer: os.makedirs(..., exist_ok=True) before the three test-eval rollout dirs.
JSON: tolerant extract_json() fallback via optional json_repair, gated to a single unambiguous object.

Follow-up fixes (this PR)

String-aware brace scan — _top_level_brace_objects() now ignores { inside quoted prose in its outer scan, not just inside an object. Prose like '"set it to {x}"' no longer gets "repaired" into a bogus dict; extract_json() returns None. This matters because extract_json feeds optimizer skill edits — a plausible-but-wrong object is strictly worse than dropping the edit. (Flagged by the automated reviewer on Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs #77.)
No warning spam — pick the repair candidate before importing json_repair, so the missing-dependency RuntimeWarning only fires when there is genuinely a repairable single object. Ordinary no-JSON replies return None silently.
Dependency metadata — json_repair is optional: added to the all extra (already in claude/qwen) and demoted from a hard requirements.txt pin to the commented optional convention.

Test plan

tests/test_json_utils.py — 22 existing pass with and without json_repair installed (previously emitted 4 RuntimeWarnings without it; now 0).
New regression tests: prose-with-braces → None, no-warning-on-plain-text, single-object repair, multi-object ambiguity → None.
claude --append-system-prompt-file + stdin prompt verified live against Claude Code 2.1.183; behavior identical to the old inline flag.
Codex text=True + encoding/errors combo verified valid on Python 3.10.
All four changed modules compile; import skillopt clean.

Note: a pre-existing --schema → --json-schema issue in claude_backend.py (rejected by current Claude Code) is not addressed here and deserves its own PR.

🤖 Generated with Claude Code

…ubprocess encoding, tolerant JSON, test-eval dirs Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend (local Claude CLI) on Windows. None changes the OpenAI/GPT happy path. 1. skillopt/engine/trainer.py — the final test-eval directory (test_eval_final/) is written to before being created; add os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs. Without it, summary.json raises FileNotFoundError when a rollout yields zero predictions. 2. skillopt/model/claude_backend.py a. Pass the prompt via stdin (not argv): on Windows the whole command line is capped at ~32 KB and a large optimizer prompt (the success-analyst minibatch carrying several report trajectories) overflows it with [WinError 206], killing the run after retries. b. Pass the system prompt via --append-system-prompt-file (a temp file), not argv. The system prompt here is the skill being optimized, which SkillOpt grows over training; since the ~32 KB cap applies to the SUM of all argv, a grown skill would re-hit [WinError 206] even with the prompt on stdin. c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True and no encoding=, stdin is encoded with the system codepage; on a zh-CN box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters raises UnicodeEncodeError before the CLI even starts, failing every retry. 3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its subprocess.run(input=...) call (identical unpinned-encoding pattern). 4. skillopt/utils/json_utils.py — extract_json() returned None for valid- looking JSON that strict json.loads rejects (unescaped ASCII quotes inside CJK string values, trailing commas), silently dropping the analyst's edits on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied. Add a json_repair fallback, but only on a single unambiguous object — a balanced-brace extractor plus a refuse-on-multiple-objects guard — so a chain-of-thought "scratch + final" response can't make repair silently return the wrong (discarded) object, which would be worse than None (None is detectable and retryable; a wrong-but-valid edit is applied blind). Declare json_repair in requirements.txt and the claude/qwen optional extras so the fallback is actually present (it otherwise no-ops, dropping edits silently). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit dca74a6)

Follow-up fixes on top of the cherry-picked Windows-robustness change: 1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"') no longer starts a candidate object, so extract_json() returns None for prose pseudo-JSON instead of repairing it into a bogus dict — which would be strictly worse than dropping the edit, since extract_json feeds the optimizer's skill edits. 2. Pick the repair candidate BEFORE importing json_repair, so the missing- dependency RuntimeWarning only fires when there is genuinely a single malformed object that could have been repaired. Ordinary no-JSON / prose replies (the common case) now return None silently instead of warning on every call. 3. Resolve dependency-metadata inconsistency: json_repair is optional, so add it to the `all` extra (it was already in `claude`/`qwen`) and demote it from a hard requirement to an optional/commented entry in requirements.txt, matching the project's convention for backend-specific deps. Adds regression tests for prose-with-braces (-> None), no-warning-on-plain- text, single-object repair, and multi-object ambiguity. Existing 22 json tests still pass with and without json_repair installed. Co-Authored-By: Claude <noreply@anthropic.com>

@samuelgoofus-boop

Add an Acknowledgements section crediting @samuelgoofus-boop for the Windows-robustness work on the Claude/Codex backends (originally #77, merged via #79). Co-authored-by: Claude <noreply@anthropic.com>

The contributor is already credited via the Co-authored-by trailer carried into main by #79; a dedicated README section is unnecessary. Co-authored-by: Claude <noreply@anthropic.com>

…82) Follow-up to the string-aware brace scan: that change only skipped double-quoted prose, so brace-shaped text in single quotes, backticks, or bare prose (e.g. `{op: delete}`, '{x: 1}') still reached json_repair and was fabricated into a bogus dict — strictly worse than None, since extract_json feeds the optimizer's skill edits. Add a _looks_json_like() guard before repair: a genuine JSON object's first non-space char after `{` is `"` (a key) or `}` (empty). Prose pseudo-objects start with a bare word and are rejected, while legitimate repair targets (trailing commas, unescaped quotes inside string values) all begin with `"` and pass — including objects whose string VALUES contain single quotes or backticks, which must not be rejected. Found by an independent GPT-5.5 re-review of the merged #79 code. Adds regression tests for single-quoted / backticked / bare prose (-> None) and for legitimate objects with quote/backtick string values (still repaired). Tests: 30 pass (+3 skip) without json_repair, 33 pass with it, both clean under -W error::RuntimeWarning. Co-authored-by: Claude <noreply@anthropic.com>

samuelgoofus-boop and others added 2 commits June 23, 2026 10:28

Yif-Yang merged commit 14c045f into main Jun 23, 2026
3 of 4 checks passed

Yif-Yang deleted the fix/windows-robustness-pr77 branch June 23, 2026 11:00

Yif-Yang mentioned this pull request Jun 23, 2026

docs(readme): acknowledge community contributor @samuelgoofus-boop #80

Merged

This was referenced Jun 23, 2026

Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs #77

Closed

docs(readme): remove Acknowledgements section #81

Merged

Yif-Yang mentioned this pull request Jun 23, 2026

fix(json_utils): reject prose pseudo-JSON in single quotes/backticks #82

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Windows robustness for claude/codex backends (+ hardened JSON fallback)#79

Windows robustness for claude/codex backends (+ hardened JSON fallback)#79
Yif-Yang merged 2 commits into
mainfrom
fix/windows-robustness-pr77

Yif-Yang commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Yif-Yang commented Jun 23, 2026

Summary

From #77 (cherry-picked, credit to @samuelgoofus-boop)

Follow-up fixes (this PR)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants