fix(stage1): clean-install the app before launch (clears stale per-run state)#91
Merged
Conversation
…n state) installAndLaunch did `simctl install` / `adb install -r` without removing the prior install, so app/Keychain state accumulated across runs. The iOS app persists an auth token; each VISUAL run recreates the Rails DB, so the stale token is invalid → the app errors to a "Back to Start Screen" launch state and fails Stage 1's renders-cleanly rubric. (Confirmed: a manual `simctl uninstall` between runs fixed it; this automates that.) Prepend a best-effort uninstall before install on both platforms: iOS: xcrun simctl uninstall booted <bundleId> Android: adb uninstall <package> Errors are ignored (e.g. app not installed on the first run), so each run gets a clean install. Reported `command` strings include the uninstall step. First of the Stage-1/Stage-2 capture-stability fixes (settle-wait before capture + capture-retry are separate follow-ups). Code is unit-tested here; the end-to-end determinism win shows once the others land too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dadachi
added a commit
that referenced
this pull request
May 23, 2026
…blind sleep) (#92) Stage 1 waited a fixed 3s after launch then captured once — so the judged frame could still be mid-launch / mid-transition, which is the main source of the "renders-cleanly" flakiness (the welcome+list overlap reads, etc.). Replace the single capture with waitForStableCapture: after the initial wait, poll captures until two consecutive frames are byte-identical (the screen has settled), capped at stabilityTimeoutMs (default 8s; on cap the last frame is used). Structural transitions produce very different consecutive frames, so the poll waits them out; truly-animated screens hit the cap and degrade to today's behavior (no worse). The loop is dependency-injected (captureOnce / sleep / now) so it's unit-tested without a sim. Second of the Stage-1/2 stability fixes (after #91 clean-install; capture-retry on a judge fail is the remaining one). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dadachi
added a commit
that referenced
this pull request
May 23, 2026
…il (#93) Final Stage-1 stability fix. After settling (#92) the judged frame can still occasionally be bad; if Layer 3 fails *only* on render quality (renders-cleanly), re-settle + re-judge up to maxJudgeRetries (default 1) — a fresh frame may render clean. Crucially this does NOT retry deterministic content failures: if a content criterion like no-substrate-leak failed, re-capturing the same screen can't change it, so isTransientRenderFail returns false and no judge pass is wasted. judgeWithRetry + isTransientRenderFail are dependency-injected / pure, so the retry policy is unit-tested without a sim or the real vision judge (retry on transient, no retry on content, cap respected, initial-capture-failure path). Completes the three stability fixes (#91 clean-install, #92 settle, this); end-to-end validation is a real-device run. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
dadachi
added a commit
that referenced
this pull request
May 23, 2026
Validation-harness stability — Stage 1/2 capture and iOS Stage 2 reliability. Internal to the agent's `NATIVEAPPTEMPLATE_VISUAL` validation; no CLI flags or generated-output changes since 0.2.0. Since 0.2.0: - fix: clean-install the app before launch (clears stale per-run state) (#91) - fix: settle the screen before judging — stability poll vs blind sleep (#92) - fix: retry the capture+judge on a transient render-quality fail (#93) - chore: target iPhone 17 Pro simulator instead of iPhone 17 (#94) - fix: Stage 2 foregrounds the app + dismisses the post-sign-in Keychain dialog (#95) - fix: recover from the intermittent iOS launch error + dismiss the paid org modal (#96) Outcome: paid VISUAL=2 now runs fully green (Layer 3 2/2 — iOS Stage 2 46/46, Android 48/48), validated end-to-end on a real sim/emulator. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First of the Stage-1/Stage-2 capture-stability fixes (the open work item after the 0.2.0 release). Addresses the failure we're most certain about.
Problem
installAndLaunchinstalled without removing the prior install (simctl install,adb install -r), so app + Keychain state accumulated across runs. The iOS app persists an auth token; every VISUAL run recreates the Rails DB (db:prepare), so the stale token is invalid → the app errors to a "Back to Start Screen" launch state → Stage 1'srenders-cleanlyrubric fails (and Stage 2 is skipped). Confirmed during thesentovaruns: a manualsimctl uninstallbetween runs fixed it — this automates that.Fix
Best-effort uninstall before install, both platforms:
Errors are ignored (e.g. app not installed on the first run), so each run gets a clean install. The reported
commandstrings include the uninstall step.Tests
xcrun simctl uninstall booted ….npm run ci→ 75/75.Scope
Code is unit-tested here. The end-to-end determinism win (a clean 2/2) also needs the other two stability fixes — settle-wait before the Stage 1 screenshot and capture-retry on a renders-cleanly fail — which are separate follow-up PRs. Validating the full outcome requires real-device runs (slow + themselves flaky until all three land).
🤖 Generated with Claude Code