Recovery tool for Codex App 0.130→0.131 startup crash: SQLx checksum drift + 30s backfill timeout

### What version of the Codex App are you using (From “About Codex” dialog)?

26.519.2081.0  (Codex.exe ProductVersion: 26.519.21041)  Bundled backend CLI (Linux ELF run inside WSL2): codex-cli 0.131.0-alpha.9 Last known-good version before this crash: 0.130.0-alpha.5

### What subscription do you have?

ChatGPT Pro

### What platform is your computer?

Microsoft Windows NT 10.0.26200.0 x64  Detail: Windows 11 Enterprise, build 26200, 64-bit. Codex App backend runs as a Linux ELF inside WSL2 (Ubuntu-24.04, kernel 6.6.87.2-microsoft-standard-WSL2).

### What issue are you seeing?

After Codex Desktop auto-updated from `0.130.0-alpha.5` to `0.131.0-alpha.9`, the app refuses to start. The first user-facing dialog is:

![Redacted screenshot of Codex local database startup error](https://raw.githubusercontent.com/xdifu/codex-repair/main/docs/assets/codex-local-db-dialog-redacted.png)

```text
Codex cannot access its local database.

The app cannot finish launching until its SQLite database is accessible.

Database path: /mnt/c/Users/<user>/.codex/state_5.sqlite: failed to initialize state runtime at /mnt/c/Users/<user>/.codex

Close other Codex applications, then click Retry to check whether access has been restored.

Error: (code=1, signal=null).
Most recent error: Error: failed to initialize sqlite state db at /mnt/c/Users/<user>/.codex/state_5.sqlite: failed to initialize state runtime at /mnt/c/Users/<user>/.codex: migration 1 was previously applied but has been modified
```

Two distinct bugs fire in sequence:

### Symptom A (fires immediately on launch)

```
Codex cannot access its local database
  Location: /mnt/c/Users/<user>/.codex/state_5.sqlite
  Cause: failed to initialize state runtime at /mnt/c/Users/<user>/.codex:
         migration 1 was previously applied but has been modified
```

The "Repair Codex local data now? [y/N]" prompt is destructive — accepting wipes thread metadata. Declining leaves Codex unusable.

### Symptom B (fires after Symptom A is patched)

```
timed out waiting for state db backfill at /mnt/c/Users/<user>/.codex
  after 30s (status: running)
```

The GUI gives up after 30 s even though the backend's own backfill lease is 900 s (PR #11377), and the backend would have completed in ~50 s on my install (325 active sessions + 40 archived = ~3.5 GB total session jsonl).

### Root cause A — `logs_2.sqlite` migrations modified in place

The SQL bytes of migration 1 (`logs`) and migration 2 (`logs feedback log body`) were edited in place between `0.130.x` and `0.131.x`. sqlx hashes each migration's SQL bytes with SHA-384 at build time, stores the hash in the binary, and refuses any DB whose stored checksum doesn't match — even when the resulting final table schema is fully forward-compatible. This violates sqlx's documented "migrations are immutable once published" contract.

Concrete checksum drift (full anchor list in attached `codex-checksums-0.131.0.json`):

| Migration | DB-stored hash (post-0.130) | Binary-embedded hash (0.131) |
|-----------|------------------------------|------------------------------|
| `logs_2` m1 `logs`                    | `F477E605…` | `009639EAFE599BE9…` |
| `logs_2` m2 `logs feedback log body` | `5C82B1A6…` | `CF6C93AF074A9022…` |

All 32 `state_5.sqlite` migration checksums *do* match between versions, so this is isolated to `logs_2.sqlite`. Both old and new SQL produce the same final 12-column `logs` table — the difference is in the SQL bytes themselves, not in any schema-meaningful change.

### Root cause B — hard-coded 30 s GUI backfill cap

The GUI startup gate waits for `state_5.sqlite.backfill_state.status='complete'` with a hard-coded 30 s deadline. The string literal `"timed out waiting for state db backfill at {} after {}s (status: {})"` is in the binary, but there is no env var, config knob, or CLI flag controlling that 30 s. The backend's own lease is 900 s, so the GUI cap is internally inconsistent with the backend's design.

For users with non-trivial session histories (a few hundred MB+), a cold backfill routinely takes 30–120 s, so the GUI gives up while the backend is still making progress.

### `0.132.0` does NOT fix either

The [`rust-v0.132.0` changelog](https://github.com/openai/codex/releases/tag/rust-v0.132.0) (latest as of 2026-05-20) lists 24 bug fixes; none touches the migrator validation path or the GUI startup timeout.

The closest upstream work is [PR #16924](https://github.com/openai/codex/pull/16924) (merged 2026-04-06), which relaxes the migrator only when the DB has migrations the binary doesn't know about (DB newer than binary). That does NOT cover the symmetric case here, where the binary has the same migration's SQL bytes hashing to a different value than what's recorded in the DB.

### What steps can reproduce the bug?

1. Install Codex Desktop `0.130.0-alpha.5` (or any 0.130.x release) and use it daily for ≥1 day so `logs_2.sqlite` accumulates rows and `_sqlx_migrations` is populated with checksums computed from that version's migration SQL bytes.
2. Let auto-update jump to `0.131.x` (`0.131.0-alpha.9` in my case via MSIX `OpenAI.Codex 26.519.2081.0`).
3. Launch Codex → immediate `migration 1 was previously applied but has been modified` crash (Symptom A).
4. Manually `UPDATE _sqlx_migrations SET checksum = ?` in `logs_2.sqlite` for `version IN (1, 2)` with the binary-expected values, then relaunch. With > ~50 MB total session jsonl on disk: GUI hits the 30 s backfill timeout (Symptom B).

A reproducible recovery toolkit at https://github.com/xdifu/codex-repair extracts the binary-expected checksums automatically (via SHA-384 anchor scanning + DB description-based cluster localization) and applies the schema-verified fixes. `python codex-repair.py doctor` against any affected install reports the same drift; `python codex-repair.py extract-checksums --json` produces the full evidence list attached here.

### Investigation summary

1. Located the actual backend binary: `%USERPROFILE%\.codex\bin\wsl\<hash>\codex` (Linux ELF in WSL2, not the MSIX-bundled Windows `Codex.exe`). The crash path `/mnt/c/...` was the WSL view of the Windows drive.
2. Extracted all 33 embedded migration checksums (32 for `state_5.sqlite`, 2 for `logs_2.sqlite`) from the backend ELF by scanning for `(sql, sha384(sql))` byte-adjacency anchors.
3. Diffed extracted checksums against each DB's `_sqlx_migrations` rows — found mismatch only for `logs_2` m1 and m2.
4. Verified the actual `logs` table schema (`PRAGMA table_info(logs)` shows all 12 expected columns including `feedback_log_body`, `thread_id`, `process_uuid`, `estimated_bytes`) is fully compatible with both the old and new migration SQL — proving the change is cosmetic.
5. Rewrote `_sqlx_migrations.checksum` for the 2 affected rows. Symptom A cleared; Symptom B appeared.
6. Confirmed Symptom B's 30 s timeout is hard-coded by grepping the binary for the timeout literal and related env-var names; no config path exists.
7. Backfilled manually from Python (parsing `sessions/**/*.jsonl` first-line `session_meta`), bypassing the 30 s GUI cap, then `UPDATE backfill_state SET status='complete'`. Codex started cleanly with full thread history intact.

Full archeology and the 5-phase timeline in [`docs/root-cause-analysis.md`](https://github.com/xdifu/codex-repair/blob/main/docs/root-cause-analysis.md).


### What is the expected behavior?

**Symptom A**: a Codex App update should never fail to open a `0.130.x`-created `logs_2.sqlite` when the final table schema is fully forward-compatible. Either:

- **Fix 1 (preferred, OpenAI-internal hygiene):** never modify a published migration. Express the new desired schema as `003_…sql` / `004_…sql` rather than editing `001_…sql` / `002_…sql` in place. This is the canonical sqlx approach and avoids any client-side compatibility shim.

- **Fix 2 (defensive, in `codex-rs/state/src/runtime.rs`):** in the migrator's `VersionMismatch` arm, resolve the binary-expected SQL for the failing migration, parse its `CREATE TABLE` / `ALTER TABLE` / `ADD COLUMN` targets, run `PRAGMA table_info(<table>)` on the live DB, and if every expected column is already present, `UPDATE _sqlx_migrations SET checksum = <new_hash>` and log a warning. This is symmetric to [PR #16924](https://github.com/openai/codex/pull/16924), which already forgives the *opposite* direction.

- **Fix 3 (minimum-effort escape hatch):** accept `CODEX_TOLERATE_MODIFIED_MIGRATIONS=1` env var that triggers Fix 2's code path explicitly.

**Symptom B**: the GUI's 30 s backfill cap should either be removed (let the backend run to its own 900 s lease boundary with a progress indicator) or made configurable via `~/.codex/config.toml` (e.g. `[startup] backfill_timeout_secs = 30`). The current 30 s vs 900 s mismatch between GUI and backend is internally inconsistent and breaks any install with non-trivial session history.


### Additional information

### Platform reach

This is **not a Windows-only bug**, even though Windows users hit it first/worst:

- **Symptom A** (sqlx migration checksum drift): 100% platform-agnostic. The
  migration SQL bytes and their SHA-384 hashes are computed at compile time from
  `codex-rs/state/migrations/logs_2/*.sql` and baked into the Rust binary, so
  every platform's binary embeds the same hashes. Any user — macOS, Linux, or
  Windows — going `0.130.x → 0.131.x` with prior `logs_2.sqlite` history hits
  the same `migration 1 was previously applied but has been modified` wall.
  Mac users have already reported drift symptoms in #20608, #18364, #17304.

- **Symptom B** (30 s GUI backfill timeout): the 30 s constant itself is also
  in the cross-platform Rust source. But the *practical* trigger rate is much
  higher on Windows because the backend runs inside WSL2 and accesses
  `sessions/*.jsonl` via the 9P protocol over `/mnt/c/` — roughly 5–10× slower
  than native APFS/ext4. A 200-MB history that backfills in ~8 s on macOS will
  routinely take 40–120 s on Windows. Mac users with multi-GB histories or
  spinning-rust HDDs are still latently affected.

A fix should target the cross-platform Rust source (`codex-rs/state/src/runtime.rs`),
not any Windows-specific code path.

### User-side recovery (already works today, no upstream fix needed)

A standalone Python toolkit is published under Apache-2.0 at **https://github.com/xdifu/codex-repair**.

Capabilities:

1. Auto-locates the active backend binary (no hard-coded hash subdir).
2. Extracts every embedded migration checksum by scanning the ELF for `(sql, sha384(sql))` anchors, using DB-known migration `description` strings as a cluster locator (robust across future binary versions — no version-pinned constants).
3. Diffs against each DB's `_sqlx_migrations`.
4. **Verifies schema compatibility** via `PRAGMA table_info` before rewriting any checksum (refuses unsafe updates).
5. Reproduces backfill in Python independent of the 30 s GUI cap, then marks `backfill_state.status='complete'` with full session metadata.
6. Has a `--use-isolated-copy` mode that copies the DBs to a temp dir before reading, so it's safe to run a diagnose pass while Codex is open.

Usage by an affected user is one command:

```powershell
python codex-repair.py fix --apply
```

Run history on my install: full repair from initial crash to healthy state (365 threads indexed, no errors) completed in under 5 minutes once root cause was identified.

### Related upstream issues / PRs

- **#23251** — `WSL CLI cannot share Windows Codex App CODEX_HOME: migration 1 was previously applied but has been modified` (open; my own earlier report; describes the WSL-sharing subset of this same root cause)
- **#17304** — `Desktop project sidebar hides active threads after state DB migration drift` (open; family of related drift bugs)
- **#17354**, **#17540**, **#18364**, **#19873** — overlapping sidebar / thread-disappearing reports stemming from `_sqlx_migrations` drift after auto-update
- **#16924** — `fix(sqlite): don't hard fail migrator if DB is newer` (merged; opposite direction)
- **#11377** — `feat: prevent double backfill` (introduced 900 s lease — the figure that makes the 30 s GUI cap so glaringly inconsistent)
- **#16877** — `Make thread metadata updates tolerate pending backfill` (open)
- **#13772** — `Move sqlite logs to a dedicated database` (context for why `logs_2.sqlite` exists as a separate DB from `state_5.sqlite`)

### Attached evidence

`codex-checksums-0.131.0.json` — full list of all 34 migration anchors (32 state_5 + 2 logs_2) extracted from my `0.131.0-alpha.9` backend binary at `%USERPROFILE%\.codex\bin\wsl\7945a00f33bdc140\codex`. Anyone with the same backend version can reproduce by running:

```bash
python codex-repair.py extract-checksums --json > codex-checksums-0.131.0.json
```

and diffing against mine to confirm identical hashes per migration.

### Note

I am happy to contribute the schema-diff helper or a `CODEX_TOLERATE_MODIFIED_MIGRATIONS` runtime flag as a PR upstream if maintainers consider Fix 2 or Fix 3 the right direction — per the contributing guide, I'll wait for an explicit invitation before opening one.


Migration	DB-stored hash (post-0.130)	Binary-embedded hash (0.131)
`logs_2` m1 `logs`	`F477E605…`	`009639EAFE599BE9…`
`logs_2` m2 `logs feedback log body`	`5C82B1A6…`	`CF6C93AF074A9022…`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery tool for Codex App 0.130→0.131 startup crash: SQLx checksum drift + 30s backfill timeout #23787

What version of the Codex App are you using (From “About Codex” dialog)?

What subscription do you have?

What platform is your computer?

What issue are you seeing?

Symptom A (fires immediately on launch)

Symptom B (fires after Symptom A is patched)

Root cause A — `logs_2.sqlite` migrations modified in place

Root cause B — hard-coded 30 s GUI backfill cap

`0.132.0` does NOT fix either

What steps can reproduce the bug?

Investigation summary

What is the expected behavior?

Additional information

Platform reach

User-side recovery (already works today, no upstream fix needed)

Related upstream issues / PRs

Attached evidence

Note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Recovery tool for Codex App 0.130→0.131 startup crash: SQLx checksum drift + 30s backfill timeout #23787

Description

What version of the Codex App are you using (From “About Codex” dialog)?

What subscription do you have?

What platform is your computer?

What issue are you seeing?

Symptom A (fires immediately on launch)

Symptom B (fires after Symptom A is patched)

Root cause A — logs_2.sqlite migrations modified in place

Root cause B — hard-coded 30 s GUI backfill cap

0.132.0 does NOT fix either

What steps can reproduce the bug?

Investigation summary

What is the expected behavior?

Additional information

Platform reach

User-side recovery (already works today, no upstream fix needed)

Related upstream issues / PRs

Attached evidence

Note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Root cause A — `logs_2.sqlite` migrations modified in place

`0.132.0` does NOT fix either