igerber · igerber · Jun 1, 2026 · Jun 1, 2026 · Jun 1, 2026
diff --git a/.claude/commands/ai-review-local.md b/.claude/commands/ai-review-local.md
@@ -16,7 +16,7 @@ Two backends are supported:
 
 | Backend | Latency | Cost | Quality |
 |---|---|---|---|
-| `api` (`gpt-5.4`) | 30-60s | $0.05-0.50/run, metered via `OPENAI_API_KEY` | Single-shot — won't grep, can't load files on its own initiative |
+| `api` (`gpt-5.5`) | 30-60s | ~$0.10-1.00/run, metered via `OPENAI_API_KEY` | Single-shot — won't grep, can't load files on its own initiative |
 | `codex` (any auth) | 3-15 min | depends on your `codex login` mode (subscription vs API key) — see codex docs | Agentic — matches CI Codex reviewer, can grep / load files / multi-turn |
 
 Choose with `--backend {auto,codex,api}` (default `auto`):
@@ -66,8 +66,8 @@ Notes:
   *Api backend only.*
 - `--force-fresh`: Skip delta-diff mode, run a full fresh review even if previous state exists
 - `--full-registry`: Include the entire REGISTRY.md instead of selective sections
-- `--model <name>`: Override the model (default: `gpt-5.4`). Applies to both backends.
-- `--timeout <seconds>`: HTTP request timeout. If omitted, defaults to 900 for reasoning models (gpt-5.4, *-pro, o1/o3/o4) and 300 otherwise. *Api backend only.*
+- `--model <name>`: Override the model (default: `gpt-5.5`). Applies to both backends.
+- `--timeout <seconds>`: HTTP request timeout. If omitted, defaults to 900 for reasoning models (gpt-5.4/gpt-5.5, *-pro, o1/o3/o4) and 300 otherwise. *Api backend only.*
 - `--dry-run`: Print the compiled prompt without invoking the chosen backend
   (no API call, no codex subprocess)
 
@@ -107,7 +107,7 @@ Step 5 invokes the chosen backend:
 
 Parse `$ARGUMENTS` for the optional flags listed above. All flags are optional —
 the default behavior (auto-detect backend, standard context for api or
-agentic loading for codex, selective registry, gpt-5.4)
+agentic loading for codex, selective registry, gpt-5.5)
 requires no arguments.
 
 ### Step 2: Validate Prerequisites
@@ -417,8 +417,8 @@ would be silently ignored.
 Note: `--force-fresh` is a skill-only flag — it controls whether delta diffs are
 generated in Step 4 and is NOT passed to the script.
 
-**Reasoning model handling:** If the model is `gpt-5.4`, contains `-pro`, or starts with
-`o1`/`o3`/`o4` (e.g., `gpt-5.4`, `gpt-5.4-pro`, `o3`, `o4-mini`):
+**Reasoning model handling:** If the model is `gpt-5.4`/`gpt-5.5`, contains `-pro`, or starts with
+`o1`/`o3`/`o4` (e.g., `gpt-5.5`, `gpt-5.4-pro`, `o3`, `o4-mini`):
 - The script auto-resolves `--timeout` to 900s for reasoning models when omitted, so
   no extra flag is required unless overriding
 - Run the Bash command with `run_in_background: true` (bypasses the 600s Bash tool timeout cap)

diff --git a/.claude/scripts/openai_review.py b/.claude/scripts/openai_review.py
@@ -871,6 +871,8 @@ def apply_token_budget(
 PRICING = {
     "gpt-5.4": (2.50, 15.00),
     "gpt-5.4-pro": (30.00, 180.00),
+    "gpt-5.5": (5.00, 30.00),
+    "gpt-5.5-pro": (30.00, 180.00),
     "gpt-4.1": (2.00, 8.00),
     "gpt-4.1-mini": (0.40, 1.60),
     "o3": (2.00, 8.00),
@@ -1140,7 +1142,7 @@ def compile_prompt(
 # ---------------------------------------------------------------------------
 
 ENDPOINT = "https://api.openai.com/v1/responses"
-DEFAULT_MODEL = "gpt-5.4"
+DEFAULT_MODEL = "gpt-5.5"
 DEFAULT_TIMEOUT = 300  # seconds
 REASONING_TIMEOUT = 900  # seconds
 DEFAULT_MAX_TOKENS = 16384
@@ -1149,13 +1151,13 @@ def compile_prompt(
 
 def _is_reasoning_model(model: str) -> bool:
     """Return True for models that use internal chain-of-thought reasoning."""
-    return model.startswith(("o1", "o3", "o4", "gpt-5.4")) or "-pro" in model
+    return model.startswith(("o1", "o3", "o4", "gpt-5.4", "gpt-5.5")) or "-pro" in model
 
 
 def _resolve_timeout(timeout: "int | None", model: str) -> int:
     """Auto-resolve omitted --timeout based on model class.
 
-    Reasoning models (o1/o3/o4/gpt-5.4/*-pro) get REASONING_TIMEOUT (900s).
+    Reasoning models (o1/o3/o4/gpt-5.4/gpt-5.5/*-pro) get REASONING_TIMEOUT (900s).
     Non-reasoning models get DEFAULT_TIMEOUT (300s).
     Explicit values are passed through unchanged.
     """
@@ -1321,7 +1323,7 @@ def _build_codex_cmd(
 ) -> "list[str]":
     """Construct the argv for `codex exec`.
 
-    Pinned to match the CI Codex action's invocation (gpt-5.4 + xhigh effort +
+    Pinned to match the CI Codex action's invocation (gpt-5.5 + xhigh effort +
     read-only sandbox) so local reviews give CI-equivalent quality.
 
     NOTE: the effort key MUST be `model_reasoning_effort`, not
@@ -1705,7 +1707,7 @@ def main() -> None:
         default=None,
         help=(
             f"HTTP request timeout in seconds (api backend only). If omitted, "
-            f"defaults to {REASONING_TIMEOUT} for reasoning models (gpt-5.4, "
+            f"defaults to {REASONING_TIMEOUT} for reasoning models (gpt-5.4/gpt-5.5, "
             f"*-pro, o1/o3/o4) and {DEFAULT_TIMEOUT} otherwise. Ignored under "
             f"--backend codex (codex manages its own time budget)."
         ),

diff --git a/.github/workflows/ai_pr_review.yml b/.github/workflows/ai_pr_review.yml
@@ -516,7 +516,7 @@ jobs:
           sandbox: read-only
           safety-strategy: drop-sudo
           # Recommended by OpenAI for review quality/consistency:
-          model: gpt-5.4
+          model: gpt-5.5
           effort: xhigh
 
       - name: Post PR comment (new on every event except initial open)

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Fixed
 - **Covariate names that collide with reserved structural terms now raise `ValueError` instead of silently corrupting the coefficient dict (`DifferenceInDifferences`, `MultiPeriodDiD`, `TwoWayFixedEffects`).** These estimators build their `coefficients` dict by zipping a variable-name list -- structural term names PLUS the user covariate column names appended verbatim -- with the fitted coefficient vector. A covariate whose name equaled a reserved structural name (`const`; the treatment/time column names; the `{treatment}:{time}` interaction; MultiPeriodDiD `period_{p}` dummies and `{treatment}:period_{p}` interactions; `TwoWayFixedEffects` `ATT`; fixed-effect / unit / time dummy names; or an internal `_`-prefixed working column such as `_treat_time` / `_did_treatment` / `_treatment_post`) silently **overwrote** that structural coefficient via Python dict last-write-wins -- e.g. a covariate named `const` dropped the intercept -- with no error or warning. A new shared `validate_covariate_names` helper (`diff_diff/utils.py`) is now called in each of the three `fit()` methods before the design matrix is built; it raises `ValueError` on a collision (the comparison is case-sensitive, so e.g. `Const` is still allowed) **and** on duplicate names within `covariates` (which collapse to a single dict entry the same way). Fixed-effect/unit/time dummy reserved names are taken from the same `pd.get_dummies(..., drop_first=True)` call used to build them, so they match exactly (including for pandas `Categorical` columns with a non-default category order). For `TwoWayFixedEffects` the guard fires on **all** variance paths: the default within-transform path returns only `{"ATT": att}` (no covariate is a dict key there), but a covariate named `_treatment_post` would still clobber the internal interaction column, so guarding both paths is uniform and forward-compatible. **Potentially breaking:** a fit that previously *succeeded* with a colliding (or duplicated) covariate name -- silently returning a corrupted coefficient dict -- now raises; rename the covariate column(s). The staggered / influence-function estimators (CallawaySantAnna, SunAbraham, StaggeredTripleDifference, EfficientDiD, TwoStageDiD, ImputationDiD, WooldridgeDiD, dCDH, StackedDiD) key results by `(g, t)` tuples / relative-time indices, never covariate names, and `TripleDifference` / `SyntheticControl` / `SyntheticDiD` do not expose covariates by name, so none are affected. New tests in `tests/test_utils.py`, `tests/test_estimators.py`, and `tests/test_estimators_vcov_type.py`.
 
+### Changed
+- **CI + local AI PR-reviewer model upgraded `gpt-5.4` → `gpt-5.5`.** The CI Codex reviewer (`.github/workflows/ai_pr_review.yml`) and the local `/ai-review-local` default (`.claude/scripts/openai_review.py` `DEFAULT_MODEL`) now run `gpt-5.5` @ `xhigh` effort / `read-only` sandbox (all other invocation settings unchanged). Validated empirically before the swap via the `tools/reviewer-eval/` A/B harness: on a real-bug corpus plus a k=6 big-diff de-risk, `gpt-5.5` matched-or-beat `gpt-5.4` on every test-backed recall case (including a bug buried in a ~3k-line methodology diff), added zero false positives, and ran faster; an end-to-end CI canary confirmed the action environment (`openai/codex-action@v1`, codex CLI 0.135.0) runs `gpt-5.5` and catches a planted P0. `gpt-5.5` is also added to the reasoning-model set (`_is_reasoning_model`, for the api-backend timeout/token limits) and to the `PRICING` table at OpenAI's confirmed standard rates (`gpt-5.5` $5/$30, `gpt-5.5-pro` $30/$180 per 1M input/output tokens) — the production CI + local reviewer run `gpt-5.5` via the flat-rate **codex backend**, but `--backend auto` falls back to the metered API path when the codex CLI is unavailable, so the cost estimate must stay accurate there. `gpt-5.4` remains accepted.
+
 ## [3.5.0] - 2026-06-01
 
 ### Added