Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .claude/commands/ai-review-local.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Two backends are supported:

| Backend | Latency | Cost | Quality |
|---|---|---|---|
| `api` (`gpt-5.4`) | 30-60s | $0.05-0.50/run, metered via `OPENAI_API_KEY` | Single-shot — won't grep, can't load files on its own initiative |
| `api` (`gpt-5.5`) | 30-60s | ~$0.10-1.00/run, metered via `OPENAI_API_KEY` | Single-shot — won't grep, can't load files on its own initiative |
| `codex` (any auth) | 3-15 min | depends on your `codex login` mode (subscription vs API key) — see codex docs | Agentic — matches CI Codex reviewer, can grep / load files / multi-turn |

Choose with `--backend {auto,codex,api}` (default `auto`):
Expand Down Expand Up @@ -66,8 +66,8 @@ Notes:
*Api backend only.*
- `--force-fresh`: Skip delta-diff mode, run a full fresh review even if previous state exists
- `--full-registry`: Include the entire REGISTRY.md instead of selective sections
- `--model <name>`: Override the model (default: `gpt-5.4`). Applies to both backends.
- `--timeout <seconds>`: HTTP request timeout. If omitted, defaults to 900 for reasoning models (gpt-5.4, *-pro, o1/o3/o4) and 300 otherwise. *Api backend only.*
- `--model <name>`: Override the model (default: `gpt-5.5`). Applies to both backends.
- `--timeout <seconds>`: HTTP request timeout. If omitted, defaults to 900 for reasoning models (gpt-5.4/gpt-5.5, *-pro, o1/o3/o4) and 300 otherwise. *Api backend only.*
- `--dry-run`: Print the compiled prompt without invoking the chosen backend
(no API call, no codex subprocess)

Expand Down Expand Up @@ -107,7 +107,7 @@ Step 5 invokes the chosen backend:

Parse `$ARGUMENTS` for the optional flags listed above. All flags are optional —
the default behavior (auto-detect backend, standard context for api or
agentic loading for codex, selective registry, gpt-5.4)
agentic loading for codex, selective registry, gpt-5.5)
requires no arguments.

### Step 2: Validate Prerequisites
Expand Down Expand Up @@ -417,8 +417,8 @@ would be silently ignored.
Note: `--force-fresh` is a skill-only flag — it controls whether delta diffs are
generated in Step 4 and is NOT passed to the script.

**Reasoning model handling:** If the model is `gpt-5.4`, contains `-pro`, or starts with
`o1`/`o3`/`o4` (e.g., `gpt-5.4`, `gpt-5.4-pro`, `o3`, `o4-mini`):
**Reasoning model handling:** If the model is `gpt-5.4`/`gpt-5.5`, contains `-pro`, or starts with
`o1`/`o3`/`o4` (e.g., `gpt-5.5`, `gpt-5.4-pro`, `o3`, `o4-mini`):
- The script auto-resolves `--timeout` to 900s for reasoning models when omitted, so
no extra flag is required unless overriding
- Run the Bash command with `run_in_background: true` (bypasses the 600s Bash tool timeout cap)
Expand Down
12 changes: 7 additions & 5 deletions .claude/scripts/openai_review.py
Original file line number Diff line number Diff line change
Expand Up @@ -871,6 +871,8 @@ def apply_token_budget(
PRICING = {
"gpt-5.4": (2.50, 15.00),
"gpt-5.4-pro": (30.00, 180.00),
"gpt-5.5": (5.00, 30.00),
"gpt-5.5-pro": (30.00, 180.00),
"gpt-4.1": (2.00, 8.00),
"gpt-4.1-mini": (0.40, 1.60),
"o3": (2.00, 8.00),
Expand Down Expand Up @@ -1140,7 +1142,7 @@ def compile_prompt(
# ---------------------------------------------------------------------------

ENDPOINT = "https://api.openai.com/v1/responses"
DEFAULT_MODEL = "gpt-5.4"
DEFAULT_MODEL = "gpt-5.5"
DEFAULT_TIMEOUT = 300 # seconds
REASONING_TIMEOUT = 900 # seconds
DEFAULT_MAX_TOKENS = 16384
Expand All @@ -1149,13 +1151,13 @@ def compile_prompt(

def _is_reasoning_model(model: str) -> bool:
"""Return True for models that use internal chain-of-thought reasoning."""
return model.startswith(("o1", "o3", "o4", "gpt-5.4")) or "-pro" in model
return model.startswith(("o1", "o3", "o4", "gpt-5.4", "gpt-5.5")) or "-pro" in model


def _resolve_timeout(timeout: "int | None", model: str) -> int:
"""Auto-resolve omitted --timeout based on model class.

Reasoning models (o1/o3/o4/gpt-5.4/*-pro) get REASONING_TIMEOUT (900s).
Reasoning models (o1/o3/o4/gpt-5.4/gpt-5.5/*-pro) get REASONING_TIMEOUT (900s).
Non-reasoning models get DEFAULT_TIMEOUT (300s).
Explicit values are passed through unchanged.
"""
Expand Down Expand Up @@ -1321,7 +1323,7 @@ def _build_codex_cmd(
) -> "list[str]":
"""Construct the argv for `codex exec`.

Pinned to match the CI Codex action's invocation (gpt-5.4 + xhigh effort +
Pinned to match the CI Codex action's invocation (gpt-5.5 + xhigh effort +
read-only sandbox) so local reviews give CI-equivalent quality.

NOTE: the effort key MUST be `model_reasoning_effort`, not
Expand Down Expand Up @@ -1705,7 +1707,7 @@ def main() -> None:
default=None,
help=(
f"HTTP request timeout in seconds (api backend only). If omitted, "
f"defaults to {REASONING_TIMEOUT} for reasoning models (gpt-5.4, "
f"defaults to {REASONING_TIMEOUT} for reasoning models (gpt-5.4/gpt-5.5, "
f"*-pro, o1/o3/o4) and {DEFAULT_TIMEOUT} otherwise. Ignored under "
f"--backend codex (codex manages its own time budget)."
),
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ai_pr_review.yml
Original file line number Diff line number Diff line change
Expand Up @@ -516,7 +516,7 @@ jobs:
sandbox: read-only
safety-strategy: drop-sudo
# Recommended by OpenAI for review quality/consistency:
model: gpt-5.4
model: gpt-5.5
effort: xhigh

- name: Post PR comment (new on every event except initial open)
Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Fixed
- **Covariate names that collide with reserved structural terms now raise `ValueError` instead of silently corrupting the coefficient dict (`DifferenceInDifferences`, `MultiPeriodDiD`, `TwoWayFixedEffects`).** These estimators build their `coefficients` dict by zipping a variable-name list -- structural term names PLUS the user covariate column names appended verbatim -- with the fitted coefficient vector. A covariate whose name equaled a reserved structural name (`const`; the treatment/time column names; the `{treatment}:{time}` interaction; MultiPeriodDiD `period_{p}` dummies and `{treatment}:period_{p}` interactions; `TwoWayFixedEffects` `ATT`; fixed-effect / unit / time dummy names; or an internal `_`-prefixed working column such as `_treat_time` / `_did_treatment` / `_treatment_post`) silently **overwrote** that structural coefficient via Python dict last-write-wins -- e.g. a covariate named `const` dropped the intercept -- with no error or warning. A new shared `validate_covariate_names` helper (`diff_diff/utils.py`) is now called in each of the three `fit()` methods before the design matrix is built; it raises `ValueError` on a collision (the comparison is case-sensitive, so e.g. `Const` is still allowed) **and** on duplicate names within `covariates` (which collapse to a single dict entry the same way). Fixed-effect/unit/time dummy reserved names are taken from the same `pd.get_dummies(..., drop_first=True)` call used to build them, so they match exactly (including for pandas `Categorical` columns with a non-default category order). For `TwoWayFixedEffects` the guard fires on **all** variance paths: the default within-transform path returns only `{"ATT": att}` (no covariate is a dict key there), but a covariate named `_treatment_post` would still clobber the internal interaction column, so guarding both paths is uniform and forward-compatible. **Potentially breaking:** a fit that previously *succeeded* with a colliding (or duplicated) covariate name -- silently returning a corrupted coefficient dict -- now raises; rename the covariate column(s). The staggered / influence-function estimators (CallawaySantAnna, SunAbraham, StaggeredTripleDifference, EfficientDiD, TwoStageDiD, ImputationDiD, WooldridgeDiD, dCDH, StackedDiD) key results by `(g, t)` tuples / relative-time indices, never covariate names, and `TripleDifference` / `SyntheticControl` / `SyntheticDiD` do not expose covariates by name, so none are affected. New tests in `tests/test_utils.py`, `tests/test_estimators.py`, and `tests/test_estimators_vcov_type.py`.

### Changed
- **CI + local AI PR-reviewer model upgraded `gpt-5.4` → `gpt-5.5`.** The CI Codex reviewer (`.github/workflows/ai_pr_review.yml`) and the local `/ai-review-local` default (`.claude/scripts/openai_review.py` `DEFAULT_MODEL`) now run `gpt-5.5` @ `xhigh` effort / `read-only` sandbox (all other invocation settings unchanged). Validated empirically before the swap via the `tools/reviewer-eval/` A/B harness: on a real-bug corpus plus a k=6 big-diff de-risk, `gpt-5.5` matched-or-beat `gpt-5.4` on every test-backed recall case (including a bug buried in a ~3k-line methodology diff), added zero false positives, and ran faster; an end-to-end CI canary confirmed the action environment (`openai/codex-action@v1`, codex CLI 0.135.0) runs `gpt-5.5` and catches a planted P0. `gpt-5.5` is also added to the reasoning-model set (`_is_reasoning_model`, for the api-backend timeout/token limits) and to the `PRICING` table at OpenAI's confirmed standard rates (`gpt-5.5` $5/$30, `gpt-5.5-pro` $30/$180 per 1M input/output tokens) — the production CI + local reviewer run `gpt-5.5` via the flat-rate **codex backend**, but `--backend auto` falls back to the metered API path when the codex CLI is unavailable, so the cost estimate must stay accurate there. `gpt-5.4` remains accepted.

## [3.5.0] - 2026-06-01

### Added
Expand Down
Loading