gpt-5.5 xhigh sometimes short-circuits with reasoning_output_tokens=516 and wrong final_answer in Codex Desktop

### Summary

I am seeing a reproducible pattern in Codex Desktop with `gpt-5.5` + `xhigh` reasoning.

For the same reasoning puzzle, some runs appear to take a very short reasoning path, produce no intermediate `commentary` output, go directly to `final_answer`, and return the wrong answer. In those failed runs, the local session logs consistently show:

```text
reasoning_output_tokens: 516
phases: ["final_answer"]
```

Runs that first emit a short `commentary` message continue reasoning much longer and return the correct answer.

### Environment

- Product: Codex Desktop app
- Model: `gpt-5.5`
- Reasoning effort: `xhigh`
- Date tested: 2026-06-22, Asia/Shanghai
- Local session logs inspected under `~/.codex/sessions/...`

### Related external discussion

There is also an external discussion reporting a similar `516` reasoning-token behavior:

- https://github.com/router-for-me/CLIProxyAPI/discussions/3937
- My added reproduction comment there: https://github.com/router-for-me/CLIProxyAPI/discussions/3937#discussioncomment-17383083

### Reproduction prompt

```text
In a black bag, there are candies in three flavors. Each flavor comes in two shapes: round and five-pointed star. The two shapes can be distinguished by touch.

The quantities of candies by flavor and shape are shown in the table below. Before the activity begins, contestants need to decide how many candies to draw. What is the minimum number of candies that must be drawn to guarantee that they have apple-flavored and peach-flavored candies of different shapes at the same time?

Either a round apple-flavored candy together with a five-pointed-star peach-flavored candy, or a round peach-flavored candy together with a five-pointed-star apple-flavored candy, satisfies the requirement.

| Shape | Apple | Peach | Watermelon |
|---|---:|---:|---:|
| Round | 7 | 9 | 8 |
| Five-pointed star | 7 | 6 | 4 |
```

The expected answer is `21`, because the shapes can be distinguished by touch. For example, drawing 9 round candies and 12 five-pointed-star candies guarantees the required pair.

### Observed results

I opened 5 fresh Codex Desktop conversations with the same English prompt:

| Thread | Visible assistant phases | Final answer | reasoning_output_tokens | Duration |
|---|---|---:|---:|---:|
| `019eeb3b-052c...` | `["final_answer"]` | Wrong, `29` | `516` | `15.4s` |
| `019eeb3c-df82...` | `["final_answer"]` | Wrong, `29` | `516` | `14.4s` |
| `019eeb3b-5843...` | `["commentary", "final_answer"]` | Correct, `21` | `5884` | `114.0s` |
| `019eeb3b-8fb2...` | `["commentary", "commentary", "final_answer"]` | Correct, `21` | `3904` | `77.7s` |
| `019eeb3d-a846...` | `["commentary", "final_answer"]` | Correct, `21` | `7766` | `145.2s` |

I also reproduced a similar pattern with the Chinese version of the same prompt:

| Visible assistant phases | Final answer | reasoning_output_tokens |
|---|---:|---:|
| `["final_answer"]` | Wrong, `29` | `516` |
| `["final_answer"]` | Wrong, `29` | `516` |
| `["commentary", "final_answer"]` | Correct, `21` | `3624` |

### Why this looks suspicious

The wrong runs are not merely lower quality outputs. They have a very specific shape:

- no visible intermediate `commentary`
- direct `final_answer`
- very short completion time
- exactly `516` reasoning output tokens
- same wrong interpretation of the problem

The successful runs follow a different path:

- emit an initial `commentary` message
- spend much longer reasoning
- use thousands of reasoning tokens
- return the correct answer

This makes me wonder whether Codex Desktop or the model route is sometimes short-circuiting/truncating the reasoning path before the intended `xhigh` reasoning behavior actually happens.

### Expected behavior

For `gpt-5.5` with `xhigh`, identical fresh conversations should not consistently fall into a fixed `reasoning_output_tokens=516` direct-final path that correlates with the wrong answer.

Even if the model can still make mistakes, the fixed `516` reasoning token count plus the phase difference suggests there may be a routing, truncation, or response-stream handling issue worth investigating.

### Additional note

I searched existing `openai/codex` issues/discussions for `516`. I found one issue mentioning `reasoning 516`, but it appeared to be about context-window display rather than this direct-final/wrong-answer pattern.

Thread	Visible assistant phases	Final answer	reasoning_output_tokens	Duration
`019eeb3b-052c...`	`["final_answer"]`	Wrong, `29`	`516`	`15.4s`
`019eeb3c-df82...`	`["final_answer"]`	Wrong, `29`	`516`	`14.4s`
`019eeb3b-5843...`	`["commentary", "final_answer"]`	Correct, `21`	`5884`	`114.0s`
`019eeb3b-8fb2...`	`["commentary", "commentary", "final_answer"]`	Correct, `21`	`3904`	`77.7s`
`019eeb3d-a846...`	`["commentary", "final_answer"]`	Correct, `21`	`7766`	`145.2s`

Visible assistant phases	Final answer	reasoning_output_tokens
`["final_answer"]`	Wrong, `29`	`516`
`["final_answer"]`	Wrong, `29`	`516`
`["commentary", "final_answer"]`	Correct, `21`	`3624`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpt-5.5 xhigh sometimes short-circuits with reasoning_output_tokens=516 and wrong final_answer in Codex Desktop #29353

Summary

Environment

Related external discussion

Reproduction prompt

Observed results

Why this looks suspicious

Expected behavior

Additional note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

gpt-5.5 xhigh sometimes short-circuits with reasoning_output_tokens=516 and wrong final_answer in Codex Desktop #29353

Description

Summary

Environment

Related external discussion

Reproduction prompt

Observed results

Why this looks suspicious

Expected behavior

Additional note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions