Skip to content

gpt-5.5 xhigh sometimes short-circuits with reasoning_output_tokens=516 and wrong final_answer in Codex Desktop #29353

Description

@gydx6

Summary

I am seeing a reproducible pattern in Codex Desktop with gpt-5.5 + xhigh reasoning.

For the same reasoning puzzle, some runs appear to take a very short reasoning path, produce no intermediate commentary output, go directly to final_answer, and return the wrong answer. In those failed runs, the local session logs consistently show:

reasoning_output_tokens: 516
phases: ["final_answer"]

Runs that first emit a short commentary message continue reasoning much longer and return the correct answer.

Environment

  • Product: Codex Desktop app
  • Model: gpt-5.5
  • Reasoning effort: xhigh
  • Date tested: 2026-06-22, Asia/Shanghai
  • Local session logs inspected under ~/.codex/sessions/...

Related external discussion

There is also an external discussion reporting a similar 516 reasoning-token behavior:

Reproduction prompt

In a black bag, there are candies in three flavors. Each flavor comes in two shapes: round and five-pointed star. The two shapes can be distinguished by touch.

The quantities of candies by flavor and shape are shown in the table below. Before the activity begins, contestants need to decide how many candies to draw. What is the minimum number of candies that must be drawn to guarantee that they have apple-flavored and peach-flavored candies of different shapes at the same time?

Either a round apple-flavored candy together with a five-pointed-star peach-flavored candy, or a round peach-flavored candy together with a five-pointed-star apple-flavored candy, satisfies the requirement.

| Shape | Apple | Peach | Watermelon |
|---|---:|---:|---:|
| Round | 7 | 9 | 8 |
| Five-pointed star | 7 | 6 | 4 |

The expected answer is 21, because the shapes can be distinguished by touch. For example, drawing 9 round candies and 12 five-pointed-star candies guarantees the required pair.

Observed results

I opened 5 fresh Codex Desktop conversations with the same English prompt:

Thread Visible assistant phases Final answer reasoning_output_tokens Duration
019eeb3b-052c... ["final_answer"] Wrong, 29 516 15.4s
019eeb3c-df82... ["final_answer"] Wrong, 29 516 14.4s
019eeb3b-5843... ["commentary", "final_answer"] Correct, 21 5884 114.0s
019eeb3b-8fb2... ["commentary", "commentary", "final_answer"] Correct, 21 3904 77.7s
019eeb3d-a846... ["commentary", "final_answer"] Correct, 21 7766 145.2s

I also reproduced a similar pattern with the Chinese version of the same prompt:

Visible assistant phases Final answer reasoning_output_tokens
["final_answer"] Wrong, 29 516
["final_answer"] Wrong, 29 516
["commentary", "final_answer"] Correct, 21 3624

Why this looks suspicious

The wrong runs are not merely lower quality outputs. They have a very specific shape:

  • no visible intermediate commentary
  • direct final_answer
  • very short completion time
  • exactly 516 reasoning output tokens
  • same wrong interpretation of the problem

The successful runs follow a different path:

  • emit an initial commentary message
  • spend much longer reasoning
  • use thousands of reasoning tokens
  • return the correct answer

This makes me wonder whether Codex Desktop or the model route is sometimes short-circuiting/truncating the reasoning path before the intended xhigh reasoning behavior actually happens.

Expected behavior

For gpt-5.5 with xhigh, identical fresh conversations should not consistently fall into a fixed reasoning_output_tokens=516 direct-final path that correlates with the wrong answer.

Even if the model can still make mistakes, the fixed 516 reasoning token count plus the phase difference suggests there may be a routing, truncation, or response-stream handling issue worth investigating.

Additional note

I searched existing openai/codex issues/discussions for 516. I found one issue mentioning reasoning 516, but it appeared to be about context-window display rather than this direct-final/wrong-answer pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    appIssues related to the Codex desktop appbugSomething isn't workingmodel-behaviorIssues related to behaviors exhibited by the model

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions