Skip to content

Severe output quality variance across accounts and time periods #18104

@hratio

Description

@hratio

What version of Codex CLI is running?

0.121.0

What subscription do you have?

Pro

Which model were you using?

gpt-5.4 xhigh

What platform is your computer?

Linux 6.19.11-arch1-1

What terminal emulator and version are you using (if applicable)?

No response

What issue are you seeing?

I know this is a hot topic, hard to quantify and provide solid evidence for. I did however take the time to put this simple task through the /feedback loop. (A few paragraphs have been processed by gpt to better clarify my points).

In before calling it out as a skill issue, please hear me out.

Quick background:

  • I juggled 6 Plus accounts up until getting a single 5x Pro sub.
  • Monorepo > 200 projects, MCP Servers: 1 (Serena)
  • 100 lines AGENTS.md, a linked and auto-generated semi-lean MEMORY.md architecture for all projects.
  • Using a custom "symphony" orchestrator for agentic runs with a well tested research/planning/execution/review multi-agent workflow. Ad-hoc codex-cli work using the whole range of models. Preferably 5.4 medium/high for coding, occationally xhigh for planning or debugging.

The problem

The models seem to, at fairly regular intervals, usually for 2-3 days at a time, become dumb.

Anywhere from 4 to 6 of the accounts would be dumb, but consistent for the period of time. Same machine, same workflows, same ~/.codex/. (1 account would always stay dumb for some reason).

Having switched to a single account, I'm stuck with dumbness and it creates alot of frustration. (Side note: Yesterday around 22:00 UTC there was a sudden switch from bad to good, within the same long-running session I had running for debugging some specific test failures, before any compaction, same model. It went from "You go boy ✅🚀" to actually doing the job which was nice).

There are also some recurring signs in the generated responses that suggest some kind of nerfing is happening. The severity varies, but it has been especially bad the last 60 hours, give or take. Also, there is no relationship between context size when I compare the good days against the bad.

Bad days

  • Walls of text with non-reasoned explanations.
  • Excuses: "it wasn't me", "it was already there".
  • Rebuts criticism without putting in any reasoning effort, making things up that just "fits" whatever is in the context.
  • Might skip tool calls or reasoning when asked about, or called out on, factual errors. Lots of hand-waving.
  • Starts reasoning about or implementing stuff totally unrelated to the context at hand.
  • Hallmarks of very low reasoning
    • Fancy summaries
    • Catchy headers
    • Short, concise bullet points
    • Unrelated information not in context
    • "So the precise answer is: [another wall of text]"

It will very often end turns prematurely with a question about whether X or Y should be fixed, when one message earlier it introduced the error itself. When told to go fix it, it might just reiterate what needs to be fixed and end turn again. When it eventually starts applying patches, it might revert unrelated changes in files not even touched in the session, undo things that did work, leave failures behind, and end the turn with a summary of what is still not working. It literally gives up very early.

Good days

  • Will often say things like: "I'm not going to hand-wave this" and do extra tool calls.
  • Rebuts criticism well and provides arguments and clarification
  • Single concise paragraph with coherent sentences explaining what went wrong and how it will remediate it.
  • Immediately jumps to fix it
  • Understands and reiterates mistakes and takes responsibility without too much fuzz.
  • Runs checks, ignores failures in unrelated files, and adds a note about it.
  • Short summary without noise, maybe with a footnote or heads-up about unrelated files that appear in the diff.

Provided no nerfing is being done and this is all just my own hallucinations, but subtle changes in how issues are written, onboarded, and even created by agents themselves, btw, can drive the LLM to behave this drastically differently, would it not be beneficial for you as a service provider to call that out so we can stop yapping around this topic? That even the smallest changes in a prompt would yield responses that seemingly appear to come from completely different models?

Provided nerfing actually does occur for whatever reason, please state that up front in a way that lets us save our quota expenditures for when normal inference is working again.

I’d gladly take much slower response times, especially in agentic workflows, if the quality stayed consistent. I’d much rather have quality of service measured by output quality than by response time or availability. If I could opt into time slots where quality is consistent, and get hard failures or warnings when I’m routed to some sub-par GPU, I’d choose that over spending endless amounts of time cleaning up slop that normally is not generated. It just wastes compute, time, and trust for an otherwise excellent model and CLI.

Sudden quality change example

Here we are discussing concurrent schema migrations in a test suite. We go back and forth, it not being as helpful as it usually is.

• Yes. My earlier wording was sloppy.

  - Table location: schema-local. In this repo, Kysely is configured with migrationTableSchema: args.schema, so the bookkeeping tables are A.kysely_migration / A.kysely_migration_lock, B.kysely_migration /
    B.kysely_migration_lock, etc. See packages/libs/data-access/dd-data/src/migrations.ts:77.
  - Coordination lock: database-global. On Postgres, Kysely uses one fixed pg_advisory_xact_lock(...) value for migrations, regardless of schema. See node_modules/.pnpm/kysely@0.28.15/node_modules/kysely/dist/
    esm/dialect/postgres/postgres-adapter.js:13.
  - The bug: Kysely acquires that global migration lock too late. It first does schema/table/lock-row setup, then takes the lock. See node_modules/.pnpm/kysely@0.28.15/node_modules/kysely/dist/esm/migration/
    migrator.js:263 and node_modules/.pnpm/kysely@0.28.15/node_modules/kysely/dist/esm/migration/migrator.js:414.

  So the precise truth is:

  - Migration tables are not global.
  - Migration locking/coordination is global.
  - Concurrent per-schema migrations in the same DB are therefore not safely isolated, because the unsafe setup phase happens before the global lock.

  When I said “effectively database-global,” I meant behaviorally for concurrency, not “the tables are stored globally.”


› But you  have not provided a definite answer, you hand-wave and provide alot of noise. Focus on what im asking you. Exactly what causes the failure of concurrent migrations when we've set the migrationTableS
  chema? Are our usage of migrator wrong?  Search github, and take a look at this post: https://github.com/kysely-org/kysely-ctl/issues/349 Does it help us?

• Searching the web
• Searched site:github.com/kysely-org/kysely-ctl/issues/349 migrationTableSchema concurrent schema migrations
• Searching the web
• Searched https://github.com/kysely-org/kysely-ctl/issues/349
• Searching the web
• Searched site:github.com/kysely-org/kysely/issues withSchema migrationTableSchema "relation" "kysely_migration"
• Searching the web
• Searched https://github.com/kysely-org/kysely/issues/1269
• Searching the web
• Searched site:github.com/kysely-org/kysely/issues "WithSchemaPlugin" migrations postgres

Then suddenly a switch happens IN THE ACTIVE TURN, wording completely changes, it starts using proper tools, creates a reproduction outside of the test files. Isolates the exact issue . It creates a plan, goes back and forth until it has proof of what is happening, a concrete solution. No sugar coating or stating the obvious. All by itself. Day and night within the same session within a timeline of 10 minutes.


• I’m searching Kysely’s GitHub issues for migrationTableSchema + withSchema failures and checking whether issue kysely-ctl#349 is relevant.

• Ran gh api repos/kysely-org/kysely/issues/1269/comments
  └ [{"url":"https://api.github.com/repos/kysely-org/kysely/issues/comments/2485497987","html_url":"https://github.com/kysely-org/kysely/issues/1269#issuecomment-2485497987","issue_url":"https://api.github.com/
    repos/kysely-org/kysely/issues/1269","id":2485497987,"node_id":"IC_kwDOFFVhD86UJbCD","user":{"login":"jroitgrund","id":1767394,"node_id":"MDQ6VXNlcjE3NjczOTQ=","avatar_url":"https://avatars.githubuserconten
    … +126 lines (ctrl + t to view transcript)
    ions":{"url":"https://api.github.com/repos/kysely-org/kysely/issues/comments/3536540367/reactions","total_count":3,"+1":3,"-1":0,"laugh":0,"hooray":0,"confused":0,"heart":0,"rocket":0,"eyes":0},"performed_v
    ia_github_app":null}]

• Ran gh api repos/kysely-org/kysely/issues/80/comments
  └ [{"url":"https://api.github.com/repos/kysely-org/kysely/issues/comments/1100851277","html_url":"https://github.com/kysely-org/kysely/issues/80#issuecomment-1100851277","issue_url":"https://api.github.com/re
    pos/kysely-org/kysely/issues/80","id":1100851277,"node_id":"IC_kwDOFFVhD85BnahN","user":{"login":"koskimas","id":846508,"node_id":"MDQ6VXNlcjg0NjUwOA==","avatar_url":"https://avatars.githubusercontent.com/u
    … +56 lines (ctrl + t to view transcript)

• I’m checking the actual migration bodies for unqualified SQL or schema handling mistakes.
• I’m running the exact failing target again and capturing the first concrete stack/error around the missing relation.
• The failing query is not a select from schema.kysely_migration. It is the introspector query inside Migrator.#doesTableExist(). I’m checking that exact query now.

The wrong assumption is that migrationTableSchema scopes the migrator’s catalog lookup. In kysely@0.28.15, it does not.

If you want, I can patch a local workaround next: either bypass getTables() for migration-table existence checks or replace it with a schema-filtered pg_catalog lookup.

---

• I’ve verified the patched dependency removes the missing-table crash. The nx rerun now gets past that and fails later on an unrelated assertion in validator_clean.e2e.spec.ts. I’m adding the repo memory note for this Kysely pitfall, then I’ll summarize.

Look at that beauty! When it works it freaking works !

What steps can reproduce the bug?

Uploaded thread: 019d94d0-243b-76f0-a9ce-4d25b716b691

What is the expected behavior?

No response

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentIssues related to the core agent loopbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions