Skip to content

Transient error handling and concurrency limit for postgres based indexed storage#3184

Merged
vigoo merged 3 commits into
mainfrom
indexed-storage-overload-fix
Apr 15, 2026
Merged

Transient error handling and concurrency limit for postgres based indexed storage#3184
vigoo merged 3 commits into
mainfrom
indexed-storage-overload-fix

Conversation

@vigoo
Copy link
Copy Markdown
Contributor

@vigoo vigoo commented Apr 15, 2026

The oplog layer was intentionally .expect the indexed storage operations because we wanted to the executor to stop running as soon as it stopped being able to access the oplog, to minimize effects of a split-brain scenario, for example.
While checking failed benchmark logs I noticed that with the switch to postgres-based indexed storage implementation, now we can easily overload the sqlx pool with many parallel agents, and that manifests in pool timeout errors that are also treated as such critical issue and crashes the executor.

We should not crash the executor just because it is overloaded; this PR introduces a few things to avoid that:

  • error classification for the indexed storage implementations (transient vs other)
  • transient errors such as pool timeout are not causing an immediate panic, just a configurable inline retry
  • there is a semaphore in the indexed storage now so parallel writes are controlled by that and not the sqlx layer that has a timeout. this also leaves a configurable amount of free pool slots for non-indexed-storage use

let level = self.level;
let key = Self::compressed_oplog_key(&owned_agent_id.agent_id);
retry_storage_op(&self.retry_config, "compressed_exists", &key, || {
let is = is.clone();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that it matters much, but do we have to also clone on line 141?

@vigoo vigoo merged commit 6811d06 into main Apr 15, 2026
86 of 91 checks passed
@vigoo vigoo deleted the indexed-storage-overload-fix branch April 15, 2026 09:52
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 15, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants