fix: transactor nonce gaps causing stale queued txs#896
Merged
Conversation
When transactions are lost from the mempool (e.g. after an Erigon node restart), the transactor's local nonce counter can drift permanently ahead of the chain's pending nonce, creating an unrecoverable nonce gap. All subsequent transactions pile up in Erigon's queued pool and never execute. Add drift detection in PendingNonceAt: if the local nonce exceeds the chain's pending nonce by more than maxNonceDrift (default: 5), reset to the chain nonce. This uses no extra RPC calls — the chain's pending nonce is already queried on every call. Also fix a context leak where defer cancel() inside the SendTransaction retry loop caused all context cancellations to pile up until function return instead of being released per iteration.
When all retry attempts fail with DeadlineExceeded, the for loop exits and falls through to the success path, incorrectly calling watcher.Sent() and incrementing the nonce for a transaction that was never actually sent. This creates a permanent nonce gap that leads to stale queued transactions in the mempool. Track the send error and check it after the loop exits. If all retries failed, return the error so the defer puts the nonce back into the channel for reuse.
owen-eth
approved these changes
Mar 5, 2026
aloknerurkar
reviewed
Mar 5, 2026
| // after a node restart). Reset to chain nonce to close the gap. | ||
| // This does not add any extra RPC calls — we already query the | ||
| // chain's pending nonce above. | ||
| if nonce > pendingNonce+t.maxNonceDrift { |
Collaborator
There was a problem hiding this comment.
Typically during heavy load (especially in stress test) this will replace existing transactions. The Watcher.Allow allows certain number of transactions to be enqueued beyond the backend nonce. This number is much higher (around 1024 I think). This nonceDrift is overloading this behaviour.
In order to do this type of healing, we should also keep track of when last successful txn was sent and when context.Deadline is occurring (on line 100). We need to ensure we are trying to send transactions but we are stuck for sometime before we can heal.
kant777
added a commit
that referenced
this pull request
Mar 5, 2026
Instead of the transactor detecting nonce drift on every PendingNonceAt call (which conflicts with maxPendingTxs=2048), move stuck detection to the Monitor. When the confirmed nonce doesn't advance for 30s while there are pending txs, the Monitor signals the transactor to reset its local nonce via a NonceOverride channel. This cleanly separates concerns: the Monitor already tracks confirmed nonces and pending transactions, making it the right place for stuck detection. The transactor simply reacts to the override signal. Replaces the maxNonceDrift approach from PR #896 (bug 2 fix). Retains the retry loop fall-through fix (bug 1) from PR #896.
4 tasks
kant777
added a commit
that referenced
this pull request
Mar 5, 2026
* fix: replace maxNonceDrift with monitor-driven nonce reset Instead of the transactor detecting nonce drift on every PendingNonceAt call (which conflicts with maxPendingTxs=2048), move stuck detection to the Monitor. When the confirmed nonce doesn't advance for 30s while there are pending txs, the Monitor signals the transactor to reset its local nonce via a NonceOverride channel. This cleanly separates concerns: the Monitor already tracks confirmed nonces and pending transactions, making it the right place for stuck detection. The transactor simply reacts to the override signal. Replaces the maxNonceDrift approach from PR #896 (bug 2 fix). Retains the retry loop fall-through fix (bug 1) from PR #896. * review: fix data race in log, add monitor stuck detection tests - Fix data race: len(m.waitMap) was read without lock in the stuck detection log. Replaced hasPendingTxs() with pendingTxCount() that returns the count under lock, reused in both the condition and log. - Add SetStuckDuration() for testing with short durations. - Add TestStuckDetectionSendsNonceOverride: verifies monitor sends override when confirmed nonce doesn't advance with pending txs. - Add TestNoOverrideWhenNonceAdvances: verifies no override fires when the confirmed nonce advances before stuckDuration.
aloknerurkar
pushed a commit
that referenced
this pull request
Mar 9, 2026
* fix: self-healing nonce drift detection in transactor When transactions are lost from the mempool (e.g. after an Erigon node restart), the transactor's local nonce counter can drift permanently ahead of the chain's pending nonce, creating an unrecoverable nonce gap. All subsequent transactions pile up in Erigon's queued pool and never execute. Add drift detection in PendingNonceAt: if the local nonce exceeds the chain's pending nonce by more than maxNonceDrift (default: 5), reset to the chain nonce. This uses no extra RPC calls — the chain's pending nonce is already queried on every call. Also fix a context leak where defer cancel() inside the SendTransaction retry loop caused all context cancellations to pile up until function return instead of being released per iteration. * fix: prevent nonce gap when SendTransaction retries exhaust When all retry attempts fail with DeadlineExceeded, the for loop exits and falls through to the success path, incorrectly calling watcher.Sent() and incrementing the nonce for a transaction that was never actually sent. This creates a permanent nonce gap that leads to stale queued transactions in the mempool. Track the send error and check it after the loop exits. If all retries failed, return the error so the defer puts the nonce back into the channel for reuse.
aloknerurkar
pushed a commit
that referenced
this pull request
Mar 9, 2026
* fix: replace maxNonceDrift with monitor-driven nonce reset Instead of the transactor detecting nonce drift on every PendingNonceAt call (which conflicts with maxPendingTxs=2048), move stuck detection to the Monitor. When the confirmed nonce doesn't advance for 30s while there are pending txs, the Monitor signals the transactor to reset its local nonce via a NonceOverride channel. This cleanly separates concerns: the Monitor already tracks confirmed nonces and pending transactions, making it the right place for stuck detection. The transactor simply reacts to the override signal. Replaces the maxNonceDrift approach from PR #896 (bug 2 fix). Retains the retry loop fall-through fix (bug 1) from PR #896. * review: fix data race in log, add monitor stuck detection tests - Fix data race: len(m.waitMap) was read without lock in the stuck detection log. Replaced hasPendingTxs() with pendingTxCount() that returns the count under lock, reused in both the condition and log. - Add SetStuckDuration() for testing with short durations. - Add TestStuckDetectionSendsNonceOverride: verifies monitor sends override when confirmed nonce doesn't advance with pending txs. - Add TestNoOverrideWhenNonceAdvances: verifies no override fires when the confirmed nonce advances before stuckDuration.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
We noticed queued transactions accumulating on erigon nodes in production. Root cause is in the transactor's nonce management — two separate bugs that both lead to nonce gaps.
What changed
Bug 1: SendTransaction retry loop falls through to success path
When all retry attempts fail with
DeadlineExceeded, the for loop exits and the code falls through towatcher.Sent()+ nonce increment — even though the tx was never actually sent. This permanently skips a nonce and creates queued transactions that can never execute.Fixed by tracking the send error across the loop and returning it when all retries exhaust. The existing defer puts the nonce back in the channel for reuse.
Bug 2: No recovery when mempool loses transactions
If a transaction is successfully submitted to the mempool but later lost (node restart, mempool wipe), the local nonce counter has already moved past it. There was no way to recover — the transactor would keep sending txs with future nonces, all of which queue up behind the gap.
Added drift detection in
PendingNonceAt: if the local nonce is more thanmaxNonceDrift(default 5) ahead of what the chain reports, reset to the chain nonce. No extra RPC calls — we already query the chain's pending nonce on every call.Other
WithMaxNonceDriftandWithLoggerfunctional options (backwards compatible, no callers need to change)defer cancel()inside the retry loop that was leaking contextsTesting
Added 3 new tests:
TestSendTransactionRetriesExhausted— verifies nonce is preserved when all retries failTestNonceDriftSelfHealing— verifies drift beyond threshold triggers resetTestNonceDriftWithinThreshold— verifies normal pending txs aren't affected