Skip to content

fix: transactor nonce gaps causing stale queued txs#896

Merged
kant777 merged 2 commits intomainfrom
fix/transactor-nonce-drift-self-healing
Mar 5, 2026
Merged

fix: transactor nonce gaps causing stale queued txs#896
kant777 merged 2 commits intomainfrom
fix/transactor-nonce-drift-self-healing

Conversation

@kant777
Copy link
Contributor

@kant777 kant777 commented Mar 5, 2026

Context

We noticed queued transactions accumulating on erigon nodes in production. Root cause is in the transactor's nonce management — two separate bugs that both lead to nonce gaps.

What changed

Bug 1: SendTransaction retry loop falls through to success path

When all retry attempts fail with DeadlineExceeded, the for loop exits and the code falls through to watcher.Sent() + nonce increment — even though the tx was never actually sent. This permanently skips a nonce and creates queued transactions that can never execute.

Fixed by tracking the send error across the loop and returning it when all retries exhaust. The existing defer puts the nonce back in the channel for reuse.

Bug 2: No recovery when mempool loses transactions

If a transaction is successfully submitted to the mempool but later lost (node restart, mempool wipe), the local nonce counter has already moved past it. There was no way to recover — the transactor would keep sending txs with future nonces, all of which queue up behind the gap.

Added drift detection in PendingNonceAt: if the local nonce is more than maxNonceDrift (default 5) ahead of what the chain reports, reset to the chain nonce. No extra RPC calls — we already query the chain's pending nonce on every call.

Other

  • Added WithMaxNonceDrift and WithLogger functional options (backwards compatible, no callers need to change)
  • Fixed defer cancel() inside the retry loop that was leaking contexts

Testing

Added 3 new tests:

  • TestSendTransactionRetriesExhausted — verifies nonce is preserved when all retries fail
  • TestNonceDriftSelfHealing — verifies drift beyond threshold triggers reset
  • TestNonceDriftWithinThreshold — verifies normal pending txs aren't affected

kant777 added 2 commits March 5, 2026 00:27
When transactions are lost from the mempool (e.g. after an Erigon node
restart), the transactor's local nonce counter can drift permanently
ahead of the chain's pending nonce, creating an unrecoverable nonce gap.
All subsequent transactions pile up in Erigon's queued pool and never
execute.

Add drift detection in PendingNonceAt: if the local nonce exceeds the
chain's pending nonce by more than maxNonceDrift (default: 5), reset to
the chain nonce. This uses no extra RPC calls — the chain's pending
nonce is already queried on every call.

Also fix a context leak where defer cancel() inside the SendTransaction
retry loop caused all context cancellations to pile up until function
return instead of being released per iteration.
When all retry attempts fail with DeadlineExceeded, the for loop
exits and falls through to the success path, incorrectly calling
watcher.Sent() and incrementing the nonce for a transaction that
was never actually sent. This creates a permanent nonce gap that
leads to stale queued transactions in the mempool.

Track the send error and check it after the loop exits. If all
retries failed, return the error so the defer puts the nonce back
into the channel for reuse.
@harshsingh1002 harshsingh1002 changed the title Fix transactor nonce gaps causing stale queued txs fix: transactor nonce gaps causing stale queued txs Mar 5, 2026
@kant777 kant777 requested review from aloknerurkar and owen-eth March 5, 2026 09:33
@kant777 kant777 merged commit a331744 into main Mar 5, 2026
6 of 8 checks passed
@kant777 kant777 deleted the fix/transactor-nonce-drift-self-healing branch March 5, 2026 18:59
// after a node restart). Reset to chain nonce to close the gap.
// This does not add any extra RPC calls — we already query the
// chain's pending nonce above.
if nonce > pendingNonce+t.maxNonceDrift {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically during heavy load (especially in stress test) this will replace existing transactions. The Watcher.Allow allows certain number of transactions to be enqueued beyond the backend nonce. This number is much higher (around 1024 I think). This nonceDrift is overloading this behaviour.

In order to do this type of healing, we should also keep track of when last successful txn was sent and when context.Deadline is occurring (on line 100). We need to ensure we are trying to send transactions but we are stuck for sometime before we can heal.

kant777 added a commit that referenced this pull request Mar 5, 2026
Instead of the transactor detecting nonce drift on every PendingNonceAt
call (which conflicts with maxPendingTxs=2048), move stuck detection to
the Monitor. When the confirmed nonce doesn't advance for 30s while
there are pending txs, the Monitor signals the transactor to reset its
local nonce via a NonceOverride channel.

This cleanly separates concerns: the Monitor already tracks confirmed
nonces and pending transactions, making it the right place for stuck
detection. The transactor simply reacts to the override signal.

Replaces the maxNonceDrift approach from PR #896 (bug 2 fix).
Retains the retry loop fall-through fix (bug 1) from PR #896.
kant777 added a commit that referenced this pull request Mar 5, 2026
* fix: replace maxNonceDrift with monitor-driven nonce reset

Instead of the transactor detecting nonce drift on every PendingNonceAt
call (which conflicts with maxPendingTxs=2048), move stuck detection to
the Monitor. When the confirmed nonce doesn't advance for 30s while
there are pending txs, the Monitor signals the transactor to reset its
local nonce via a NonceOverride channel.

This cleanly separates concerns: the Monitor already tracks confirmed
nonces and pending transactions, making it the right place for stuck
detection. The transactor simply reacts to the override signal.

Replaces the maxNonceDrift approach from PR #896 (bug 2 fix).
Retains the retry loop fall-through fix (bug 1) from PR #896.

* review: fix data race in log, add monitor stuck detection tests

- Fix data race: len(m.waitMap) was read without lock in the stuck
  detection log. Replaced hasPendingTxs() with pendingTxCount() that
  returns the count under lock, reused in both the condition and log.
- Add SetStuckDuration() for testing with short durations.
- Add TestStuckDetectionSendsNonceOverride: verifies monitor sends
  override when confirmed nonce doesn't advance with pending txs.
- Add TestNoOverrideWhenNonceAdvances: verifies no override fires
  when the confirmed nonce advances before stuckDuration.
aloknerurkar pushed a commit that referenced this pull request Mar 9, 2026
* fix: self-healing nonce drift detection in transactor

When transactions are lost from the mempool (e.g. after an Erigon node
restart), the transactor's local nonce counter can drift permanently
ahead of the chain's pending nonce, creating an unrecoverable nonce gap.
All subsequent transactions pile up in Erigon's queued pool and never
execute.

Add drift detection in PendingNonceAt: if the local nonce exceeds the
chain's pending nonce by more than maxNonceDrift (default: 5), reset to
the chain nonce. This uses no extra RPC calls — the chain's pending
nonce is already queried on every call.

Also fix a context leak where defer cancel() inside the SendTransaction
retry loop caused all context cancellations to pile up until function
return instead of being released per iteration.

* fix: prevent nonce gap when SendTransaction retries exhaust

When all retry attempts fail with DeadlineExceeded, the for loop
exits and falls through to the success path, incorrectly calling
watcher.Sent() and incrementing the nonce for a transaction that
was never actually sent. This creates a permanent nonce gap that
leads to stale queued transactions in the mempool.

Track the send error and check it after the loop exits. If all
retries failed, return the error so the defer puts the nonce back
into the channel for reuse.
aloknerurkar pushed a commit that referenced this pull request Mar 9, 2026
* fix: replace maxNonceDrift with monitor-driven nonce reset

Instead of the transactor detecting nonce drift on every PendingNonceAt
call (which conflicts with maxPendingTxs=2048), move stuck detection to
the Monitor. When the confirmed nonce doesn't advance for 30s while
there are pending txs, the Monitor signals the transactor to reset its
local nonce via a NonceOverride channel.

This cleanly separates concerns: the Monitor already tracks confirmed
nonces and pending transactions, making it the right place for stuck
detection. The transactor simply reacts to the override signal.

Replaces the maxNonceDrift approach from PR #896 (bug 2 fix).
Retains the retry loop fall-through fix (bug 1) from PR #896.

* review: fix data race in log, add monitor stuck detection tests

- Fix data race: len(m.waitMap) was read without lock in the stuck
  detection log. Replaced hasPendingTxs() with pendingTxCount() that
  returns the count under lock, reused in both the condition and log.
- Add SetStuckDuration() for testing with short durations.
- Add TestStuckDetectionSendsNonceOverride: verifies monitor sends
  override when confirmed nonce doesn't advance with pending txs.
- Add TestNoOverrideWhenNonceAdvances: verifies no override fires
  when the confirmed nonce advances before stuckDuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants