fix: transactor nonce gaps causing stale queued txs by kant777 · Pull Request #896 · primev/mev-commit

kant777 · 2026-03-05T08:45:55Z

Context

We noticed queued transactions accumulating on erigon nodes in production. Root cause is in the transactor's nonce management — two separate bugs that both lead to nonce gaps.

What changed

Bug 1: SendTransaction retry loop falls through to success path

When all retry attempts fail with DeadlineExceeded, the for loop exits and the code falls through to watcher.Sent() + nonce increment — even though the tx was never actually sent. This permanently skips a nonce and creates queued transactions that can never execute.

Fixed by tracking the send error across the loop and returning it when all retries exhaust. The existing defer puts the nonce back in the channel for reuse.

Bug 2: No recovery when mempool loses transactions

If a transaction is successfully submitted to the mempool but later lost (node restart, mempool wipe), the local nonce counter has already moved past it. There was no way to recover — the transactor would keep sending txs with future nonces, all of which queue up behind the gap.

Added drift detection in PendingNonceAt: if the local nonce is more than maxNonceDrift (default 5) ahead of what the chain reports, reset to the chain nonce. No extra RPC calls — we already query the chain's pending nonce on every call.

Other

Added WithMaxNonceDrift and WithLogger functional options (backwards compatible, no callers need to change)
Fixed defer cancel() inside the retry loop that was leaking contexts

Testing

Added 3 new tests:

TestSendTransactionRetriesExhausted — verifies nonce is preserved when all retries fail
TestNonceDriftSelfHealing — verifies drift beyond threshold triggers reset
TestNonceDriftWithinThreshold — verifies normal pending txs aren't affected

When transactions are lost from the mempool (e.g. after an Erigon node restart), the transactor's local nonce counter can drift permanently ahead of the chain's pending nonce, creating an unrecoverable nonce gap. All subsequent transactions pile up in Erigon's queued pool and never execute. Add drift detection in PendingNonceAt: if the local nonce exceeds the chain's pending nonce by more than maxNonceDrift (default: 5), reset to the chain nonce. This uses no extra RPC calls — the chain's pending nonce is already queried on every call. Also fix a context leak where defer cancel() inside the SendTransaction retry loop caused all context cancellations to pile up until function return instead of being released per iteration.

When all retry attempts fail with DeadlineExceeded, the for loop exits and falls through to the success path, incorrectly calling watcher.Sent() and incrementing the nonce for a transaction that was never actually sent. This creates a permanent nonce gap that leads to stale queued transactions in the mempool. Track the send error and check it after the loop exits. If all retries failed, return the error so the defer puts the nonce back into the channel for reuse.

aloknerurkar · 2026-03-05T19:03:47Z

x/contracts/transactor/transactor.go

+		// after a node restart). Reset to chain nonce to close the gap.
+		// This does not add any extra RPC calls — we already query the
+		// chain's pending nonce above.
+		if nonce > pendingNonce+t.maxNonceDrift {


Typically during heavy load (especially in stress test) this will replace existing transactions. The Watcher.Allow allows certain number of transactions to be enqueued beyond the backend nonce. This number is much higher (around 1024 I think). This nonceDrift is overloading this behaviour.

In order to do this type of healing, we should also keep track of when last successful txn was sent and when context.Deadline is occurring (on line 100). We need to ensure we are trying to send transactions but we are stuck for sometime before we can heal.

Instead of the transactor detecting nonce drift on every PendingNonceAt call (which conflicts with maxPendingTxs=2048), move stuck detection to the Monitor. When the confirmed nonce doesn't advance for 30s while there are pending txs, the Monitor signals the transactor to reset its local nonce via a NonceOverride channel. This cleanly separates concerns: the Monitor already tracks confirmed nonces and pending transactions, making it the right place for stuck detection. The transactor simply reacts to the override signal. Replaces the maxNonceDrift approach from PR #896 (bug 2 fix). Retains the retry loop fall-through fix (bug 1) from PR #896.

* fix: replace maxNonceDrift with monitor-driven nonce reset Instead of the transactor detecting nonce drift on every PendingNonceAt call (which conflicts with maxPendingTxs=2048), move stuck detection to the Monitor. When the confirmed nonce doesn't advance for 30s while there are pending txs, the Monitor signals the transactor to reset its local nonce via a NonceOverride channel. This cleanly separates concerns: the Monitor already tracks confirmed nonces and pending transactions, making it the right place for stuck detection. The transactor simply reacts to the override signal. Replaces the maxNonceDrift approach from PR #896 (bug 2 fix). Retains the retry loop fall-through fix (bug 1) from PR #896. * review: fix data race in log, add monitor stuck detection tests - Fix data race: len(m.waitMap) was read without lock in the stuck detection log. Replaced hasPendingTxs() with pendingTxCount() that returns the count under lock, reused in both the condition and log. - Add SetStuckDuration() for testing with short durations. - Add TestStuckDetectionSendsNonceOverride: verifies monitor sends override when confirmed nonce doesn't advance with pending txs. - Add TestNoOverrideWhenNonceAdvances: verifies no override fires when the confirmed nonce advances before stuckDuration.

* fix: self-healing nonce drift detection in transactor When transactions are lost from the mempool (e.g. after an Erigon node restart), the transactor's local nonce counter can drift permanently ahead of the chain's pending nonce, creating an unrecoverable nonce gap. All subsequent transactions pile up in Erigon's queued pool and never execute. Add drift detection in PendingNonceAt: if the local nonce exceeds the chain's pending nonce by more than maxNonceDrift (default: 5), reset to the chain nonce. This uses no extra RPC calls — the chain's pending nonce is already queried on every call. Also fix a context leak where defer cancel() inside the SendTransaction retry loop caused all context cancellations to pile up until function return instead of being released per iteration. * fix: prevent nonce gap when SendTransaction retries exhaust When all retry attempts fail with DeadlineExceeded, the for loop exits and falls through to the success path, incorrectly calling watcher.Sent() and incrementing the nonce for a transaction that was never actually sent. This creates a permanent nonce gap that leads to stale queued transactions in the mempool. Track the send error and check it after the loop exits. If all retries failed, return the error so the defer puts the nonce back into the channel for reuse.

* fix: replace maxNonceDrift with monitor-driven nonce reset Instead of the transactor detecting nonce drift on every PendingNonceAt call (which conflicts with maxPendingTxs=2048), move stuck detection to the Monitor. When the confirmed nonce doesn't advance for 30s while there are pending txs, the Monitor signals the transactor to reset its local nonce via a NonceOverride channel. This cleanly separates concerns: the Monitor already tracks confirmed nonces and pending transactions, making it the right place for stuck detection. The transactor simply reacts to the override signal. Replaces the maxNonceDrift approach from PR #896 (bug 2 fix). Retains the retry loop fall-through fix (bug 1) from PR #896. * review: fix data race in log, add monitor stuck detection tests - Fix data race: len(m.waitMap) was read without lock in the stuck detection log. Replaced hasPendingTxs() with pendingTxCount() that returns the count under lock, reused in both the condition and log. - Add SetStuckDuration() for testing with short durations. - Add TestStuckDetectionSendsNonceOverride: verifies monitor sends override when confirmed nonce doesn't advance with pending txs. - Add TestNoOverrideWhenNonceAdvances: verifies no override fires when the confirmed nonce advances before stuckDuration.

kant777 added 2 commits March 5, 2026 00:27

harshsingh1002 changed the title ~~Fix transactor nonce gaps causing stale queued txs~~ fix: transactor nonce gaps causing stale queued txs Mar 5, 2026

kant777 requested review from aloknerurkar and owen-eth March 5, 2026 09:33

owen-eth approved these changes Mar 5, 2026

View reviewed changes

kant777 merged commit a331744 into main Mar 5, 2026
6 of 8 checks passed

kant777 deleted the fix/transactor-nonce-drift-self-healing branch March 5, 2026 18:59

aloknerurkar reviewed Mar 5, 2026

View reviewed changes

kant777 mentioned this pull request Mar 5, 2026

fix: monitor-driven nonce reset for stuck transactions #899

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: transactor nonce gaps causing stale queued txs#896

fix: transactor nonce gaps causing stale queued txs#896
kant777 merged 2 commits intomainfrom
fix/transactor-nonce-drift-self-healing

kant777 commented Mar 5, 2026

Uh oh!

Uh oh!

aloknerurkar Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kant777 commented Mar 5, 2026

Context

What changed

Bug 1: SendTransaction retry loop falls through to success path

Bug 2: No recovery when mempool loses transactions

Other

Testing

Uh oh!

Uh oh!

aloknerurkar Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants