Skip to content

fix: monitor-driven nonce reset for stuck transactions#899

Merged
kant777 merged 2 commits intomainfrom
fix/monitor-driven-nonce-reset
Mar 5, 2026
Merged

fix: monitor-driven nonce reset for stuck transactions#899
kant777 merged 2 commits intomainfrom
fix/monitor-driven-nonce-reset

Conversation

@kant777
Copy link
Contributor

@kant777 kant777 commented Mar 5, 2026

Summary

  • Removes the maxNonceDrift approach from PR fix: transactor nonce gaps causing stale queued txs #896 which conflicted with maxPendingTxs=2048 during heavy load
  • Adds stuck detection to the Monitor: when the confirmed nonce doesn't advance for 30s while there are pending txs in the waitMap, the Monitor sends the confirmed nonce on a NonceOverride channel
  • The transactor does a non-blocking read on NonceOverride in PendingNonceAt and resets its local nonce when signaled
  • Retains the retry loop fall-through fix (bug 1) from PR fix: transactor nonce gaps causing stale queued txs #896

Changes

  • x/contracts/txmonitor/txmonitor.go: Added NonceOverride() <-chan uint64, hasPendingTxs(), stuck detection loop with stuckDuration (30s default)
  • x/contracts/transactor/transactor.go: Removed maxNonceDrift/WithMaxNonceDrift, added NonceOverride() to Watcher interface, non-blocking override check in PendingNonceAt
  • x/contracts/transactor/transactor_test.go: Replaced drift tests with monitor-driven override tests, added NonceOverride() to test watchers

Test plan

  • TestNonceOverrideFromMonitor — verifies transactor resets nonce when override signal is sent
  • TestNonceOverrideNotTriggeredWithoutSignal — verifies local nonce is preserved when no override
  • TestSendTransactionRetriesExhausted — verifies nonce recycling on retry exhaustion (from PR fix: transactor nonce gaps causing stale queued txs #896)
  • Existing TestTrasactor and txmonitor tests pass

Instead of the transactor detecting nonce drift on every PendingNonceAt
call (which conflicts with maxPendingTxs=2048), move stuck detection to
the Monitor. When the confirmed nonce doesn't advance for 30s while
there are pending txs, the Monitor signals the transactor to reset its
local nonce via a NonceOverride channel.

This cleanly separates concerns: the Monitor already tracks confirmed
nonces and pending transactions, making it the right place for stuck
detection. The transactor simply reacts to the override signal.

Replaces the maxNonceDrift approach from PR #896 (bug 2 fix).
Retains the retry loop fall-through fix (bug 1) from PR #896.
- Fix data race: len(m.waitMap) was read without lock in the stuck
  detection log. Replaced hasPendingTxs() with pendingTxCount() that
  returns the count under lock, reused in both the condition and log.
- Add SetStuckDuration() for testing with short durations.
- Add TestStuckDetectionSendsNonceOverride: verifies monitor sends
  override when confirmed nonce doesn't advance with pending txs.
- Add TestNoOverrideWhenNonceAdvances: verifies no override fires
  when the confirmed nonce advances before stuckDuration.
@kant777 kant777 merged commit 08508a4 into main Mar 5, 2026
4 of 5 checks passed
@kant777 kant777 deleted the fix/monitor-driven-nonce-reset branch March 5, 2026 20:01
aloknerurkar pushed a commit that referenced this pull request Mar 9, 2026
* fix: replace maxNonceDrift with monitor-driven nonce reset

Instead of the transactor detecting nonce drift on every PendingNonceAt
call (which conflicts with maxPendingTxs=2048), move stuck detection to
the Monitor. When the confirmed nonce doesn't advance for 30s while
there are pending txs, the Monitor signals the transactor to reset its
local nonce via a NonceOverride channel.

This cleanly separates concerns: the Monitor already tracks confirmed
nonces and pending transactions, making it the right place for stuck
detection. The transactor simply reacts to the override signal.

Replaces the maxNonceDrift approach from PR #896 (bug 2 fix).
Retains the retry loop fall-through fix (bug 1) from PR #896.

* review: fix data race in log, add monitor stuck detection tests

- Fix data race: len(m.waitMap) was read without lock in the stuck
  detection log. Replaced hasPendingTxs() with pendingTxCount() that
  returns the count under lock, reused in both the condition and log.
- Add SetStuckDuration() for testing with short durations.
- Add TestStuckDetectionSendsNonceOverride: verifies monitor sends
  override when confirmed nonce doesn't advance with pending txs.
- Add TestNoOverrideWhenNonceAdvances: verifies no override fires
  when the confirmed nonce advances before stuckDuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants