feat(derivation): deriveForce skip#967
Conversation
…Number When local verify sees batchInfo.lastBlockNumber missing locally, the new dispatcher distinguishes four cases: - A consistent: header present + versioned hash matches L1 → existing rebuildBlob path, safe advances as before. - B self-heal: header present + versioned hash mismatch → existing self-heal, now calling deriveForce(_, 0) to keep the full-overwrite semantics. - C fill-gap: header missing AND L2 head stays flat across the 3×2s retry window → pull the real L1 blob via fetchRollupDataByTxHash, then deriveForce(_, localLatestL2) writes only blocks above the local head, leaving existing blocks untouched (no EL reorg). Falls through to the shared verifyBatchRoots + advanceSafe. - D wait-next-poll: header missing AND L2 head grew within the window → continue, sequencer is alive and P2P will deliver the missing blocks before the next poll cycle. deriveForce now takes a skipNumber so scenarios B and C share one implementation: skipNumber=0 means "overwrite everything", skipNumber>0 skips blocks already present locally and only appends the missing tail. The parent header anchor is max(firstNum-1, skipNumber) so the first block we actually write always has its parent in the EL. The retry loop surfaces non-NotFound RPC errors from HeaderByNumber / BlockNumber so transient client failures are visible instead of being masked as "sequencer stopped". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scenario C (header missing + L2 head flat) appends blocks above the local head via deriveForce. Even though it doesn't reorg existing blocks, blocksync / broadcast reactors may race with derivation on the same height range as P2P delivers blocks for the missing tail. Wrap the fill-gap block writes with StopReactorsBeforeReorg / StartReactorsAfterReorg so the reactor lifecycle matches scenario B self-heal: derivation owns the writes during the operation, then reactors restart from the post-write head and catch up via P2P. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The previous reactor stop/start pattern around deriveForce returned early on derive errors, leaving consensus reactors permanently stopped: Stop is idempotent on retry, but Start was never reached, so blocksync / broadcast stayed paused indefinitely until process restart. Wrap stop/derive/start in withReactorsQuiesced so Start runs in a defer regardless of body outcome. Restart height is read from the L2 EL latest (truth source) rather than the lastHeader returned by deriveForce — that header may be nil or stale if writes failed mid-batch, but the EL always knows what's actually applied. Use context.Background() for the deferred BlockNumber read so a cancelled parent ctx doesn't block reactor recovery. Both scenario B (self-heal) and scenario C (fill-gap) call sites now share the same helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous implementation logged the post-write BlockNumber read failure but still passed the zero-valued cur to StartReactorsAfterReorg, which would have told blocksync to resume from height 0 and re-fetch the entire chain from genesis. withReactorsQuiesced now captures the pre-write height before stopping reactors and uses it as a safe lower-bound fallback if the post-write EL read fails. If even the pre-write read fails we don't stop reactors at all and return the error so the batch retries next poll, rather than committing to a quiesce window with no way to pick a sensible restart height. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the inline 6s retry loop in the per-batch dispatcher with a single observation captured once per derivationBlock pull. The previous-pull's L2 latest is cached on Derivation.lastObservedL2Latest; the next pull compares its current latest against the cache to derive l2Grew, then runs through the per-batch loop with that single bool: header present → A / B (rebuildBlob). header missing AND l2Grew → D (alive, skip; next poll re-tries). header missing AND !l2Grew → C (stopped → L1 blob fill-gap). The poll interval (default 15s) is naturally the observation window — no inline sleep, no extra RPCs in the per-batch loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Path B entry-point's alive-vs-stopped signal is the cross-pull
movement of L2 latest. Capture it once per pull, just above the
`for _, lg := range logs` loop, so:
- All batches in a pull share the same verdict — fillGap's own
block writes can't pollute later batches' growth comparison.
(The previous fix attempt moved the read into the per-batch
branch, which broke multi-batch sequencer-stopped recovery: only
one batch was rescued per poll instead of all accumulated logs.)
- Snapshot is fresher than placing it at the top of derivationBlock,
where the L1 prep RPCs (getLatestConfirmedBlockNumber,
fetchRollupLog, LastCommittedBatchIndex) sit between the read
and the consumer.
- Reads are colocated with their use site for easier review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: corey <corey.zhang@bitget.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Number
When local verify sees batchInfo.lastBlockNumber missing locally, the new dispatcher distinguishes four cases:
deriveForce now takes a skipNumber so scenarios B and C share one implementation: skipNumber=0 means "overwrite everything", skipNumber>0 skips blocks already present locally and only appends the missing tail. The parent header anchor is max(firstNum-1, skipNumber) so the first block we actually write always has its parent in the EL.
The retry loop surfaces non-NotFound RPC errors from HeaderByNumber / BlockNumber so transient client failures are visible instead of being masked as "sequencer stopped".