Skip to content

Tick scheduler silently skips intervals when a duty overruns 800ms (proposer never attests on its own slot) #413

@MegaRedHand

Description

@MegaRedHand

Summary

handle_tick schedules the next tick after on_tick returns, with a delay computed from the timestamp captured at entry (crates/blockchain/src/lib.rs:694-708 on main):

let now_ms = ...;                 // captured at tick entry
self.on_tick(now_ms, ctx).await;  // may run long (block building proves inline)
let ms_to_next_interval = ms_until_next_interval(now_ms, genesis_time_ms);
send_after(Duration::from_millis(ms_to_next_interval), ...);

If on_tick takes longer than the remaining time to the next interval boundary, the next tick fires past that boundary and the skipped interval's duty never runs.

Concrete case: the tick at interval 0 builds a block. Block building proves inline on the actor thread (proposer signature aggregation / proof merging, ~1.3s observed). With entry at t=0, ms_until_next_interval = 800ms, but send_after only starts counting at t≈1300ms → next tick fires at t≈2100ms, which is interval 2. Interval 1 — attestation production — is silently skipped.

Impact

  • A proposing node produces no attestations for any slot it proposes, whenever block building exceeds 800ms (which it does whenever it proves).
  • N-node devnets lose ~1/N of attestations every slot.
  • 1-node devnets can never justify (proposer every slot → zero attestations after slot 0); 2-node devnets can't reach the 2/3 threshold either. Finalization requires ≥4 nodes purely because of this skip.
  • Any other duty overrun has the same effect on subsequent intervals (e.g. a slow interval-2 tick would skip the interval-3 safe-target update).

Evidence

Single-node devnet (4 validators, aggregator), devnet5 + leanVM 0520822, release build:

  • Slot 0: 4 attestations published at interval 1, aggregation completes (interval 0 had no block build).
  • Slots 1–26: Building block slot=N every slot (each ~1.3s incl. inline prove), zero Published attestation lines, justified_slot=0 finalized_slot=0 throughout.
  • Same topology with 4 nodes (proposer rotates, 3/4 attest each slot): justification and finalization advance normally — confirming the mechanism rather than any crypto issue.

Possible directions

  1. Schedule each tick for the next interval boundary regardless of how long on_tick took (compute the delay from a fresh timestamp, and if a boundary was missed, fire immediately so the skipped interval's duty still runs, possibly tagged with its intended interval).
  2. Move block building off the actor thread (spawn_blocking, like the aggregation worker) so on_tick returns within the interval.
  3. Derive slot/interval from the tick's scheduled time instead of the wall clock at processing time, so late ticks still execute their intended duty.

(1) and (3) change catch-up semantics for genuinely-late nodes and need care; (2) is the most contained but only fixes the block-building instance.

Found while soak-testing the zk-alloc allocator (#412); the bug is independent of that PR and reproduces with the default allocator whenever block building exceeds one interval.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions