fix(consensus): close vote-race in broadcastBlockHash, restore [pro, con] BFT semantics (batch D) [DRAFT - consensus]#888
Conversation
…con] BFT semantics (batch D) Closes audit-sweep CRITICAL #2 (broadcastBlockHash.ts vote race) from the 2026-05-28 bug hunt. The pre-existing implementation had three distinct, interacting consensus-safety defects. Bug 1 — fire-and-forget .then chains Each `peer.longCall` was wrapped in `promise.then(async response => { ... pro++; con++; signatures[...] = ...; })`. The `.then` chain was discarded; only the original `longCall` promises were awaited via `Promise.all(promises)`. That means `pro/con` increments and signature writes ran AFTER the routine returned. The caller (`voteOnBlock`) then consumed a half-populated vote tally and a half-merged signatures map. Bug 2 — return value contradicts BFT semantics Last line returned `[signatureCount, shard.length - signatureCount]` rather than `[pro, con]`. Vote count and signature count are not the same metric. Signature count includes: - our own pre-existing signature - signatures peer-merged via the response.extra payload - signatures relayed by other shard members concurrently calling our `manageProposeBlockHash` handler (because `block` IS `getSharedState.candidateBlock` — same object reference) BFT threshold (2/3 + 1 of shard.length) was being compared against this inflated number. Blocks could pass `isBlockValid` on fewer actual peer-agreement votes than BFT requires. Safety violation. Bug 3 — concurrent writes to shared signatures map `block.validation_data.signatures` is the same object as `getSharedState.candidateBlock.validation_data.signatures`. Outbound `.then` callbacks and inbound `manageProposeBlockHash` handler invocations from every shard peer race to write into it. Individual `obj[key] = value` is atomic in JS, but the read-verify-async-write pattern (incoming sig → await verify → write) interleaves across multiple paths. Fix shape (Path C — full rewrite with explicit aggregation) 1. Per-peer flow extracted to `proposeAndCollect`. Wraps `peer.longCall` in try/catch so a network failure becomes a `con` vote with rejection reason, not a routine-level abort. Implements the same `Promise.allSettled` discipline batch A applied to `broadcastNewBlock`/`broadcastOurSyncData`. 2. `verifyIncomingSignatures` helper runs all per-peer signature verifications in parallel via `Promise.all`, returns the verified subset as `Record<string, string>`. Drops invalid signatures with logging; never throws. 3. Outer routine: `await Promise.all(shard.map(...))` aggregates every peer outcome before counting anything. Single deterministic serial-merge loop over the outcomes increments `pro/con` and writes verified signatures into `block.validation_data.signatures`. No `.then` chains. 4. Return value: `[pro, con]` — actual vote counts. Matches `isBlockValid(pro, totalVotes)` 2/3+1 BFT threshold semantics in `PoRBFT.ts:642-645` and the caller signature in `voteOnBlock` (`PoRBFT.ts:614-617`). 5. Allowed-codes list extended to include 404 (candidate block not formed) in addition to 401 (validator not in shard). Both are valid "con" responses the handler can return; previously 404 was being treated as a retryable failure in `longCall`, producing 3× retries before eventually counting as con. Why not Bun workers Race wasn't CPU-bound; it was async-completion-ordering. Workers fix CPU bottlenecks, not promise misuse. Awaiting the chain fixes the race; threads wouldn't. TxValidatorPool already uses worker_threads for crypto. Verification - tsc --noEmit clean. - Manual trace of the original race: 1. Shard A,B,C,D,E. We call A, B, C, D in parallel. 2. Old code: outer Promise.all(promises) resolves when all four longCall HTTP responses come back. pro/con still 0; signatures still {}. Routine returns [0, shard.length] (signatureCount=0). isBlockValid fails. Block rejected even though all peers agreed. 3. New code: outer Promise.all(map) waits for proposeAndCollect (longCall + verify + classify) to complete on EVERY peer. Outcomes aggregated. Merge loop runs. Routine returns [pro=4, con=0]. isBlockValid passes correctly. - Manual trace of the signature inflation: 1. Shard A,B,C,D,E. We propose. A relays C's and D's signatures. B relays D's and E's signatures. 2. Old code: signatureCount = our + A + C + D + B + E = 6. But shard.length = 5, so [6, -1] gets returned. isBlockValid compares pro=6 against threshold (10/3 + 1) = 4. Passes. Reality: only A and B actually voted pro. C, D, E were relayed in. 3. New code: pro = 2 (A and B), con = 3 (C, D, E never directly called us). Threshold = 4. Correctly rejects. The relayed signatures from A and B still get merged into our signatures map for downstream block-finalisation use, but they don't inflate our vote count. Out of scope for this PR - Cross-RPC vote-relay (the TODO at the end of the previous implementation): peers transmitting their inbound vote counts to other peers to handle network partitions. Tracked separately. - e2e test that exercises the race scenarios. The traces above are walkable on devnet via the existing `testing/devnet/scripts/test-transfer-e2e.sh` harness with a second-validator chaos injection.
|
Warning Review limit reached
More reviews will be available in 45 minutes. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe broadcast block hash consensus routine is refactored to eliminate post-return signature mutations. Per-peer calls now execute in parallel via ChangesConsensus Broadcast Refactoring
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@greptile review |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/libs/consensus/v2/routines/broadcastBlockHash.ts (1)
260-263: 💤 Low valueComment mentions
Object.assignbut code uses direct property assignment.The comment describes "each
Object.assignoperation" but the actual merge at line 269 uses direct property assignment. Consider updating the comment to match the implementation.📝 Suggested comment update
- // Serial merge: each `Object.assign` operation against the - // shared `signatures` map is atomic, but we want a single - // deterministic order so log lines reflect what landed and - // operators can replay the merge if needed. + // Serial merge: we iterate outcomes in deterministic order so + // log lines reflect what landed and operators can replay the + // merge if needed. All writes complete before the function + // returns, eliminating the post-return mutation race.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/libs/consensus/v2/routines/broadcastBlockHash.ts` around lines 260 - 263, The comment referencing "each `Object.assign` operation" is inaccurate because the merge uses direct property assignment into the shared `signatures` map; update the comment near the merge in broadcastBlockHash (the block merge that writes into `signatures`) to describe the actual implementation (e.g., "each property is assigned directly into the shared `signatures` map in a deterministic, serial order") or alternatively change the merge code to use `Object.assign(signatures, ...)` if you prefer the comment to stay as-is; ensure the comment and the implementation (the merge into `signatures` inside broadcastBlockHash) match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/libs/consensus/v2/routines/broadcastBlockHash.ts`:
- Around line 260-263: The comment referencing "each `Object.assign` operation"
is inaccurate because the merge uses direct property assignment into the shared
`signatures` map; update the comment near the merge in broadcastBlockHash (the
block merge that writes into `signatures`) to describe the actual implementation
(e.g., "each property is assigned directly into the shared `signatures` map in a
deterministic, serial order") or alternatively change the merge code to use
`Object.assign(signatures, ...)` if you prefer the comment to stay as-is; ensure
the comment and the implementation (the merge into `signatures` inside
broadcastBlockHash) match.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 613d21a8-961e-44ad-8f98-398f647b3d34
📒 Files selected for processing (1)
src/libs/consensus/v2/routines/broadcastBlockHash.ts
Greptile SummaryThis PR rewrites
Confidence Score: 4/5The three correctness bugs are properly fixed and the new BFT vote-counting logic is consistent with isBlockValid's 2/3+1 threshold; the main open question is whether relayed-but-lost signatures from the own-sig-missing con branch could prevent finalization under network partition. The core fixes — awaited vote tallying, returning [pro, con] instead of [signatureCount, ...], and serial post-collect merge — are all implemented correctly. The remaining edge case (verified third-party signatures silently discarded on the own-sig-missing con path) could affect finalization liveness under network partition, worth addressing before production. src/libs/consensus/v2/routines/broadcastBlockHash.ts — specifically the con return path at lines 228-238 where verified is populated but not forwarded to the merge loop Important Files Changed
Sequence DiagramsequenceDiagram
participant BH as broadcastBlockHash
participant PAC as proposeAndCollect xN
participant Peer as Remote Peer
participant VIS as verifyIncomingSignatures
participant Merge as Serial Merge Loop
BH->>BH: structuredClone validation_data snapshot
BH->>PAC: Promise.all shard.map proposeAndCollect
par Per peer
PAC->>Peer: longCall proposeBlockHash
alt Network error
PAC-->>BH: vote con
else HTTP 401 or 404
PAC-->>BH: vote con
else HTTP 200 empty sigs
PAC-->>BH: vote con
else HTTP 200 with sigs
PAC->>VIS: verifyIncomingSignatures
VIS-->>PAC: verified subset
alt own sig missing
PAC-->>BH: vote con
else own sig verified
PAC-->>BH: vote pro signaturesToMerge
end
end
end
BH->>Merge: serial loop outcomes
Merge->>Merge: pro or con increment merge sigs
BH-->>BH: return pro con
Reviews (6): Last reviewed commit: "fix(consensus): require verified self-si..." | Re-trigger Greptile |
Greptile SummaryThis PR rewrites
Confidence Score: 3/5The core vote-race and wrong-return-value bugs are correctly addressed, but a peer that returns 200 with no verifiable signatures still increments the pro counter, meaning the BFT tally can include votes that carry no cryptographic attestation into the block. The rewrite fixes the three described bugs cleanly and the overall structure is sound. The main residual concern is that src/libs/consensus/v2/routines/broadcastBlockHash.ts — specifically the Important Files Changed
Sequence DiagramsequenceDiagram
participant CR as consensusRoutine (PoRBFT)
participant BB as broadcastBlockHash
participant PAC as proposeAndCollect (×N)
participant VIS as verifyIncomingSignatures
participant Peer as Peer N (longCall)
participant MPB as manageProposeBlockHash (remote)
CR->>BB: broadcastBlockHash(block, shard)
BB->>BB: build proposeParams [hash, validation_data ref, ourId]
BB->>PAC: Promise.all(shard.map(proposeAndCollect))
par for each shard peer
PAC->>Peer: longCall(proposeBlockHash, proposeParams)
Peer->>MPB: HTTP POST consensus_routine/proposeBlockHash
MPB-->>Peer: 200 + validation_data (or 401/404)
Peer-->>PAC: RPCResponse
alt "result === 200"
PAC->>VIS: verifyIncomingSignatures(extra.signatures, block.hash, peerId)
VIS->>VIS: Promise.all(verify each sig via TxValidatorPool)
VIS-->>PAC: "verified: Record<string,string>"
PAC-->>BB: "{vote:pro, signaturesToMerge: verified}"
else "result === 401 / 404"
PAC-->>BB: "{vote:con, signaturesToMerge:{}}"
else longCall throws
PAC-->>BB: "{vote:con, rejectionReason: msg}"
end
end
BB->>BB: serial merge loop — count pro/con, write signaturesToMerge into block.validation_data.signatures
BB-->>CR: [pro, con]
CR->>CR: isBlockValid(pro, shard.length) → finalizeBlock or throw BlockInvalidError
Reviews (2): Last reviewed commit: "fix(consensus): close vote-race in broad..." | Re-trigger Greptile |
… validation_data fan-out (PR #888 iter 1) Four Greptile findings — all real. P1 — pro vote without cryptographic attestation The previous iter classified any HTTP 200 as `vote: "pro"`, even when: (a) `extra.signatures` was empty, or (b) every entry in `extra.signatures` failed verification. BFT quorum could be reached on votes carrying no auditable proof. A node with a broken/rotated signing key contributed to threshold without leaving a verifiable trace. Fix: tighten the pro contract to require the peer's OWN signature on our block hash to be PRESENT in `extra.signatures` AND survive verification. Anything weaker downgrades to a `con` with explicit rejectionReason: - "200 with empty signatures map" - "200 without verifiable own signature on block hash" Relayed third-party signatures (other validators' attestations the peer happens to have collected) are still merged into our signatures map for downstream finalisation, but they no longer qualify as the peer's own vote. P2 — double error log on verify exception `verifyIncomingSignatures` catch path logged "Signature verification threw for X". The outer loop then ran the `else` branch on `isValid: false` and logged "Invalid signature relayed by Y for X; dropping" — two unrelated- looking errors for a single event, confusing log triage. Fix: add `loggedFailure: boolean` to the per-entry result. Outer loop skips its own log when the inner catch already emitted one. Genuine `isValid: false` from a clean verify (not exception) still produces a single "Invalid signature relayed" log. P2 — `proposeParams[1]` live reference race `proposeParams[1]` held `block.validation_data` directly. That is the same object as `getSharedState.candidateBlock.validation_data`. Inbound `manageProposeBlockHash` handlers run concurrently with our outbound broadcast (every shard member calls every other simultaneously) and mutate `signatures` on that same object. Effect: each peer in our `Promise.all` fan-out received a slightly different payload depending on which inbound calls landed first. Receivers see inconsistent snapshots of our validation_data — verifies are still correct (signatures are self-contained crypto), but log-based debugging is harder because "what we sent to peer A" no longer matches "what we sent to peer B". Fix: `structuredClone(block.validation_data)` once before fan-out. Every peer sees the same frozen payload. Receivers still verify and merge what they get; the freeze only affects what we ship. Verification - tsc --noEmit clean. - Pro vote with empty extra.signatures: now logs error + returns con with reason. Cannot inflate BFT tally. - Pro vote where peer's own pubkey is not in verified set: now downgrades to con with explicit "missing own signature" reason. - Double-log scenario: catch path emits "verification threw", outer loop sees `loggedFailure: true`, skips its own log. Single log per event. - Concurrent inbound during fan-out: `validationDataSnapshot` is the immutable frozen copy; mutations to live `block.validation_data.signatures` from inbound handlers do not reach the outbound serialisation. Each peer call gets bit-identical `proposeParams[1]`.
|
@greptile review |
1 similar comment
|
@greptile review |
…ork testing pass Single document covering everything surfaced during the testing pass that ran 2026-05-26 → 2026-05-31 against dev.node2. Organised by topic, not chronology, so the team has one place to look up what was found, what was fixed, and what is still pending — without scrolling Telegram or hopping between PRs. Sections: - L2PS encryption nonce reuse (SDK #87 + 3 follow-up security findings) - Governance hash mismatch on dev.node2 (still broken; real fix in DEM-727) - Test-coverage gap that allowed the governance break to slip through - Side finding — nonces were not checked or incremented per tx (Hovhannes's batch C #884–#887, batch D #888 in flight) - UX surfacing for L2PS (Demo #11) - Operational risks still open - Linear ticket map across DEM-722 epic - Source material cross-links Cross-links the raw battery output and serializer analysis already in flight via Node #876. Closes DEM-729. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@greptile review |
Plain-English summary
What your code does (the consensus voting handshake)
When a node thinks it has a candidate block ready, it asks every other node in the shard: "do you agree the block's hash should be X?" by calling each peer's
proposeBlockHashendpoint.Each peer answers with either:
The sender tallies the yes/no votes. If more than 2/3 voted yes → the block is valid → it gets finalised onto the chain.
What was broken (three things at once, all amplifying each other)
Half-counted votes — the tally code (
pro++/con++) ran inside a.then(...)callback that wasn't being waited for. Outer code returned before the votes finished counting. Result: routine sometimes returnedpro=0even though every peer agreed.Wrong number returned — instead of returning the vote count, the routine returned the signature count — a totally different number. Signatures pile up from multiple sources (peer's own attestation, signatures the peer relayed from other peers, signatures other peers were concurrently writing into the same shared object because everyone calls everyone). The BFT "2/3 majority" check was comparing this inflated number against the threshold → blocks were getting approved on fewer real votes than required.
Shared mutable map race — because all five shard members call each other in parallel, they were all writing signatures into the same JavaScript object at the same time. JS is single-threaded so individual writes don't tear, but the read-verify-write pattern interleaves badly across paths.
What the fix does
{vote: "pro" | "con", signaturesToMerge: ...}. All outcomes are collected first, then a single serial loop counts votes and merges verified signatures. No more racing writes.[pro, con](real vote counts), not signature count. BFT check now compares apples to apples.convotes with an explicit reason, not silent aborts.Why this matters
The combination of bug 2 + bug 3 was the dangerous one: blocks could pass BFT without enough validators actually agreeing. With the fix, the math under
isBlockValidfinally matches the protocol's intent (2/3 + 1 of shard size).Technical detail
Closes audit-sweep CRITICAL #2 from the 2026-05-28 bug hunt: `broadcastBlockHash.ts` vote race.
Three interacting bugs in the pre-existing implementation
1. Fire-and-forget `.then` chains
`peer.longCall(...).then(async response => { ... pro++; signatures[...] = ... })` — the `.then` is discarded. Outer `await Promise.all(promises)` waits for the original longCall promises only. `pro/con` increments and signature writes complete AFTER the routine returns. Caller (`voteOnBlock`) consumes a half-populated tally.
2. Return value contradicts BFT semantics
Last line returned `[signatureCount, shard.length - signatureCount]` instead of `[pro, con]`. Signature count includes:
BFT `isBlockValid(pro, totalVotes)` was being passed signature count vs threshold computed from `shard.length`. Blocks could pass BFT on fewer actual peer-agreement votes than 2/3+1 requires. Consensus safety violation.
3. Concurrent writes to shared signatures map
`block.validation_data.signatures` IS `getSharedState.candidateBlock.validation_data.signatures` (same object — verified at `createBlock.ts:67`). Outbound `.then` callbacks AND inbound `manageProposeBlockHash` handler invocations race to write into it.
Fix shape (Path C — full rewrite with explicit aggregation)
Why not Bun workers (asked during implementation)
Race wasn't CPU-bound. Workers fix CPU, not promise misuse. `TxValidatorPool` already uses `worker_threads` for crypto.
Manual verification traces
Race scenario (4 peers, all agree):
Signature-inflation scenario (5-node shard, A relays C+D, B relays D+E):
Specifically wanted feedback on
Out of scope
Test plan