Skip to content

fix(consensus): close vote-race in broadcastBlockHash, restore [pro, con] BFT semantics (batch D) [DRAFT - consensus]#888

Merged
tcsenpai merged 2 commits into
stabilisationfrom
bugfix/audit-sweep-batch-d-vote-race-2026-05-31
Jun 1, 2026
Merged

fix(consensus): close vote-race in broadcastBlockHash, restore [pro, con] BFT semantics (batch D) [DRAFT - consensus]#888
tcsenpai merged 2 commits into
stabilisationfrom
bugfix/audit-sweep-batch-d-vote-race-2026-05-31

Conversation

@tcsenpai
Copy link
Copy Markdown
Contributor

@tcsenpai tcsenpai commented May 31, 2026

Plain-English summary

What your code does (the consensus voting handshake)

When a node thinks it has a candidate block ready, it asks every other node in the shard: "do you agree the block's hash should be X?" by calling each peer's proposeBlockHash endpoint.

Each peer answers with either:

  • 200 = yes + a signature attesting to that hash, plus any other signatures it's already collected from other peers
  • 401/404 = no (you're not in the shard / I haven't formed a candidate yet)

The sender tallies the yes/no votes. If more than 2/3 voted yes → the block is valid → it gets finalised onto the chain.

What was broken (three things at once, all amplifying each other)

  1. Half-counted votes — the tally code (pro++/con++) ran inside a .then(...) callback that wasn't being waited for. Outer code returned before the votes finished counting. Result: routine sometimes returned pro=0 even though every peer agreed.

  2. Wrong number returned — instead of returning the vote count, the routine returned the signature count — a totally different number. Signatures pile up from multiple sources (peer's own attestation, signatures the peer relayed from other peers, signatures other peers were concurrently writing into the same shared object because everyone calls everyone). The BFT "2/3 majority" check was comparing this inflated number against the threshold → blocks were getting approved on fewer real votes than required.

  3. Shared mutable map race — because all five shard members call each other in parallel, they were all writing signatures into the same JavaScript object at the same time. JS is single-threaded so individual writes don't tear, but the read-verify-write pattern interleaves badly across paths.

What the fix does

  • Replace fire-and-forget callbacks with awaited promises. Vote tally now finishes before the function returns.
  • Each per-peer call returns a small typed object: {vote: "pro" | "con", signaturesToMerge: ...}. All outcomes are collected first, then a single serial loop counts votes and merges verified signatures. No more racing writes.
  • Return [pro, con] (real vote counts), not signature count. BFT check now compares apples to apples.
  • Network failures count as con votes with an explicit reason, not silent aborts.

Why this matters

The combination of bug 2 + bug 3 was the dangerous one: blocks could pass BFT without enough validators actually agreeing. With the fix, the math under isBlockValid finally matches the protocol's intent (2/3 + 1 of shard size).


Technical detail

Closes audit-sweep CRITICAL #2 from the 2026-05-28 bug hunt: `broadcastBlockHash.ts` vote race.

Three interacting bugs in the pre-existing implementation

1. Fire-and-forget `.then` chains

`peer.longCall(...).then(async response => { ... pro++; signatures[...] = ... })` — the `.then` is discarded. Outer `await Promise.all(promises)` waits for the original longCall promises only. `pro/con` increments and signature writes complete AFTER the routine returns. Caller (`voteOnBlock`) consumes a half-populated tally.

2. Return value contradicts BFT semantics

Last line returned `[signatureCount, shard.length - signatureCount]` instead of `[pro, con]`. Signature count includes:

  • our own pre-existing signature
  • signatures peer-merged via `response.extra`
  • signatures relayed by other shard peers concurrently calling our `manageProposeBlockHash` handler (because `block` IS `getSharedState.candidateBlock` — same object reference)

BFT `isBlockValid(pro, totalVotes)` was being passed signature count vs threshold computed from `shard.length`. Blocks could pass BFT on fewer actual peer-agreement votes than 2/3+1 requires. Consensus safety violation.

3. Concurrent writes to shared signatures map

`block.validation_data.signatures` IS `getSharedState.candidateBlock.validation_data.signatures` (same object — verified at `createBlock.ts:67`). Outbound `.then` callbacks AND inbound `manageProposeBlockHash` handler invocations race to write into it.

Fix shape (Path C — full rewrite with explicit aggregation)

  1. `proposeAndCollect` per-peer flow — try/catch around `longCall` so network failures become `con` votes with reason, not aborts. Matches the `Promise.allSettled` discipline batch A applied elsewhere.
  2. `verifyIncomingSignatures` parallel sig verify, returns verified subset.
  3. Outer routine — `await Promise.all(shard.map(...))`. Aggregates outcomes before counting. Single deterministic serial merge loop.
  4. Return — `[pro, con]` actual vote counts.
  5. Allowed codes — added 404 (candidate not formed) alongside 401, was being treated as retry-worthy before.

Why not Bun workers (asked during implementation)

Race wasn't CPU-bound. Workers fix CPU, not promise misuse. `TxValidatorPool` already uses `worker_threads` for crypto.

Manual verification traces

Race scenario (4 peers, all agree):

  • Old: `Promise.all(promises)` resolves on HTTP completion. `pro=0`, returns `[0, shard.length]`. Block rejected.
  • New: `Promise.all(map(proposeAndCollect))` waits on full per-peer flow. Returns `[4, 0]`. Block passes.

Signature-inflation scenario (5-node shard, A relays C+D, B relays D+E):

  • Old: signatureCount = us+A+B+C+D+E = 6. Returns `[6, -1]`. Threshold = floor(10/3)+1 = 4. Passes on inflated count.
  • New: `pro=2` (A, B directly voted); C, D, E never directly called us. Returns `[2, 3]`. Correctly rejected. Relayed signatures still merged into map for downstream finalisation.

Specifically wanted feedback on

  • Whether ANY caller of `broadcastBlockHash` was implicitly relying on the inflated signature-count return value.
  • Whether `isBlockValid`'s 2/3+1 threshold was tuned with the inflated count in mind (it should be — but want explicit confirmation).
  • Whether the 404 allowed-codes addition is safe (was a 404 "candidate block not formed" previously triggering 3× `longCall` retry by accident, masking some other behaviour?).

Out of scope

  • Cross-RPC vote-relay (TODO comment carried over).
  • e2e test — devnet harness exists, scripted test follow-up.

Test plan

  • `bun install`, confirm clean tsc.
  • Boot devnet (5 nodes). Confirm consensus rounds still complete.
  • Inject chaos: stop one node mid-round. Confirm `pro=3, con=1` shape (down from 4 healthy).
  • Confirm BFT correctly rejects when 3/5 vote con (threshold = 4, so `pro=2` should fail).

…con] BFT semantics (batch D)

Closes audit-sweep CRITICAL #2 (broadcastBlockHash.ts vote race) from
the 2026-05-28 bug hunt. The pre-existing implementation had three
distinct, interacting consensus-safety defects.

Bug 1 — fire-and-forget .then chains
  Each `peer.longCall` was wrapped in `promise.then(async response =>
  { ... pro++; con++; signatures[...] = ...; })`. The `.then` chain
  was discarded; only the original `longCall` promises were awaited
  via `Promise.all(promises)`. That means `pro/con` increments and
  signature writes ran AFTER the routine returned. The caller
  (`voteOnBlock`) then consumed a half-populated vote tally and a
  half-merged signatures map.

Bug 2 — return value contradicts BFT semantics
  Last line returned `[signatureCount, shard.length - signatureCount]`
  rather than `[pro, con]`. Vote count and signature count are not
  the same metric. Signature count includes:
    - our own pre-existing signature
    - signatures peer-merged via the response.extra payload
    - signatures relayed by other shard members concurrently calling
      our `manageProposeBlockHash` handler (because `block` IS
      `getSharedState.candidateBlock` — same object reference)
  BFT threshold (2/3 + 1 of shard.length) was being compared against
  this inflated number. Blocks could pass `isBlockValid` on fewer
  actual peer-agreement votes than BFT requires. Safety violation.

Bug 3 — concurrent writes to shared signatures map
  `block.validation_data.signatures` is the same object as
  `getSharedState.candidateBlock.validation_data.signatures`.
  Outbound `.then` callbacks and inbound `manageProposeBlockHash`
  handler invocations from every shard peer race to write into it.
  Individual `obj[key] = value` is atomic in JS, but the
  read-verify-async-write pattern (incoming sig → await verify →
  write) interleaves across multiple paths.

Fix shape (Path C — full rewrite with explicit aggregation)

  1. Per-peer flow extracted to `proposeAndCollect`. Wraps
     `peer.longCall` in try/catch so a network failure becomes a
     `con` vote with rejection reason, not a routine-level abort.
     Implements the same `Promise.allSettled` discipline batch A
     applied to `broadcastNewBlock`/`broadcastOurSyncData`.

  2. `verifyIncomingSignatures` helper runs all per-peer signature
     verifications in parallel via `Promise.all`, returns the
     verified subset as `Record<string, string>`. Drops invalid
     signatures with logging; never throws.

  3. Outer routine: `await Promise.all(shard.map(...))` aggregates
     every peer outcome before counting anything. Single
     deterministic serial-merge loop over the outcomes increments
     `pro/con` and writes verified signatures into
     `block.validation_data.signatures`. No `.then` chains.

  4. Return value: `[pro, con]` — actual vote counts. Matches
     `isBlockValid(pro, totalVotes)` 2/3+1 BFT threshold semantics
     in `PoRBFT.ts:642-645` and the caller signature in
     `voteOnBlock` (`PoRBFT.ts:614-617`).

  5. Allowed-codes list extended to include 404 (candidate block not
     formed) in addition to 401 (validator not in shard). Both are
     valid "con" responses the handler can return; previously 404
     was being treated as a retryable failure in `longCall`,
     producing 3× retries before eventually counting as con.

Why not Bun workers
  Race wasn't CPU-bound; it was async-completion-ordering. Workers
  fix CPU bottlenecks, not promise misuse. Awaiting the chain fixes
  the race; threads wouldn't. TxValidatorPool already uses
  worker_threads for crypto.

Verification
  - tsc --noEmit clean.
  - Manual trace of the original race:
    1. Shard A,B,C,D,E. We call A, B, C, D in parallel.
    2. Old code: outer Promise.all(promises) resolves when all four
       longCall HTTP responses come back. pro/con still 0; signatures
       still {}. Routine returns [0, shard.length] (signatureCount=0).
       isBlockValid fails. Block rejected even though all peers
       agreed.
    3. New code: outer Promise.all(map) waits for proposeAndCollect
       (longCall + verify + classify) to complete on EVERY peer.
       Outcomes aggregated. Merge loop runs. Routine returns
       [pro=4, con=0]. isBlockValid passes correctly.
  - Manual trace of the signature inflation:
    1. Shard A,B,C,D,E. We propose. A relays C's and D's signatures.
       B relays D's and E's signatures.
    2. Old code: signatureCount = our + A + C + D + B + E = 6. But
       shard.length = 5, so [6, -1] gets returned. isBlockValid
       compares pro=6 against threshold (10/3 + 1) = 4. Passes.
       Reality: only A and B actually voted pro. C, D, E were
       relayed in.
    3. New code: pro = 2 (A and B), con = 3 (C, D, E never directly
       called us). Threshold = 4. Correctly rejects. The relayed
       signatures from A and B still get merged into our signatures
       map for downstream block-finalisation use, but they don't
       inflate our vote count.

Out of scope for this PR
  - Cross-RPC vote-relay (the TODO at the end of the previous
    implementation): peers transmitting their inbound vote counts
    to other peers to handle network partitions. Tracked
    separately.
  - e2e test that exercises the race scenarios. The traces above
    are walkable on devnet via the existing
    `testing/devnet/scripts/test-transfer-e2e.sh` harness with a
    second-validator chaos injection.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 31, 2026

Review Change Stack

Warning

Review limit reached

@tcsenpai, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 45 minutes. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ade400fe-e89a-4af4-a28f-6f05efcdf01d

📥 Commits

Reviewing files that changed from the base of the PR and between 8decf73 and 4cd4654.

📒 Files selected for processing (1)
  • src/libs/consensus/v2/routines/broadcastBlockHash.ts

Walkthrough

The broadcast block hash consensus routine is refactored to eliminate post-return signature mutations. Per-peer calls now execute in parallel via Promise.all, with network failures and non-200 responses converted to deterministic "con" vote outcomes. Signature verification occurs synchronously before merging, and all signatures are merged serially into the block before the function returns.

Changes

Consensus Broadcast Refactoring

Layer / File(s) Summary
Type contract and signature verification
src/libs/consensus/v2/routines/broadcastBlockHash.ts (lines 2–72)
Adds RPCResponse import, defines PeerVoteOutcome type to classify peer responses, and implements verifyIncomingSignatures helper that validates each relayed signature against the candidate block hash, dropping invalid ones with error logging and returning only verified {identity: signature} pairs.
Per-peer proposal and collection
src/libs/consensus/v2/routines/broadcastBlockHash.ts (lines 74–201)
Introduces proposeAndCollect helper that wraps the consensus_routine call, uses try/catch to convert network errors into "con" votes with rejection reason, handles non-200 responses with optional diagnostic tx-set diff logging, and on success invokes signature verification to produce verified signature subsets for merging.
Orchestration, signature merge, and return
src/libs/consensus/v2/routines/broadcastBlockHash.ts (lines 203–291)
Refactors broadcastBlockHash to execute per-peer outcomes via Promise.all, counts pro/con votes from outcomes, performs deterministic serial merge of verified signatures from each "pro" outcome into block.validation_data.signatures, ensures all merges complete before returning, and returns actual [pro, con] peer vote counts instead of signature-derived values.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit hops through consensus threads,
Where race conditions filled the dreads.
Now promises align in neat arrays,
Signatures merge before the routine strays.
No ghosts haunt the block when return's displayed! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: fixing a vote-race bug in broadcastBlockHash and restoring proper BFT vote semantics ([pro, con] counts), which aligns with the substantial rewrite eliminating race-prone patterns and async ordering issues.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bugfix/audit-sweep-batch-d-vote-race-2026-05-31

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tcsenpai
Copy link
Copy Markdown
Contributor Author

@greptile review

@tcsenpai tcsenpai marked this pull request as ready for review May 31, 2026 14:59
@qodo-code-review
Copy link
Copy Markdown
Contributor

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/libs/consensus/v2/routines/broadcastBlockHash.ts (1)

260-263: 💤 Low value

Comment mentions Object.assign but code uses direct property assignment.

The comment describes "each Object.assign operation" but the actual merge at line 269 uses direct property assignment. Consider updating the comment to match the implementation.

📝 Suggested comment update
-    // Serial merge: each `Object.assign` operation against the
-    // shared `signatures` map is atomic, but we want a single
-    // deterministic order so log lines reflect what landed and
-    // operators can replay the merge if needed.
+    // Serial merge: we iterate outcomes in deterministic order so
+    // log lines reflect what landed and operators can replay the
+    // merge if needed. All writes complete before the function
+    // returns, eliminating the post-return mutation race.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/libs/consensus/v2/routines/broadcastBlockHash.ts` around lines 260 - 263,
The comment referencing "each `Object.assign` operation" is inaccurate because
the merge uses direct property assignment into the shared `signatures` map;
update the comment near the merge in broadcastBlockHash (the block merge that
writes into `signatures`) to describe the actual implementation (e.g., "each
property is assigned directly into the shared `signatures` map in a
deterministic, serial order") or alternatively change the merge code to use
`Object.assign(signatures, ...)` if you prefer the comment to stay as-is; ensure
the comment and the implementation (the merge into `signatures` inside
broadcastBlockHash) match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/libs/consensus/v2/routines/broadcastBlockHash.ts`:
- Around line 260-263: The comment referencing "each `Object.assign` operation"
is inaccurate because the merge uses direct property assignment into the shared
`signatures` map; update the comment near the merge in broadcastBlockHash (the
block merge that writes into `signatures`) to describe the actual implementation
(e.g., "each property is assigned directly into the shared `signatures` map in a
deterministic, serial order") or alternatively change the merge code to use
`Object.assign(signatures, ...)` if you prefer the comment to stay as-is; ensure
the comment and the implementation (the merge into `signatures` inside
broadcastBlockHash) match.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 613d21a8-961e-44ad-8f98-398f647b3d34

📥 Commits

Reviewing files that changed from the base of the PR and between 924df40 and 8decf73.

📒 Files selected for processing (1)
  • src/libs/consensus/v2/routines/broadcastBlockHash.ts

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 31, 2026

Greptile Summary

This PR rewrites broadcastBlockHash.ts to fix three interacting consensus bugs: fire-and-forget .then callbacks that left vote tallies empty when the caller consumed them, returning signatureCount instead of actual [pro, con] vote counts to the BFT check, and concurrent .then writes racing against inbound manageProposeBlockHash calls on the shared validation_data.signatures object.

  • Vote-race fix: per-peer calls are wrapped in proposeAndCollect, which always resolves (never rejects), so Promise.all collects every outcome before the serial merge loop increments pro/con and writes signatures.
  • BFT semantics restored: the function now returns [pro, con] (actual vote counts), not [signatureCount, shard.length - signatureCount], so isBlockValid(pro, shard.length) finally compares apples to apples against the 2/3+1 threshold.
  • Signature validation tightened: a 200 response now only counts as pro if the responder's own cryptographic signature is present in the returned bundle and passes TxValidatorPool.verify; responses that lack a verifiable own-signature are demoted to con with an explicit rejection reason.

Confidence Score: 4/5

The three correctness bugs are properly fixed and the new BFT vote-counting logic is consistent with isBlockValid's 2/3+1 threshold; the main open question is whether relayed-but-lost signatures from the own-sig-missing con branch could prevent finalization under network partition.

The core fixes — awaited vote tallying, returning [pro, con] instead of [signatureCount, ...], and serial post-collect merge — are all implemented correctly. The remaining edge case (verified third-party signatures silently discarded on the own-sig-missing con path) could affect finalization liveness under network partition, worth addressing before production.

src/libs/consensus/v2/routines/broadcastBlockHash.ts — specifically the con return path at lines 228-238 where verified is populated but not forwarded to the merge loop

Important Files Changed

Filename Overview
src/libs/consensus/v2/routines/broadcastBlockHash.ts Full rewrite fixing fire-and-forget vote tallying, wrong BFT return value, and shared-map race; new per-peer validation logic is correct; one edge case where third-party verified signatures on the con-due-to-missing-own-sig path are silently dropped could affect liveness in rare scenarios.

Sequence Diagram

sequenceDiagram
    participant BH as broadcastBlockHash
    participant PAC as proposeAndCollect xN
    participant Peer as Remote Peer
    participant VIS as verifyIncomingSignatures
    participant Merge as Serial Merge Loop
    BH->>BH: structuredClone validation_data snapshot
    BH->>PAC: Promise.all shard.map proposeAndCollect
    par Per peer
        PAC->>Peer: longCall proposeBlockHash
        alt Network error
            PAC-->>BH: vote con
        else HTTP 401 or 404
            PAC-->>BH: vote con
        else HTTP 200 empty sigs
            PAC-->>BH: vote con
        else HTTP 200 with sigs
            PAC->>VIS: verifyIncomingSignatures
            VIS-->>PAC: verified subset
            alt own sig missing
                PAC-->>BH: vote con
            else own sig verified
                PAC-->>BH: vote pro signaturesToMerge
            end
        end
    end
    BH->>Merge: serial loop outcomes
    Merge->>Merge: pro or con increment merge sigs
    BH-->>BH: return pro con
Loading

Reviews (6): Last reviewed commit: "fix(consensus): require verified self-si..." | Re-trigger Greptile

Comment thread src/libs/consensus/v2/routines/broadcastBlockHash.ts
Comment thread src/libs/consensus/v2/routines/broadcastBlockHash.ts
Comment thread src/libs/consensus/v2/routines/broadcastBlockHash.ts
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 31, 2026

Greptile Summary

This PR rewrites broadcastBlockHash.ts to fix three interlocking consensus bugs: fire-and-forget .then callbacks that let the vote tally return before peer responses were counted, an incorrect return value (signatureCount instead of [pro, con] vote counts) that made the BFT threshold comparison meaningless, and concurrent writes to the shared validation_data.signatures map. The fix introduces explicit await-ed Promise.all over per-peer proposeAndCollect flows, a serial post-collection merge loop, and restores correct [pro, con] semantics throughout.

  • Vote-race fix: each peer call is now fully await-ed end-to-end inside proposeAndCollect, so pro/con counts are complete before the function returns.
  • BFT semantics restored: the return value is now actual vote counts, not an inflated signature count derived from the shared candidateBlock object.
  • Signature merge serialised: all writes to block.validation_data.signatures happen in a single deterministic loop after all peer outcomes settle, eliminating the old write race between concurrent .then callbacks.

Confidence Score: 3/5

The core vote-race and wrong-return-value bugs are correctly addressed, but a peer that returns 200 with no verifiable signatures still increments the pro counter, meaning the BFT tally can include votes that carry no cryptographic attestation into the block.

The rewrite fixes the three described bugs cleanly and the overall structure is sound. The main residual concern is that proposeAndCollect classifies any HTTP 200 as vote: "pro" regardless of whether the peer's signatures pass verifyIncomingSignatures — so a node with a broken or rotated signing key contributes to quorum without leaving a verifiable trace in validation_data.signatures. Additionally, the double-log pattern in verifyIncomingSignatures emits two distinct error messages for a single verification exception, which will make log-based debugging confusing.

src/libs/consensus/v2/routines/broadcastBlockHash.ts — specifically the pro vote counting in proposeAndCollect and the error-logging path in verifyIncomingSignatures.

Important Files Changed

Filename Overview
src/libs/consensus/v2/routines/broadcastBlockHash.ts Full rewrite fixing three interacting bugs (fire-and-forget vote tally, wrong return value, shared-map write race); introduces a double-log on verification exceptions and a gap where a peer returning 200 with no verifiable signatures still increments the pro counter.

Sequence Diagram

sequenceDiagram
    participant CR as consensusRoutine (PoRBFT)
    participant BB as broadcastBlockHash
    participant PAC as proposeAndCollect (×N)
    participant VIS as verifyIncomingSignatures
    participant Peer as Peer N (longCall)
    participant MPB as manageProposeBlockHash (remote)

    CR->>BB: broadcastBlockHash(block, shard)
    BB->>BB: build proposeParams [hash, validation_data ref, ourId]
    BB->>PAC: Promise.all(shard.map(proposeAndCollect))

    par for each shard peer
        PAC->>Peer: longCall(proposeBlockHash, proposeParams)
        Peer->>MPB: HTTP POST consensus_routine/proposeBlockHash
        MPB-->>Peer: 200 + validation_data (or 401/404)
        Peer-->>PAC: RPCResponse
        alt "result === 200"
            PAC->>VIS: verifyIncomingSignatures(extra.signatures, block.hash, peerId)
            VIS->>VIS: Promise.all(verify each sig via TxValidatorPool)
            VIS-->>PAC: "verified: Record<string,string>"
            PAC-->>BB: "{vote:pro, signaturesToMerge: verified}"
        else "result === 401 / 404"
            PAC-->>BB: "{vote:con, signaturesToMerge:{}}"
        else longCall throws
            PAC-->>BB: "{vote:con, rejectionReason: msg}"
        end
    end

    BB->>BB: serial merge loop — count pro/con, write signaturesToMerge into block.validation_data.signatures
    BB-->>CR: [pro, con]
    CR->>CR: isBlockValid(pro, shard.length) → finalizeBlock or throw BlockInvalidError
Loading

Reviews (2): Last reviewed commit: "fix(consensus): close vote-race in broad..." | Re-trigger Greptile

Comment thread src/libs/consensus/v2/routines/broadcastBlockHash.ts
Comment thread src/libs/consensus/v2/routines/broadcastBlockHash.ts
Comment thread src/libs/consensus/v2/routines/broadcastBlockHash.ts
… validation_data fan-out (PR #888 iter 1)

Four Greptile findings — all real.

P1 — pro vote without cryptographic attestation
  The previous iter classified any HTTP 200 as `vote: "pro"`,
  even when:
    (a) `extra.signatures` was empty, or
    (b) every entry in `extra.signatures` failed verification.

  BFT quorum could be reached on votes carrying no auditable
  proof. A node with a broken/rotated signing key contributed to
  threshold without leaving a verifiable trace.

  Fix: tighten the pro contract to require the peer's OWN
  signature on our block hash to be PRESENT in `extra.signatures`
  AND survive verification. Anything weaker downgrades to a
  `con` with explicit rejectionReason:
    - "200 with empty signatures map"
    - "200 without verifiable own signature on block hash"

  Relayed third-party signatures (other validators' attestations
  the peer happens to have collected) are still merged into our
  signatures map for downstream finalisation, but they no longer
  qualify as the peer's own vote.

P2 — double error log on verify exception
  `verifyIncomingSignatures` catch path logged
  "Signature verification threw for X". The outer loop then ran
  the `else` branch on `isValid: false` and logged "Invalid
  signature relayed by Y for X; dropping" — two unrelated-
  looking errors for a single event, confusing log triage.

  Fix: add `loggedFailure: boolean` to the per-entry result.
  Outer loop skips its own log when the inner catch already
  emitted one. Genuine `isValid: false` from a clean verify
  (not exception) still produces a single "Invalid signature
  relayed" log.

P2 — `proposeParams[1]` live reference race
  `proposeParams[1]` held `block.validation_data` directly. That
  is the same object as
  `getSharedState.candidateBlock.validation_data`. Inbound
  `manageProposeBlockHash` handlers run concurrently with our
  outbound broadcast (every shard member calls every other
  simultaneously) and mutate `signatures` on that same object.

  Effect: each peer in our `Promise.all` fan-out received a
  slightly different payload depending on which inbound calls
  landed first. Receivers see inconsistent snapshots of our
  validation_data — verifies are still correct (signatures are
  self-contained crypto), but log-based debugging is harder
  because "what we sent to peer A" no longer matches "what we
  sent to peer B".

  Fix: `structuredClone(block.validation_data)` once before
  fan-out. Every peer sees the same frozen payload. Receivers
  still verify and merge what they get; the freeze only affects
  what we ship.

Verification
  - tsc --noEmit clean.
  - Pro vote with empty extra.signatures: now logs error +
    returns con with reason. Cannot inflate BFT tally.
  - Pro vote where peer's own pubkey is not in verified set:
    now downgrades to con with explicit "missing own signature"
    reason.
  - Double-log scenario: catch path emits "verification threw",
    outer loop sees `loggedFailure: true`, skips its own log.
    Single log per event.
  - Concurrent inbound during fan-out: `validationDataSnapshot`
    is the immutable frozen copy; mutations to live
    `block.validation_data.signatures` from inbound handlers do
    not reach the outbound serialisation. Each peer call gets
    bit-identical `proposeParams[1]`.
@tcsenpai
Copy link
Copy Markdown
Contributor Author

@greptile review

1 similar comment
@tcsenpai
Copy link
Copy Markdown
Contributor Author

@greptile review

@Shitikyan Shitikyan mentioned this pull request Jun 1, 2026
Shitikyan added a commit that referenced this pull request Jun 1, 2026
…ork testing pass

Single document covering everything surfaced during the testing pass that
ran 2026-05-26 → 2026-05-31 against dev.node2. Organised by topic, not
chronology, so the team has one place to look up what was found, what
was fixed, and what is still pending — without scrolling Telegram or
hopping between PRs.

Sections:
- L2PS encryption nonce reuse (SDK #87 + 3 follow-up security findings)
- Governance hash mismatch on dev.node2 (still broken; real fix in DEM-727)
- Test-coverage gap that allowed the governance break to slip through
- Side finding — nonces were not checked or incremented per tx (Hovhannes's batch C #884#887, batch D #888 in flight)
- UX surfacing for L2PS (Demo #11)
- Operational risks still open
- Linear ticket map across DEM-722 epic
- Source material cross-links

Cross-links the raw battery output and serializer analysis already in
flight via Node #876. Closes DEM-729.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@tcsenpai
Copy link
Copy Markdown
Contributor Author

tcsenpai commented Jun 1, 2026

@greptile review

@tcsenpai tcsenpai merged commit f62a609 into stabilisation Jun 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant