Skip to content

fix: use bounded incoming message buffers for all protocols#2268

Merged
gilcu3 merged 4 commits intomainfrom
2247-remote-dos----node-oom-crash
Mar 4, 2026
Merged

fix: use bounded incoming message buffers for all protocols#2268
gilcu3 merged 4 commits intomainfrom
2247-remote-dos----node-oom-crash

Conversation

@gilcu3
Copy link
Contributor

@gilcu3 gilcu3 commented Feb 27, 2026

Closes #2247

Added per-protocol incoming messages buffer capacity constants throughout threshold-signatures. Comms::with_buffer_capacity(max) rejects messages for new headers once the cap
is reached; messages for existing entries still flow. Honest participants always use the same buffer capacity for each protocol.

Now each protocol now declares its own maximum:

  • Simple protocols (sign, presign, CKD, DKG): small constants (0–7), derived by counting waitpoints
  • Triple generation: $131 \cdot N \cdot (P-1) + 7$, derived from the sub-protocol structure (and empirical tests), when N is the number of triples (used in batch triple generation) and P the number of participants. This is an exact formula for N = 1, but just an upper bound for N > 1. The reason for this is an optimization in one of the subprotocols, which makes computing the total with a constant formula impossible (because it uses hashing to decide some execution branches).

Majority of the code is the added tests to guarantee that the new buffer capacity bounds do not break existing implementations.

This solves the issue because the buffer is always bounded. A malicious can still stall the protocol, but can no longer cause an OOM, which seems a strict improvement. A better solution might be possible, but would certainly require breaking changes, so this one seems the best we can get for now.

As a byproduct, the computed numbers make it extremely clear how many rounds each protocol has, which is something that we did not know previously (for the triples for example)

@gilcu3 gilcu3 linked an issue Feb 27, 2026 that may be closed by this pull request
@gilcu3 gilcu3 force-pushed the 2247-remote-dos----node-oom-crash branch 3 times, most recently from b5605db to 362dfd8 Compare February 27, 2026 19:35
@gilcu3 gilcu3 marked this pull request as ready for review February 27, 2026 19:39
@claude
Copy link

claude bot commented Feb 27, 2026

Code Review

Solid fix for the OOM DoS vector (issue #2247). The approach of bounding per-protocol incoming message buffers with protocol-specific constants is well-designed. The formula derivation for triple generation is clearly documented, and the test coverage across all protocol variants is thorough.

One observation worth noting (not blocking):

pop creates entries beyond max_entriesMessageBuffer::pop (internal.rs:252-262) unconditionally calls messages_lock.entry(header).or_default(), which can grow the HashMap beyond the stated max_entries. This is by design (honest protocol needs to receive on its own waitpoints), and the bounding in push prevents attacker-driven unbounded growth. The tests confirm honest execution matches the constant. Just noting this for future maintainers: max_entries bounds attacker-created entries, not total entries. A clarifying comment on the max_entries field or pop method would help.

No critical issues found.

✅ Approved

@gilcu3 gilcu3 force-pushed the 2247-remote-dos----node-oom-crash branch from e8ce1ed to 0b9e5d5 Compare March 3, 2026 13:34
@gilcu3
Copy link
Contributor Author

gilcu3 commented Mar 3, 2026

@claude review

@claude
Copy link

claude bot commented Mar 3, 2026

PR title type suggestion: This PR introduces bounded message buffers as a new defensive feature/systematic improvement across all protocols, rather than fixing an existing bug. The type should probably be feat: or refactor: instead of fix:.

Suggested title: feat: use bounded incoming message buffers for all protocols or refactor: add bounded incoming message buffers for all protocols

@claude
Copy link

claude bot commented Mar 3, 2026

Code Review

This is a well-designed fix for the OOM DoS vector. The bounded buffer approach with protocol-specific constants is sound, and the test coverage is thorough. A few observations:

BufferFull error propagation terminates the protocol — In crates/node/src/protocol.rs:131, the ? on protocol.message() propagates MessageError::BufferFull as an anyhow::Error, aborting the entire protocol run. The PR description acknowledges this trade-off ("A malicious can still stall the protocol"). This is a strictly better situation than OOM, but worth confirming this is the desired production behavior rather than silently dropping the message and logging a warning. If a node receives a burst of legitimate messages that happen to arrive before the protocol has started polling (e.g., during a slow initialization), the buffer could fill and terminate prematurely. Consider whether a warn! + continue (drop the message) would be more resilient than hard-failing the protocol, since the protocol would simply stall on missing messages rather than crash. Not blocking if the team has considered this.

No other critical issues found. Safe arithmetic is used in the formula computation, the From<MessageError> for ProtocolError wiring is correct for test utils, and all Comms::new() callers are migrated to with_buffer_capacity().

✅ Approved

kevindeforth
kevindeforth previously approved these changes Mar 3, 2026
Copy link
Contributor

@kevindeforth kevindeforth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
Only nits.

pub(crate) const REDJUBJUB_SIGN_MAX_INCOMING_COORDINATOR_ENTRIES: usize = 1;
/// Maximum incoming buffer entries for non-coordinator participants in the `RedJubjub` sign protocol.
#[cfg(test)]
pub(crate) const REDJUBJUB_SIGN_MAX_INCOMING_PARTICIPANT_ENTRIES: usize = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this somewhere else, or could we define it inside the test module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left it there just to keep some uniformity. atm it could be just in the test module as you mention

Copy link
Contributor

@SimonRastikian SimonRastikian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed most of the PR.
As explained in the meeting, we all know this is not the cleanest solution. I hope you could open an Issue describing a cleaner way to solve it. (This is independent of my approval)

The changes I hope to see is more helper functions that cleanup the code. Otherwise thanks for takling this.


for &p in &participants {
let comms = Comms::with_buffer_capacity(usize::MAX);
let comms_ref = comms.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please add a comment as discussed in our meeting about cloning Arc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not something you normally find in code, but I can add it in one place if that would help the future reader. Maybe I can refactor a bit to make it more clear

@gilcu3
Copy link
Contributor Author

gilcu3 commented Mar 3, 2026

I reviewed most of the PR. As explained in the meeting, we all know this is not the cleanest solution. I hope you could open an Issue describing a cleaner way to solve it. (This is independent of my approval)

Opened #2285

@gilcu3 gilcu3 force-pushed the 2247-remote-dos----node-oom-crash branch from 0b9e5d5 to 3cf34a6 Compare March 3, 2026 16:47
Copy link
Contributor

@SimonRastikian SimonRastikian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! :)

@gilcu3 gilcu3 added this pull request to the merge queue Mar 4, 2026
Merged via the queue into main with commit c4dcc06 Mar 4, 2026
10 checks passed
@gilcu3 gilcu3 deleted the 2247-remote-dos----node-oom-crash branch March 4, 2026 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote DoS -- Node OOM crash

3 participants