feat: rate limit for network messages #11646

Trisfald · 2024-06-21T12:59:28Z

Implementation for rate limits of incoming network messages.
Original issue: #11617. Also supersedes #11618.

Note: rate limits are implemented but not defined with this PR; in practice, nothing should change for a node.

PR summary

This PR adds:

A module to arbitrate rate limits using a token bucket algorithm (see token_bucket.rs)
Convenience class to handle all rate limits of a network connection (see messages_limits.rs)
Changes to peer_actor.rs to implement the rate limits on received messages
Changes to the network configuration
A new metric to count messages dropped due to rate limits
Unit tests

Leftovers

~~Make rate limits configurable, likely through config files with overrides~~ done
~~Use more accurate token allocation for some network messages, in particular the ones containing a dynamic number of elements. For reference: analysis~~ to be done in a another PR

Trisfald · 2024-06-21T13:01:26Z

chain/network/src/rate_limits/messages_limits.rs

+/// Returns `None` if the message is not meant to be rate limited in any scenario.
+fn get_key_and_token_cost(message: &PeerMessage) -> Option<(RateLimitedPeerMessageKey, u32)> {
+    use RateLimitedPeerMessageKey::*;
+    match message {


More boilerplate than I would like, but it make sure we don't forget to consider rate limits when adding new messages.

codecov · 2024-06-21T13:22:57Z

Codecov Report

Attention: Patch coverage is 97.26651% with 12 lines in your changes missing coverage. Please review.

Project coverage is 71.73%. Comparing base (af022d8) to head (f2a9246).

Files	Patch %	Lines
chain/network/src/rate_limits/messages_limits.rs	96.62%	7 Missing and 1 partial ⚠️
chain/network/src/config.rs	85.71%	1 Missing and 2 partials ⚠️
chain/network/src/rate_limits/token_bucket.rs	99.39%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11646      +/-   ##
==========================================
+ Coverage   71.66%   71.73%   +0.07%     
==========================================
  Files         788      790       +2     
  Lines      161300   161740     +440     
  Branches   161300   161740     +440     
==========================================
+ Hits       115589   116020     +431     
- Misses      40674    40681       +7     
- Partials     5037     5039       +2

Flag	Coverage Δ
backward-compatibility	`0.23% <0.00%> (-0.01%)`	⬇️
db-migration	`0.23% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.35% <0.00%> (-0.01%)`	⬇️
integration-tests	`37.79% <11.61%> (-0.06%)`	⬇️
linux	`69.11% <96.35%> (+0.06%)`	⬆️
linux-nightly	`71.22% <97.26%> (+0.06%)`	⬆️
macos	`52.60% <20.49%> (-0.06%)`	⬇️
pytests	`1.59% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.38% <0.00%> (-0.01%)`	⬇️
unittests	`66.32% <95.89%> (+0.08%)`	⬆️
upgradability	`0.28% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Trisfald · 2024-06-24T16:36:01Z

Added configuration of rate limits through overrides, in a fashion similar to existing network settings override.

saketh-are

Thanks Andrea, this looks great overall. I left one comment about the internals of TokenBucket for your thoughts.

chain/network/src/rate_limits/token_bucket.rs

… way the tokens can be refreshed during the call to acquire() eliminating the need of a periodic task

Trisfald · 2024-06-26T15:49:38Z

I think this draft is now ready for review. IMO a good plan can be to merge this PR which has the rate limits implementation and then open another PR with the actual rate limits settings.

saketh-are

Looks great. The plan of adding the default limits in a separate PR sounds reasonable.

saketh-are · 2024-06-27T02:00:33Z

chain/network/src/peer/peer_actor.rs

        {
            let labels = [peer_msg.msg_variant()];
            metrics::PEER_MESSAGE_RECEIVED_BY_TYPE_TOTAL.with_label_values(&labels).inc();
            metrics::PEER_MESSAGE_RECEIVED_BY_TYPE_BYTES
                .with_label_values(&labels)
                .inc_by(msg.len() as u64);
+            if !self.received_messages_rate_limits.is_allowed(&peer_msg, now) {
+                metrics::PEER_MESSAGE_RATE_LIMITED_BY_TYPE_TOTAL.with_label_values(&labels).inc();
+                tracing::debug!(target: "network", "Peer {} is being rate limited for message {}", self.peer_info, peer_msg.msg_variant());


Food for thought, what purpose do you envision this log message serving?

I guess one thing it helps us to understand is which peer is sending the spam, something which the aggregate metrics (which are only labelled by variant) do not capture. Another way to check which peer is spamming us could be to look at the receive rate (in bytes) per connection, something which is reported via debug api and visible on debug pages. However, that would not be fine-grained by message variant.

Just trying to think through how this might be used and whether there might be a more efficient way to represent the same information.

Yes - the purpose would be to understanding which peer is spamming. As you said there are other ways to know know the source of the attack.

I think by default we ran at INFO level is that right? I don't think it hurts to have this debug but I don't think it brings a lot of value either

saketh-are · 2024-06-27T02:14:17Z

chain/network/src/rate_limits/token_bucket.rs

+        let tokens_to_add = duration.as_secs_f64() * self.refill_rate as f64;
+        let tokens_to_add = (tokens_to_add * TOKEN_PARTS_NUMBER as f64) as u64;
+        // Update `last_refill` and `size` only if there's a change. This is done to prevent
+        // losing token parts to clamping if the duration is too small.


saketh-are · 2024-06-27T02:17:20Z

chain/network/src/rate_limits/token_bucket.rs

+}
+
+#[cfg(test)]
+mod tests {


Props on thorough coverage

saketh-are · 2024-06-27T02:20:50Z

chain/network/src/rate_limits/messages_limits.rs

+impl RateLimits {
+    /// Creates all buckets as configured in `config`.
+    /// See also [TokenBucket::new].
+    pub fn from_config(config: &Config, start_time: Instant) -> Self {


Maybe it would make sense to print some kind of warning if the override is lower than the default.

Changing this configuration is kinda tricky, I don't see many operators doing that - they need to know what they are doing. Is emitting warning for config changes a pattern we use in other places?

Oh hmm why do you say the configuration is tricky to change? If I want to bump up my receive rate for a certain message type, do I need to do more than adding a line to the config.json in the right place?

To answer your question, I didn't find an example of warns specifically but there are some examples here and here of checks on config values. Of course a bail! is impossible to miss (node won't start) while a warning may not even be seen, so if it's not super easy to add I wouldn't waste time on it.

Yes, it's easy once you know what to do, a simple addition to config.json.
It's just that if I put myself in the shoes of an average operator it might non-trivial to know what values to use. A knowledge problem rather than a technical one.

Personally, I think we can go without the warning and keep the bail! if validation fails, for now at least.

CHANGELOG.md

tayfunelmas · 2024-06-27T06:39:28Z

chain/network/src/peer/peer_actor.rs

@@ -311,6 +315,10 @@ impl PeerActor {
            // That likely requires bigger changes and account_id here is later used for debug / logging purposes only.
            account_id: network_state.config.validator.account_id(),
        };
+        let received_messages_rate_limits = messages_limits::RateLimits::from_config(
+            &network_state.config.received_messages_rate_limits,
+            clock.now(),


would it help to pass a Clock instance (clock.clone()) to RateLimits and call it for getting current time through different methods during the rate limiting?

I would prefer keeping the parameter an Instant which has less responsibility than a clock, because calling now() is all we do with the clock

Trisfald added 7 commits June 21, 2024 13:48

add simple token bucket implementation

930f013

make token bucket implementation thread safe

e5c7ddd

improvements to token_bucket

05f3d90

add an utility class to manage network messages rate limits

cea1060

implement network messages rate limits in peer_actor

ee6e086

implement unit tests to validate network messages rate limits

19dde3a

adapt and fix rate limit test

4f12ba9

Trisfald commented Jun 21, 2024

View reviewed changes

Trisfald mentioned this pull request Jun 21, 2024

draft: Implement unit tests to validate network messages rate limits #11618

Closed

Trisfald requested a review from saketh-are June 21, 2024 13:02

Trisfald and others added 5 commits June 21, 2024 18:45

add small test case

7e6c3f8

add configuration validation

b2d0660

change message limit Config to hold a map of configuration values

2134244

implement configuration override for network messages rate limits

b67a2f6

Merge branch 'near:master' into add-rate-limits

1d21b04

saketh-are reviewed Jun 24, 2024

View reviewed changes

chain/network/src/rate_limits/token_bucket.rs Outdated Show resolved Hide resolved

Trisfald and others added 5 commits June 25, 2024 11:18

add CHANGELOG entry

3e3fea1

Merge branch 'master' into add-rate-limits

ccec393

change TokenBucket to store internally the last refill time - in this…

40a2362

… way the tokens can be refreshed during the call to acquire() eliminating the need of a periodic task

simplify token bucket by making start_time mandatory

1057607

Merge branch 'master' into add-rate-limits

f2a9246

Trisfald marked this pull request as ready for review June 26, 2024 16:52

Trisfald requested a review from a team as a code owner June 26, 2024 16:52

Trisfald requested a review from tayfunelmas June 26, 2024 16:52

saketh-are approved these changes Jun 27, 2024

View reviewed changes

tayfunelmas approved these changes Jun 27, 2024

View reviewed changes

Trisfald added this pull request to the merge queue Jun 27, 2024

Merged via the queue into near:master with commit ee85b26 Jun 27, 2024
30 checks passed

Trisfald deleted the add-rate-limits branch June 27, 2024 19:26

Trisfald mentioned this pull request Jun 28, 2024

feat: implement default rate limit configuration #11684

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rate limit for network messages #11646

feat: rate limit for network messages #11646

Trisfald commented Jun 21, 2024 •

edited

Loading

Trisfald Jun 21, 2024

codecov bot commented Jun 21, 2024 •

edited

Loading

Trisfald commented Jun 24, 2024

saketh-are left a comment

Trisfald commented Jun 26, 2024

saketh-are left a comment

saketh-are Jun 27, 2024

Trisfald Jun 27, 2024

saketh-are Jun 27, 2024

saketh-are Jun 27, 2024

saketh-are Jun 27, 2024

Trisfald Jun 27, 2024

saketh-are Jun 27, 2024

Trisfald Jun 27, 2024

tayfunelmas Jun 27, 2024

Trisfald Jun 27, 2024

feat: rate limit for network messages #11646

feat: rate limit for network messages #11646

Conversation

Trisfald commented Jun 21, 2024 • edited Loading

PR summary

Leftovers

Choose a reason for hiding this comment

codecov bot commented Jun 21, 2024 • edited Loading

Codecov Report

Trisfald commented Jun 24, 2024

saketh-are left a comment

Choose a reason for hiding this comment

Trisfald commented Jun 26, 2024

saketh-are left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trisfald commented Jun 21, 2024 •

edited

Loading

codecov bot commented Jun 21, 2024 •

edited

Loading