Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: rate limit for network messages #11646

Merged
merged 17 commits into from
Jun 27, 2024
Merged

Conversation

Trisfald
Copy link
Contributor

@Trisfald Trisfald commented Jun 21, 2024

Implementation for rate limits of incoming network messages.
Original issue: #11617. Also supersedes #11618.

Note: rate limits are implemented but not defined with this PR; in practice, nothing should change for a node.

PR summary

This PR adds:

  • A module to arbitrate rate limits using a token bucket algorithm (see token_bucket.rs)
  • Convenience class to handle all rate limits of a network connection (see messages_limits.rs)
  • Changes to peer_actor.rs to implement the rate limits on received messages
  • Changes to the network configuration
  • A new metric to count messages dropped due to rate limits
  • Unit tests

Leftovers

  • Make rate limits configurable, likely through config files with overrides done
  • Use more accurate token allocation for some network messages, in particular the ones containing a dynamic number of elements. For reference: analysis to be done in a another PR

/// Returns `None` if the message is not meant to be rate limited in any scenario.
fn get_key_and_token_cost(message: &PeerMessage) -> Option<(RateLimitedPeerMessageKey, u32)> {
use RateLimitedPeerMessageKey::*;
match message {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More boilerplate than I would like, but it make sure we don't forget to consider rate limits when adding new messages.

Copy link

codecov bot commented Jun 21, 2024

Codecov Report

Attention: Patch coverage is 97.26651% with 12 lines in your changes missing coverage. Please review.

Project coverage is 71.73%. Comparing base (af022d8) to head (f2a9246).

Files Patch % Lines
chain/network/src/rate_limits/messages_limits.rs 96.62% 7 Missing and 1 partial ⚠️
chain/network/src/config.rs 85.71% 1 Missing and 2 partials ⚠️
chain/network/src/rate_limits/token_bucket.rs 99.39% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11646      +/-   ##
==========================================
+ Coverage   71.66%   71.73%   +0.07%     
==========================================
  Files         788      790       +2     
  Lines      161300   161740     +440     
  Branches   161300   161740     +440     
==========================================
+ Hits       115589   116020     +431     
- Misses      40674    40681       +7     
- Partials     5037     5039       +2     
Flag Coverage Δ
backward-compatibility 0.23% <0.00%> (-0.01%) ⬇️
db-migration 0.23% <0.00%> (-0.01%) ⬇️
genesis-check 1.35% <0.00%> (-0.01%) ⬇️
integration-tests 37.79% <11.61%> (-0.06%) ⬇️
linux 69.11% <96.35%> (+0.06%) ⬆️
linux-nightly 71.22% <97.26%> (+0.06%) ⬆️
macos 52.60% <20.49%> (-0.06%) ⬇️
pytests 1.59% <0.00%> (-0.01%) ⬇️
sanity-checks 1.38% <0.00%> (-0.01%) ⬇️
unittests 66.32% <95.89%> (+0.08%) ⬆️
upgradability 0.28% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Trisfald
Copy link
Contributor Author

Added configuration of rate limits through overrides, in a fashion similar to existing network settings override.

Copy link
Collaborator

@saketh-are saketh-are left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrea, this looks great overall. I left one comment about the internals of TokenBucket for your thoughts.

chain/network/src/rate_limits/token_bucket.rs Outdated Show resolved Hide resolved
@Trisfald
Copy link
Contributor Author

I think this draft is now ready for review. IMO a good plan can be to merge this PR which has the rate limits implementation and then open another PR with the actual rate limits settings.

@Trisfald Trisfald marked this pull request as ready for review June 26, 2024 16:52
@Trisfald Trisfald requested a review from a team as a code owner June 26, 2024 16:52
Copy link
Collaborator

@saketh-are saketh-are left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. The plan of adding the default limits in a separate PR sounds reasonable.

{
let labels = [peer_msg.msg_variant()];
metrics::PEER_MESSAGE_RECEIVED_BY_TYPE_TOTAL.with_label_values(&labels).inc();
metrics::PEER_MESSAGE_RECEIVED_BY_TYPE_BYTES
.with_label_values(&labels)
.inc_by(msg.len() as u64);
if !self.received_messages_rate_limits.is_allowed(&peer_msg, now) {
metrics::PEER_MESSAGE_RATE_LIMITED_BY_TYPE_TOTAL.with_label_values(&labels).inc();
tracing::debug!(target: "network", "Peer {} is being rate limited for message {}", self.peer_info, peer_msg.msg_variant());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Food for thought, what purpose do you envision this log message serving?

I guess one thing it helps us to understand is which peer is sending the spam, something which the aggregate metrics (which are only labelled by variant) do not capture. Another way to check which peer is spamming us could be to look at the receive rate (in bytes) per connection, something which is reported via debug api and visible on debug pages. However, that would not be fine-grained by message variant.

Just trying to think through how this might be used and whether there might be a more efficient way to represent the same information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the purpose would be to understanding which peer is spamming. As you said there are other ways to know know the source of the attack.

I think by default we ran at INFO level is that right? I don't think it hurts to have this debug but I don't think it brings a lot of value either

let tokens_to_add = duration.as_secs_f64() * self.refill_rate as f64;
let tokens_to_add = (tokens_to_add * TOKEN_PARTS_NUMBER as f64) as u64;
// Update `last_refill` and `size` only if there's a change. This is done to prevent
// losing token parts to clamping if the duration is too small.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

}

#[cfg(test)]
mod tests {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Props on thorough coverage

impl RateLimits {
/// Creates all buckets as configured in `config`.
/// See also [TokenBucket::new].
pub fn from_config(config: &Config, start_time: Instant) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would make sense to print some kind of warning if the override is lower than the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this configuration is kinda tricky, I don't see many operators doing that - they need to know what they are doing. Is emitting warning for config changes a pattern we use in other places?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hmm why do you say the configuration is tricky to change? If I want to bump up my receive rate for a certain message type, do I need to do more than adding a line to the config.json in the right place?

To answer your question, I didn't find an example of warns specifically but there are some examples here and here of checks on config values. Of course a bail! is impossible to miss (node won't start) while a warning may not even be seen, so if it's not super easy to add I wouldn't waste time on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's easy once you know what to do, a simple addition to config.json.
It's just that if I put myself in the shoes of an average operator it might non-trivial to know what values to use. A knowledge problem rather than a technical one.

Personally, I think we can go without the warning and keep the bail! if validation fails, for now at least.

CHANGELOG.md Show resolved Hide resolved
@@ -311,6 +315,10 @@ impl PeerActor {
// That likely requires bigger changes and account_id here is later used for debug / logging purposes only.
account_id: network_state.config.validator.account_id(),
};
let received_messages_rate_limits = messages_limits::RateLimits::from_config(
&network_state.config.received_messages_rate_limits,
clock.now(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it help to pass a Clock instance (clock.clone()) to RateLimits and call it for getting current time through different methods during the rate limiting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer keeping the parameter an Instant which has less responsibility than a clock, because calling now() is all we do with the clock

@Trisfald Trisfald added this pull request to the merge queue Jun 27, 2024
Merged via the queue into near:master with commit ee85b26 Jun 27, 2024
30 checks passed
@Trisfald Trisfald deleted the add-rate-limits branch June 27, 2024 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants