Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADDED] TLS connection rate limiter #2573

Merged
merged 1 commit into from Jan 13, 2022

Conversation

julius-welink
Copy link
Contributor

@julius-welink julius-welink commented Sep 27, 2021

  • Link to issue, e.g. Resolves #NNN
  • Documentation added (if applicable)
  • Tests added
  • Branch rebased on top of current main (git pull --rebase origin main)
  • Changes squashed to a single commit (described here)
  • Build is green in Travis CI
  • You have certified that the contribution is your original work and that you license the work to the project under the Apache 2 license

Changes proposed in this pull request:

TLS handshake negotiation is very CPU intensive, especially if RSA is used. For this reason I see that nats has "Lame Duck Mode". Unfortunately, during a network outage, it is still possible for many clients to try to connect at the same time. This rationale behind tls.connection_rate_limit is that we prevent CPU from being overwhelmed by a surge of clients trying to connect at the same time, by dropping the connection before TLS is initialized, thus effectively throttling TLS connection rate.

The good thing about this change is that is does not affect current nats functionality, unless enabled. Also, I will soon be testing this in a real project with more than 500 clients.

Other solutions that I have considered:

  1. SYN rate limit on iptables. Unfortunately it's not possible to configure them within a docker container and I use docker.
  2. Sidecar haproxy. This could work, but I believe rate limiter directly implemented in nats-server is a more straight forward solution. Haproxy would be an additional point of failure or misconfiguration.

The is no related issue afaik. I am making this change for myself. PR should be a good way to get the tests running, since TravisCI does not cooperate with me :) I am willing to spend more time on this PR if there is a chance it could be accepted to the main nats-server repo. Please let me know what else I need to do to get this approved.

I'll wait for some feedback before squashing commits. Do I need to update docs? If yes, how?

/cc @nats-io/core

@julius-welink julius-welink marked this pull request as draft September 27, 2021 07:37
@julius-welink julius-welink marked this pull request as ready for review September 27, 2021 07:37
@julius-welink
Copy link
Contributor Author

I'm seeing some test failures unrelated to my PR.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made some comments. I think the idea is good, but I have mainly commented about the code, I would like other, say @philpennock or @derekcollison to comment on the approach itself.

server/opts.go Show resolved Hide resolved
server/opts.go Show resolved Hide resolved
server/rate_counter.go Show resolved Hide resolved
server/rate_counter.go Outdated Show resolved Hide resolved
server/rate_counter.go Outdated Show resolved Hide resolved
server/server.go Outdated Show resolved Hide resolved
server/server.go Outdated Show resolved Hide resolved
test/configs/tls_rate.conf Outdated Show resolved Hide resolved
test/tls_test.go Outdated Show resolved Hide resolved
test/tls_test.go Outdated Show resolved Hide resolved
@derekcollison
Copy link
Member

Most clients that have official support will also support jitter reconnect logic to help smooth out, and we will use all available cores to process the TLS since IIRC we do that post accept and in a separate Go routine.

So with that said, this still could be useful under extreme conditions and we have seen some enterprise customers with large RSA certs get bit by this when the cluster itself was not sized properly.

@julius-welink
Copy link
Contributor Author

julius-welink commented Sep 28, 2021

Thanks for the review! Will address the above comments soon. I have some more considerations:

  • Was not aware of jitter reconnect. Maybe I can ditch this effort and use it? (NO)

I doubt this will solve my problems. I'll have thousands of clients. It is acceptable to me that clients have to queue for up to 5 or 10 minutes to reconnect. It's either that or I will have to reserve a lot of extra CPUs only to handle reconnect spikes. The problem with only using Jitter reconnect is that clients do not know if server is busy or not. Using a small jitter does not solve the issue, using large jitter makes it more difficult for client to reconnect under normal server load. The proposed TLS rate limit setting would have a good synergy with a back-off strategy however.

  • Make sure that throttling only affects clients and not server -> server connections (YES)

Yes, it applies only to go s.acceptConnections(l, "Client" as opposed to other connection types

go s.acceptConnections(l, "Gateway"
go s.acceptConnections(l, "Leafnode"
...
  • I have some special clients that I wish were not throttled, but this would probably require some protocol changes, or perhaps I could have an exception list with IP addresses 🤔

Seems feasible to make exceptions based on IP addresses. But let's make this a separate task (?) I do not necessarily need this right now.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some more little changes. Thanks!

server/opts.go Outdated Show resolved Hide resolved
server/rate_counter.go Show resolved Hide resolved
server/rate_counter_test.go Show resolved Hide resolved
server/rate_counter_test.go Outdated Show resolved Hide resolved
server/rate_counter_test.go Outdated Show resolved Hide resolved
server/server.go Show resolved Hide resolved
test/tls_test.go Outdated Show resolved Hide resolved
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. From a code perspective, this looks good to me. Want to let @derekcollison decide if we merge this or if the rate limit approach is not warranted at this time.

@julius-welink
Copy link
Contributor Author

julius-welink commented Sep 30, 2021

LGTM. From a code perspective, this looks good to me. Want to let @derekcollison decide if we merge this or if the rate limit approach is not warranted at this time.

ok thanks! Let me know what you want to do. If this is up for merging, then I'll squash commits.

server/server.go Outdated
@@ -2409,6 +2415,15 @@ func (s *Server) createClient(conn net.Conn) *client {

// Check for TLS
if !isClosed && tlsRequired {
if s.connRateCounter != nil && !s.connRateCounter.allow() {
c.Warnf("Rejecting connection due to TLS rate limiting")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a change in the log volume surfacing as a result of client actions, right? We will log freely for peers within a cluster, or across gateways, but tend to be cautious about log messages which scale with end-user connections. (Particularly if the client IP is decorated into the log, as that can be PII). We should definitely log this, but I think (regrettably) it might need to be at debug or trace level. @derekcollison can better say what the policy is here.

We should probably add new fields to the type Varz in server/monitor.go; something like ThrottledConnections ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you say if I aggregated logs within 1s window?

Rejected 100 connections due to TLS rate limiting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems sane enough, but keeping track and doing the conditional logging seems like more work than adding a counter to the stats tracking and incrementing it. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, quite a bit of code:
5e5eb63

if you think that's fine, I could rebase the whole PR on the latest nats codebase. Possibly open a new PR with changed squished to a single commit. Let me know if this is up for merging or not.

@derekcollison
Copy link
Member

I am ok with it, but still feel mostly the same as my original comment.

server/rate_counter.go Outdated Show resolved Hide resolved
server/rate_counter_test.go Outdated Show resolved Hide resolved
server/server.go Outdated Show resolved Hide resolved
test/tls_test.go Show resolved Hide resolved
@kozlovic
Copy link
Member

kozlovic commented Jan 7, 2022

@julius-welink Also forgot to mention that you have conflicts, so you may want to rebase from main branch..

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @derekcollison I think there is still value and I am ok with this PR, should we merge?

@julius-welink
Copy link
Contributor Author

julius-welink commented Jan 13, 2022

@derekcollison

I am ok with it, but still feel mostly the same as my original comment.
cluster itself was not sized properly.

image

This spike here is when clients reconnect. To properly size the cluster I'd need to spend 20x more on hosting fees, just to account for these reconnect spikes. If I don't, NATS would simply stop responding due to high load and not let in any clients. It has happened. In the case of mass reconnect, I'd rather put clients on queue. Perhaps you have any insights of how enterprises deal with this?

large RSA certs

I'm playing around with EC (Elliptic Curve) certs. Haven't yet tested in production, but I hope those will reduce CPU usage.

Something else on my mind. This is our problem statement: "NATS would simply stop responding due to high load". It should be technically possible to stop accepting new connections when NATS detects high CPU usage. I've seen you do measurements of "event loop" delay. Under a high load NATS was logging warnings. This perhaps seems a bit less reliable, how would you even test that? Nevertheless, it could be an alternative to this MR. Do you think it's feasible to go to this direction instead?

@derekcollison
Copy link
Member

I understand your concerns on costs. What I was trying to highlight is that in a total system design, you need to account and size for client rollover during server upgrades, network outages or even server failures themselves. When using large RSA keys that becomes something that needs to be tested.

One answer is of course ECC keys which can help. We also do jitter in the clients and if you control the clients can help out here by increasing this value, etc.

@kozlovic
Copy link
Member

@derekcollison Still, do you think that this PR as-is as merit? Code wise, I approved it, and want to know if we merge it.

@julius-welink
Copy link
Contributor Author

julius-welink commented Jan 13, 2022

@derekcollison
The problem with only using Jitter reconnect is that clients do not know if server is busy or not. Using a small jitter does not solve the issue, using large jitter makes it more difficult for client to reconnect under normal server load.

client rollover during server upgrades

I do account for controlled client rollover. Not sure how you can reasonably account for a total network outage though, without blowing the budget :)

Also backend systems exist behind nats, that do benefit from client throttling.

@kozlovic kozlovic merged commit c9c603b into nats-io:main Jan 13, 2022
@julius-welink
Copy link
Contributor Author

@kozlovic @philpennock thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants