[ADDED] TLS connection rate limiter #2573

julius-welink · 2021-09-27T07:19:38Z

Link to issue, e.g. Resolves #NNN
Documentation added (if applicable)
Tests added
Branch rebased on top of current main (git pull --rebase origin main)
Changes squashed to a single commit (described here)
Build is green in Travis CI
You have certified that the contribution is your original work and that you license the work to the project under the Apache 2 license

Changes proposed in this pull request:

TLS handshake negotiation is very CPU intensive, especially if RSA is used. For this reason I see that nats has "Lame Duck Mode". Unfortunately, during a network outage, it is still possible for many clients to try to connect at the same time. This rationale behind tls.connection_rate_limit is that we prevent CPU from being overwhelmed by a surge of clients trying to connect at the same time, by dropping the connection before TLS is initialized, thus effectively throttling TLS connection rate.

The good thing about this change is that is does not affect current nats functionality, unless enabled. Also, I will soon be testing this in a real project with more than 500 clients.

Other solutions that I have considered:

SYN rate limit on iptables. Unfortunately it's not possible to configure them within a docker container and I use docker.
Sidecar haproxy. This could work, but I believe rate limiter directly implemented in nats-server is a more straight forward solution. Haproxy would be an additional point of failure or misconfiguration.

The is no related issue afaik. I am making this change for myself. PR should be a good way to get the tests running, since TravisCI does not cooperate with me :) I am willing to spend more time on this PR if there is a chance it could be accepted to the main nats-server repo. Please let me know what else I need to do to get this approved.

I'll wait for some feedback before squashing commits. Do I need to update docs? If yes, how?

/cc @nats-io/core

julius-welink · 2021-09-27T08:00:56Z

I'm seeing some test failures unrelated to my PR.

kozlovic

I have made some comments. I think the idea is good, but I have mainly commented about the code, I would like other, say @philpennock or @derekcollison to comment on the approach itself.

server/opts.go

server/rate_counter.go

server/server.go

test/configs/tls_rate.conf

test/tls_test.go

derekcollison · 2021-09-27T23:53:40Z

Most clients that have official support will also support jitter reconnect logic to help smooth out, and we will use all available cores to process the TLS since IIRC we do that post accept and in a separate Go routine.

So with that said, this still could be useful under extreme conditions and we have seen some enterprise customers with large RSA certs get bit by this when the cluster itself was not sized properly.

julius-welink · 2021-09-28T08:37:12Z

Thanks for the review! Will address the above comments soon. I have some more considerations:

Was not aware of jitter reconnect. Maybe I can ditch this effort and use it? (NO)

I doubt this will solve my problems. I'll have thousands of clients. It is acceptable to me that clients have to queue for up to 5 or 10 minutes to reconnect. It's either that or I will have to reserve a lot of extra CPUs only to handle reconnect spikes. The problem with only using Jitter reconnect is that clients do not know if server is busy or not. Using a small jitter does not solve the issue, using large jitter makes it more difficult for client to reconnect under normal server load. The proposed TLS rate limit setting would have a good synergy with a back-off strategy however.

Make sure that throttling only affects clients and not server -> server connections (YES)

Yes, it applies only to go s.acceptConnections(l, "Client" as opposed to other connection types

go s.acceptConnections(l, "Gateway"
go s.acceptConnections(l, "Leafnode"
...

I have some special clients that I wish were not throttled, but this would probably require some protocol changes, or perhaps I could have an exception list with IP addresses 🤔

Seems feasible to make exceptions based on IP addresses. But let's make this a separate task (?) I do not necessarily need this right now.

kozlovic

Just some more little changes. Thanks!

server/opts.go

server/rate_counter.go

server/rate_counter_test.go

server/server.go

test/tls_test.go

kozlovic

LGTM. From a code perspective, this looks good to me. Want to let @derekcollison decide if we merge this or if the rate limit approach is not warranted at this time.

julius-welink · 2021-09-30T08:32:48Z

LGTM. From a code perspective, this looks good to me. Want to let @derekcollison decide if we merge this or if the rate limit approach is not warranted at this time.

ok thanks! Let me know what you want to do. If this is up for merging, then I'll squash commits.

philpennock · 2021-12-31T20:09:15Z

server/server.go

@@ -2409,6 +2415,15 @@ func (s *Server) createClient(conn net.Conn) *client {

 	// Check for TLS
 	if !isClosed && tlsRequired {
+		if s.connRateCounter != nil && !s.connRateCounter.allow() {
+			c.Warnf("Rejecting connection due to TLS rate limiting")


I think this is a change in the log volume surfacing as a result of client actions, right? We will log freely for peers within a cluster, or across gateways, but tend to be cautious about log messages which scale with end-user connections. (Particularly if the client IP is decorated into the log, as that can be PII). We should definitely log this, but I think (regrettably) it might need to be at debug or trace level. @derekcollison can better say what the policy is here.

We should probably add new fields to the type Varz in server/monitor.go; something like ThrottledConnections ?

What would you say if I aggregated logs within 1s window?

Rejected 100 connections due to TLS rate limiting

Seems sane enough, but keeping track and doing the conditional logging seems like more work than adding a counter to the stats tracking and incrementing it. :)

You're right, quite a bit of code:
5e5eb63

if you think that's fine, I could rebase the whole PR on the latest nats codebase. Possibly open a new PR with changed squished to a single commit. Let me know if this is up for merging or not.

derekcollison · 2022-01-01T18:10:43Z

I am ok with it, but still feel mostly the same as my original comment.

server/rate_counter.go

server/rate_counter_test.go

server/server.go

test/tls_test.go

kozlovic · 2022-01-07T17:38:24Z

@julius-welink Also forgot to mention that you have conflicts, so you may want to rebase from main branch..

kozlovic

LGTM. @derekcollison I think there is still value and I am ok with this PR, should we merge?

julius-welink · 2022-01-13T08:02:09Z

@derekcollison

I am ok with it, but still feel mostly the same as my original comment.
cluster itself was not sized properly.

This spike here is when clients reconnect. To properly size the cluster I'd need to spend 20x more on hosting fees, just to account for these reconnect spikes. If I don't, NATS would simply stop responding due to high load and not let in any clients. It has happened. In the case of mass reconnect, I'd rather put clients on queue. Perhaps you have any insights of how enterprises deal with this?

large RSA certs

I'm playing around with EC (Elliptic Curve) certs. Haven't yet tested in production, but I hope those will reduce CPU usage.

Something else on my mind. This is our problem statement: "NATS would simply stop responding due to high load". It should be technically possible to stop accepting new connections when NATS detects high CPU usage. I've seen you do measurements of "event loop" delay. Under a high load NATS was logging warnings. This perhaps seems a bit less reliable, how would you even test that? Nevertheless, it could be an alternative to this MR. Do you think it's feasible to go to this direction instead?

derekcollison · 2022-01-13T15:17:44Z

I understand your concerns on costs. What I was trying to highlight is that in a total system design, you need to account and size for client rollover during server upgrades, network outages or even server failures themselves. When using large RSA keys that becomes something that needs to be tested.

One answer is of course ECC keys which can help. We also do jitter in the clients and if you control the clients can help out here by increasing this value, etc.

kozlovic · 2022-01-13T16:00:10Z

@derekcollison Still, do you think that this PR as-is as merit? Code wise, I approved it, and want to know if we merge it.

julius-welink · 2022-01-13T16:11:52Z

@derekcollison
The problem with only using Jitter reconnect is that clients do not know if server is busy or not. Using a small jitter does not solve the issue, using large jitter makes it more difficult for client to reconnect under normal server load.

client rollover during server upgrades

I do account for controlled client rollover. Not sure how you can reasonably account for a total network outage though, without blowing the budget :)

Also backend systems exist behind nats, that do benefit from client throttling.

julius-welink · 2022-01-14T07:09:40Z

@kozlovic @philpennock thanks for the reviews!

julius-welink marked this pull request as draft September 27, 2021 07:37

julius-welink marked this pull request as ready for review September 27, 2021 07:37

kozlovic requested changes Sep 27, 2021

View reviewed changes

julius-welink requested a review from kozlovic September 28, 2021 11:40

kozlovic requested changes Sep 28, 2021

View reviewed changes

julius-welink requested a review from kozlovic September 29, 2021 06:53

kozlovic approved these changes Sep 29, 2021

View reviewed changes

julius-welink closed this Sep 30, 2021

julius-welink reopened this Sep 30, 2021

philpennock reviewed Dec 31, 2021

View reviewed changes

kozlovic requested changes Jan 7, 2022

View reviewed changes

server/rate_counter.go Outdated Show resolved Hide resolved

server/rate_counter_test.go Outdated Show resolved Hide resolved

server/server.go Outdated Show resolved Hide resolved

test/tls_test.go Show resolved Hide resolved

julius-welink force-pushed the implement-rate-limiting branch from c7c5463 to 0e31ad9 Compare January 11, 2022 14:07

julius-welink requested a review from kozlovic January 11, 2022 14:09

[ADDED] TLS connection rate limiter

a47e5e0

julius-welink force-pushed the implement-rate-limiting branch from b3b013a to a47e5e0 Compare January 11, 2022 14:57

kozlovic approved these changes Jan 12, 2022

View reviewed changes

kozlovic merged commit c9c603b into nats-io:main Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADDED] TLS connection rate limiter #2573

[ADDED] TLS connection rate limiter #2573

julius-welink commented Sep 27, 2021 •

edited

julius-welink commented Sep 27, 2021

kozlovic left a comment

derekcollison commented Sep 27, 2021

julius-welink commented Sep 28, 2021 •

edited

kozlovic left a comment

kozlovic left a comment

julius-welink commented Sep 30, 2021 •

edited

philpennock Dec 31, 2021

julius-welink Jan 5, 2022

philpennock Jan 5, 2022

julius-welink Jan 7, 2022

derekcollison commented Jan 1, 2022

kozlovic commented Jan 7, 2022

kozlovic left a comment

julius-welink commented Jan 13, 2022 •

edited

derekcollison commented Jan 13, 2022

kozlovic commented Jan 13, 2022

julius-welink commented Jan 13, 2022 •

edited

julius-welink commented Jan 14, 2022

[ADDED] TLS connection rate limiter #2573

[ADDED] TLS connection rate limiter #2573

Conversation

julius-welink commented Sep 27, 2021 • edited

Changes proposed in this pull request:

julius-welink commented Sep 27, 2021

kozlovic left a comment

Choose a reason for hiding this comment

derekcollison commented Sep 27, 2021

julius-welink commented Sep 28, 2021 • edited

kozlovic left a comment

Choose a reason for hiding this comment

kozlovic left a comment

Choose a reason for hiding this comment

julius-welink commented Sep 30, 2021 • edited

philpennock Dec 31, 2021

Choose a reason for hiding this comment

julius-welink Jan 5, 2022

Choose a reason for hiding this comment

philpennock Jan 5, 2022

Choose a reason for hiding this comment

julius-welink Jan 7, 2022

Choose a reason for hiding this comment

derekcollison commented Jan 1, 2022

kozlovic commented Jan 7, 2022

kozlovic left a comment

Choose a reason for hiding this comment

julius-welink commented Jan 13, 2022 • edited

derekcollison commented Jan 13, 2022

kozlovic commented Jan 13, 2022

julius-welink commented Jan 13, 2022 • edited

julius-welink commented Jan 14, 2022

julius-welink commented Sep 27, 2021 •

edited

julius-welink commented Sep 28, 2021 •

edited

julius-welink commented Sep 30, 2021 •

edited

julius-welink commented Jan 13, 2022 •

edited

julius-welink commented Jan 13, 2022 •

edited