New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADDED] TLS connection rate limiter #2573
Conversation
I'm seeing some test failures unrelated to my PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made some comments. I think the idea is good, but I have mainly commented about the code, I would like other, say @philpennock or @derekcollison to comment on the approach itself.
Most clients that have official support will also support jitter reconnect logic to help smooth out, and we will use all available cores to process the TLS since IIRC we do that post accept and in a separate Go routine. So with that said, this still could be useful under extreme conditions and we have seen some enterprise customers with large RSA certs get bit by this when the cluster itself was not sized properly. |
Thanks for the review! Will address the above comments soon. I have some more considerations:
I doubt this will solve my problems. I'll have thousands of clients. It is acceptable to me that clients have to queue for up to 5 or 10 minutes to reconnect. It's either that or I will have to reserve a lot of extra CPUs only to handle reconnect spikes. The problem with only using Jitter reconnect is that clients do not know if server is busy or not. Using a small jitter does not solve the issue, using large jitter makes it more difficult for client to reconnect under normal server load. The proposed TLS rate limit setting would have a good synergy with a back-off strategy however.
Yes, it applies only to
Seems feasible to make exceptions based on IP addresses. But let's make this a separate task (?) I do not necessarily need this right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some more little changes. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. From a code perspective, this looks good to me. Want to let @derekcollison decide if we merge this or if the rate limit approach is not warranted at this time.
ok thanks! Let me know what you want to do. If this is up for merging, then I'll squash commits. |
server/server.go
Outdated
@@ -2409,6 +2415,15 @@ func (s *Server) createClient(conn net.Conn) *client { | |||
|
|||
// Check for TLS | |||
if !isClosed && tlsRequired { | |||
if s.connRateCounter != nil && !s.connRateCounter.allow() { | |||
c.Warnf("Rejecting connection due to TLS rate limiting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a change in the log volume surfacing as a result of client actions, right? We will log freely for peers within a cluster, or across gateways, but tend to be cautious about log messages which scale with end-user connections. (Particularly if the client IP is decorated into the log, as that can be PII). We should definitely log this, but I think (regrettably) it might need to be at debug or trace level. @derekcollison can better say what the policy is here.
We should probably add new fields to the type Varz
in server/monitor.go
; something like ThrottledConnections
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you say if I aggregated logs within 1s window?
Rejected 100 connections due to TLS rate limiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems sane enough, but keeping track and doing the conditional logging seems like more work than adding a counter to the stats tracking and incrementing it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, quite a bit of code:
5e5eb63
if you think that's fine, I could rebase the whole PR on the latest nats codebase. Possibly open a new PR with changed squished to a single commit. Let me know if this is up for merging or not.
I am ok with it, but still feel mostly the same as my original comment. |
@julius-welink Also forgot to mention that you have conflicts, so you may want to rebase from main branch.. |
c7c5463
to
0e31ad9
Compare
b3b013a
to
a47e5e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @derekcollison I think there is still value and I am ok with this PR, should we merge?
This spike here is when clients reconnect. To properly size the cluster I'd need to spend 20x more on hosting fees, just to account for these reconnect spikes. If I don't, NATS would simply stop responding due to high load and not let in any clients. It has happened. In the case of mass reconnect, I'd rather put clients on queue. Perhaps you have any insights of how enterprises deal with this?
I'm playing around with EC (Elliptic Curve) certs. Haven't yet tested in production, but I hope those will reduce CPU usage. Something else on my mind. This is our problem statement: "NATS would simply stop responding due to high load". It should be technically possible to stop accepting new connections when NATS detects high CPU usage. I've seen you do measurements of "event loop" delay. Under a high load NATS was logging warnings. This perhaps seems a bit less reliable, how would you even test that? Nevertheless, it could be an alternative to this MR. Do you think it's feasible to go to this direction instead? |
I understand your concerns on costs. What I was trying to highlight is that in a total system design, you need to account and size for client rollover during server upgrades, network outages or even server failures themselves. When using large RSA keys that becomes something that needs to be tested. One answer is of course ECC keys which can help. We also do jitter in the clients and if you control the clients can help out here by increasing this value, etc. |
@derekcollison Still, do you think that this PR as-is as merit? Code wise, I approved it, and want to know if we merge it. |
@derekcollison
I do account for controlled client rollover. Not sure how you can reasonably account for a total network outage though, without blowing the budget :) Also backend systems exist behind nats, that do benefit from client throttling. |
@kozlovic @philpennock thanks for the reviews! |
Resolves #NNN
git pull --rebase origin main
)Changes proposed in this pull request:
TLS handshake negotiation is very CPU intensive, especially if RSA is used. For this reason I see that nats has "Lame Duck Mode". Unfortunately, during a network outage, it is still possible for many clients to try to connect at the same time. This rationale behind tls.connection_rate_limit is that we prevent CPU from being overwhelmed by a surge of clients trying to connect at the same time, by dropping the connection before TLS is initialized, thus effectively throttling TLS connection rate.
The good thing about this change is that is does not affect current nats functionality, unless enabled. Also, I will soon be testing this in a real project with more than 500 clients.
Other solutions that I have considered:
The is no related issue afaik. I am making this change for myself. PR should be a good way to get the tests running, since TravisCI does not cooperate with me :) I am willing to spend more time on this PR if there is a chance it could be accepted to the main nats-server repo. Please let me know what else I need to do to get this approved.
I'll wait for some feedback before squashing commits. Do I need to update docs? If yes, how?
/cc @nats-io/core