Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate Limit of Push Consumers for KV Watches #619

Closed
ColinSullivan1 opened this issue Dec 7, 2022 · 13 comments
Closed

Rate Limit of Push Consumers for KV Watches #619

ColinSullivan1 opened this issue Dec 7, 2022 · 13 comments
Labels
enhancement Enhancement to existing functionality question

Comments

@ColinSullivan1
Copy link
Member

ColinSullivan1 commented Dec 7, 2022

Is it possible to apply RateLimit using nats-c to the underlying push consumer when watching a KV ?

Users would like to throttle the initial reply when watching all.

@ColinSullivan1 ColinSullivan1 added enhancement Enhancement to existing functionality question labels Dec 7, 2022
@derekcollison
Copy link
Member

Watchers should be using the ordered consumer which is flow controlled but not rate limited per se, but that could be added since consumers support this.

In general I get concerned when seeing rate limiting being used however.

@kozlovic
Copy link
Member

kozlovic commented Dec 9, 2022

Also, the original request seem to indicate that the user would want rate-limit for the initial values, so that would need to be clarified. If it is setting the rate limit on the ordered consumer that's one thing, but if the setting has to be removed once the initial values have been received (when the library would normally "post" a NULL entry), then that would require the server to support a consumer update removing the rate-limit (or have to delete and recreate a new one that may not be a good idea). I am not sure if the server supports the rate-limit change during a consumer update.

@derekcollison
Copy link
Member

It does not, but the ordered consumers flip out ephemerals as normal course when data is missed, so we could use same trick to switch over after initial messages received.

@kino71
Copy link

kino71 commented Dec 9, 2022

Hello,
Let me describe our use case in detail.
We are using the KV to store information needed by a process upon startup. This process run in many instances (50+ and increasing) on dedicated hardware (linux boxes) in our datacenters.
In the current version, this data is stored in a text csv local file, propagated daily (process is restarted after propagation). It used to be a table on a relational db but we didn't like the pressure on the db when many instances restarted.

KV allow us to centralize this information (no need for file distribution), handle the updates (albeit rare, it happens during the day) and avoid restarting altogether.
Our concern is about all the instances restarting simultaneously due to some issue such as network malfunctioning or alike. All the instances would start watching at the same time putting a lot of pressure on the network.
In test we have simulated this scenario by restarting several instances on a single machine. The host network couldn't cope with that much data but Nats handles well by disconnecting (slow consumer) and reconnecting (you might recall a fix on nats.c for this).
Our concern is not to overwhelm Nats and the other services that make use of it, should a complete restart happens.

Having said that, I noticed that there was a rate limit for consumers that's why the original question.
We had just the initial answer in mind, but it wouldn't be a problem if the throttling stays in place (updated are rare in our scenario).

Let me add that if you reckon that Nats will protect itself with those "slow consumer" disconnections without impacts then this throttling is no actually needed.

thanks!

kozlovic added a commit that referenced this issue Dec 9, 2022
This is expressed in bits per second and allow the server to rate
limit updates sent to the watcher. This applies to both initial
and new updates.

Resolves #619

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
kozlovic added a commit that referenced this issue Dec 9, 2022
This is expressed in bits per second and allow the server to rate
limit updates sent to the watcher. This applies to both initial
and new updates.

Resolves #619

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@derekcollison
Copy link
Member

The watchers for the KV should be flow controlled, so before going down the path of rate limits, which can be useful but usually are the wrong solution to a problem, I would like to understand a few things.

  1. Were you using the KV watchers?
  2. Will check with @kozlovic but they should be flow controlled.
  3. What problems were you seeing on the network in your setup.

@kino71
Copy link

kino71 commented Dec 11, 2022

hello @derekcollison,
we are using KV watchers, our process does a "watchall" upon startup and keep watching for key updates (key updates aren't frequent).
With 5 process running on a single host, when restarted simultaneously, we see a number of "slow consumers" events, reconnection so on, due to the host's network being overwhelmed. Everything is handled well by nats and nats.c library.
Our process run in production as a single instance on over 50 dedicated host. Our concern was about overwhelm Nats network if restarted all at once (it happens after a network failure for instance).
We use to have a connection to a database and the "restart-all" scenario was a nightmare...

I'm not able to try the exact same scenario in dev/test but once in production I could run a few tests on weekends.

@ripienaar
Copy link

I use randomised sleeps at start for this kind of process and it works well to control the impact of a cold start. Success with thousands of subscribers.

If you have 5 processes each doing this on one node probably flow control won’t help much but rate limit might. You could also consider 1 process that manages a local cache of the data - so your other processes just reads, for example, a local JSON file and watch it for changes. Thus you get 1/5th of the traffic.

Cold start that’s a thundering herd will always be problematic so the real solution isn’t to find a more thundering herd resistant tech but to control the real problem instead

@kino71
Copy link

kino71 commented Dec 11, 2022

Randomized sleep is a good idea, we currently restart this way during controlled startup (daily, very early in the morning).
We might add a random sleep when the restart is after a sudden crash.
In production we have a phyisical host per process, no way to share anything locally. We will containerize (docker) this application, in that case we'll be able to do as you suggest.
This week we'll give a try to this implementation (RateLimit option) and report back.
thanks!

@derekcollison
Copy link
Member

You should not be seeing any slow consumers since the watchers should be flow controlled.

Can you give us more information on the slow consumers you are seeing?

@kino71
Copy link

kino71 commented Dec 15, 2022

Hi, I finally have had some time to run a few test.
You are right, we no longer see "slow consumer" errors. I think there were only before nats.c was patched (nats.c issue #600 #601 #602 ).

We'll see what happens in production when all our processes restart at the same time and then will decide whether to use RateLimit option.

thanks

@kozlovic
Copy link
Member

kozlovic commented Jan 5, 2023

@kino71 Hi, I was wondering if you had time to evaluate the PR. If there is no need, I would rather close and not merge it because we don't need to add features that are not proving beneficial.

@kino71
Copy link

kino71 commented Jan 13, 2023

Hi, I haven't had much time, but we actually didn't get any "slow consumer" error while restarting simultaneously many instances, I think we might not need this after all.
thanks!

@kozlovic
Copy link
Member

@kino71 Thanks for the feedback. Given your answer, I am going to close this issue and the corresponding PR. There is no need to add APIs/configs if we don't see a need at this moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement to existing functionality question
Projects
None yet
Development

No branches or pull requests

5 participants