-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource usage issues after upgrading Jetstream cluster to 2.2.2 #2170
Comments
Some more info that might help track this down... In 2.2.1, our stream seemed to be fine for a while, but then we'd starting seeing this in the log...
Then the Nats servers would stop responding to any request regarding the stream, for example:
Would just never be responded to. Even simpler requests like to The strange thing is that in 2.2.1, all the Grafana graphs looked completely healthy. |
What is the configuration setup for the cluster, servers? What does |
What happened just after 12 to server memory? What event was that? |
We have a 3 node Nats server cluster setup via a Kubernetes stateful set. All three nodes are started like:
Where
I can't figure how to run that command. The subcommand
Nothing that we can tell. None of the Nats nodes have been restarted since over 24 hours ago (long before that sharp drop in memory usage). The only thing that uses Nats are a few processes that read from Postgresql CDC and write the changes to the Jetstream stream. Those processes have been running for 6 days now. They automatically reconnected after we deployed 2.2.2, and according to our logs are just happily chugging along...
If I make a request to
Hope that helps and let me know what other info I can provide. Thanks! |
Thanks, much appreciated. Do you have monitoring on for the cluster? I am interested in seeing the output of /varz and /subsz?subs=1 for each server if possible. And if any of them seem to be using more cpu or memory, the /stacksz for those as well. Could also jump on a Zoom or GH tomorrow. |
Yes we have monitoring enabled. I can help out with requested info. 😄 Be happy to provide any additional details |
…ouble for each fetch. Resolves part of #2170 Signed-off-by: Derek Collison <derek@nats.io>
Going to close now but please let me know if you are having any other issues. |
Hey @derekcollison , since upgrading to 2.2.3 our Jetstream cluster is more stable in the sense that it responds to all requests, but it thinks it's out of resources. Server logs are filled with:
varz reports over quota:
But
Our nats config looks like:
If we look at volume usage in AWS, it also shows what Thanks for the help! |
You have 3 replicas, so 3 x 2.7GB I believe, plus some overhead for other stuff. What does the following show on any one of the servers?
|
Interesting, I assumed that setting meant like "logical size of a single stream". Does Also, am wrong in thinking these settings apply to single streams? In other words, if I set Thanks! |
If the setting is on the server JetStream block it applies to raw resources used but the server. If in the account it means total usage, so for the $G account would need to be 2.7 X 3, so 8.1GB, I would round up to $10GB for that account. |
And pulling varz from same server says you are using way more than 2.7GB? |
ok checking locally now on varz myself and see something similar. Let me dig in. |
The JetStream usage underneath of a server's varz was reporting total system usage incorrectly. |
The fix you made seems to only affect reporting (the /varz endpoint)? I'm still unsure why we're getting "resource limits exceeded for account" when trying to make publish requests to the stream.
Our config seems to be setting 8gb at the server level:
So if I'm understanding that correctly, that means in a 3 node cluster a stream cannot exceed 24gb? As opposed to setting the max_file_store at the account level, which says "a stream cannot exceed 8gb, no matter how many nodes are in a cluster. This is assuming stream size is defined as the sum of all its replicas. So I'm still not sure why we're getting these errors. Is there documentation or any example config files to look at? All I could find was this one: https://github.com/nats-io/nats-server/blob/master/server/configs/js-op.conf Thanks again. |
Yes the change was around varz's JetStream subsection reporting global values for usage as monitored by all servers, not just for that server. In your case above, 3x8GB of storage means you have 24GB to use anyway you want. If you have a 3x replicated stream that means that stream can house about 8GB since it will be using 24GB once replicated. In your setup what does |
So our 3-node cluster has 24 gb to work with, but according to each node, they are full. But according to And according to Somethings not adding up, yes? |
You have the main account limited to use 8GB system wide, regardless of the underlying NATS system resources. Change that to something higher since that accounts for all storage including replicas for that account. So for 2.7GB stream but with Replicas = 3 that is just over 8GB and you have hit the limit for that account. |
How to do that? We couldn't find any config file info in the Jetstream documentation. Here's our config now:
Thanks! |
If you want to simulate the global account and no credentials etc but do want the jetstream block you can do the following.
|
We started seeing strange resource usage patterns after updating to 2.2.2. Our application code hasn't changed at all, and is very simple. It just creates a Jetstream stream and publishes messages to it (via new style requests). There are no consumers at all.
The memory usage of one of servers kept growing despite having zero connections to it. Also, it looks like the subscriptions are growing indefinitely, though that might be a reporting error.
After upgrading to 2.2.2, you can see the subscriptions keeps growing indefinitely, but we figured that was just a reporting issue due to metric names being changed. I.e.
gnatsd_varz_subscriptions
vsgnatsd_subsz_num_subscriptions
(the latter doesn't grow indefinitely, but holds constant).But then seeing memory usage grow on one server, seeming indefinitely until a sudden drop, caused us to open this issue. Something seem strange.
Also, all the server logs show this error happening a lot:
We have very few connections to the Nats cluster (as seen in the graphs). Each connection makes a subscription to
_INBOX.<inbox-uid>.*
and makes requests usingreply_to
s like_INBOX.<inbox-uid>.<request-uid>
. Lastly, the data volumes attached to each node are only at 20% capacity.How to debug this further? Thanks for the help.
The text was updated successfully, but these errors were encountered: