-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Messages stop being delivered to a consumer [v2.9.22, v2.10.14] #4736
Comments
|
Could you please check if latest 2.10 resolves this issue? |
|
we have the similar problem on the latest nats v2.10.7 (we use debian 12). It's happens rare, but symptoms are quite similar. Consumer just stops delivering messages to our go app. on all 3 servers working in a cluster we got a lot of messages like this: and around the time consumer stopped processing: before consumer stopped on the metadata leader we see a huge amount of messages like this: however this consumer is from another stream after restarting all nats servers (in sequence) as an additional info: stream and we use jetstream.Consumer#Consume for retrieving messages. from time to time some apps just stops to receive messages, but there are no any connection errors or something. in nats error.log there are errors like: |
|
EDIT: Rephrasing, as you mentioned you're using push: Are you having any custom config for Consume method, or you're using the defaults? |
|
We encountered the same problem on nats-v2.10.7 |
|
@AetheWu can you share some info - |
nats server and client versionserver: v2.10.7 nats configconsumer infostream info |
|
@AetheWu The consumer reports it just tried to deliver the message and it was not acked. Can you please explain what do you think does not work here? |
|
Hi @Jarema NATS 2.10.9 Stream report Consumer report do we have some miss configuration or need another setup for kube staefulset with PVC? |
|
We have this problem again. nats configstream infoconsumer info |
|
Any news? This problem is really annoying. But there is no solution |
|
Might be best to start a new issue and describe it there as this started on a 2.9.22 server. Also be good to make sure issue is reproducible on 2.10.17-RC4 prelease. |
Observed behavior
A couple of weeks ago, one of consumers on one of our streams started misbehaving - after receiving an influx of messages that the consumer should process, the messages end up in pending state for this consumer (nats_consumer_num_pending), form where the consumer processes them. After processing some of the messages, all of a sudden the messages dissapear from pending and the consumer is unable to continue processing them.
The unprocessed messages still take up space on the stream, and will not be processed unless we delete and recreate the affected consumer. When we recreate it, the consumer continues working normally for 2-30mins, after which the problem usually happens again. We can recreate the consumer multiple times and process all messages in this way. While the messages are "stuck", new messages for the same consumer end up in the pending queue and get processed normally.
The first time that this happened, our stream had only one replica, and that nats node was restarted (or it crashed) - the problems started appearing shortly after that restart. Restarting the nats node on which stream lives fixes the issue temporarily (but loses the "stuck" messages). Scaling the stream up to have 3 replicas also fixes the issue temporarily, but loses the stuck messages.
Yesterday we tried recreating the stream, but the problem occured again even though there were no nats node restarts like the first time that the issue happened.
This issue happened in 4 of our production environments, in each one affecting different consumers (although affected consumers are staying consistent across environments). On 3 of these envs the problem didn't happen again for the last week (action taken was scaling up to 3 replicas and restarting nats nodes), but on one of the envs it keeps happening every 1-2 days around the same time (when this consumer has a big influx of messages).
Note that while one consumer gets stuck with most messages, there are also other consumers that sometimes get stuck together with it, but with much less messages.
edit: It's also important to note that this is happening only on a single stream, while all others are operating normally
Some details about the stream and affected consumer:
Expected behavior
All messages intended for a consumer are delivered to it until the stream is empty, there is no need for consumer recreation in order to process messages.
Server and client version
server: nats jetstream 2.9.22
client: github.com/nats-io/nats.go v1.28.0
Host environment
both client and server run in KOPS kubernetes on nodes with:
Steps to reproduce
first time it happened it was right after nats node restart while stream had only 1 replica, tried reproducing it on another environment by killing a node on which the 1-replica stream lives, but didn't manage to do so
The text was updated successfully, but these errors were encountered: