Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of redelivery attempts #728

Closed
robertmircea opened this issue Jan 12, 2019 · 24 comments
Closed

Limit number of redelivery attempts #728

robertmircea opened this issue Jan 12, 2019 · 24 comments

Comments

@robertmircea
Copy link

What would be the best strategy to limit number of redeliveries of the same message to subscribers which are failing to ack?

For example: If the messages is not acked several times while attempting delivery to client, I want the message to be redelivered for up to x times and then dropped.

Derived question: Is it possible to keep a counter on server of how many times a message has been redelivered to client?

@kozlovic
Copy link
Member

What would be the best strategy to limit number of redeliveries of the same message to subscribers which are failing to ack?

ACK it.. this is not being sarcastic ;-). Really, if your application keeps getting a redelivered message and is unable to acknowledge it because say processing of it is failing, and the application really decides that it does not want to have to deal with it, simply ACK it. The server will stop the redelivery.

Messages are not removed from the message log because they are ack'ed by consumers. Messages are always in the message log since they may have to be delivered to new consumers starting at any sequence in the log: https://github.com/nats-io/nats-streaming-server#message-log

The only advantage of having such configuration in the server would be when you try to limit the redelivery of a message to an application that has exited without calling Close() (therefore server does not know). But with pings from server to client, server will ultimately detect that the client is not responding and will suspend redelivery. The default heartbeat interval is 30seconds. If that's too high, it can be lowered through configuration.

@robertmircea
Copy link
Author

I understand what you are saying, but this means that the app needs to be aware and keep track of the messages for a period of time in order to detect repeated delivery attempts (even the failed ones) up to a maximum. My first thought was to keep the message id in Redis for a while for idempotency checks. This is not very elegant and it scales up to a point. A better mechanism would be in my opinion to keep track on the server of number of redeliveries and either server or the client to have a mechanism to decide if more redeliveries are allowed. This would imply to have an explicit Nack command or to allow the client to instruct the server to deliver based on a specific policy. Probably, what I need here is to have the properties of a queuing system in nats server.

@dansouza
Copy link

dansouza commented Jan 31, 2019

@robertmircea if you're getting the same message redelivered over and over, chances are, a single subscriber will see it multiple times - so you don't need to create a counter on Redis for every single message you receive, only the ones that you're seeing twice (meaning they're getting redelivered in a loop).

Keep an in-memory counter of the messages you've seen with a cuckoo filter and if you see a message twice, then you start counting its redeliveries on Redis. This should scale a lot better since it won't hit Redis everytime, only when you suspect that a message is getting redelivered.

Once you hit enough redelivery attempts, you ACK the message (to remove it from the server) and delete it from the cuckoo filter and from Redis.

It won't perfectly track the number of redeliveries if you have many subscribers (since each subscriber will have its own counter that needs to see a message twice), but it should be good enough and scale well.

@vtolstov
Copy link

vtolstov commented Feb 4, 2019

if i have invalid message, for example payload of it created with buggy service, consumers can't success ack this message.
what is the preferred way?

@kozlovic
Copy link
Member

kozlovic commented Feb 4, 2019

Again, from a consumer perspective, if you want to stop the server to redeliver this message, simply ack it. When the message callback is invoked, and say your application cannot process it, you still can call msg.Ack() on it. What you can't do, is remove a "random" message from a message log.

@vtolstov
Copy link

vtolstov commented Feb 4, 2019

my question more from devops perspective, for example i see in prometheus that some service have big error counter in subscribe handler. what can i do? i can't rewrite service now. so what workaround to temporary fix this issue , why service rewritten and deployed?

@robertmircea
Copy link
Author

Again, from a consumer perspective, if you want to stop the server to redeliver this message, simply ack it. When the message callback is invoked, and say your application cannot process it, you still can call msg.Ack() on it. What you can't do, is remove a "random" message from a message log.

I understand this, but this strategy means that on first failed message processing I should remove (ack) the message no matter if the error is transient or permanent. Usually, if it is a transient error this means that I should retry processing for a number of times before giving up. NATS server could help in this situation by keeping the number of retries instead of a boolean flag like redelivery. It is very easy to deduct if it is a redelivery by inspecting the count of retries. If it is greater than zero, it means that it was a redelivery. In my case, if I would like to retry for up to n times, maybe would be more efficient to have an explicit nack protocol command without waiting for timeout. For example, if the handler throws an exception would be more effective to fail fast. In case that redelivery counter would be n, I would explicitly ack it from the client in order to be removed from server.

@vtolstov
Copy link

vtolstov commented Feb 4, 2019

Yes,does it poossible to count delivery attemps per durable group?

@vtolstov
Copy link

vtolstov commented Feb 4, 2019

I can check attemps and do some client side action, for example serialize event to store and ack message

@sujeshthekkepatt
Copy link

Same need. Any updates on this?

@kozlovic
Copy link
Member

kozlovic commented Nov 8, 2019

@sujeshthekkepatt I commented on the other issue related to this: #789

@bmcustodio
Copy link

bmcustodio commented Dec 10, 2019

NATS server could help in this situation by keeping the number of retries instead of a boolean flag like redelivery.

@kozlovic while I understand that limiting the number of redelivery attempts or implementing any kind of dead-letter queue is out of scope, would it be feasible for NATS Streaming to provide a counter of redeliveries as suggested here (possibly as a new field rather than replacing the existing Redelivery one, for compatibility with existing consumers)?

@sujeshthekkepatt
Copy link

@bmcstdio that would be great. I have done a work around using the same concept where we manually add some metadata like number of retries etc and use that fields for pushing to multi stage poison queues and later to a dead letter queue. But as you said having this built in would be perfect. I think they are working on a new revised version of Stan.

@kozlovic
Copy link
Member

@bmcstdio Possibly. The caveat is that this number would likely be "valid" only during the runtime of a server. Meaning that I don't think that it would be feasible to persist the delivery count (but maybe?). Also, in clustering mode, when leader changes, it may not have the redelivery count (again, if that info is not stored/replicated). In this worst case scenario (valid only at runtime/leader election), would that still be valuable?

@bmcustodio
Copy link

@kozlovic I think it would still be valuable, yes, but of course the absolute best would be to persist that information. I am not familiar at all with the NATS/NATS Streaming code base, but I'd be willing to contribute to this somehow (either implementing, reviewing or testing).

@kell18
Copy link

kell18 commented Dec 16, 2019

Redeliveries counter would be nice to have for us as well! Even if only during the runtime (not persistent). Thanks!

kozlovic added a commit to nats-io/stan.go that referenced this issue Dec 17, 2019
This is the first step to support server setting a RedeliveryCount
in PubMsg when those are redelivered.
See nats-io/nats-streaming-server#728

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@kozlovic
Copy link
Member

@bmcstdio @sujeshthekkepatt @kell18 Ok, I have decided to add RedeliveryCount (only runtime at the moment, and won't survive a leader election if leader is different node). Starting with the client's repo that holds the PubMsg protobuf: nats-io/stan.go#295.

@kozlovic
Copy link
Member

Server PR will be submitted shortly after we agree on the field name/type in the client repo. Thanks!

@kozlovic
Copy link
Member

@bmcstdio @sujeshthekkepatt @kell18 Question for you guys.. the redelivery count is per subscription, but in the case of a queue group, would you expect the redelivery count to be for the group, or individual member?

@bmcustodio
Copy link

@kozlovic I'd expect it to be for the group.

@vtolstov
Copy link

vtolstov commented Dec 20, 2019 via email

@robertmircea
Copy link
Author

For group for sure!

@kozlovic
Copy link
Member

Ok, I hear you, but in case you did not notice, if you use clustering, as of now, a message is already redelivered to the same member (subject to change in the future, but as of now that is the case), so that will not make much of a difference :-)

@kozlovic
Copy link
Member

Guys... closing this one in favor of #996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants