Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer Ack/Nak Backoffs #2812

Merged
merged 5 commits into from Jan 24, 2022
Merged

Consumer Ack/Nak Backoffs #2812

merged 5 commits into from Jan 24, 2022

Conversation

derekcollison
Copy link
Member

This introduces both NAK delays and backoff designation.

For NAK proto, it now accepts either a time duration string, e.g. -NAK 2s or a formal JSON arg -NAK {"delay": 2000000000}. This will delay redelivery until that time.

We also now accept a BackOff directive when setting up a consumer. It is an array of time durations that represent the time to delay based on delivery count. The first of course is delivered instantly, but once delivered, and we decide to deliver again, we will use the BackOff directive. The MaxDeliver is the length of BackOff+1.

Signed-off-by: Derek Collison derek@nats.io

/cc @nats-io/core

AckPolicy AckPolicy `json:"ack_policy"`
AckWait time.Duration `json:"ack_wait,omitempty"`
MaxDeliver int `json:"max_deliver,omitempty"`
BackOff []time.Duration `json:"backoff,omitempty"`
Copy link
Contributor

@ripienaar ripienaar Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Backoff is the list of times that the deliveries will take and implies the maximum delivery attempts? This seems like asking a lot - and makes this not accessible to those who want uncapped retries.

The server can instead be given a low mark and high mark for the backoffs, it can calculate range of backoffs (this list essentially) on consumer creations.

Once the top mark is reached thats the time between tries (with some jitter of course).

So essentially:

{
  "backoff": {
    "low": 1000000000,
    "high" 120000000000
  }
}

Server then calculate a range between 1 second and 120 seconds and that becomes your backoff list.

If no MaxDeliver is given every try once 120s is reached is +- 120s apart (need jitter), else its subject to MaxDeliver as today.

I dont think AckWait and low mark for backoffs is strictly related tbh, why tie those 2 together?

Copy link
Contributor

@ripienaar ripienaar Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this https://go.dev/play/p/U1HOrPpTX0L it's not expo backoff but works well in my experience. In that I give it the steps to create but could be dynamically determined or user supplied I suppose. My jitter is quite big but adjust to taste.

Point is, I dont think we should ask every language / every user to bother figuring out a retry strategy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought through using multipliers and having to do math based on delivery count but decided against it tbh. Also, if you are having non-linear backoffs you probably do not want unlimited there. Also this gives them full control of what the backoff is based on delivery count, so they can do any polynomial they want.. ;)

Also, if we do want a high water mark, meaning at the end keep delivering but re-use last value we could figure that out. I don't think the low/high is a good use case and it will just cause folks to ask for more customization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in your example it is finite and has 10 steps. Also I think there will be a finality where we start formalizing DLQs. For now we could add in to TERM within the client to move to another stream possibly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 10 steps - but I dont think that should imply 10 retries only (see the play link), it settles to a jittery near max delivery time retry time for how many ever retries the user wants.

Perhaps as a default MaxTries == listed durations, but forcing it seems arbitrary

Below when given 10 steps only.

Attempt 0: 5.69777941s
Attempt 1: 12.982153551s
Attempt 2: 20.016145821s
Attempt 3: 18.035010051s
Attempt 4: 35.537113937s
Attempt 5: 52.24916732s
Attempt 6: 22.482969758s
Attempt 7: 24.131776148s
Attempt 8: 55.133117216s
Attempt 9: 1m0.980279449s
Attempt 10: 1m14.260398084s
Attempt 11: 37.763669287s
Attempt 12: 55.384491574s
Attempt 13: 50.914458836s
Attempt 14: 46.711445515s
Attempt 15: 1m23.338182873s
Attempt 16: 1m22.972644968s
Attempt 17: 1m10.410584091s
Attempt 18: 1m19.03911079s
Attempt 19: 1m18.353353331s
Attempt 20: 32.824778273s
Attempt 21: 53.638149956s
Attempt 22: 1m11.193774911s
Attempt 23: 49.683515637s
Attempt 24: 39.321866378s
Attempt 25: 1m21.842231109s
Attempt 26: 34.297652072s
Attempt 27: 1m0.69383165s
Attempt 28: 1m15.929961588s
Attempt 29: 50.388298971s
Attempt 30: 1m10.658288344s
Attempt 31: 30.804784443s
Attempt 32: 1m15.097786623s
Attempt 33: 1m24.087988459s
Attempt 34: 1m28.075641803s
Attempt 35: 30.553361739s
Attempt 36: 56.650761883s
Attempt 37: 44.629038907s
Attempt 38: 1m11.308658932s
Attempt 39: 33.97338778s
Attempt 40: 48.685383115s
Attempt 41: 1m22.333742716s
Attempt 42: 42.21505002s
Attempt 43: 34.609623071s
Attempt 44: 33.811134652s
Attempt 45: 1m22.620281907s
Attempt 46: 57.160772719s
Attempt 47: 1m3.587659062s
Attempt 48: 1m26.81566311s
Attempt 49: 30.942961397s
Attempt 50: 49.039813105s
Attempt 51: 52.091569721s
Attempt 52: 38.639585917s
Attempt 53: 1m9.704764712s
Attempt 54: 29.550264449s
Attempt 55: 59.803989579s
Attempt 56: 57.409156333s
Attempt 57: 1m11.929879825s
Attempt 58: 1m15.684312746s
Attempt 59: 57.206343734s
Attempt 60: 45.976771718s
Attempt 61: 38.658377901s
Attempt 62: 1m13.884065704s
Attempt 63: 51.61208761s
Attempt 64: 56.764975024s
Attempt 65: 50.625624717s
Attempt 66: 51.159802269s
Attempt 67: 47.702729479s
Attempt 68: 1m13.782286269s
Attempt 69: 41.48333175s
Attempt 70: 54.903119657s
Attempt 71: 1m10.329895923s
Attempt 72: 1m19.121642512s
Attempt 73: 36.362113702s
Attempt 74: 52.741334655s
Attempt 75: 56.982907257s
Attempt 76: 56.428831643s
Attempt 77: 36.491797301s
Attempt 78: 1m26.564819019s
Attempt 79: 1m26.478233018s
Attempt 80: 1m26.295498694s
Attempt 81: 41.098140041s
Attempt 82: 40.888862788s
Attempt 83: 32.421919021s
Attempt 84: 1m15.453062911s
Attempt 85: 48.72145828s
Attempt 86: 1m13.418424343s
Attempt 87: 30.137008055s
Attempt 88: 1m25.339692041s
Attempt 89: 1m0.872812453s
Attempt 90: 1m21.430018631s
Attempt 91: 53.752858217s
Attempt 92: 1m0.402443631s
Attempt 93: 40.412138218s
Attempt 94: 37.172546632s
Attempt 95: 59.017317514s
Attempt 96: 42.315074904s
Attempt 97: 1m27.86467918s
Attempt 98: 1m11.5044885s
Attempt 99: 1m26.218450071s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, NATS CLI and client libraries should provide wrappers that provide a few common back-off patterns such as fixed-interval, doubling, etc. Let the wrappers calculate and fill the array. Devs could still fill in directly in they have a special need (but would not expose this in the NATS CLI necessarily as complexity is already high there).

Copy link
Contributor

@ripienaar ripienaar Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we then start with this:

{
  "backoff": {
    "sequence": [1,2,4,8]
  }
}

this leaves the discussion open, if we later do wish to support a min/max thing those are new keys to backoff structure, then this is fine and deferrable. Pick a better name than sequence though :)

I do find the logic for pinning AckWait and MaxDelivery to these numbers weird though - we are not currently pinned in that way so not sure why the addition is needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. Set MaxDeliver to -1 or a value larger than the BackOff array and do same behavior, keep using last value in BackOff.

I like number 2 here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we then start with this:

{
  "backoff": {
    "sequence": [1,2,4,8]
  }
}

this leaves the discussion open, if we later do wish to support a min/max thing those are new keys to backoff structure, then this is fine and deferrable. Pick a better name than sequence though :)

I do find the logic for pinning AckWait and MaxDelivery to these numbers weird though - we are not currently pinned in that way so not sure why the addition is needed

As Todd described, this needs to be exposed simply as a strategy name. Actual values are extra complexity that won't be provided properly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what Todd said at all :)

I'd prefer that approach though

@derekcollison
Copy link
Member Author

@aricart We can make simpler in client but server needs to support anything someone could come up with and there are a ton of options. I went through all of these on Friday and did not feel they would serve us well long term, we would simply get pestered to add in a new way of calculating backoffs etc. This allows a user to do pretty much whatever they want. Clients can offer simple version that calculate these values as needed.

server/consumer.go Outdated Show resolved Hide resolved
@@ -324,6 +334,11 @@ func (mset *stream) addConsumerWithAssignment(config *ConsumerConfig, oname stri
// Make sure we have sane defaults.
setConsumerConfigDefaults(config)

// Check if we have a BackOff defined that MaxDeliver is within range etc.
if lbo := len(config.BackOff); lbo > 0 && config.MaxDeliver <= lbo {
return nil, errors.New("max deliver required to be > length backoff values")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the new error with the proper process..

@@ -3129,6 +3129,14 @@ func (s *Server) jsConsumerCreate(sub *subscription, c *client, a *Account, subj
// Make sure we have sane defaults.
setConsumerConfigDefaults(&req.Config)

// Check if we have a BackOff defined that MaxDeliver is within range etc.
if lbo := len(req.Config.BackOff); lbo > 0 && req.Config.MaxDeliver <= lbo {
err := errors.New("max deliver required to be > length backoff values")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this, and then have resp.Error = NewJS... that you will get adding the new error.

@derekcollison
Copy link
Member Author

The error is using the new format and specified that the consumer could not be created. We could create one specifically for this one config error condition, but we do account for errors to have modified descriptions etc.

@derekcollison
Copy link
Member Author

Will change to plural.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

… amount of time.

Signed-off-by: Derek Collison <derek@nats.io>
This allows a consumer to have exponential backoffs vs static AckWait and MaxDeliver.
When BackOff is set it will overridde AckWait to BackOff[0] and MaxDeliver will be len(BackOff)+1.

Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
… be > len(BackOff) but if larger we will reuse last value in BackOff array.

Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
@ripienaar
Copy link
Contributor

ripienaar commented Jan 25, 2022

I want to just again make the case for jitter here which I dont think i made strongly easlier having mainly mentioned that it's needed.

In backoff systems jitter is essential to avoid ongoing repeated run on the bank scenarios.

Imagine you have a stream of messages that results in work on a remote system out of your control. You would set a Max Ack Outstanding thats reasonable but ultimately you dont control the behavior of the remote (its a remote SaaS), often those systems adjust what they allow you based on protecting themselve.

Backoff is [30s, 1m, 2m], you try deliver the 1000 messages it accepts 100 and starts rejecting you. At 30s all 900 remaining messages are attempted again at about the same time, you now have 800 retries. after 90s you have 700 retries etc thats almost 10 minutes to deliver 1000 messages. Each time crushing the remote because all messags are in sync.

Now consider if we had [30s, 1m, 2m] with a jitter time of 20s, at 30s the retry is spread out to roughly 45/sec, well within capabilities. Total delivery time is 50 seconds vs 10 minutes. What, if any, remains after 1 retry is further spread out on the 2nd try by jitter and your peak becomes unrecognisable background noise in no time.

This is the purpose of Backoffs, I can tell you many many examples of this in the real world where it caused problems.

I ran voting for a large TV show like $country got talent (but not that one). Due to regulations voters had to pre-register and when voting starts we needed to send all a message 1. for Bob, etc, user reply with their vote.

On this night we had effectively no messages going out, Vodafone SMSC was rate limiting us based on general network conditions and we had to essentially do FC. Someone fat fingered and didnt set jitter it was left to default 0. We sent essentially no messages.

Hugely embarrasing and the company running the votes had to pay large 7 figure sum in compensation. When jitter is correctly set we can sustain 100s of k of messages a minute essentially indefinitely.

I ran into this exact scenario doing SMS services for 3 Olympic games and many TV ad campaigns etc.

With the feature set here, it's impossible to build in jitter into message delivery and it's bad news.

@@ -297,6 +303,10 @@ func setConsumerConfigDefaults(config *ConsumerConfig) {
if config.MaxDeliver == 0 {
config.MaxDeliver = -1
}
// If BackOff was specified that will override the AckWait and the MaxDeliver.
if len(config.BackOff) > 0 {
config.AckWait = config.BackOff[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that consumers must be aware that they must acknowledge before the backoff duration at index 0? As anything slower than that will represent the message will already be redelivered for the 2nd attempt.

If I'm not wrong, the NAK delay is applied when it reaches the server again, adding to the time spent already on the application code. For me is easier to think of backoff as an addition to the AckWait as the redeliver will start counting from that moment plus the backoff.

Of course, application side can choose another backoff to follow that reasoning, but when I get this, I get surprised about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants