New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of delivery #187

Closed
jaquadro opened this Issue Oct 28, 2016 · 7 comments

Comments

Projects
None yet
2 participants
@jaquadro
Copy link

jaquadro commented Oct 28, 2016

Does nats-streaming-server make any guarantees about the order of messages delivered to a subscriber? E.g. at a base level, if a single publisher writes messages A, B, C to the same topic, will a single subscriber on a durable subscription receive the messages in that order?

Under normal operation I've always seen messages delivered in sequence number order, but I've run into a case where if I restart the server and the durable subscription with un-acked messages, and publish a storm of new messages right away, then the subscriber will receive a few of those new messages, than all of the outstanding un-acked messages in the file store, and then resume with the new messages. The file store itself ends up with all messages correctly ordered.

It's a problem for my use case, but don't know if it's within streaming-server's design parameters. I'm using the nodejs client.

@kozlovic

This comment has been minimized.

Copy link
Member

kozlovic commented Oct 28, 2016

With the redelivery feature, order can't be guaranteed, since by definition server will resend messages that have not been acknowledged after a period of time. Suppose your consumer receives messages 1, 2 and 3, does not acknowledge 2. Then message 4 is produced, server sends this message to the consumer. The redelivery timer then kicks in and server will resend message 2. The consumer would see messages: 1, 2, 3, 4, 2, 5, etc...

In conclusion, the server does not offer this guarantee although it tries to redeliver messages first thing on startup. That being said, if the durable is stalled (number of outstanding messages >= MaxInflight), then the redelivery will also be stalled, and new messages will be allowed to be sent. When the consumer resumes acking messages, then it may receive redelivered and new messages interleaved (new messages will be in order though).

@jaquadro

This comment has been minimized.

Copy link
Author

jaquadro commented Oct 28, 2016

Thanks for the fast reply.

Could the behavior in the first paragraph be worked around by setting the redelivery timeout to something really big, effectively disabling it? It seems separate from the issue of restart redelivery not winning the race to get its messages out first.

@kozlovic

This comment has been minimized.

Copy link
Member

kozlovic commented Oct 28, 2016

I don't think you can disable redelivery entirely. You could set AckWait() to a very high value or not use manual ack mode at all. Even then though, there is a possibility that even the auto-acks are missed by a stopping server and so it would resend those messages.

Redelivery is a feature of the NATS Streaming server. It sounds like you are asking for a server without this feature. Not sure there are plans for that at the moment.

@jaquadro

This comment has been minimized.

Copy link
Author

jaquadro commented Oct 28, 2016

Maybe. It's something I can deal with to a point. While I've found NATS to be a close match to what I'm looking for so far, I'm really trying to emulate the behavior of AWS Kinesis where I am guaranteed to read records in sequence-number-order, as determined by the queue when a message was published to it. But this has to be entirely local to the same machine so Kinesis isn't on the table.

If a spurious message were redelivered out of order, then I would discard it because I can keep track of sequence numbers while I'm connected. It just falls apart on a reconnect because I could get a newly published record delivered before an enormous backlog of older data is processed, and all that data would now be behind my latest sequence number.

@kozlovic

This comment has been minimized.

Copy link
Member

kozlovic commented Nov 2, 2016

Here's what was the issue and first a bit of background:

The server stops sending messages (new or redelivered) when the number of messages sent to a subscriber reaches the MaxInflight value. When the server receives ACKs, that number reduces and so it can send more messages.

I realized that this would prevent redelivery all together if the subscriber had stalled (that is, messages sent >= MaxInflight). For durables, it would even mean that if the application is stalled and stopped and then restarted, it would not receive any of the undelivered messages, which will block this durable for ever since the number of unack'ed would never go down.

So I decided to force redelivery. For durables, on restart (server or client), all unacknowledged messages are redelivered. For non-durables, on server restart, a redelivery is attempted before new messages can be sent. Again, however, due to the MaxInflight, it is possible that the redelivery fails (that is, no message is sent because of the MaxInflight value and number of outstanding acks). The redelivery will be attempted 3 times and then forced again (messages will be sent). But after the failed first attempt redelivery, the subscriber is marked as being able to receive new messages. This is why you probably see this mix of old/new messages.

I need to think if I should always force redelivery and not care about MaxInFlight. The problem with that is that a subscriber is already at a point where it is not acking messages and if the server keep resending (potentially thousands of) messages while in that state, that may not be too good. Again, need to think more about it.

@jaquadro

This comment has been minimized.

Copy link
Author

jaquadro commented Nov 2, 2016

This is how things are currently working in my testing:

  • A durable subscriber has stalled for a period of time (because an AWS endpoint is out for several hours, for example).
  • The subscriber crashes for some reason (Node out of memory, PC restart, whatever). [Maybe important detail: the NATS server is a child process and is necessarily restarted during this process]
  • When the application restarts and reconnects the durable subscriber, a batch of up to MaxInFlight messages are redelivered.
  • As those messages are processed and ack'd, more backlogged messages are delivered.

I'm not sure if that is consistent with your description or not, but it matches what I would expect to happen, and what I would want to happen. So that's good.

The mixed messages problem can be worked around by avoiding publishing new messages until I know the backlog has been sent out (not necessarily ack'd), but it does leave those new messages vulnerable as they are being buffered in a non-persistent space during that time. Maybe it would be sufficient to receive any backlogged messages before publishing again? I don't know what potential mixing period is.

@kozlovic

This comment has been minimized.

Copy link
Member

kozlovic commented Nov 7, 2016

Sorry for the delay. If we were to change the redelivery to be forced since the get go, then that would reduce the risk of having mixed messages. On server restart or durable restart, if there are un-acknowledged messages, there would be sent to the client, then the subscriber would be unlocked for receiving new messages.

There would still be a possibility that if the client misses some of the redelivered messages (remember, there is no direct TCP connection between the Streaming server and clients, so this is still a possibility), the server would not know and allow "new" messages to be sent. Say that there were 5 messages that a durable received and did not acknowledge messages 1, 3 and 5. The durable is stopped, 5 more messages are added to the log. The server (or the durable) restarts and messages 1, 3 and 5 are redelivered before messages 6 to 10 can be sent. Say that message 3 did not make it to the client. Server will not know. So the client may see something like 1,5,6,7,8,9,10 and AckWait later message 3.
That's probably unlikely but could happen.

The above is under the condition that we make force redelivery, regardless of MaxInflight. Still need to check the consequences, but there is a good chance that we change that soon.

kozlovic added a commit that referenced this issue Nov 21, 2016

[CHANGED] Redelivery is now always forced
When a consumer was stalled (number of ack pending is >= of MaxInflight)
when a redelivery occurred, the server would be prevented from sending
messages since the number of outstanding acks was already at the limit.
Some code was introduced to force redelivery after a certain number
of failed attempts. This code is now removed and redelivery always
resend messages for which acks have not been received.

This helps with message ordering on server or durable restart.

Resolves #187

@kozlovic kozlovic closed this in #200 Dec 5, 2016

@wafflebot wafflebot bot removed the waffle:needs review label Dec 5, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment