-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queue group subscribe/close race-condition creates multiple shadow subscriptions and possible undelivered messages #950
Comments
Thank you for the detailed report. Will have a look asap. Just curious, did you experience the race or this is just from code inspection? |
That probably explains #322 which at the time I could not find the explanation for. I have just written a test and can confirm the race. Working on a fix right now. Thank you again for this great report! |
That's great, thanks! We encountered the delayed message delivery, and the message was delivered after we restarted NATS. When I investigated, I noticed that the queue group had multiple shadow subscriptions. Our use-case is a bit unusual in that we subscribe, wait to receive a message, and then close the subscription. We do this to control the rate of incoming messages. When we called 'close', our code didn't wait for the operation to complete, and so when we subscribed again, there were cases where the 'close' and 'subscribe' would run in parallel. Basically, our code was written in a way that exercised the race often. I've updated our code to wait for the 'close' operation to complete, but it could still be a problem if we scale up to multiple subscribers. I really appreciate your efforts. Thanks again! |
A race between a durable queue subscriber close and create could cause the server to store multiple shadow subscriptions. Resolves #950 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
I have a branch up with a possible fix (https://github.com/nats-io/nats-streaming-server/tree/fix_950). If there is any chance you can test it that would be great! |
I can confirm that your fix resolves the issue. Thanks for the super-fast turnaround! |
I believe there is a race-condition when a lone durable queue subscriber closes its subscription at the same time that a client subscribes to the same queue. The closing thread sees no other subscribers, and converts the active subscription into a shadow subscription. The subscribing thread does not see the shadow subscription, and so creates a new active one. The result is that the queue group has an active subscription as well as a shadow subscription. If the active subscription is closed, we get two shadow subscriptions. If the race is repeated, we can get more than two shadow subscriptions for the same queue group.
If the subscription was closed while a message was being delivered, the message is unacknowledged and in a pending state bound to the shadow subscription. Since the active and shadow subscriptions are unaware of each other, the message is not adopted by the active subscription and remains undelivered until the shadow subscription is adopted in a subsequent 'subscribe' operation. Depending on the pattern of activity and number of shadow subscriptions, it is possible that the shadow subscription will be orphaned forever, and the message is never successfully delivered.
There are barrier operations that appear to prevent a client-side 'close subscription' request from being processed while a 'subscribe' request is being processed. However, if the 'close subscription' request gets past the barrier first, we can get the two requests to race. The race can also occur if a client previously crashed, and the heartbeat-check times out and closes the client's subscription while another client subscribes.
The race-condition window in the 'subscribe' operation appears to be between
nats-streaming-server/server/server.go
Lines 4668 to 4767 in 910d6e1
The race-condition window in the 'close subscription' operation appears to be between
nats-streaming-server/server/server.go
Lines 1093 to 1108 in 910d6e1
I believe that if the operations are running in both windows at the same time for the same durable queue group subscription, we will get the problem described above.
Thanks!
The text was updated successfully, but these errors were encountered: