[#632] - fix a race condition causing messages to be lost with non-zero max_msgs in a cluster #633

vkhorozvs · 2018-03-05T16:43:14Z

Resolves #632
/cc @nats-io/core

…ro max_msgs in a cluster

vkhorozvs · 2018-03-05T17:11:53Z

A TestRouteQueueSemantics test fails with this patch.

Probably I've broken an L1/L2 route semantics, but I doubt I'd be able to take into account all use cases.
Need some NATS expert to take an action.

kozlovic · 2018-03-07T17:04:14Z

The failure of the test is because in the test we generate incorrect protocols (on purpose), which should result in no send. Your code, however, translates that to a non delivery attempt and try to send it (and succeeds). The function routeSidQueueSubscriber and/or parseRouteSid should then be modified to distinguish between invalid protocol and absence of client and/or sub.

vkhorozvs · 2018-03-07T18:29:22Z

@kozlovic I thought that I accounted for a wrong protocol message with that isRouteQsub variable which is set a a boolean value from routeSidQueueSubscriber, but I may be wrong.

Unfortunately, I don't know all the use cases NATS server needs to handle, and I even don't know if my patch fits well into NATS design/coding style. As of now we've created a custom build for our needs with this fix and we are happy with what we've got so far. It'd be great if you or some other NATS core developer can take this issue further (if you consider a bug enough critical) - you'd do it better than me.

Just some context: we have a retry logic behind NATS server which accounts for lost messages (world is not ideal, and even if NATS behaves well, network or any application may crash and a message can be lost). Without this fix our full automation suite was running for over 20 hours due to lost messages (and associated timeouts of around 2 minutes per every loss). After a fix the same suite runs for less than 3 hours. So, for us it's a big improvement for test scenarios and a couple of real-life scenarios as well. I believe it might be important for other NATS users too.

kozlovic · 2018-03-07T18:34:51Z

@vkhorozvs Your contribution is much appreciated. Sorry that my email suggested that you had to fix it. I am busy at the moment but wanted just to comment on the reason for the test failure. I will have another look a bit later. I am happy to see that you have a fix for the moment that works for you.
We will discuss this internally and decide if this something we want to address. If so, I will update the PR with required fixed. Thanks again!

kozlovic · 2018-03-09T21:29:03Z

Just an update that I have been fixing code so that the tests with invalid QRSID still pass and now it's ok. However, I also added a test equivalent to what you were doing in the issue you reported, and that uncovers a race condition between the sublist results access in processMsg() and when the subscription is removed. I can reproduce the race even from master.

I will still update this PR but tests will fail with a race condition. This will be addressed in a separate PR.

This PR is based out of #633. It imroves parsing QRSID so that the TestRouteQueueSemantics test now passes (when dealing with malformed QRSID). A test similar to what is reported in #632 was also added. This test however, uncovers a race condition that will be fixed in a separate PR. Resolves #632

kozlovic · 2018-03-09T21:53:00Z

Sorry, was not able to update the PR. Also, noticed that we would not get the race from the changes that I made from your branch. It may be because your branch is from a quite old gnatsd fork, so it may indicate that the issue was introduced later. For the updates to this PR please check PR #638.

vkhorozvs · 2018-03-16T14:52:16Z

@kozlovic thank you for a great work.
I'm closing this out in favor of your PR.

kozlovic · 2018-03-16T14:55:01Z

@vkhorozvs Thank you for your contribution! As you have seen, I will need to have a second look to the other PR to make sure that we have an at-most-once in all cases (which I think we have). If so, we can then merge and be assured that you will be mentioned in the release notes. Thanks again!

[#632] - fix a race condition causing messages to be lost with non-ze…

791bfd4

…ro max_msgs in a cluster

derekcollison requested a review from kozlovic March 5, 2018 20:01

kozlovic mentioned this pull request Mar 9, 2018

[IMPROVED] Better attempt at delivering messages to queue subscriptions #638

Merged

vkhorozvs closed this Mar 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#632] - fix a race condition causing messages to be lost with non-zero max_msgs in a cluster #633

[#632] - fix a race condition causing messages to be lost with non-zero max_msgs in a cluster #633

vkhorozvs commented Mar 5, 2018

vkhorozvs commented Mar 5, 2018

kozlovic commented Mar 7, 2018

vkhorozvs commented Mar 7, 2018 •

edited

Loading

kozlovic commented Mar 7, 2018

kozlovic commented Mar 9, 2018

kozlovic commented Mar 9, 2018

vkhorozvs commented Mar 16, 2018

kozlovic commented Mar 16, 2018

[#632] - fix a race condition causing messages to be lost with non-zero max_msgs in a cluster #633

[#632] - fix a race condition causing messages to be lost with non-zero max_msgs in a cluster #633

Conversation

vkhorozvs commented Mar 5, 2018

vkhorozvs commented Mar 5, 2018

kozlovic commented Mar 7, 2018

vkhorozvs commented Mar 7, 2018 • edited Loading

kozlovic commented Mar 7, 2018

kozlovic commented Mar 9, 2018

kozlovic commented Mar 9, 2018

vkhorozvs commented Mar 16, 2018

kozlovic commented Mar 16, 2018

vkhorozvs commented Mar 7, 2018 •

edited

Loading