Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#632] - fix a race condition causing messages to be lost with non-zero max_msgs in a cluster #633

Closed
wants to merge 1 commit into from

Conversation

vkhorozvs
Copy link
Contributor

Resolves #632
/cc @nats-io/core

@vkhorozvs
Copy link
Contributor Author

A TestRouteQueueSemantics test fails with this patch.

Probably I've broken an L1/L2 route semantics, but I doubt I'd be able to take into account all use cases.
Need some NATS expert to take an action.

@kozlovic
Copy link
Member

kozlovic commented Mar 7, 2018

The failure of the test is because in the test we generate incorrect protocols (on purpose), which should result in no send. Your code, however, translates that to a non delivery attempt and try to send it (and succeeds). The function routeSidQueueSubscriber and/or parseRouteSid should then be modified to distinguish between invalid protocol and absence of client and/or sub.

@vkhorozvs
Copy link
Contributor Author

vkhorozvs commented Mar 7, 2018

@kozlovic I thought that I accounted for a wrong protocol message with that isRouteQsub variable which is set a a boolean value from routeSidQueueSubscriber, but I may be wrong.

Unfortunately, I don't know all the use cases NATS server needs to handle, and I even don't know if my patch fits well into NATS design/coding style. As of now we've created a custom build for our needs with this fix and we are happy with what we've got so far. It'd be great if you or some other NATS core developer can take this issue further (if you consider a bug enough critical) - you'd do it better than me.


Just some context: we have a retry logic behind NATS server which accounts for lost messages (world is not ideal, and even if NATS behaves well, network or any application may crash and a message can be lost). Without this fix our full automation suite was running for over 20 hours due to lost messages (and associated timeouts of around 2 minutes per every loss). After a fix the same suite runs for less than 3 hours. So, for us it's a big improvement for test scenarios and a couple of real-life scenarios as well. I believe it might be important for other NATS users too.

@kozlovic
Copy link
Member

kozlovic commented Mar 7, 2018

@vkhorozvs Your contribution is much appreciated. Sorry that my email suggested that you had to fix it. I am busy at the moment but wanted just to comment on the reason for the test failure. I will have another look a bit later. I am happy to see that you have a fix for the moment that works for you.
We will discuss this internally and decide if this something we want to address. If so, I will update the PR with required fixed. Thanks again!

@kozlovic
Copy link
Member

kozlovic commented Mar 9, 2018

Just an update that I have been fixing code so that the tests with invalid QRSID still pass and now it's ok. However, I also added a test equivalent to what you were doing in the issue you reported, and that uncovers a race condition between the sublist results access in processMsg() and when the subscription is removed. I can reproduce the race even from master.

I will still update this PR but tests will fail with a race condition. This will be addressed in a separate PR.

kozlovic added a commit that referenced this pull request Mar 9, 2018
This PR is based out of #633. It imroves parsing QRSID so that the
TestRouteQueueSemantics test now passes (when dealing with malformed
QRSID).
A test similar to what is reported in #632 was also added. This
test however, uncovers a race condition that will be fixed in a
separate PR.

Resolves #632
@kozlovic
Copy link
Member

kozlovic commented Mar 9, 2018

Sorry, was not able to update the PR. Also, noticed that we would not get the race from the changes that I made from your branch. It may be because your branch is from a quite old gnatsd fork, so it may indicate that the issue was introduced later. For the updates to this PR please check PR #638.

@vkhorozvs
Copy link
Contributor Author

@kozlovic thank you for a great work.
I'm closing this out in favor of your PR.

@vkhorozvs vkhorozvs closed this Mar 16, 2018
@kozlovic
Copy link
Member

@vkhorozvs Thank you for your contribution! As you have seen, I will need to have a second look to the other PR to make sure that we have an at-most-once in all cases (which I think we have). If so, we can then merge and be assured that you will be mentioned in the release notes. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants