-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#632] - fix a race condition causing messages to be lost with non-zero max_msgs in a cluster #633
Conversation
…ro max_msgs in a cluster
A TestRouteQueueSemantics test fails with this patch. Probably I've broken an L1/L2 route semantics, but I doubt I'd be able to take into account all use cases. |
The failure of the test is because in the test we generate incorrect protocols (on purpose), which should result in no send. Your code, however, translates that to a non delivery attempt and try to send it (and succeeds). The function |
@kozlovic I thought that I accounted for a wrong protocol message with that Unfortunately, I don't know all the use cases NATS server needs to handle, and I even don't know if my patch fits well into NATS design/coding style. As of now we've created a custom build for our needs with this fix and we are happy with what we've got so far. It'd be great if you or some other NATS core developer can take this issue further (if you consider a bug enough critical) - you'd do it better than me. Just some context: we have a retry logic behind NATS server which accounts for lost messages (world is not ideal, and even if NATS behaves well, network or any application may crash and a message can be lost). Without this fix our full automation suite was running for over 20 hours due to lost messages (and associated timeouts of around 2 minutes per every loss). After a fix the same suite runs for less than 3 hours. So, for us it's a big improvement for test scenarios and a couple of real-life scenarios as well. I believe it might be important for other NATS users too. |
@vkhorozvs Your contribution is much appreciated. Sorry that my email suggested that you had to fix it. I am busy at the moment but wanted just to comment on the reason for the test failure. I will have another look a bit later. I am happy to see that you have a fix for the moment that works for you. |
Just an update that I have been fixing code so that the tests with invalid QRSID still pass and now it's ok. However, I also added a test equivalent to what you were doing in the issue you reported, and that uncovers a race condition between the sublist results access in processMsg() and when the subscription is removed. I can reproduce the race even from master. I will still update this PR but tests will fail with a race condition. This will be addressed in a separate PR. |
Sorry, was not able to update the PR. Also, noticed that we would not get the race from the changes that I made from your branch. It may be because your branch is from a quite old gnatsd fork, so it may indicate that the issue was introduced later. For the updates to this PR please check PR #638. |
@kozlovic thank you for a great work. |
@vkhorozvs Thank you for your contribution! As you have seen, I will need to have a second look to the other PR to make sure that we have an at-most-once in all cases (which I think we have). If so, we can then merge and be assured that you will be mentioned in the release notes. Thanks again! |
Resolves #632
/cc @nats-io/core