Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADDED] ReplaceDurable option #1136

Merged
merged 2 commits into from
Jan 7, 2021
Merged

[ADDED] ReplaceDurable option #1136

merged 2 commits into from
Jan 7, 2021

Conversation

kozlovic
Copy link
Member

@kozlovic kozlovic commented Jan 5, 2021

When the subscription request for a durable subscription times out
or fail on the client side, but it was accepted in the server, then
if the application tries to restart the subscription request again
it will fail with a "duplicate durable subscription" error until
the connection is closed.

This new option allows the user to decide how the server should behave
when processing a duplicate durable subscription. If disabled, the
default, it behaves as described above, that is, it will reject
the second subscription request and return the "duplicate durable"
error.
If enabled, if the server detects that this is a duplicate, it will
close the active one and accept the new one. It is a suspend followed
by a resume.

From the client perspective, if this is done in the context of #1135,
then everything works well since the original subscription in the
client was actually not started due to subscription request failure.
However, if user try to create multiple duplicate durable subscriptions
for subscription requests (Subscribe() calls) that did not fail, then
their application will not be notified that the subscriptions that are
being replaced are replaced, but they will simply stop receiving messages
on those.

Resolves #1135

Signed-off-by: Ivan Kozlovic ivan@synadia.com

When the subscription request for a durable subscription times out
or fail on the client side, but it was accepted in the server, then
if the application tries to restart the subscription request again
it will fail with a "duplicate durable subscription" error until
the connection is closed.

This new option allows the user to decide how the server should behave
when processing a duplicate durable subscription. If disabled, the
default, it behaves as described above, that is, it will reject
the second subscription request and return the "duplicate durable"
error.
If enabled, if the server detects that this is a duplicate, it will
close the active one and accept the new one. It is a suspend followed
by a resume.

From the client perspective, if this is done in the context of #1135,
then everything works well since the original subscription in the
client was actually not started due to subscription request failure.
However, if user try to create multiple duplicate durable subscriptions
for subscription requests (Subscribe() calls) that did not fail, then
their application will not be notified that the subscriptions that are
being replaced are replaced, but they will simply stop receiving messages
on those.

Resolves #1135

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@coveralls
Copy link

coveralls commented Jan 5, 2021

Coverage Status

Coverage increased (+0.05%) to 91.67% when pulling 2e417c5 on fix_1135 into 4ddbef0 on master.

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kozlovic
Copy link
Member Author

kozlovic commented Jan 6, 2021

@adklager @yadvlz @etrochim: I wonder if you could have a go with this PR and see if that helps the issue you have observed. If not, that's ok, I will merge and this will be available in the next release.

@ghost
Copy link

ghost commented Jan 6, 2021

I did a go get github.com/nats-io/nats-streaming-server@be195d378bb36e4cb42830331e2d7da2e56a931a
then created a stan.conf file with the contents:

replace_durable: true

and instantiated the server with

nats-streaming-server --sc /home/aklager/stan.conf ...

With this build and setting, I am no longer able to reproduce the initial request timeout error that led to the subsequent duplicate error. This is not the outcome I expected, though it does seem to be working now in a scenario that it previously did not.

@kozlovic
Copy link
Member Author

kozlovic commented Jan 6, 2021

Thank you for testing it. Some comments:

With this build and setting, I am no longer able to reproduce the initial request timeout error that led to the subsequent duplicate error.

To be clear, this PR does not solve any timeout that your application may have experienced when creating the subscription. Some users had the same "timeout" issue and consequent "duplicate" error because they were on purpose making their network being very lossy, etc.. (see #1135)

This is not the outcome I expected, though it does seem to be working now in a scenario that it previously did not.

Again, this PR simply allows users to call Subscribe() from the same connection for the same durable without getting the "duplicate durable" error and instead replace the previous with the latest. If this option is enabled, users should make sure that they are not "double" subscribing unless the previous subscription request had failed. The application would NOT receive the message twice though.

@ghost
Copy link

ghost commented Jan 6, 2021

I am basically putting a breakpoint here https://github.com/nats-io/go-nats/blob/73ffc26dfe70cc3b217df38d209c83b6d07eaf3c/nats.go#L2623 to cause a user timeout failure on the initial request, and then on a subsequent call would see the duplicate error. With the new build, I no longer see the initial timeout, and therefore never make a second call. Any reason why that initial call doesn't still timeout as expected? Sounds like this is not a valid repro.

@ghost
Copy link

ghost commented Jan 6, 2021

To be clear, this PR does not solve any timeout that your application may have experienced when creating the subscription.

But it did. That's what's confusing :) ..and I depended on that error to "enable" the duplicate error.

@kozlovic
Copy link
Member Author

kozlovic commented Jan 6, 2021

Your breakpoint is too late. The request has already been sent, and the message is likely to be available when you resume execution. So the select will get the message and return it.

How did you get the timeout error in the first place, which lead you to open the issue since the second subscribe would have caused a duplicate? Was it just an error that happened one day - maybe the server was overloaded and did not send reply on time, or there was a network disconnect at the bad time, etc..

@ghost
Copy link

ghost commented Jan 6, 2021

The "real" error happened on a heavy loaded server that took longer than the timeout to reply. I've since been able to reproduce though using the breakpoint- which sounds like it was a race condition, and I got lucky. Do you have a suggestion as how to generate the initial error?

@kozlovic
Copy link
Member Author

kozlovic commented Jan 6, 2021

If your server is local and you can "play" with it, you could put it in paused state (ctrl+z), and have the client send the request. That request will timeout. You can then resume the server "fg" (for foreground).

Or if you can modify the code/debugger runtime, set err = nats.ErrTimeout after this line: https://github.com/nats-io/stan.go/blob/5e2953cdd5b1dc02070fec826b0752ea18afbe0d/sub.go#L286

But to be fair, I was more asking for you guys to make sure that this option is a viable solution and does not break any of the existing deployment that you have.

@ghost
Copy link

ghost commented Jan 6, 2021

Thanks for the suggestions.

But to be fair, I was more asking for you guys to make sure that this option is a viable solution and does not break any of the existing deployment that you have.

Got it - works for me 👍

As a trivial future request, we exclusively use command line arguments and not a config file, so that parity would be useful.

@ghost
Copy link

ghost commented Jan 6, 2021

Or if you can modify the code/debugger runtime, set err = nats.ErrTimeout after this line: https://github.com/nats-io/stan.go/blob/5e2953cdd5b1dc02070fec826b0752ea18afbe0d/sub.go#L286

Doing this on the first request does still cause a duplicate durable registration error on subsequent requests.

@kozlovic
Copy link
Member Author

kozlovic commented Jan 6, 2021

Doing this on the first request does still cause a duplicate durable registration error on subsequent requests.

If you set the err to timeout only on the first attempt, and then your app tries to subscribe again, it should work. Of course, only if the server has the new option enabled. I just did it and it works for me.

PS: I am adding the -replace_durable as a command line option as you suggested, so will update the PR soo.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kozlovic kozlovic merged commit 564e335 into master Jan 7, 2021
@kozlovic kozlovic deleted the fix_1135 branch January 7, 2021 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"duplicate durable registration" error after "subscribe request timeout"
3 participants