Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"duplicate durable registration" error after "subscribe request timeout" #1135

Closed
etrochim opened this issue Dec 31, 2020 · 1 comment · Fixed by #1136
Closed

"duplicate durable registration" error after "subscribe request timeout" #1135

etrochim opened this issue Dec 31, 2020 · 1 comment · Fixed by #1136

Comments

@etrochim
Copy link

I've been testing how my stan clients, written using nats.c, behave in the face of network problems introducing high packet loss (which, unfortunately, happens on rare occasions between two of our geographically dispersed sites). During my testing I've found that on some occasions, when resubscribing after a connection failure, the subscription request will fail with "subscribe request timeout" then, after retrying, will fail with "duplicate durable registration". All subsequent resubscribe attempts will fail with that message for the duration of the connection.

The reconnect logic my stan clients use is pretty straight forward and believe matches the recommendation:

  • Set the ConnectionLostHandler on the stan connection
  • The ConnectionLostHandler will:
    • Destroy all subscriptions
    • Destroy the connection
    • Attempt to reestablish the connection until it succeeds
    • Attempt to resubscribe each subscription until it succeeds unless the subscription fails due to NATS_CONNECTION_CLOSED in which case we stop everything and let the ConnectionLostHandler run again.

I'm simulating an unreliable network by using Linux's NetEm. I've configured the network interface to rate limit to 150kbit and 40% packet loss (unrealistically high but appears to be quite good at hitting many possible failure cases when running for hours/days at a time). Most of the time the re-connection logic works correctly, except, of course, for the originally stated problem.

I haven't confirmed this but it appears as if the SubscriptionRequest message gets to the server but the SubscriptionResponse message doesn't get back to the client before the ConnectionWait period elapses causing a situation where the server thinks the client has a valid subscription but the client disagrees. This traps the client in a state where the subscription request can never be fulfilled until the connection stops and starts again.

Out of curiosity I tested the same scenario with nats-replicator. It too will get trapped in a "duplicate durable registration" error loop until the connection dies again.

Setting the ConnectionWait time to something higher (I've been using 10 seconds) significantly reduces the occurrence of the problem but does not entirely eliminate it.

The only work around I've been able to find is to simply destroy the connection entirely and reconnect when this error is encountered. However, that seems like a fairly drastic solution. Is there a better workaround for this?

@kozlovic
Copy link
Member

kozlovic commented Jan 4, 2021

@etrochim You are correct. As of now, I can't think of any other workaround. You clearly understand what the issue is and how to remedy to it, for now.
I did not have a chance to check the code to see how feasible/safe it is to do this, but maybe we could add a NATS Streaming server option that accept the new durable and internally close the old one? It is understood that in situations where the user is incorrectly trying to start the same durable from the same connection twice (the first having succeeded), then the application will not know that and have a durable just not receiving any message..

kozlovic added a commit that referenced this issue Jan 5, 2021
When the subscription request for a durable subscription times out
or fail on the client side, but it was accepted in the server, then
if the application tries to restart the subscription request again
it will fail with a "duplicate durable subscription" error until
the connection is closed.

This new option allows the user to decide how the server should behave
when processing a duplicate durable subscription. If disabled, the
default, it behaves as described above, that is, it will reject
the second subscription request and return the "duplicate durable"
error.
If enabled, if the server detects that this is a duplicate, it will
close the active one and accept the new one. It is a suspend followed
by a resume.

From the client perspective, if this is done in the context of #1135,
then everything works well since the original subscription in the
client was actually not started due to subscription request failure.
However, if user try to create multiple duplicate durable subscriptions
for subscription requests (Subscribe() calls) that did not fail, then
their application will not be notified that the subscriptions that are
being replaced are replaced, but they will simply stop receiving messages
on those.

Resolves #1135

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants