-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaf node fails to reconnect, due to ping messages being held off indefinitely #3682
Comments
Also, raised originally on Slack channel, where JNM replied saying I should create an issue about this |
Great writeup of the issue, will take a look. |
Thanks |
… due to ping messages being held off indefinitely.
FWIW I created a PR for this |
…connect, due to ping messages being held off indefinitely. Always send pings for gateway and leafnode connections
… gateway and solicited leafnode connections
…F connections, to ensure failover of leaf node connections
I have created a fresh PR with only the single line change we agreed |
However, the fresh PR triggers a Travis build test failure in the TestLeafNodeRTT test with "RTT not tracked". But I don't understand enough about this test to understand how to fix this... |
…F connections, to ensure failover of leaf node connections
Fix for #3682 (Take 2): do not delay PINGs for GATEWAY or spoke LEAF connections
Defect
We are using a LeafNode NATS server to connect to a cluster, and see a strange effect which prevents the LeafNode reconnecting properly in the event that its link to the cluster goes down.
The LeafNode has two local clients which connect to it, and which generate traffic on a continuous basis. And in short, this continuous client traffic results in the outbound Pings from LeafNode to cluster being held off/delayed indefinitely, with the message "Leaf Node Ping Timer", "Delaying Ping due to client activity". And because the Pings are held off, the LeafNode doesn't detect a stale connection, and hence doesn't close the connection and attempt to reconnect.
I believe I understand the issue.
In NATS server client.go:: processPingTimer() there is a check to test whether to delay sending an outgoing ping, in two cases:
This makes perfect sense: incoming receive messages mean we still have a link and don't need to send a ping.
However, the first test above is derived from the client.last field ("last packet" time) which is set in two cases:
But I believe that this second case is incorrect: we should only hold off pings when there has been receive traffic on the specified connection, and this isn't so in the second case. We received traffic on one connection, but are resetting the c.last field on another connection where we are forwarding/sending the message.
If I remove the line of code in flushClients() (about line 1113) that updates cp.last, then the Stale Client Connection fires fine in my testing.
Versions of
nats-server
and affected client libraries used:Latest master NATS server as of 2/12/2022, eg git commit c4c8761
OS/Container environment:
Linux ARM64 - but not a factor
Steps or code to reproduce the issue:
Configure leaf node server with local clients sending traffic that an external client is subscribed for, and then break the connection to the external NATS cluster
Expected result:
Leaf node server should automatically trigger a reconnect when the connection to an external NATS cluster is lost, with the detection time as configured by ping_interval and ping_max parameters.
Actual result:
Leaf node server does not automatically reconnect. Instead, it remains in a zombie state indefinitely.
To be more precise, it might possibly recover when the network stack TCP keepalive timeout expires (which by default is 2 hours), but that is much too long to be useful
The text was updated successfully, but these errors were encountered: