Leaf node fails to reconnect, due to ping messages being held off indefinitely #3682

sandykellagher · 2022-12-06T10:54:45Z

Defect

We are using a LeafNode NATS server to connect to a cluster, and see a strange effect which prevents the LeafNode reconnecting properly in the event that its link to the cluster goes down.

The LeafNode has two local clients which connect to it, and which generate traffic on a continuous basis. And in short, this continuous client traffic results in the outbound Pings from LeafNode to cluster being held off/delayed indefinitely, with the message "Leaf Node Ping Timer", "Delaying Ping due to client activity". And because the Pings are held off, the LeafNode doesn't detect a stale connection, and hence doesn't close the connection and attempt to reconnect.

I believe I understand the issue.

In NATS server client.go:: processPingTimer() there is a check to test whether to delay sending an outgoing ping, in two cases:

we recently (within specified pingInterval) received a data message or sub/unsub message from the remote end (Client activity)
we recently received a ping from the remote end (Remote ping)

This makes perfect sense: incoming receive messages mean we still have a link and don't need to send a ping.

However, the first test above is derived from the client.last field ("last packet" time) which is set in two cases:

when the readLoop parsing determines we have received a new message or sub/unsub
in flushClients() routine, which is called when we have received some messages that need to be forwarded to other client connections

But I believe that this second case is incorrect: we should only hold off pings when there has been receive traffic on the specified connection, and this isn't so in the second case. We received traffic on one connection, but are resetting the c.last field on another connection where we are forwarding/sending the message.

If I remove the line of code in flushClients() (about line 1113) that updates cp.last, then the Stale Client Connection fires fine in my testing.

Versions of `nats-server` and affected client libraries used:

Latest master NATS server as of 2/12/2022, eg git commit c4c8761

OS/Container environment:

Linux ARM64 - but not a factor

Steps or code to reproduce the issue:

Configure leaf node server with local clients sending traffic that an external client is subscribed for, and then break the connection to the external NATS cluster

Expected result:

Leaf node server should automatically trigger a reconnect when the connection to an external NATS cluster is lost, with the detection time as configured by ping_interval and ping_max parameters.

Actual result:

Leaf node server does not automatically reconnect. Instead, it remains in a zombie state indefinitely.
To be more precise, it might possibly recover when the network stack TCP keepalive timeout expires (which by default is 2 hours), but that is much too long to be useful

The text was updated successfully, but these errors were encountered:

sandykellagher · 2022-12-06T10:55:46Z

Also, raised originally on Slack channel, where JNM replied saying I should create an issue about this

derekcollison · 2022-12-06T11:43:20Z

Great writeup of the issue, will take a look.

sandykellagher · 2022-12-06T12:23:22Z

Thanks

… due to ping messages being held off indefinitely.

sandykellagher · 2022-12-06T13:20:51Z

FWIW I created a PR for this

…connect, due to ping messages being held off indefinitely. Always send pings for gateway and leafnode connections

… gateway and solicited leafnode connections

…F connections, to ensure failover of leaf node connections

sandykellagher · 2022-12-07T15:03:18Z

I have created a fresh PR with only the single line change we agreed

sandykellagher · 2022-12-07T15:36:27Z

However, the fresh PR triggers a Travis build test failure in the TestLeafNodeRTT test with "RTT not tracked". But I don't understand enough about this test to understand how to fix this...

…F connections, to ensure failover of leaf node connections

Fix for #3682 (Take 2): do not delay PINGs for GATEWAY or spoke LEAF connections

sandykellagher added the 🐞 bug label Dec 6, 2022

derekcollison self-assigned this Dec 6, 2022

sandykellagher added a commit to Lawo-Ext/nats-server that referenced this issue Dec 6, 2022

Fix for NATS server issue nats-io#3682: Leaf node fails to reconnect,…

cf2a935

… due to ping messages being held off indefinitely.

sandykellagher added a commit to Lawo-Ext/nats-server that referenced this issue Dec 7, 2022

Updated fix for NATS server issue nats-io#3682: Leaf node fails to re…

8a299ef

…connect, due to ping messages being held off indefinitely. Always send pings for gateway and leafnode connections

sandykellagher added a commit to Lawo-Ext/nats-server that referenced this issue Dec 7, 2022

Updated fix for NATS server issue nats-io#3682. Always send pings for…

dcf66c6

… gateway and solicited leafnode connections

sandykellagher added a commit to Lawo-Ext/nats-server that referenced this issue Dec 7, 2022

Fix for nats-io#3682: do not delay PINGs for GATEWAY or solicited LEA…

24ba945

…F connections, to ensure failover of leaf node connections

sandykellagher added a commit to Lawo-Ext/nats-server that referenced this issue Dec 7, 2022

Fix for nats-io#3682: do not delay PINGs for GATEWAY or solicited LEA…

7907950

…F connections, to ensure failover of leaf node connections

derekcollison added a commit that referenced this issue Dec 7, 2022

Merge pull request #3692 from Lawo-Ext/main

effccfd

Fix for #3682 (Take 2): do not delay PINGs for GATEWAY or spoke LEAF connections

derekcollison closed this as completed Dec 15, 2022

sandykellagher mentioned this issue Apr 3, 2023

Disconnection of a NATS server in a cluster is not detected by ping-pong mechanism #4014

Closed

bruth removed the 🐞 bug label Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaf node fails to reconnect, due to ping messages being held off indefinitely #3682

Leaf node fails to reconnect, due to ping messages being held off indefinitely #3682

sandykellagher commented Dec 6, 2022

sandykellagher commented Dec 6, 2022

derekcollison commented Dec 6, 2022

sandykellagher commented Dec 6, 2022

sandykellagher commented Dec 6, 2022

sandykellagher commented Dec 7, 2022

sandykellagher commented Dec 7, 2022

Leaf node fails to reconnect, due to ping messages being held off indefinitely #3682

Leaf node fails to reconnect, due to ping messages being held off indefinitely #3682

Comments

sandykellagher commented Dec 6, 2022

Defect

Versions of nats-server and affected client libraries used:

OS/Container environment:

Steps or code to reproduce the issue:

Expected result:

Actual result:

sandykellagher commented Dec 6, 2022

derekcollison commented Dec 6, 2022

sandykellagher commented Dec 6, 2022

sandykellagher commented Dec 6, 2022

sandykellagher commented Dec 7, 2022

sandykellagher commented Dec 7, 2022

Versions of `nats-server` and affected client libraries used: