Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaf node fails to reconnect, due to ping messages being held off indefinitely #3682

Closed
sandykellagher opened this issue Dec 6, 2022 · 6 comments
Assignees

Comments

@sandykellagher
Copy link
Contributor

Defect

We are using a LeafNode NATS server to connect to a cluster, and see a strange effect which prevents the LeafNode reconnecting properly in the event that its link to the cluster goes down.

The LeafNode has two local clients which connect to it, and which generate traffic on a continuous basis. And in short, this continuous client traffic results in the outbound Pings from LeafNode to cluster being held off/delayed indefinitely, with the message "Leaf Node Ping Timer", "Delaying Ping due to client activity". And because the Pings are held off, the LeafNode doesn't detect a stale connection, and hence doesn't close the connection and attempt to reconnect.

I believe I understand the issue.

In NATS server client.go:: processPingTimer() there is a check to test whether to delay sending an outgoing ping, in two cases:

  • we recently (within specified pingInterval) received a data message or sub/unsub message from the remote end (Client activity)
  • we recently received a ping from the remote end (Remote ping)

This makes perfect sense: incoming receive messages mean we still have a link and don't need to send a ping.

However, the first test above is derived from the client.last field ("last packet" time) which is set in two cases:

  • when the readLoop parsing determines we have received a new message or sub/unsub
  • in flushClients() routine, which is called when we have received some messages that need to be forwarded to other client connections

But I believe that this second case is incorrect: we should only hold off pings when there has been receive traffic on the specified connection, and this isn't so in the second case. We received traffic on one connection, but are resetting the c.last field on another connection where we are forwarding/sending the message.

If I remove the line of code in flushClients() (about line 1113) that updates cp.last, then the Stale Client Connection fires fine in my testing.

Versions of nats-server and affected client libraries used:

Latest master NATS server as of 2/12/2022, eg git commit c4c8761

OS/Container environment:

Linux ARM64 - but not a factor

Steps or code to reproduce the issue:

Configure leaf node server with local clients sending traffic that an external client is subscribed for, and then break the connection to the external NATS cluster

Expected result:

Leaf node server should automatically trigger a reconnect when the connection to an external NATS cluster is lost, with the detection time as configured by ping_interval and ping_max parameters.

Actual result:

Leaf node server does not automatically reconnect. Instead, it remains in a zombie state indefinitely.
To be more precise, it might possibly recover when the network stack TCP keepalive timeout expires (which by default is 2 hours), but that is much too long to be useful

@sandykellagher
Copy link
Contributor Author

Also, raised originally on Slack channel, where JNM replied saying I should create an issue about this

@derekcollison derekcollison self-assigned this Dec 6, 2022
@derekcollison
Copy link
Member

Great writeup of the issue, will take a look.

@sandykellagher
Copy link
Contributor Author

Thanks

sandykellagher added a commit to Lawo-Ext/nats-server that referenced this issue Dec 6, 2022
… due to ping messages being held off indefinitely.
@sandykellagher
Copy link
Contributor Author

FWIW I created a PR for this

sandykellagher added a commit to Lawo-Ext/nats-server that referenced this issue Dec 7, 2022
…connect, due to ping messages being held off indefinitely. Always send pings for gateway and leafnode connections
sandykellagher added a commit to Lawo-Ext/nats-server that referenced this issue Dec 7, 2022
… gateway and solicited leafnode connections
sandykellagher added a commit to Lawo-Ext/nats-server that referenced this issue Dec 7, 2022
…F connections, to ensure failover of leaf node connections
@sandykellagher
Copy link
Contributor Author

I have created a fresh PR with only the single line change we agreed

@sandykellagher
Copy link
Contributor Author

However, the fresh PR triggers a Travis build test failure in the TestLeafNodeRTT test with "RTT not tracked". But I don't understand enough about this test to understand how to fix this...

sandykellagher added a commit to Lawo-Ext/nats-server that referenced this issue Dec 7, 2022
…F connections, to ensure failover of leaf node connections
derekcollison added a commit that referenced this issue Dec 7, 2022
Fix for #3682 (Take 2): do not delay PINGs for GATEWAY or spoke LEAF connections
@bruth bruth removed the 🐞 bug label Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants