Disconnection of a NATS server in a cluster is not detected by ping-pong mechanism #4014

sandykellagher · 2023-04-03T10:51:04Z

Defect
This is more or less a follow on to #3682

We are using three NATS servers in a cluster, and see a bug which prevents the loss of a particular server being detected by the other two servers. This matters because we use queue group subscriptions: when a server goes offline, we need the queue group subscription for services on the dead server to be removed. Otherwise, NATS can end up selecting entities on the dead server as the nominated responder for pub messages to the queue group.

In our system, there is continual traffic exchanged between all three servers during normal operation. And in short, this continuous traffic results in all outbound Pings from one server to another being held off/delayed indefinitely. And because the Pings are held off, the servers don't detect a stale connection, and hence doesn't close the connection and attempt to reconnect.

I believe I understand the issue.

In NATS server client.go:: processPingTimer() there is a check to test whether to delay sending an outgoing ping, in two cases:

we recently (within specified pingInterval) received traffic (a message or sub/unsub) from the remote end
we recently received a ping from the remote end (Remote ping)

This makes perfect sense: incoming receive messages mean we still have a link and don't need to send a ping.

However, the first test above is derived from the client.last field ("last packet" time) which is set for both incoming and outgoing traffic on this link.

I think the solution is to store a separate client.ping.lastRxMsg field which is updated for receive traffic only, and then use that field in the test of whether to hold off sending a ping.

Also, with this change, I think the same logic can be applied to all connection types (CLIENT, ROUTER, GATEWAY, LEAF). So we can remove the tests to always send ping for a GATEWAY.

OS/Container environment
Linux ARM64 - but not a factor

Steps or code to reproduce the issue
Configure 3 server cluster, and then break the connection to one of the servers

Expected result
The remaining servers should automatically detect loss of connection, with the detection time as configured by ping_interval and ping_max parameters.

Actual result
The remaining servers do eventually detect loss of connection via a WriteBuffer deadline being exceeded, but it takes a very long time (presumably) depending on the volume of traffic and the size of the buffer.

sandykellagher · 2023-04-04T12:07:52Z

I've closed PR #4015 in favour of #4016

…connections (#4016) Alternative fix for issue #4014: where we always send pings for ROUTER, GATEWAY and LEAF spoke connections. This is my own original work that I license to the project [ x] Link to issue, e.g. Resolves #NNN Documentation added (if applicable) Tests added [ x] Branch rebased on top of current main (git pull --rebase origin main) [ x] Changes squashed to a single commit (described [here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html)) [x] Build is green in Travis CI [ x] You have certified that the contribution is your original work and that you license the work to the project under the [Apache 2 license](https://github.com/nats-io/nats-server/blob/main/LICENSE) Resolves # Changes proposed in this pull request: - in processPingTimer, add explicit test for ROUTER connection, and only hold off sending ping based on receive ping

sandykellagher added the 🐞 bug label Apr 3, 2023

This was referenced Apr 3, 2023

Fix for Issue 4014: simplified ping handling / deferral based on rece… #4015

Closed

Alternative fix for issue 4014, where we always send pings on ROUTER connections #4016

Merged

derekcollison closed this as completed Jun 11, 2023

bruth removed the 🐞 bug label Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disconnection of a NATS server in a cluster is not detected by ping-pong mechanism #4014

Disconnection of a NATS server in a cluster is not detected by ping-pong mechanism #4014

sandykellagher commented Apr 3, 2023

sandykellagher commented Apr 4, 2023

Disconnection of a NATS server in a cluster is not detected by ping-pong mechanism #4014

Disconnection of a NATS server in a cluster is not detected by ping-pong mechanism #4014

Comments

sandykellagher commented Apr 3, 2023

sandykellagher commented Apr 4, 2023