You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using three NATS servers in a cluster, and see a bug which prevents the loss of a particular server being detected by the other two servers. This matters because we use queue group subscriptions: when a server goes offline, we need the queue group subscription for services on the dead server to be removed. Otherwise, NATS can end up selecting entities on the dead server as the nominated responder for pub messages to the queue group.
In our system, there is continual traffic exchanged between all three servers during normal operation. And in short, this continuous traffic results in all outbound Pings from one server to another being held off/delayed indefinitely. And because the Pings are held off, the servers don't detect a stale connection, and hence doesn't close the connection and attempt to reconnect.
I believe I understand the issue.
In NATS server client.go:: processPingTimer() there is a check to test whether to delay sending an outgoing ping, in two cases:
we recently (within specified pingInterval) received traffic (a message or sub/unsub) from the remote end
we recently received a ping from the remote end (Remote ping)
This makes perfect sense: incoming receive messages mean we still have a link and don't need to send a ping.
However, the first test above is derived from the client.last field ("last packet" time) which is set for both incoming and outgoing traffic on this link.
I think the solution is to store a separate client.ping.lastRxMsg field which is updated for receive traffic only, and then use that field in the test of whether to hold off sending a ping.
Also, with this change, I think the same logic can be applied to all connection types (CLIENT, ROUTER, GATEWAY, LEAF). So we can remove the tests to always send ping for a GATEWAY.
OS/Container environment
Linux ARM64 - but not a factor
Steps or code to reproduce the issue
Configure 3 server cluster, and then break the connection to one of the servers
Expected result
The remaining servers should automatically detect loss of connection, with the detection time as configured by ping_interval and ping_max parameters.
Actual result
The remaining servers do eventually detect loss of connection via a WriteBuffer deadline being exceeded, but it takes a very long time (presumably) depending on the volume of traffic and the size of the buffer.
The text was updated successfully, but these errors were encountered:
…connections (#4016)
Alternative fix for issue
#4014: where we always send
pings for ROUTER, GATEWAY and LEAF spoke connections.
This is my own original work that I license to the project
[ x] Link to issue, e.g. Resolves #NNN
Documentation added (if applicable)
Tests added
[ x] Branch rebased on top of current main (git pull --rebase origin
main)
[ x] Changes squashed to a single commit (described
[here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html))
[x] Build is green in Travis CI
[ x] You have certified that the contribution is your original work and
that you license the work to the project under the [Apache 2
license](https://github.com/nats-io/nats-server/blob/main/LICENSE)
Resolves #
Changes proposed in this pull request:
- in processPingTimer, add explicit test for ROUTER connection, and only
hold off sending ping based on receive ping
Defect
This is more or less a follow on to #3682
We are using three NATS servers in a cluster, and see a bug which prevents the loss of a particular server being detected by the other two servers. This matters because we use queue group subscriptions: when a server goes offline, we need the queue group subscription for services on the dead server to be removed. Otherwise, NATS can end up selecting entities on the dead server as the nominated responder for pub messages to the queue group.
In our system, there is continual traffic exchanged between all three servers during normal operation. And in short, this continuous traffic results in all outbound Pings from one server to another being held off/delayed indefinitely. And because the Pings are held off, the servers don't detect a stale connection, and hence doesn't close the connection and attempt to reconnect.
I believe I understand the issue.
In NATS server client.go:: processPingTimer() there is a check to test whether to delay sending an outgoing ping, in two cases:
This makes perfect sense: incoming receive messages mean we still have a link and don't need to send a ping.
However, the first test above is derived from the client.last field ("last packet" time) which is set for both incoming and outgoing traffic on this link.
I think the solution is to store a separate client.ping.lastRxMsg field which is updated for receive traffic only, and then use that field in the test of whether to hold off sending a ping.
Also, with this change, I think the same logic can be applied to all connection types (CLIENT, ROUTER, GATEWAY, LEAF). So we can remove the tests to always send ping for a GATEWAY.
OS/Container environment
Linux ARM64 - but not a factor
Steps or code to reproduce the issue
Configure 3 server cluster, and then break the connection to one of the servers
Expected result
The remaining servers should automatically detect loss of connection, with the detection time as configured by ping_interval and ping_max parameters.
Actual result
The remaining servers do eventually detect loss of connection via a WriteBuffer deadline being exceeded, but it takes a very long time (presumably) depending on the volume of traffic and the size of the buffer.
The text was updated successfully, but these errors were encountered: