Prevent duplicate routes #5484

wjordan · 2024-06-04T21:04:32Z

This PR attempts to prevent duplicate routes by checking the existing route-connection count before attempting to create a new connection.

Fixes #5483.

Before attempting a new connection, connectToRoute checks s.numRouteConns (the total sum of registered routes, unregistered routes, and pending route-connections), and skips the connection attempt if the server is already at its limit. This check prevents a pileup of pending connections (which will be eventually closed as duplicate routes) described in the linked issue #5483.

In order to track pending route-connections, this PR adds a counter map (s.pendingRouteConns) that is incremented on each route-connection attempt and decremented after the connection is established/aborted.

Signed-off-by: Your Name will.jordan@gmail.com

derekcollison · 2024-06-04T21:07:54Z

@kozlovic could you possibly take a look?

Check route connection count before creating a new connection.

Instead of cycling through temp clients on each check, update pendingRouteConns when temp router-clients are added/removed.

use atomic.Int64 for loading counter value as well

@wjordan

This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

kozlovic · 2024-06-27T17:51:25Z

@derekcollison @wjordan Sorry for the delay. I did look at this PR but with PR #5602 test, I still could see duplicate routes with @wjordan approach. The reason this PR did not see after the changes is likely the way the cluster was formed, starting one server at a time. The ref count added in this PR is done "too late", in connectToRoute(), which is running from a go routine. When servers start concurrently, there are way more chances for the duplicate routes situation to occur, even with this PR.

That made me realize that the issue was really the excess of gossip protocol, which is what I tackled in PR #5602. @wjordan Let me know if you think the other approach is solving the issue in your environment. Thanks!

@wjordan

This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

derekcollison · 2024-07-22T15:51:55Z

Closing in favor of @kozlovic's PR.

@wjordan

This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

@wjordan

This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

@wjordan

This is an alternate approach to the PR nats-io#5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves nats-io#5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

@wjordan

This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

@wjordan

This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

wjordan requested a review from a team as a code owner June 4, 2024 21:04

derekcollison requested a review from kozlovic June 4, 2024 21:07

Prevent duplicate routes

f3d3565

Check route connection count before creating a new connection.

wjordan force-pushed the duplicate-route-fixes branch from ba05173 to f3d3565 Compare June 4, 2024 21:13

wjordan added 3 commits June 4, 2024 14:49

fix race

885b52b

try again to fix race

0ee051c

Instead of cycling through temp clients on each check, update pendingRouteConns when temp router-clients are added/removed.

fix another race

a190b39

use atomic.Int64 for loading counter value as well

wjordan force-pushed the duplicate-route-fixes branch from 1b80ced to a190b39 Compare June 4, 2024 23:27

cleanup cluster in test

60f90df

kozlovic mentioned this pull request Jun 27, 2024

[IMPROVED] Routing: reduce chances of duplicate implicit routes #5602

Merged

derekcollison closed this Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent duplicate routes #5484

Prevent duplicate routes #5484

wjordan commented Jun 4, 2024

derekcollison commented Jun 4, 2024

kozlovic commented Jun 27, 2024

derekcollison commented Jul 22, 2024

Prevent duplicate routes #5484

Prevent duplicate routes #5484

Conversation

wjordan commented Jun 4, 2024

derekcollison commented Jun 4, 2024

kozlovic commented Jun 27, 2024

derekcollison commented Jul 22, 2024