Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent duplicate routes #5484

Closed
wants to merge 5 commits into from
Closed

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented Jun 4, 2024

This PR attempts to prevent duplicate routes by checking the existing route-connection count before attempting to create a new connection.

Fixes #5483.

Before attempting a new connection, connectToRoute checks s.numRouteConns (the total sum of registered routes, unregistered routes, and pending route-connections), and skips the connection attempt if the server is already at its limit. This check prevents a pileup of pending connections (which will be eventually closed as duplicate routes) described in the linked issue #5483.

In order to track pending route-connections, this PR adds a counter map (s.pendingRouteConns) that is incremented on each route-connection attempt and decremented after the connection is established/aborted.

Signed-off-by: Your Name will.jordan@gmail.com

@wjordan wjordan requested a review from a team as a code owner June 4, 2024 21:04
@derekcollison
Copy link
Member

@kozlovic could you possibly take a look?

Check route connection count before creating a new connection.
Instead of cycling through temp clients on each check,
update pendingRouteConns when temp router-clients are added/removed.
use atomic.Int64 for loading counter value as well
kozlovic added a commit that referenced this pull request Jun 27, 2024
This is an alternate approach to the PR #5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could
still see duplicate routes (up to 125 in one of the matrix), and
still had a data race (that could have easily be fixed). The main
issue is that the increment happens in connectToRoute, which is
running from a go routine, so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result
of way too many gossip protocols. Suppose that you have servers A
and B already connected. C connects to A. A gossips to B that it
should connect to C. When that happened, B would gossip to A the
server C and C would gossip to A the server B, which all that was
unnecessary. It would grow quite fast with the size of the cluster
(that is, several thousands for a cluster size of 15 or so).

Resolves #5483

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@kozlovic
Copy link
Member

@derekcollison @wjordan Sorry for the delay. I did look at this PR but with PR #5602 test, I still could see duplicate routes with @wjordan approach. The reason this PR did not see after the changes is likely the way the cluster was formed, starting one server at a time. The ref count added in this PR is done "too late", in connectToRoute(), which is running from a go routine. When servers start concurrently, there are way more chances for the duplicate routes situation to occur, even with this PR.

That made me realize that the issue was really the excess of gossip protocol, which is what I tackled in PR #5602. @wjordan Let me know if you think the other approach is solving the issue in your environment. Thanks!

derekcollison added a commit that referenced this pull request Jul 22, 2024
This is an alternate approach to the PR #5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could still
see duplicate routes (up to 125 in one of the matrix), and still had a
data race (that could have easily be fixed). The main issue is that the
increment happens in connectToRoute, which is running from a go routine,
so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result of
way too many gossip protocols. Suppose that you have servers A and B
already connected. C connects to A. A gossips to B that it should
connect to C. When that happened, B would gossip to A the server C and C
would gossip to A the server B, which all that was unnecessary. It would
grow quite fast with the size of the cluster (that is, several thousands
for a cluster size of 15 or so).

Resolves #5483

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@derekcollison
Copy link
Member

Closing in favor of @kozlovic's PR.

neilalexander pushed a commit that referenced this pull request Jul 29, 2024
This is an alternate approach to the PR #5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could
still see duplicate routes (up to 125 in one of the matrix), and
still had a data race (that could have easily be fixed). The main
issue is that the increment happens in connectToRoute, which is
running from a go routine, so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result
of way too many gossip protocols. Suppose that you have servers A
and B already connected. C connects to A. A gossips to B that it
should connect to C. When that happened, B would gossip to A the
server C and C would gossip to A the server B, which all that was
unnecessary. It would grow quite fast with the size of the cluster
(that is, several thousands for a cluster size of 15 or so).

Resolves #5483

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
neilalexander pushed a commit that referenced this pull request Jul 29, 2024
This is an alternate approach to the PR #5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could
still see duplicate routes (up to 125 in one of the matrix), and
still had a data race (that could have easily be fixed). The main
issue is that the increment happens in connectToRoute, which is
running from a go routine, so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result
of way too many gossip protocols. Suppose that you have servers A
and B already connected. C connects to A. A gossips to B that it
should connect to C. When that happened, B would gossip to A the
server C and C would gossip to A the server B, which all that was
unnecessary. It would grow quite fast with the size of the cluster
(that is, several thousands for a cluster size of 15 or so).

Resolves #5483

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
ReubenMathew pushed a commit to ReubenMathew/nats-server that referenced this pull request Jul 29, 2024
This is an alternate approach to the PR nats-io#5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could
still see duplicate routes (up to 125 in one of the matrix), and
still had a data race (that could have easily be fixed). The main
issue is that the increment happens in connectToRoute, which is
running from a go routine, so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result
of way too many gossip protocols. Suppose that you have servers A
and B already connected. C connects to A. A gossips to B that it
should connect to C. When that happened, B would gossip to A the
server C and C would gossip to A the server B, which all that was
unnecessary. It would grow quite fast with the size of the cluster
(that is, several thousands for a cluster size of 15 or so).

Resolves nats-io#5483

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
bruth pushed a commit that referenced this pull request Jul 29, 2024
This is an alternate approach to the PR #5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could
still see duplicate routes (up to 125 in one of the matrix), and
still had a data race (that could have easily be fixed). The main
issue is that the increment happens in connectToRoute, which is
running from a go routine, so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result
of way too many gossip protocols. Suppose that you have servers A
and B already connected. C connects to A. A gossips to B that it
should connect to C. When that happened, B would gossip to A the
server C and C would gossip to A the server B, which all that was
unnecessary. It would grow quite fast with the size of the cluster
(that is, several thousands for a cluster size of 15 or so).

Resolves #5483

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
bruth pushed a commit that referenced this pull request Jul 29, 2024
This is an alternate approach to the PR #5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could
still see duplicate routes (up to 125 in one of the matrix), and
still had a data race (that could have easily be fixed). The main
issue is that the increment happens in connectToRoute, which is
running from a go routine, so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result
of way too many gossip protocols. Suppose that you have servers A
and B already connected. C connects to A. A gossips to B that it
should connect to C. When that happened, B would gossip to A the
server C and C would gossip to A the server B, which all that was
unnecessary. It would grow quite fast with the size of the cluster
(that is, several thousands for a cluster size of 15 or so).

Resolves #5483

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'Duplicate Route' connection floods on implicit routes
3 participants