-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent duplicate routes #5484
Prevent duplicate routes #5484
Conversation
@kozlovic could you possibly take a look? |
Check route connection count before creating a new connection.
ba05173
to
f3d3565
Compare
Instead of cycling through temp clients on each check, update pendingRouteConns when temp router-clients are added/removed.
use atomic.Int64 for loading counter value as well
1b80ced
to
a190b39
Compare
This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@derekcollison @wjordan Sorry for the delay. I did look at this PR but with PR #5602 test, I still could see duplicate routes with @wjordan approach. The reason this PR did not see after the changes is likely the way the cluster was formed, starting one server at a time. The ref count added in this PR is done "too late", in connectToRoute(), which is running from a go routine. When servers start concurrently, there are way more chances for the duplicate routes situation to occur, even with this PR. That made me realize that the issue was really the excess of gossip protocol, which is what I tackled in PR #5602. @wjordan Let me know if you think the other approach is solving the issue in your environment. Thanks! |
This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Closing in favor of @kozlovic's PR. |
This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This is an alternate approach to the PR nats-io#5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves nats-io#5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This is an alternate approach to the PR #5484 from @wjordan. Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates. Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so). Resolves #5483 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This PR attempts to prevent duplicate routes by checking the existing route-connection count before attempting to create a new connection.
Fixes #5483.
Before attempting a new connection,
connectToRoute
checkss.numRouteConns
(the total sum of registered routes, unregistered routes, and pending route-connections), and skips the connection attempt if the server is already at its limit. This check prevents a pileup of pending connections (which will be eventually closed as duplicate routes) described in the linked issue #5483.In order to track pending route-connections, this PR adds a counter map (
s.pendingRouteConns
) that is incremented on each route-connection attempt and decremented after the connection is established/aborted.Signed-off-by: Your Name will.jordan@gmail.com