-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove sleep statements - Reduce delays when joining a cluster #18932
Conversation
Can one of the admins verify this patch? |
b19ba43
to
2a81070
Compare
Unable to reproduce #18751 at all |
I can just as easily reproduce the problem on this branch. Maybe my hardware is different and because of that the delay is not suitable for reproduction on your machine, but the problem is definitely there. Maybe these changes aren't the actual cause of the problem, but they make it much more likely to appear. Let's see when we get some more time to put into fixing it. Until then we can't leave these changes in master. |
@jbartok Thank you for putting the effort in. I agree with your assessment of this problem, as it's the same as mine :) Sine I cannot reproduce it, I will change the test and see if you can reproduce it in this branch? |
After a more thorough review of this problem, I finally figured out exactly what's happening:
There is a number of existing bugs in Hazelcast that are uncovered when Prior to the introduction of this PR, setting With the introduction of this PR, setting Proposed Solutions:
In this PR, I will set Thanks and I appreciate your working with me on this |
…d replaced them with events
b877962
to
e429480
Compare
Just to be clear, the only difference between this PR and the original PR that was already approved & merged is Hopefully this can get approved and merged again quickly :) |
Hi @lprimak. I understand what you are saying about But the fact of the matter is that if we merge in these changes now, without fixing the underlying problems, then we'll release a version where problems can and will happen. We can't do that. Leave this PR open and as soon as we have the resources to fix the root cause of the problems we will be able to merge this in too. |
@jbartok I am still puzzled by the issue you are bringing up. Merging this PR in introduces the race condition only when Also, the other solution would be set the minimum What do you think? |
@@ -35,7 +35,7 @@ | |||
public class AbstractJoinTest extends HazelcastTestSupport { | |||
|
|||
protected void testJoin(Config config) throws Exception { | |||
config.setProperty(ClusterProperty.WAIT_SECONDS_BEFORE_JOIN.getName(), "0"); | |||
config.setProperty(ClusterProperty.WAIT_SECONDS_BEFORE_JOIN.getName(), "1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran several times TcpIpJoinTest
, even with WAIT_SECONDS_BEFORE_JOIN
set to 0
and with 1500millis delay in ClusterServiceImpl#finalizeJoin
as described in #18751 (comment) and didn't get a failure. It's not a guarantee that it won't fail, but it seems the original circumstances under which this test used to fail no longer hold. wdyt about reverting this change to get a few more PR builder runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter to me. I will do if it'll get this PR merged :)
Keeping behavior the same dictates 1 millisecond, but if the problem indeed went away due to some other changes,
0 will do just as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbartok what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok for the client side again :)
Hi all, can we merge this? Now is a prime time to merge PRs like this one :) |
Thank you to @lprimak for the PR and nerves of steel and to @vbekiaris and @sancar for the reviews :) Now we're in a good position to fix and bugs that pop up as a result of this change before releasing it. |
The PR hazelcast#18932 for delays elimination when joining a cluster makes it more probable to establish duplicate connections between nodes. It makes the previous test version based on the pipeline load metric quite unstable. The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads, as a result, no threads should have more than one connection. Related issue: hazelcast#19801
The PR hazelcast#18932 for delays elimination when joining a cluster makes it more probable to establish duplicate connections between nodes. It makes the previous test version based on the pipeline load metric quite unstable. The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads, as a result, no threads should have more than one connection. Related issue: hazelcast#19801
The PR hazelcast#18932 (for delays elimination when joining a cluster) makes it more probable to establish duplicate connections between nodes. It makes the previous test version based on the pipeline load metric quite unstable. The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads. As a result, threads should have no more than one active pipeline (that's load value periodically increased) and several non-active pipelines (that's load value hasn't changed). Related issue: hazelcast#19801
The PR #18932 (for delays elimination when joining a cluster) makes it more probable to establish duplicate connections between nodes. It makes the previous test version based on the pipeline load metric quite unstable. The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads. As a result, threads should have no more than one active pipeline (that's load value periodically increased) and several non-active pipelines (that's load value hasn't changed). Related issue: #19801
Re-introduction of #18267
Relates to #18751
Eliminated sleep delays when joining a cluster. Shaves 2-3 seconds from join procedure.
Removed sleep statements and replaced them with events
Relates to #17427 and payara/Payara#4858
Clean version of the "client side" fix for #17428