Remove sleep statements - Reduce delays when joining a cluster #18932

lprimak · 2021-06-18T04:56:19Z

Re-introduction of #18267
Relates to #18751

Eliminated sleep delays when joining a cluster. Shaves 2-3 seconds from join procedure.
Removed sleep statements and replaced them with events
Relates to #17427 and payara/Payara#4858
Clean version of the "client side" fix for #17428

devOpsHazelcast · 2021-06-18T04:56:21Z

Can one of the admins verify this patch?

lprimak · 2021-06-18T06:33:07Z

Unable to reproduce #18751 at all

jbartok · 2021-06-18T07:46:18Z

I can just as easily reproduce the problem on this branch. Maybe my hardware is different and because of that the delay is not suitable for reproduction on your machine, but the problem is definitely there.

Maybe these changes aren't the actual cause of the problem, but they make it much more likely to appear. Let's see when we get some more time to put into fixing it. Until then we can't leave these changes in master.

lprimak · 2021-06-18T14:10:58Z

@jbartok Thank you for putting the effort in. I agree with your assessment of this problem, as it's the same as mine :)
I just had a thought. Perhaps the test itself is wrong. Instead of assertClusterSize() it it needs to be assertClusterSizeEventually()

Sine I cannot reproduce it, I will change the test and see if you can reproduce it in this branch?
Thank you!

lprimak · 2021-06-20T02:37:14Z

After a more thorough review of this problem, I finally figured out exactly what's happening:

WAIT_SECONDS_BEFORE_JOIN is set to 0 in TcpIpJoinTest

There is a number of existing bugs in Hazelcast that are uncovered when WAIT_SECONDS_BEFORE_JOIN is set to 0:
(example: #17586)
The bug precisely described in #18751 by @jbartok here: #18751 (comment) is another issue with setting WAIT_SECONDS_BEFORE_JOIN to 0.
#18751 needs to be reopened.

Prior to the introduction of this PR, setting WAIT_SECONDS_BEFORE_JOIN to 0 was effectively equivalent to setting it to 3 due to additional sleep delays that are present in the client join logic.

With the introduction of this PR, setting WAIT_SECONDS_BEFORE_JOIN to 0 truly sets it to zero.

Proposed Solutions:

Set WAIT_SECONDS_BEFORE_JOIN in TcpIpJoinTest to 1
Set minimum value of WAIT_SECONDS_BEFORE_JOIN to 1
Fix the underlying issues

In this PR, I will set WAIT_SECONDS_BEFORE_JOIN in TcpIpJoinTest to 1 to have minimal impact and have the tests pass correctly, despite the underlying issues not being fixed yet.
This will leave the join functionality post-PR in the same state as pre-PR, while still removing the extra sleep-induced delays as per this PR's design.

Thanks and I appreciate your working with me on this

…d replaced them with events

lprimak · 2021-06-20T05:45:27Z

Just to be clear, the only difference between this PR and the original PR that was already approved & merged is
to set WAIT_SECONDS_BEFORE_JOIN in TcpIpJoinTest to 1

Hopefully this can get approved and merged again quickly :)

jbartok · 2021-06-29T06:53:11Z

Hi @lprimak. I understand what you are saying about WAIT_SECONDS_BEFORE_JOIN, how it wasn't possible to actually set it to 0 before and how these changes make that possible. We also acknowledge that your changes don't really break anything, just expose other underlying problems, see this issue I've created: #18980.

But the fact of the matter is that if we merge in these changes now, without fixing the underlying problems, then we'll release a version where problems can and will happen. We can't do that.

Leave this PR open and as soon as we have the resources to fix the root cause of the problems we will be able to merge this in too.

lprimak · 2021-06-29T15:14:49Z

@jbartok I am still puzzled by the issue you are bringing up.
Setting WAIT_SECONDS_BEFORE_JOIN to zero brings up a lot of issues currently in 4.x, without this PR, and dare I say it, currently does not work for the most part.

Merging this PR in introduces the race condition only when WAIT_SECONDS_BEFORE_JOIN is set to zero.
Since this configuration doesn't work currently, I don't see the harm to add the race confition to a currently non-working scenario.

Also, the other solution would be set the minimum WAIT_SECONDS_BEFORE_JOIN to 1 and this way all problems would disappear altogether

What do you think?

vbekiaris · 2021-10-05T11:55:53Z

hazelcast/src/test/java/com/hazelcast/cluster/AbstractJoinTest.java

@@ -35,7 +35,7 @@
 public class AbstractJoinTest extends HazelcastTestSupport {

    protected void testJoin(Config config) throws Exception {
-        config.setProperty(ClusterProperty.WAIT_SECONDS_BEFORE_JOIN.getName(), "0");
+        config.setProperty(ClusterProperty.WAIT_SECONDS_BEFORE_JOIN.getName(), "1");


I ran several times TcpIpJoinTest, even with WAIT_SECONDS_BEFORE_JOIN set to 0 and with 1500millis delay in ClusterServiceImpl#finalizeJoin as described in #18751 (comment) and didn't get a failure. It's not a guarantee that it won't fail, but it seems the original circumstances under which this test used to fail no longer hold. wdyt about reverting this change to get a few more PR builder runs?

Doesn't matter to me. I will do if it'll get this PR merged :)
Keeping behavior the same dictates 1 millisecond, but if the problem indeed went away due to some other changes,
0 will do just as well

@jbartok what do you think?

sancar

Ok for the client side again :)

mmedenjak · 2021-10-14T08:58:55Z

Hi all, can we merge this? Now is a prime time to merge PRs like this one :)

mmedenjak · 2021-10-14T09:08:41Z

Thank you to @lprimak for the PR and nerves of steel and to @vbekiaris and @sancar for the reviews :) Now we're in a good position to fix and bugs that pop up as a result of this change before releasing it.

The PR hazelcast#18932 for delays elimination when joining a cluster makes it more probable to establish duplicate connections between nodes. It makes the previous test version based on the pipeline load metric quite unstable. The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads, as a result, no threads should have more than one connection. Related issue: hazelcast#19801

The PR hazelcast#18932 (for delays elimination when joining a cluster) makes it more probable to establish duplicate connections between nodes. It makes the previous test version based on the pipeline load metric quite unstable. The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads. As a result, threads should have no more than one active pipeline (that's load value periodically increased) and several non-active pipelines (that's load value hasn't changed). Related issue: hazelcast#19801

The PR #18932 (for delays elimination when joining a cluster) makes it more probable to establish duplicate connections between nodes. It makes the previous test version based on the pipeline load metric quite unstable. The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads. As a result, threads should have no more than one active pipeline (that's load value periodically increased) and several non-active pipelines (that's load value hasn't changed). Related issue: #19801

hz-devops-test added the Source: Community PR or issue was opened by a community user label Jun 18, 2021

lprimak force-pushed the CLUSTER-JOIN-DELAYS branch from b19ba43 to 2a81070 Compare June 18, 2021 05:09

lprimak changed the title ~~Cluster join delays~~ Remove sleep statements - Reduce delays when joining a cluster Jun 18, 2021

lprimak marked this pull request as ready for review June 18, 2021 06:32

lprimak requested a review from a team as a code owner June 18, 2021 06:32

mmedenjak added Module: Cluster Team: Core Type: Enhancement labels Jun 19, 2021

Eliminated delays when joining a cluster. Removed sleep statements an…

e429480

…d replaced them with events

lprimak force-pushed the CLUSTER-JOIN-DELAYS branch from b877962 to e429480 Compare June 20, 2021 02:42

jbartok mentioned this pull request Jun 25, 2021

Race condition on cluster member reconnect. #18980

Closed

vbekiaris self-requested a review October 1, 2021 08:41

vbekiaris approved these changes Oct 5, 2021

View reviewed changes

sancar approved these changes Oct 6, 2021

View reviewed changes

mmedenjak added this to the 5.1 milestone Oct 8, 2021

hazelcast deleted a comment from vbekiaris Oct 8, 2021

vbekiaris merged commit b787218 into hazelcast:master Oct 14, 2021

lprimak deleted the CLUSTER-JOIN-DELAYS branch October 17, 2021 19:00

arodionov mentioned this pull request Jan 14, 2022

com.hazelcast.internal.networking.nio.iobalancer.IOBalancerStressTest.testEachConnectionUseDifferentOwnerEventually [HZ-826] #19801

Closed

arodionov mentioned this pull request Jan 19, 2022

[HZ-826] Rework failing IOBalancerStressTest #20429

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove sleep statements - Reduce delays when joining a cluster #18932

Remove sleep statements - Reduce delays when joining a cluster #18932

lprimak commented Jun 18, 2021 •

edited

Loading

devOpsHazelcast commented Jun 18, 2021

lprimak commented Jun 18, 2021

jbartok commented Jun 18, 2021

lprimak commented Jun 18, 2021

lprimak commented Jun 20, 2021 •

edited

Loading

lprimak commented Jun 20, 2021 •

edited

Loading

jbartok commented Jun 29, 2021

lprimak commented Jun 29, 2021

vbekiaris Oct 5, 2021

lprimak Oct 5, 2021

lprimak Oct 5, 2021

sancar left a comment

mmedenjak commented Oct 14, 2021

mmedenjak commented Oct 14, 2021

Remove sleep statements - Reduce delays when joining a cluster #18932

Remove sleep statements - Reduce delays when joining a cluster #18932

Conversation

lprimak commented Jun 18, 2021 • edited Loading

devOpsHazelcast commented Jun 18, 2021

lprimak commented Jun 18, 2021

jbartok commented Jun 18, 2021

lprimak commented Jun 18, 2021

lprimak commented Jun 20, 2021 • edited Loading

lprimak commented Jun 20, 2021 • edited Loading

jbartok commented Jun 29, 2021

lprimak commented Jun 29, 2021

vbekiaris Oct 5, 2021

Choose a reason for hiding this comment

lprimak Oct 5, 2021

Choose a reason for hiding this comment

lprimak Oct 5, 2021

Choose a reason for hiding this comment

sancar left a comment

Choose a reason for hiding this comment

mmedenjak commented Oct 14, 2021

mmedenjak commented Oct 14, 2021

lprimak commented Jun 18, 2021 •

edited

Loading

lprimak commented Jun 20, 2021 •

edited

Loading

lprimak commented Jun 20, 2021 •

edited

Loading