Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove sleep statements - Reduce delays when joining a cluster #18932

Merged
merged 1 commit into from
Oct 14, 2021

Conversation

lprimak
Copy link
Contributor

@lprimak lprimak commented Jun 18, 2021

Re-introduction of #18267
Relates to #18751

Eliminated sleep delays when joining a cluster. Shaves 2-3 seconds from join procedure.
Removed sleep statements and replaced them with events
Relates to #17427 and payara/Payara#4858
Clean version of the "client side" fix for #17428

@hz-devops-test hz-devops-test added the Source: Community PR or issue was opened by a community user label Jun 18, 2021
@devOpsHazelcast
Copy link
Collaborator

Can one of the admins verify this patch?

@lprimak lprimak changed the title Cluster join delays Remove sleep statements - Reduce delays when joining a cluster Jun 18, 2021
@lprimak lprimak marked this pull request as ready for review June 18, 2021 06:32
@lprimak lprimak requested a review from a team as a code owner June 18, 2021 06:32
@lprimak
Copy link
Contributor Author

lprimak commented Jun 18, 2021

Unable to reproduce #18751 at all

@jbartok
Copy link
Contributor

jbartok commented Jun 18, 2021

I can just as easily reproduce the problem on this branch. Maybe my hardware is different and because of that the delay is not suitable for reproduction on your machine, but the problem is definitely there.

image

Maybe these changes aren't the actual cause of the problem, but they make it much more likely to appear. Let's see when we get some more time to put into fixing it. Until then we can't leave these changes in master.

@lprimak
Copy link
Contributor Author

lprimak commented Jun 18, 2021

@jbartok Thank you for putting the effort in. I agree with your assessment of this problem, as it's the same as mine :)
I just had a thought. Perhaps the test itself is wrong. Instead of assertClusterSize() it it needs to be assertClusterSizeEventually()

Sine I cannot reproduce it, I will change the test and see if you can reproduce it in this branch?
Thank you!

@lprimak
Copy link
Contributor Author

lprimak commented Jun 20, 2021

After a more thorough review of this problem, I finally figured out exactly what's happening:

  • WAIT_SECONDS_BEFORE_JOIN is set to 0 in TcpIpJoinTest

There is a number of existing bugs in Hazelcast that are uncovered when WAIT_SECONDS_BEFORE_JOIN is set to 0:
(example: #17586)
The bug precisely described in #18751 by @jbartok here: #18751 (comment) is another issue with setting WAIT_SECONDS_BEFORE_JOIN to 0.
#18751 needs to be reopened.

Prior to the introduction of this PR, setting WAIT_SECONDS_BEFORE_JOIN to 0 was effectively equivalent to setting it to 3 due to additional sleep delays that are present in the client join logic.

With the introduction of this PR, setting WAIT_SECONDS_BEFORE_JOIN to 0 truly sets it to zero.

Proposed Solutions:

  • Set WAIT_SECONDS_BEFORE_JOIN in TcpIpJoinTest to 1
  • Set minimum value of WAIT_SECONDS_BEFORE_JOIN to 1
  • Fix the underlying issues

In this PR, I will set WAIT_SECONDS_BEFORE_JOIN in TcpIpJoinTest to 1 to have minimal impact and have the tests pass correctly, despite the underlying issues not being fixed yet.
This will leave the join functionality post-PR in the same state as pre-PR, while still removing the extra sleep-induced delays as per this PR's design.

Thanks and I appreciate your working with me on this

@lprimak
Copy link
Contributor Author

lprimak commented Jun 20, 2021

Just to be clear, the only difference between this PR and the original PR that was already approved & merged is
to set WAIT_SECONDS_BEFORE_JOIN in TcpIpJoinTest to 1

Hopefully this can get approved and merged again quickly :)

@jbartok
Copy link
Contributor

jbartok commented Jun 29, 2021

Hi @lprimak. I understand what you are saying about WAIT_SECONDS_BEFORE_JOIN, how it wasn't possible to actually set it to 0 before and how these changes make that possible. We also acknowledge that your changes don't really break anything, just expose other underlying problems, see this issue I've created: #18980.

But the fact of the matter is that if we merge in these changes now, without fixing the underlying problems, then we'll release a version where problems can and will happen. We can't do that.

Leave this PR open and as soon as we have the resources to fix the root cause of the problems we will be able to merge this in too.

@lprimak
Copy link
Contributor Author

lprimak commented Jun 29, 2021

@jbartok I am still puzzled by the issue you are bringing up.
Setting WAIT_SECONDS_BEFORE_JOIN to zero brings up a lot of issues currently in 4.x, without this PR, and dare I say it, currently does not work for the most part.

Merging this PR in introduces the race condition only when WAIT_SECONDS_BEFORE_JOIN is set to zero.
Since this configuration doesn't work currently, I don't see the harm to add the race confition to a currently non-working scenario.

Also, the other solution would be set the minimum WAIT_SECONDS_BEFORE_JOIN to 1 and this way all problems would disappear altogether

What do you think?

@vbekiaris vbekiaris self-requested a review October 1, 2021 08:41
@@ -35,7 +35,7 @@
public class AbstractJoinTest extends HazelcastTestSupport {

protected void testJoin(Config config) throws Exception {
config.setProperty(ClusterProperty.WAIT_SECONDS_BEFORE_JOIN.getName(), "0");
config.setProperty(ClusterProperty.WAIT_SECONDS_BEFORE_JOIN.getName(), "1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran several times TcpIpJoinTest, even with WAIT_SECONDS_BEFORE_JOIN set to 0 and with 1500millis delay in ClusterServiceImpl#finalizeJoin as described in #18751 (comment) and didn't get a failure. It's not a guarantee that it won't fail, but it seems the original circumstances under which this test used to fail no longer hold. wdyt about reverting this change to get a few more PR builder runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter to me. I will do if it'll get this PR merged :)
Keeping behavior the same dictates 1 millisecond, but if the problem indeed went away due to some other changes,
0 will do just as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbartok what do you think?

Copy link
Contributor

@sancar sancar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for the client side again :)

@mmedenjak mmedenjak added this to the 5.1 milestone Oct 8, 2021
@hazelcast hazelcast deleted a comment from vbekiaris Oct 8, 2021
@hazelcast hazelcast deleted a comment from vbekiaris Oct 8, 2021
@hazelcast hazelcast deleted a comment from vbekiaris Oct 8, 2021
@hazelcast hazelcast deleted a comment from vbekiaris Oct 8, 2021
@mmedenjak
Copy link
Contributor

Hi all, can we merge this? Now is a prime time to merge PRs like this one :)

@vbekiaris vbekiaris merged commit b787218 into hazelcast:master Oct 14, 2021
@mmedenjak
Copy link
Contributor

Thank you to @lprimak for the PR and nerves of steel and to @vbekiaris and @sancar for the reviews :) Now we're in a good position to fix and bugs that pop up as a result of this change before releasing it.

@lprimak lprimak deleted the CLUSTER-JOIN-DELAYS branch October 17, 2021 19:00
arodionov added a commit to arodionov/hazelcast that referenced this pull request Jan 19, 2022
The PR hazelcast#18932 for delays elimination when joining a cluster makes it more probable to establish duplicate connections between nodes.
It makes the previous test version based on the pipeline load metric quite unstable.

The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads, as a result, no threads should have more than one connection.

Related issue: hazelcast#19801
arodionov added a commit to arodionov/hazelcast that referenced this pull request Jan 19, 2022
The PR hazelcast#18932 for delays elimination when joining a cluster makes it more probable to establish duplicate connections between nodes.
It makes the previous test version based on the pipeline load metric quite unstable.

The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads, as a result, no threads should have more than one connection.

Related issue: hazelcast#19801
arodionov added a commit to arodionov/hazelcast that referenced this pull request Jan 23, 2022
The PR hazelcast#18932 (for delays elimination when joining a cluster) makes it more probable to establish duplicate connections between nodes.
It makes the previous test version based on the pipeline load metric quite unstable.

The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads. As a result, threads should have no more than one active pipeline (that's load value periodically increased) and several non-active pipelines (that's load value hasn't changed).

Related issue: hazelcast#19801
arodionov added a commit that referenced this pull request Jan 24, 2022
The PR #18932 (for delays elimination when joining a cluster) makes it more probable to establish duplicate connections between nodes.
It makes the previous test version based on the pipeline load metric quite unstable.

The proposed solution is to increase the number of IO_THREAD_COUNT to the maximum possible connections count (with duplicates) per instance. The IOBalancer should rebalance them equally between threads. As a result, threads should have no more than one active pipeline (that's load value periodically increased) and several non-active pipelines (that's load value hasn't changed).

Related issue: #19801
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants