Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HZ-1013 HZ-1059 HZ-1051] Fix hanging cluster safe query from common pool #21145

Merged

Conversation

vbekiaris
Copy link
Collaborator

@vbekiaris vbekiaris commented Apr 4, 2022

When all FJP#commonPool threads are busy querying isClusterSafe
(eg seems this can be the case when querying via PartitionService MBean)
and partition assignments are not in sync (eg during initial
partition arrangement), then there is no chance for an important
callback to be executed after PartitionBackupReplicaAntiEntropyOperation
is done, resulting in neither partition replica sync nor cluster-safe
query being able to make any progress.
The fix is to use the Hazelcast internal async executor (instead of
the common pool) for the callback that processes replica anti-entropy
operation result.

Fixes #19672
Fixes #18286
Fixes #19665

Checklist:

  • Send backports/forwardports if fix needs to be applied to past/future releases

edit: see also #19672 (comment) on how this issue might occur

When all FJP#commonPool threads are busy querying isClusterSafe
and partition assignments are not in sync (eg during initial
partition arrangement), then there is no chance for an important
callback to be executed after PartitionBackupReplicaAntiEntropyOperation
is done, resulting in neither partition replica sync nor cluster-safe
query being able to make any progress.
The fix is to use the Hazelcast internal async executor (instead of
the common pool) for the callback that processes replica antientropy
operation result.
@vbekiaris vbekiaris added this to the 5.2 milestone Apr 4, 2022
@vbekiaris vbekiaris changed the title Fix hanging cluster safe query from common pool [HZ-1013] Fix hanging cluster safe query from common pool Apr 4, 2022
Copy link
Member

@ahmetmircik ahmetmircik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good finding 👏

@hz-devops-test
Copy link

The job Hazelcast-pr-EE-compiler of your PR failed. (Hazelcast internal details: build log, artifacts).
Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log file
--------------------------
---------SUMMARY----------
--------------------------
[ERROR] COMPILATION ERROR : 
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/nio/ssl/MemberTLSChannelInitializer.java:[32,37] error: incompatible types: InboundHandler[] cannot be converted to InboundHandler
--------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hazelcast-enterprise: Compilation failure
--------------------------
---------ERRORS-----------
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/nio/ssl/MemberTLSChannelInitializer.java:[32,37] error: incompatible types: InboundHandler[] cannot be converted to InboundHandler
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/nio/ssl/MemberTLSChannelInitializer.java:[32,37] error: incompatible types: InboundHandler[] cannot be converted to InboundHandler
--------------------------

@AyberkSorgun AyberkSorgun changed the title [HZ-1013] Fix hanging cluster safe query from common pool [HZ-1013 HZ-1059] Fix hanging cluster safe query from common pool May 9, 2022
@AyberkSorgun AyberkSorgun changed the title [HZ-1013 HZ-1059] Fix hanging cluster safe query from common pool [HZ-1013 HZ-1059 HZ-1051] Fix hanging cluster safe query from common pool May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants