Improvements for running Hazelcast persistence on kubernetes #21844

vbekiaris · 2022-07-26T12:54:05Z

Adds automated cluster state management for persistence on kubernetes
Supports cluster-wide shutdown, rolling restart and partial member recovery from failure on kubernetes
Fixes behaviour of readiness probe with persistence enabled ( https://github.com/hazelcast/hazelcast-enterprise/issues/3990 )
Includes some other minor fixes related to persistence

Design document in EE side PR: https://github.com/vbekiaris/hazelcast-enterprise/blob/enhancements/5.2/k8s-persistence/docs/design/persistence/04-persistence-kubernetes-improvements.md

EE counterpart: https://github.com/hazelcast/hazelcast-enterprise/pull/5140

Best reviewed commit-by-commit

Backport fixes in OperationRunnerImpl , MapProxySupport and making pre-join ops AllowedDuringPassiveState to 5.0.z and 5.1.z

hazelcast/src/main/java/com/hazelcast/spi/properties/ClusterProperty.java

hazelcast/src/main/java/com/hazelcast/instance/impl/Node.java

hazelcast/src/main/java/com/hazelcast/kubernetes/KubernetesClient.java

hasancelik · 2022-08-10T09:53:38Z

hazelcast/src/main/java/com/hazelcast/spi/utils/RestClient.java

+     * Since a watch implies a stream of updates from the server will be consumed, unlike other methods
+     * in this class, it is the responsibility of the consumer to disconnect the connection
+     * (by invoking {@link WatchResponse#disconnect()}) once the watch is no longer required.
+     */


The upcoming question is not related to the PR 🙂

What if we want to convert the existing discovery mechanism into a mechanism like this one, a more dynamic version? In the current version, the latest member discovers existing ones via running related REST call-based methods.

hazelcast/src/main/java/com/hazelcast/internal/ascii/rest/HttpGetCommandProcessor.java

hazelcast/src/main/java/com/hazelcast/instance/impl/ClusterTopologyIntentTracker.java

hazelcast/src/main/java/com/hazelcast/instance/impl/Node.java

hazelcast/src/main/java/com/hazelcast/kubernetes/KubernetesClient.java

hazelcast/src/main/java/com/hazelcast/instance/impl/Node.java

hazelcast/src/main/java/com/hazelcast/internal/cluster/impl/ClusterStateManager.java

ramizdundar · 2022-08-16T14:47:38Z

hazelcast/src/main/java/com/hazelcast/map/impl/proxy/MapProxySupport.java

+        if (getNodeEngine().isStartCompleted()) {
+            initializeIndexes();
+        } else {
+            initializeLocalIndexes();


Why this fixes HZ-1192? Or to be more exact, why cluster wide add index fails but local add index doesn't fail during recovery?

The cluster-wide index addition clashes with operation execution restrictions during recovery.

I think it is anyway wrong to perform index initialization cluster-wide anyway in MapProxySupport#initialize and I would remove the initializeIndexes call altogether. We should only concern ourselves with locally owned partitions in proxy initialization.
@ahmetmircik wdyt?

Not sure what the reason was to create indexes before start-completed.
Isn't it an option to throw exception if start is not completed yet?

Had a look at this again, making this local only seems like a behavior change. With this change, instead of relying on operation system guarantees, remote nodes proxies will be created by eventing system guarantees. This can introduce unexpected changes in effective behavior, when eventing system is busy and it drops events.

As discussed with Ahmet, I will prepare a separate PR for the HZ-1192 fix, so it is easier to track and revert this commit before we merge this PR. For now, I am leaving the commit in to facilitate testing.

extracted in #22485 -- I still keep the commit as part of this PR as there is still some testing ongoing with those branches. Will revert it before merge.

hazelcast/src/main/java/com/hazelcast/spi/impl/operationservice/Operation.java

hazelcast/src/main/java/com/hazelcast/internal/services/PreJoinAwareService.java

ramizdundar

I have few minor comments (basically what is left unresolved) left but they are not blockers for merge. Two importantish TODOs before merge could be:

Update ClusterTopologyIntentTracker Javadoc.
Separate 1192 changes from this PR.

@vbekiaris thank you for your efforts on this huge endevour.

hazelcast/src/main/java/com/hazelcast/instance/impl/ClusterTopologyIntentTracker.java

This reverts commit f1e60e9.

hz-devops-test · 2022-10-14T09:21:56Z

The job Hazelcast-pr-EE-compiler of your PR failed. (Hazelcast internal details: build log, artifacts).
Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log file

--------------------------
---------SUMMARY----------
--------------------------
[ERROR] COMPILATION ERROR : 
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hazelcast-enterprise: Compilation failure
--------------------------
---------ERRORS-----------
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------

vbekiaris · 2022-10-14T09:24:45Z

Thanks for your comments & reviews, they made this PR much better.
I pushed the HZ-1192 revert, will merge as soon as PR builder is green and then prepare backports to 5.2 / 5.2.z branches

hz-devops-test · 2022-10-14T10:19:37Z

The job Hazelcast-pr-EE-compiler of your PR failed. (Hazelcast internal details: build log, artifacts).
Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log file

--------------------------
---------SUMMARY----------
--------------------------
[ERROR] COMPILATION ERROR : 
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hazelcast-enterprise: Compilation failure
--------------------------
---------ERRORS-----------
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------

…st#21844) - Adds automated cluster state management for persistence on kubernetes - Supports cluster-wide shutdown, rolling restart and partial member recovery from failure on kubernetes [HZ-1190] [HZ-1191] [HZ-1193] - Fixes behaviour of readiness probe with persistence enabled [HZ-1349] - Allows tuning either for speedy crash recovery with FROZEN state or availability of in-memory data structures with NO_MIGRATION state for missing members [HZ-1311] - Fixes backup sync after single member crash recovery [HZ-1349] Design document in EE side: https://github.com/vbekiaris/hazelcast-enterprise/blob/enhancements/5.2/k8s-persistence/docs/design/persistence/04-persistence-kubernetes-improvements.md (cherry picked from commit 1ddc16e)

…22501) - Adds automated cluster state management for persistence on kubernetes - Supports cluster-wide shutdown, rolling restart and partial member recovery from failure on kubernetes [HZ-1190] [HZ-1191] [HZ-1193] - Fixes behaviour of readiness probe with persistence enabled [HZ-1349] - Allows tuning either for speedy crash recovery with FROZEN state or availability of in-memory data structures with NO_MIGRATION state for missing members [HZ-1311] - Fixes backup sync after single member crash recovery [HZ-1349] Design document in EE side: https://github.com/vbekiaris/hazelcast-enterprise/blob/enhancements/5.2/k8s-persistence/docs/design/persistence/04-persistence-kubernetes-improvements.md (cherry picked from commit 1ddc16e) 1:1 clean backport of #21844 to 5.2.0 release branch Also includes backport of #22512 Co-authored-by: Łukasz Dziedziul <lukasz.dziedziul@hazelcast.com>

…22502) - Adds automated cluster state management for persistence on kubernetes - Supports cluster-wide shutdown, rolling restart and partial member recovery from failure on kubernetes [HZ-1190] [HZ-1191] [HZ-1193] - Fixes behaviour of readiness probe with persistence enabled [HZ-1349] - Allows tuning either for speedy crash recovery with FROZEN state or availability of in-memory data structures with NO_MIGRATION state for missing members [HZ-1311] - Fixes backup sync after single member crash recovery [HZ-1349] Design document in EE side: https://github.com/vbekiaris/hazelcast-enterprise/blob/enhancements/5.2/k8s-persistence/docs/design/persistence/04-persistence-kubernetes-improvements.md (cherry picked from commit 1ddc16e) 1:1 clean backport from #21844 Also includes backport of #22512 Co-authored-by: Łukasz Dziedziul <lukasz.dziedziul@hazelcast.com>

vbekiaris force-pushed the enhancements/5.2/k8s-persistence branch 2 times, most recently from 32feb6d to 7796bca Compare August 8, 2022 07:47

hazelcast deleted a comment from hz-devops-test Aug 8, 2022

vbekiaris force-pushed the enhancements/5.2/k8s-persistence branch 2 times, most recently from 426c074 to cbaada2 Compare August 8, 2022 15:38

hazelcast deleted a comment from hz-devops-test Aug 8, 2022

vbekiaris changed the title ~~[IGNORE] Draft PR~~ Improvements for running Hazelcast persistence on kubernetes Aug 8, 2022

vbekiaris marked this pull request as ready for review August 8, 2022 16:12

vbekiaris requested review from hasancelik and ramizdundar August 8, 2022 16:13

vbekiaris added Type: Enhancement Team: Core Source: Internal PR or issue was opened by an employee Module: Persistence Module: Kubernetes labels Aug 8, 2022

vbekiaris added this to the 5.2 milestone Aug 8, 2022

vbekiaris requested a review from a team as a code owner August 10, 2022 09:11

hasancelik suggested changes Aug 10, 2022

View reviewed changes

hasancelik reviewed Aug 10, 2022

View reviewed changes

hazelcast/src/main/java/com/hazelcast/internal/ascii/rest/HttpGetCommandProcessor.java Show resolved Hide resolved

ramizdundar assigned vbekiaris Aug 10, 2022

ramizdundar reviewed Aug 16, 2022

View reviewed changes

hazelcast/src/main/java/com/hazelcast/spi/impl/operationservice/Operation.java Show resolved Hide resolved

ramizdundar reviewed Aug 16, 2022

View reviewed changes

hazelcast/src/main/java/com/hazelcast/internal/services/PreJoinAwareService.java Show resolved Hide resolved

hazelcast deleted a comment from hz-devops-test Oct 14, 2022

ramizdundar approved these changes Oct 14, 2022

View reviewed changes

vbekiaris mentioned this pull request Oct 14, 2022

Add index only on local during MapProxy init [HZ-1192] #22485

Merged

ahmetmircik reviewed Oct 14, 2022

View reviewed changes

hazelcast/src/main/java/com/hazelcast/instance/impl/ClusterTopologyIntentTracker.java Outdated Show resolved Hide resolved

ahmetmircik approved these changes Oct 14, 2022

View reviewed changes

vbekiaris added 3 commits October 14, 2022 12:10

javadoc / eliminate todo

24bc2a1

rename ClusterTopologyIntentTracker#shutdown -> destroy

6eeb8cc

Revert "[HZ-1192] Map proxy init: add index only on local"

c8d64c7

This reverts commit f1e60e9.

always put tracker in disco properties map

e7bfa59

olukas approved these changes Oct 14, 2022

View reviewed changes

vbekiaris merged commit 1ddc16e into hazelcast:master Oct 14, 2022

vbekiaris modified the milestones: 5.2.0, 5.3.0 Oct 14, 2022

vbekiaris mentioned this pull request Oct 14, 2022

Improvements for running Hazelcast persistence on kubernetes [5.2.0] #22501

Merged

vbekiaris mentioned this pull request Oct 14, 2022

Improvements for running Hazelcast persistence on kubernetes [5.2.z] #22502

Merged

hasancelik mentioned this pull request Oct 19, 2022

Permissions which are required for the persistence is not optional for non-persistence clusters #22538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements for running Hazelcast persistence on kubernetes #21844

Improvements for running Hazelcast persistence on kubernetes #21844

vbekiaris commented Jul 26, 2022 •

edited

hasancelik Aug 10, 2022

ramizdundar Aug 16, 2022 •

edited

vbekiaris Aug 22, 2022

ahmetmircik Aug 22, 2022

ahmetmircik Sep 22, 2022

vbekiaris Oct 7, 2022

vbekiaris Oct 14, 2022

ramizdundar left a comment •

edited

hz-devops-test commented Oct 14, 2022

vbekiaris commented Oct 14, 2022

hz-devops-test commented Oct 14, 2022

Improvements for running Hazelcast persistence on kubernetes #21844

Improvements for running Hazelcast persistence on kubernetes #21844

Conversation

vbekiaris commented Jul 26, 2022 • edited

hasancelik Aug 10, 2022

Choose a reason for hiding this comment

ramizdundar Aug 16, 2022 • edited

Choose a reason for hiding this comment

vbekiaris Aug 22, 2022

Choose a reason for hiding this comment

ahmetmircik Aug 22, 2022

Choose a reason for hiding this comment

ahmetmircik Sep 22, 2022

Choose a reason for hiding this comment

vbekiaris Oct 7, 2022

Choose a reason for hiding this comment

vbekiaris Oct 14, 2022

Choose a reason for hiding this comment

ramizdundar left a comment • edited

Choose a reason for hiding this comment

hz-devops-test commented Oct 14, 2022

vbekiaris commented Oct 14, 2022

hz-devops-test commented Oct 14, 2022

vbekiaris commented Jul 26, 2022 •

edited

ramizdundar Aug 16, 2022 •

edited

ramizdundar left a comment •

edited