Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for running Hazelcast persistence on kubernetes #21844

Merged
merged 60 commits into from Oct 14, 2022

Conversation

vbekiaris
Copy link
Collaborator

@vbekiaris vbekiaris commented Jul 26, 2022

  • Adds automated cluster state management for persistence on kubernetes
  • Supports cluster-wide shutdown, rolling restart and partial member recovery from failure on kubernetes
  • Fixes behaviour of readiness probe with persistence enabled ( https://github.com/hazelcast/hazelcast-enterprise/issues/3990 )
  • Includes some other minor fixes related to persistence

Design document in EE side PR: https://github.com/vbekiaris/hazelcast-enterprise/blob/enhancements/5.2/k8s-persistence/docs/design/persistence/04-persistence-kubernetes-improvements.md

EE counterpart: https://github.com/hazelcast/hazelcast-enterprise/pull/5140

Best reviewed commit-by-commit

  • Backport fixes in OperationRunnerImpl , MapProxySupport and making pre-join ops AllowedDuringPassiveState to 5.0.z and 5.1.z

@vbekiaris vbekiaris force-pushed the enhancements/5.2/k8s-persistence branch 2 times, most recently from 32feb6d to 7796bca Compare August 8, 2022 07:47
@hazelcast hazelcast deleted a comment from hz-devops-test Aug 8, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Aug 8, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Aug 8, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Aug 8, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Aug 8, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Aug 8, 2022
@vbekiaris vbekiaris force-pushed the enhancements/5.2/k8s-persistence branch 2 times, most recently from 426c074 to cbaada2 Compare August 8, 2022 15:38
@hazelcast hazelcast deleted a comment from hz-devops-test Aug 8, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Aug 8, 2022
@vbekiaris vbekiaris changed the title [IGNORE] Draft PR Improvements for running Hazelcast persistence on kubernetes Aug 8, 2022
@vbekiaris vbekiaris marked this pull request as ready for review August 8, 2022 16:12
@vbekiaris vbekiaris added this to the 5.2 milestone Aug 8, 2022
@vbekiaris vbekiaris requested a review from a team as a code owner August 10, 2022 09:11
* Since a watch implies a stream of updates from the server will be consumed, unlike other methods
* in this class, it is the responsibility of the consumer to disconnect the connection
* (by invoking {@link WatchResponse#disconnect()}) once the watch is no longer required.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upcoming question is not related to the PR 🙂

What if we want to convert the existing discovery mechanism into a mechanism like this one, a more dynamic version? In the current version, the latest member discovers existing ones via running related REST call-based methods.

if (getNodeEngine().isStartCompleted()) {
initializeIndexes();
} else {
initializeLocalIndexes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this fixes HZ-1192? Or to be more exact, why cluster wide add index fails but local add index doesn't fail during recovery?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster-wide index addition clashes with operation execution restrictions during recovery.

I think it is anyway wrong to perform index initialization cluster-wide anyway in MapProxySupport#initialize and I would remove the initializeIndexes call altogether. We should only concern ourselves with locally owned partitions in proxy initialization.
@ahmetmircik wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the reason was to create indexes before start-completed.
Isn't it an option to throw exception if start is not completed yet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look at this again, making this local only seems like a behavior change. With this change, instead of relying on operation system guarantees, remote nodes proxies will be created by eventing system guarantees. This can introduce unexpected changes in effective behavior, when eventing system is busy and it drops events.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with Ahmet, I will prepare a separate PR for the HZ-1192 fix, so it is easier to track and revert this commit before we merge this PR. For now, I am leaving the commit in to facilitate testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted in #22485 -- I still keep the commit as part of this PR as there is still some testing ongoing with those branches. Will revert it before merge.

@hazelcast hazelcast deleted a comment from hz-devops-test Oct 14, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Oct 14, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Oct 14, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Oct 14, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Oct 14, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Oct 14, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Oct 14, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Oct 14, 2022
Copy link
Contributor

@ramizdundar ramizdundar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have few minor comments (basically what is left unresolved) left but they are not blockers for merge. Two importantish TODOs before merge could be:

  • Update ClusterTopologyIntentTracker Javadoc.
  • Separate 1192 changes from this PR.

@vbekiaris thank you for your efforts on this huge endevour.

@hz-devops-test
Copy link

The job Hazelcast-pr-EE-compiler of your PR failed. (Hazelcast internal details: build log, artifacts).
Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log file
--------------------------
---------SUMMARY----------
--------------------------
[ERROR] COMPILATION ERROR : 
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hazelcast-enterprise: Compilation failure
--------------------------
---------ERRORS-----------
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------

@vbekiaris
Copy link
Collaborator Author

Thanks for your comments & reviews, they made this PR much better.
I pushed the HZ-1192 revert, will merge as soon as PR builder is green and then prepare backports to 5.2 / 5.2.z branches

@hz-devops-test
Copy link

The job Hazelcast-pr-EE-compiler of your PR failed. (Hazelcast internal details: build log, artifacts).
Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log file
--------------------------
---------SUMMARY----------
--------------------------
[ERROR] COMPILATION ERROR : 
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hazelcast-enterprise: Compilation failure
--------------------------
---------ERRORS-----------
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler_2/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/internal/hotrestart/HotRestartIntegrationService.java:[100,7] error: HotRestartIntegrationService is not abstract and does not override abstract method setClusterTopologyIntentOnMaster(ClusterTopologyIntent) in InternalHotRestartService
--------------------------

@vbekiaris vbekiaris merged commit 1ddc16e into hazelcast:master Oct 14, 2022
vbekiaris added a commit to vbekiaris/hazelcast that referenced this pull request Oct 14, 2022
…st#21844)

- Adds automated cluster state management for persistence on kubernetes
- Supports cluster-wide shutdown, rolling restart and partial member
recovery from failure on kubernetes [HZ-1190] [HZ-1191] [HZ-1193]
- Fixes behaviour of readiness probe with persistence enabled [HZ-1349]
- Allows tuning either for speedy crash recovery with FROZEN state
or availability of in-memory data structures with NO_MIGRATION state
for missing members [HZ-1311]
- Fixes backup sync after single member crash recovery [HZ-1349]

Design document in EE side:
https://github.com/vbekiaris/hazelcast-enterprise/blob/enhancements/5.2/k8s-persistence/docs/design/persistence/04-persistence-kubernetes-improvements.md

(cherry picked from commit 1ddc16e)
@vbekiaris vbekiaris modified the milestones: 5.2.0, 5.3.0 Oct 14, 2022
vbekiaris added a commit to vbekiaris/hazelcast that referenced this pull request Oct 14, 2022
…st#21844)

- Adds automated cluster state management for persistence on kubernetes
- Supports cluster-wide shutdown, rolling restart and partial member
recovery from failure on kubernetes [HZ-1190] [HZ-1191] [HZ-1193]
- Fixes behaviour of readiness probe with persistence enabled [HZ-1349]
- Allows tuning either for speedy crash recovery with FROZEN state
or availability of in-memory data structures with NO_MIGRATION state
for missing members [HZ-1311]
- Fixes backup sync after single member crash recovery [HZ-1349]

Design document in EE side:
https://github.com/vbekiaris/hazelcast-enterprise/blob/enhancements/5.2/k8s-persistence/docs/design/persistence/04-persistence-kubernetes-improvements.md

(cherry picked from commit 1ddc16e)
vbekiaris added a commit that referenced this pull request Oct 17, 2022
…22501)

- Adds automated cluster state management for persistence on kubernetes
- Supports cluster-wide shutdown, rolling restart and partial member
recovery from failure on kubernetes [HZ-1190] [HZ-1191] [HZ-1193]
- Fixes behaviour of readiness probe with persistence enabled [HZ-1349]
- Allows tuning either for speedy crash recovery with FROZEN state or
availability of in-memory data structures with NO_MIGRATION state for
missing members [HZ-1311]
- Fixes backup sync after single member crash recovery [HZ-1349]

Design document in EE side:

https://github.com/vbekiaris/hazelcast-enterprise/blob/enhancements/5.2/k8s-persistence/docs/design/persistence/04-persistence-kubernetes-improvements.md

(cherry picked from commit 1ddc16e)
1:1 clean backport of #21844 to 5.2.0 release branch

Also includes backport of #22512 

Co-authored-by: Łukasz Dziedziul <lukasz.dziedziul@hazelcast.com>
vbekiaris added a commit that referenced this pull request Oct 19, 2022
…22502)

- Adds automated cluster state management for persistence on kubernetes
- Supports cluster-wide shutdown, rolling restart and partial member
recovery from failure on kubernetes [HZ-1190] [HZ-1191] [HZ-1193]
- Fixes behaviour of readiness probe with persistence enabled [HZ-1349]
- Allows tuning either for speedy crash recovery with FROZEN state or
availability of in-memory data structures with NO_MIGRATION state for
missing members [HZ-1311]
- Fixes backup sync after single member crash recovery [HZ-1349]

Design document in EE side:

https://github.com/vbekiaris/hazelcast-enterprise/blob/enhancements/5.2/k8s-persistence/docs/design/persistence/04-persistence-kubernetes-improvements.md

(cherry picked from commit 1ddc16e)
1:1 clean backport from #21844 

Also includes backport of #22512 
Co-authored-by: Łukasz Dziedziul <lukasz.dziedziul@hazelcast.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants