Graceful shutdown improvements #7989

merged 5 commits into from Apr 26, 2016


None yet

6 participants

mdogan commented Apr 20, 2016

Goal is to ensure data safety while shutting down multiple nodes concurrently. Data safety is guaranteed when nodes are shutdown (gracefully) even when there are no backups configured.
During shutdown process, all data (partition replicas) owned by shutting down member will be migrated to other alive nodes in the cluster. A member is only allowed to shutdown, when it has no remaining ownership assignments in partition table.

For more info see:

@mdogan mdogan added this to the 3.7 milestone Apr 20, 2016
metanet and others added some commits Apr 7, 2016
@metanet @mdogan metanet Start graceful shutdown impl
- Simplified PartitionStateGenerator interface
	* Merged `initialize`and `reArrange` methods to single `arrange`.
- Introduce ShutdownRequest operation and shutdown requested node state in MigrationManager
- Rollback successfully completed migration if destination is requested to shutdown
- Exclude shutdown requested members during repartitioning
- Check shutdown requested nodes after migrations are completed
- Apply AllowedDuringPassiveState interface only for replica sync and partition table operations
	* Migrations are not allowed during node is in PASSIVE state. Only replications and partition table operations are allowed
@mdogan mdogan Implement safe shutdown await and response mechanisms
- Allow shutdown response operation when node is passive
- Shutdown request & response should not return response
- Lite member should skip safe shutdown procedure
- Allow migration operations when node is passive, except migration-commit
- Master should directly call onShutdownRequest() instead of sending op
- Apply graceful shutdown fixes
	* Send shutdown operation immediately if partition table is not initialized
	* Fix CheckShutdownRequestsTask scheduling in RepartitioningTask
- MigrationCommitOperation should check node state explicitly
- Migration planner logging improvements
- PartitionStateGenerator should fill null gaps in partition table by
swapping cold replicas with hotter ones
- Add graceful shutdown tests
@metanet @mdogan metanet Improve & simplify SHIFT DOWN & UP migrations
- Improve SHIFT UP migration decisions when source is not null
- SHIFT DOWN migration is not needed if source new index is already owned by another node. If it is the case, we can perform 2 MOVE migrations instead of a single SHIFT DOWN and safety is not broken. SHIFT DOWN is only needed if source new index is not owned by any node before.
- Remove unused `MigrationAwareService.clearPartitionReplica()`
@metanet @mdogan metanet Node running and state check improvements
- Check if node has shut down before sending operation failure response
	* This check should be done with node state instead of isRunning() method since nodes have more work to do (migrations) now during shutdown.
- Invocation.engineActive() and MockConnectionManager should check node-state explicitly
- Check if master is running during migration commit if master is the destination
- Master should allow direct shutdown when cluster is FROZEN or PASSIVE
- Remove locking in partition state publishing methods
- Improved & added graceful shutdown tests
@mdogan mdogan Implement PartitionService safety checking and enforcing
- `prepareToSafeShutdown()` should return immediately if node is not joined
- FetchMostRecentPartitionTableTask shouldn't increment version when partition table is not initialized
- `hasMissingReplicaOwners()` should check for unknown members in ptable
- MockConnectionManager.getOrConnect() should not create new connection after it's stopped/shutdown.
- Change migration result assertion in MigrationCommitServiceTest
- Add more graceful shutdown tests, including cluster states
eminn commented Apr 22, 2016

The changes complies with design document and explanatory commit messages in place this PR looks good to be merged ๐Ÿ‘

@enesakar enesakar commented on the diff Apr 25, 2016
@@ -50,10 +57,33 @@ public void tearDown() {
- final protected void terminateInstances(List<HazelcastInstance> terminatingInstances) {
- for (HazelcastInstance instance : terminatingInstances) {
- instance.getLifecycleService().terminate();
+ final protected void stopInstances(List<HazelcastInstance> terminatingInstances, final NodeLeaveType nodeLeaveType) {
enesakar Apr 25, 2016 Member

minor one: I would prefer to rename terminatingInstances to stoppingInstances.

jerrinot Apr 26, 2016 Contributor




@jerrinot jerrinot commented on the diff Apr 26, 2016
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import static org.hamcrest.Matchers.greaterThanOrEqualTo;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertThat;
+import static org.junit.Assert.assertTrue;
+@Category({QuickTest.class, ParallelTest.class})
+public class GracefulShutdownCorrectnessTest extends PartitionCorrectnessTestSupport {
jerrinot Apr 26, 2016 Contributor

this is running 6 minutes on my box. do we want to run it with each merge build?

mdogan Apr 26, 2016 Member

We can cut its time down later, either reducing parameters or partition count or both. For now we need to ensure it's passing on each build to be able to detect any possible hidden issue.
Also if we convert this to a slow test now, it will run loooonger because nighty build doesn't run parallel.

jerrinot Apr 26, 2016 Contributor

does it matter if a nightly build is longer?
cc @hasancelik

mdogan Apr 26, 2016 Member

Doesn't it?

jerrinot Apr 26, 2016 Contributor

If I have to choose between e.g. 1 extra minute in a merge build or 10 minutes in a nightly then I'll pick the 2nd option. It's not a on critical path during development.

mdogan Apr 26, 2016 Member

Then we have to convert nearly all new migration tests to nightly too. Most of them takes more than a minute, maybe more than a few. Are you ok with this?

jerrinot Apr 26, 2016 Contributor

Good balance is needed. obviously the definition of "good balance" is open to interpretation:)

Also it's a question what's impact of 5 minutes long parallel test on a merge build duration. chances are it won't be that bad thanks to JVM-level parallelization.

jerrinot Apr 26, 2016 Contributor

for the record: I don't consider this a blocker for this PR. we can move the tests to a nightly category later.

mdogan Apr 26, 2016 edited Member

Before merge:

11:10:02 [INFO] ------------------------------------------------------------------------
11:10:02 [INFO] Reactor Summary:
11:10:02 [INFO] 
11:10:02 [INFO] Hazelcast Root .................................... SUCCESS [0.510s]
11:10:02 [INFO] hazelcast ......................................... SUCCESS [16:58.865s]
11:10:02 [INFO] hazelcast-client .................................. SUCCESS [5:45.603s]
11:10:02 [INFO] hazelcast-spring .................................. SUCCESS [59.595s]
11:10:02 [INFO] hazelcast-build-utils ............................. SUCCESS [0.681s]
11:10:02 [INFO] ------------------------------------------------------------------------

After merge:

12:41:22 [INFO] ------------------------------------------------------------------------
12:41:22 [INFO] Reactor Summary:
12:41:22 [INFO] 
12:41:22 [INFO] Hazelcast Root .................................... SUCCESS [0.488s]
12:41:22 [INFO] hazelcast ......................................... SUCCESS [18:49.985s]
12:41:22 [INFO] hazelcast-client .................................. SUCCESS [7:56.258s]
12:41:22 [INFO] hazelcast-spring .................................. SUCCESS [1:04.291s]
12:41:22 [INFO] hazelcast-build-utils ............................. SUCCESS [0.681s]
12:41:22 [INFO] ------------------------------------------------------------------------

Effect on core module is 2mins. Client module build time seems increased 2mins too, I guess reason is client tests are using hazelcast.shutdown().



@enesakar enesakar commented on the diff Apr 26, 2016
@@ -691,7 +689,62 @@ public InternalPartition getPartition(int partitionId, boolean triggerOwnerAssig
public boolean prepareToSafeShutdown(long timeout, TimeUnit unit) {
- return partitionReplicaStateChecker.prepareToSafeShutdown(timeout, unit);
+ if (!node.joined()) {
+ return true;
+ }
+ if (node.isLiteMember()) {
+ return true;
enesakar Apr 26, 2016 Member

How do we handle the ongoing tasks run by executor service? Is it in scope of "fault tolerant executor service"?

mdogan Apr 26, 2016 Member

We only guarantee partitions' safety (resources related to partitions) during shutdown. Services should handle safety of other non-partition resources by theirselves.

metanet commented Apr 26, 2016

Thanks for the party guys. Merging this one

@metanet metanet merged commit c6bc1c4 into hazelcast:master Apr 26, 2016

1 check passed

default 9285 tests run, 35 skipped, 0 failed.
@mdogan mdogan deleted the mdogan:graceful-shutdown-pr branch Apr 26, 2016
@enesakar enesakar commented on the diff Apr 26, 2016
@@ -48,6 +50,11 @@ public MigrationCommitOperation(PartitionRuntimeState partitionState) {
public void run() {
+ NodeEngine nodeEngine = getNodeEngine();
+ if (!nodeEngine.isRunning()) {
enesakar Apr 26, 2016 Member

I see usages with nodeEngine.getNode().getState() == NodeState.SHUT_DOWN . Are they equivalent? nodeEngine.isRunning and nodeState?

metanet Apr 26, 2016 Contributor

When a node starts the shut down / termination process, its node state becomes PASSIVE. After this moment, nodeEngine.isRunning() always returns false. node state becomes SHUT_DOWN after the shutdown process is completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment