Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MigrationQueue deadlock #8217

Closed
ShijunK opened this issue May 19, 2016 · 4 comments

Comments

Projects
None yet
5 participants
@ShijunK
Copy link
Contributor

commented May 19, 2016

The internal queue and counter do not synced (queue size != 0, but counter = 0) in some situation (happens couple times, when try to orderly restart a cluster by invoke shutdown method on hazelcastinstance), which cause deadlock

Usually, it happens there is a migration task failure. I haven't been able to create a reliable producible configuration.

the log keeps loop until timeout

2016-05-19 12:59:41,481 [hz.NODE-K.migration] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Re-partitioning cluster data... Migration queue size: 1
2016-05-19 12:59:41,528 [hz.NODE-K.migration] WARN  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Migration failed: com.hazelcast.partition.MigrationInfo{partitionId=10, source=Address[xyz03]:5701, destination=Address[xyz04]:5703, master=Address[xyz04]:5702
, valid=true, processing=false}
2016-05-19 12:59:41,528 [hz.NODE-K.migration] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] All migration tasks have been completed, queues are empty.
2016-05-19 12:59:42,466 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:43,466 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:44,466 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:45,466 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:46,481 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:47,481 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:48,497 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:49,497 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:50,497 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:51,497 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:52,513 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:53,513 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:54,513 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:55,513 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:56,513 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:56,638 [hz.NODE-K.migration] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Re-partitioning cluster data... Migration queue size: 1
2016-05-19 12:59:56,638 [hz.NODE-K.migration] WARN  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Migration failed: com.hazelcast.partition.MigrationInfo{partitionId=10, source=Address[xyz03]:5701, destination=Address[xyz04]:5703, master=Address[xyz04]:5702
, valid=true, processing=false}
2016-05-19 12:59:56,638 [hz.NODE-K.migration] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] All migration tasks have been completed, queues are empty.
2016-05-19 12:59:57,528 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:58,528 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 12:59:59,528 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:00,528 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:01,528 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:02,528 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:03,528 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:04,544 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:05,544 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:06,544 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:07,560 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:08,575 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:09,575 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:10,575 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1
2016-05-19 13:00:11,575 [Thread-4306493] INFO  com.hazelcast.partition.InternalPartitionService - [xyz04]:5702 [xyz04] [3.6.1] Waiting for cluster migration tasks: 1

@ShijunK

This comment has been minimized.

Copy link
Contributor Author

commented May 19, 2016

Ok, I think I find out why, the MigrationQueue also accept RepoartiioningTask. which is not a instaceof MigrationTask.

When shutdown is claiming about "Waiting for cluster migraiton tasks", it is actually waiting for "ReportioningTask". But ReportioningTask keep failing, because local instance is in shutdown mode.

If ReportioningTask is actually not allowed after shutdown process kick off, shouldn't shutdown process only check the counter, not queue size?

the following code should use MigrationQueue#hasMigrationTasks, not size()

 private boolean checkForActiveMigrations(Level level) {
        final int activeSize = activeMigrations.size();
        if (activeSize != 0) {
            if (logger.isLoggable(level)) {
                logger.log(level, "Waiting for active migration tasks: " + activeSize);
            }
            return true;
        }

        int queueSize = migrationQueue.size();
        if (queueSize != 0) {
            if (logger.isLoggable(level)) {
                logger.log(level, "Waiting for cluster migration tasks: " + queueSize);
            }
            return true;
        }
        return false;
    }

@jerrinot jerrinot added this to the 3.7 milestone May 23, 2016

@metanet metanet modified the milestones: 3.6.3, 3.7 May 23, 2016

@metanet

This comment has been minimized.

Copy link
Contributor

commented May 24, 2016

Hi @ShijunK

As I see in the logs, RepartitioningTask produces the same migration repeatedly and it fails for some reason. From the logs you share, reason of the migration failure is not clear. Do you have any other logs you can share with us? Maybe migrations are failing because of an exception thrown from a service etc.

@ShijunK

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2016

@metanet
I don't have the logging anymore. The failure is caused by migration target was shutdown after repartingint task (which why it was picked as migration target) before migration complete (then infinite loop). The situation happens during my rolling deployment process, which will shutdown hazelcast node one by one gracefully by invoking HazelcastInstance#shutdown(), then deploy new code, restart hazelcast node, move on to next one.

@jerrinot jerrinot modified the milestones: 3.6.4, 3.6.3 Jun 9, 2016

@metanet metanet self-assigned this Jun 13, 2016

@enesakar enesakar modified the milestones: 3.6.4, 3.6.5 Jun 30, 2016

@mdogan

This comment has been minimized.

Copy link
Member

commented Jul 19, 2016

Fixed by #8230

@mdogan mdogan closed this Jul 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.