Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MigrationThread leak #8560

Closed
mtopolnik opened this issue Jul 20, 2016 · 9 comments

Comments

Projects
None yet
3 participants
@mtopolnik
Copy link
Contributor

commented Jul 20, 2016

MigrationThread can leak after shutting down or even forcefully terminating Hazelcast. It retains the entire HazelcastInstance.

InternalPartitionServiceImpl:

    @Override
    public void shutdown(boolean terminate) {
        logger.finest("Shutting down the partition service");
        migrationManager.stop();
        reset();
    }

MigrationManager:

void stop() {
        migrationThread.stopNow();
    }

MigrationThread:

    void stopNow() {
        queue.clear();
        interrupt();
    }

Thread is interrupted, but not joined. In particular, the thread will block forever in some cases, one of which is the following:
RepairPartitionTableTask#commitPromotionsToDestination:

                PromotionCommitOperation op = new PromotionCommitOperation(...);
                Future<Boolean> future = nodeEngine.getOperationService()
                        .createInvocationBuilder(SERVICE_NAME, op, destination)
                        .setTryCount(Integer.MAX_VALUE)
                        .setCallTimeout(Long.MAX_VALUE).invoke();
                boolean result = future.get();

@mtopolnik mtopolnik added this to the 3.7 milestone Jul 20, 2016

@metanet metanet self-assigned this Jul 21, 2016

@mdogan

This comment has been minimized.

Copy link
Member

commented Jul 21, 2016

This looks like an invocation system issue to me. Invocation is not notified when instance is being shutdown.

@mtopolnik

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2016

What about Hazelcast leaving live threads behind? Shouldn't we join on the migration thread?

@metanet

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2016

MigrationThread is already supposed to complete here. Its task queue is cleared during shutdown. So, it should complete its ongoing task (if there is any) and complete. Our guess here is even if the system is being shutdown, a new invocation can be initiated and it will not be notified. This is not a partition-service specific problem. We are trying to verify this issue.

@mtopolnik

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2016

Letting the migration thread die on its own pace is against our fail-fast doctrine and has presently resulted in the loss of 2 engineering man-days. I have submitted a PR which makes the shutdown procedure wait until the thread is dead.

Also noteworthy is that, with that in place, the CacheHotRestartTest is not blocking and timing out, instead it is passing.

@metanet

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2016

As I stated earlier, we were suspecting a problem in the invocation system, and now we reproduced it. The problem is not solely about the migration thread.

@mtopolnik

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2016

This issue is solely about the absence of fail-fast behavior in the migration thread shutdown. We should address further problems in a separate issue.

@mdogan

This comment has been minimized.

Copy link
Member

commented Jul 21, 2016

absence of fail-fast behavior in the migration thread shutdown only causes leaking the migration thread and the instance behind it. I agree it's a problem on its own and needs fixing.

But bigger problem is, even with fail fast behaviour, shutdown may block forever.

@metanet

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2016

Yes, that is why I am saying that the problem is not solely about migration thread.

@mtopolnik

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2016

That means we all agree here. This issue has a narrow focus, but the problem I met also involves further issues downstream from it, within the invocation system. Making the shutdown procedure block immediately upon facing the downstream issue is the goal of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.