Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scheduled-executor] Registration objects causing memory leak while using ScheduledExecutor #11221

Closed
Mihajlovski opened this issue Aug 25, 2017 · 19 comments

Comments

@Mihajlovski
Copy link

@Mihajlovski Mihajlovski commented Aug 25, 2017

I originally posted this question on stackoverflow.

We are using Hazelcast (version 3.8.2) ScheduledExecutor to perform tasks in a cluster. We have a memory issue.

Zabbix memory graph - OLD GC Usage

Profiling was done, and we realized that the leak is happening in EventServiceSegment for registration maps. It's all for .partitionLost events. Or, well, majority.

Yourkit memory profiling

We know about disposing finished future, and here's our code that is performed on each instance periodically.

final long futureDisposed = service.getAllScheduledFutures().values().stream().flatMap(Collection::stream)
            .filter(future -> future.getStats().getTotalRuns() > 0).peek(IScheduledFuture::dispose).count();
final long futuresLeft = service.getAllScheduledFutures().values().stream().flatMap(Collection::stream).count();

What we get when logging those values is okay, correct number of tasks is disposed, and correct number of tasks remains scheduled. At the moment of this heap dump, futuresLeft logged ~2500 scheduled futures as left in the system. But the memory is still growing.

Our tasks get generated in batches of hundreds, then get scheduled for same delay. The tasks are implementing Runnable, Serializable and HazelcastInstanceAware. We also have transient references to some objects we pick up from userContext of HazelcastInstance.

Scheduling is done as:

this.scheduler = instance.getScheduledExecutorService(schedulerName);
this.scheduler.schedule(new Task(taskId), delay, TimeUnit.MILLISECONDS);

When tasks are executed, they might get rescheduled, and then we create a new object (previously we would schedule same objects back, but then we changed it, hoping it would help, but it didn't).

Scheduler is configured with durability 1 and poolSize of 2.

I've went through code, googled whatever keyword I could find, but nothing helped. I can see that these registrations are supposed to be cleaned when calling dispose, but they are not.

How to properly clean this registrations map when tasks are finished?

@mmedenjak mmedenjak added this to the 3.9 milestone Aug 25, 2017
@mmedenjak mmedenjak changed the title Registration objects causing memory leak while using ScheduledExecutor [scheduled-executor] Registration objects causing memory leak while using ScheduledExecutor Aug 25, 2017
@Danny-Hazelcast
Copy link
Member

@Danny-Hazelcast Danny-Hazelcast commented Aug 25, 2017

I was just looking at scheduled executor, and making this test https://github.com/hazelcast/hzCmd-bench/blob/zeta/src/main/java/hzcmd/executor/scheduled/ScheduleDispose.java

each task is scheduled with a long delay, of 3600 seconds, then i schedule.dispose(); the IScheduledFuture task on the next line (edited)

but looking at the GC of the member after running this for 1 hour http://54.82.84.143/~jenkins/workspace/stable-xx/3.9-SNAPSHOT/2017_08_25-09_56_38/gc.html

looks like a leak,

(links private)

image

@Donnerbart
Copy link
Contributor

@Donnerbart Donnerbart commented Aug 25, 2017

I got a very simple reproducer. Just add this test to ScheduledExecutorServiceBasicTest:

    @Test
    public void schedule_thenDisposeLeakTest() throws Exception {
        int delay = 10000;
        String taskName = "Test";

        HazelcastInstance[] instances = createClusterWithCount(2);
        IScheduledExecutorService executorService = getScheduledExecutor(instances, "s");

        for (int i = 0; i < 10000000; i++) {
            IScheduledFuture<Double> future = executorService.schedule(
                    named(taskName, new PlainCallableTask()), delay, SECONDS);
            future.dispose();
        }
    }

I just monitored it with Visual GC, and it looks like it's leaking:
screenshot from 2017-08-25 12-02-39

Like in the linked hprof from Danny, it looks like a lot of ScheduledTaskStatisticsImpl instances are created and not being removed:
screenshot from 2017-08-25 12-03-32

@mmedenjak mmedenjak self-assigned this Aug 25, 2017
@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Aug 25, 2017

The cause is the task decoration - the decorated (LoggingDelegatingFuture) instance is put into the LoggingScheduledExecutor.workQueue. On cancellation, the cancellation is delegated to the wrapped future. It tries to remove the future from the workQueue but it doesn't contain the delegate, it contains the decorator instance.

@Mihajlovski
Copy link
Author

@Mihajlovski Mihajlovski commented Aug 25, 2017

Are you sure that's the same leak. Our heap dump contains different objects. Mostly PartitionLostListenerAdapter.
yourkit

@Mihajlovski
Copy link
Author

@Mihajlovski Mihajlovski commented Aug 25, 2017

And, of course, is there any work around we can use so we don't have to restart our instances every now and then?

@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Aug 25, 2017

Might not be. Can you traverse the references to see which hazelcast class is holding the reference to those hashmap nodes?

@Mihajlovski
Copy link
Author

@Mihajlovski Mihajlovski commented Aug 25, 2017

It's (mostly) EventServiceSegment.registrations and .registrationById maps.
Here's EventServiceImpl object.
serviceimpl

And the biggest EventServiceSegment for hz:core:partitionService
servicesegment

mmedenjak added a commit to mmedenjak/hazelcast that referenced this issue Aug 25, 2017
The workQueue inside LoggingScheduledExecutor contains wrapped
futures of scheduled tasks. This is because we decorate tasks in
LoggingScheduledExecutor.decorateTask.
When we were cancelling the task, we delegated cancellation to the
delegate which tries to remove itself from the workQueue. The
workQueue though doesn't contain the delegate but rather contains the
decorated delegate. This causes the remove to fail without any
message and causes the leak.
To fix this, after we call cancel() on the delegate, we manually
remove the decorator from the workQueue.

Fixes : hazelcast#11221
mmedenjak added a commit to mmedenjak/hazelcast that referenced this issue Aug 25, 2017
The workQueue inside LoggingScheduledExecutor contains wrapped
futures of scheduled tasks. This is because we decorate tasks in
LoggingScheduledExecutor.decorateTask.
When we were cancelling the task, we delegated cancellation to the
delegate which tries to remove itself from the workQueue. The
workQueue though doesn't contain the delegate but rather contains the
decorated delegate. This causes the remove to fail without any
message and causes the leak.
To fix this, after we call cancel() on the delegate, we manually
remove the decorator from the workQueue.

Fixes : hazelcast#11221

(cherry picked from commit a594ec2)
@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Aug 25, 2017

Hm but those look like listener registrations and are not related to the scheduled executor usage. Are you using event listeners? This might be a legitimate use of heap. In any case, we have certainly detected a leak in the disposal of scheduled executor tasks

@Mihajlovski
Copy link
Author

@Mihajlovski Mihajlovski commented Aug 25, 2017

We do use MembershipListener, but all it does is log. And it's set on the hazelcast instance, not on the tasks.
And, as said, our tasks are HazelcastInstanceAware.
We didn't explicitly set any option for event listening.

PartitionLostListenerAdapter as it's listener has a ScheduledFutureProxy.
Which as its handler has ScheduledTaskHandlerImpl.

As far as legitimate use of heap, I don't know. I'm totally open to the possibility of us misusing it somehow, I just want to figure out how. :)

mmedenjak added a commit to mmedenjak/hazelcast that referenced this issue Aug 28, 2017
The workQueue inside LoggingScheduledExecutor contains wrapped
futures of scheduled tasks. This is because we decorate tasks in
LoggingScheduledExecutor.decorateTask.
When we were cancelling the task, we delegated cancellation to the
delegate which tries to remove itself from the workQueue. The
workQueue though doesn't contain the delegate but rather contains the
decorated delegate. This causes the remove to fail without any
message and causes the leak.
To fix this, after we call cancel() on the delegate, we manually
remove the decorator from the workQueue.

Fixes : hazelcast#11221
mmedenjak added a commit to mmedenjak/hazelcast that referenced this issue Aug 28, 2017
The workQueue inside LoggingScheduledExecutor contains wrapped
futures of scheduled tasks. This is because we decorate tasks in
LoggingScheduledExecutor.decorateTask.
When we were cancelling the task, we delegated cancellation to the
delegate which tries to remove itself from the workQueue. The
workQueue though doesn't contain the delegate but rather contains the
decorated delegate. This causes the remove to fail without any
message and causes the leak.
To fix this, after we call cancel() on the delegate, we manually
remove the decorator from the workQueue.

Fixes : hazelcast#11221
@tkountis
Copy link
Contributor

@tkountis tkountis commented Aug 29, 2017

These are two different leaks, from a quick look I had.
ScheduledFutures are member & partition loss aware (or they try to be), meaning that they register themselves as listeners to these kind of events. However, @Mihajlovski is not using the returned Future, but rather re-fetches them with getAllScheduled, therefore causing extra refs to stay live.

I will play wth this use-case a bit to see what alternative solutions I can come up with.
For now I guess, as a temporary fix, to avoid that leak, is to use the original Future, as you got it out of .schedule...(). Maybe store it in a local Set and once you dispose it, then remove it.

@mmedenjak mmedenjak removed their assignment Aug 29, 2017
@Mihajlovski
Copy link
Author

@Mihajlovski Mihajlovski commented Aug 29, 2017

Is it enough only to call dispose on the instance that scheduled the task? What happens on other instances in cluster if that one fails?

@tkountis tkountis self-assigned this Aug 30, 2017
@tkountis
Copy link
Contributor

@tkountis tkountis commented Aug 30, 2017

@Mihajlovski I didn't quite get the question, could you please re-phrase it ?
Just for reference, maybe it helps, dispose stops (if still running), cleans listeners & removes the task. This means that any attempt to interact with that Future again, it will fail. Also, all calls to getAllScheduled after disposing a task, wont return this task.

Bottom line, you only have to dispose the Future once, and you can do that by holding the original reference, to a map, and periodically checking when to clean.

The leak at your use-case is that you never interact with the return value of executor.schedule() but rather periodically fetch them all over again & again, by executor.getAllScheduled(). This creates new Futures which in turn register themselves as listeners.

@tkountis
Copy link
Contributor

@tkountis tkountis commented Aug 30, 2017

Now if you store a reference to the Future in a local Set<Future>, and this node/member goes down, you don't have to worry about re-fetching the tasks, next time you restart. You do this once, and you keep that list. I hope that this makes sense.

We will work on potentially providing a better approach to deal with this.

@Mihajlovski
Copy link
Author

@Mihajlovski Mihajlovski commented Aug 30, 2017

Re-phrase coming :)
We have instances A, B and C. All of them generate some tasks. Each of them stores the futures for the tasks it creates in local set. When they are finished, occasionally it checks which of them are finished, and it calls dispose().
Instance A goes down. Some of the futures weren't cleaned. Do they stay in B and C? Is the system resilient to instance failures or will the leaks at some point sum-up to an OOME?

Here's where we picked up the code for the cleaning by calling getAllScheduled():
#9799

I would never guess that calling getAllScheduled() creates new futures, at least from reading the documentation.

@tkountis
Copy link
Contributor

@tkountis tkountis commented Aug 31, 2017

@Mihajlovski Firstly, let me clarify what 'creates new futures' means in this context. When you schedule a task, the real future is handled internally by Hazelcast, you never get access to this. What you get, is a proxy Future, that acts as the messenger of your commands, eg. when you say, .get() it finds the actual future/task, gets the result and brings it back to you. So, now when we say we create new futures when we do getAllScheduled() it just means, that we create more proxy objects that interact with that same Future that Hazelcast internally handles. Hope that helps a bit.

Now, back to your question. All tasks you schedule (except from OnMember variants) are durable, meaning that they have backups in case one node goes down. As far as you don't use getAllScheduled() to often, eg. just once when you boot, you wont run out of heap.

@Danny-Hazelcast
Copy link
Member

@Danny-Hazelcast Danny-Hazelcast commented Aug 31, 2017

schedule-dispose test, leak fixed, improved member GC

image

(Edit): The second found leak is fixed, not the one reported from @Mihajlovski

@Mihajlovski
Copy link
Author

@Mihajlovski Mihajlovski commented Sep 4, 2017

Storing and disposing local futures did the trick for us. I'd suggest a documentation update. :)

@Danny-Hazelcast
Copy link
Member

@Danny-Hazelcast Danny-Hazelcast commented Oct 23, 2017

for 3.9
hazelcast.executionservice.taskscheduler.remove.oncancel = true
prevents this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

5 participants
You can’t perform that action at this time.