Memory leak in `com.facebook.presto.memory.ClusterMemoryManager#changeListeners` #10812

sopel39 · 2018-06-12T09:55:18Z

Example log: https://api.travis-ci.org/v3/job/389838962/log.txt
Commit: 48ea8f3

Example PR: #10808
log: https://api.travis-ci.org/v3/job/390987547/log.txt

log tail:

2018-06-12T06:46:47.197+0545 INFO Major GC: application 12ms, stopped 6406ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.197+0545 INFO Major GC: application 12ms, stopped 6406ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.197+0545 INFO Major GC: application 12ms, stopped 6406ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:59.581+0545 WARNING Node state update request to http://127.0.0.1:44031/v1/info/state has not returned in 18.21s
2018-06-12T06:46:59.582+0545 WARNING Error fetching node state from http://127.0.0.1:36641/v1/info/state: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
2018-06-12T06:46:59.584+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:37427/v1/service/presto/general
2018-06-12T06:47:49.816+0545 INFO Discovery server connect succeeded for refresh (presto/general)
2018-06-12T06:48:13.880+0545 SEVERE Expected service announcement after 8000.00ms, but announcement was delayed 1.02m
2018-06-12T06:48:25.391+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:48:36.835+0545 SEVERE Expected service announcement after 8000.00ms, but announcement was delayed 28.75s
2018-06-12T06:48:42.965+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:37427/v1/service/presto/general
2018-06-12T06:48:54.927+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:41456/v1/service/presto/general
2018-06-12T06:49:06.700+0545 SEVERE Cannot connect to discovery server for announce: Announcement failed for http://127.0.0.1:41456
2018-06-12T06:49:06.700+0545 SEVERE Service announcement failed after 2.62m. Next request will happen within 8000.00ms
2018-06-12T06:49:06.700+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:41456/v1/service/presto/general
2018-06-12T06:49:06.703+0545 INFO Discovery server connect succeeded for refresh (presto/general)
2018-06-12T06:49:06.703+0545 INFO Discovery server connect succeeded for refresh (presto/general)
2018-06-12T06:49:12.678+0545 SEVERE Expected service announcement after 8000.00ms, but announcement was delayed 1.84m
2018-06-12T06:49:12.679+0545 SEVERE Cannot connect to discovery server for announce: Announcement failed for http://127.0.0.1:37427
2018-06-12T06:49:12.679+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:37427/v1/service/presto/general
2018-06-12T06:49:12.679+0545 SEVERE Service announcement failed after 2.72m. Next request will happen within 8000.00ms
2018-06-12T06:49:18.676+0545 SEVERE Service announcement failed after 5.19m. Next request will happen within 2.00ms
2018-06-12T06:49:24.589+0545 SEVERE Expected service announcement after 8000.00ms, but announcement was delayed 1.94m
2018-06-12T06:49:24.591+0545 SEVERE Expected service announcement after 8000.00ms, but announcement was delayed 1.94m
2018-06-12T06:49:24.591+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:37427/v1/service/presto/general
2018-06-12T06:49:24.593+0545 SEVERE Cannot connect to discovery server for announce: Announcement failed for http://127.0.0.1:37427
2018-06-12T06:49:30.664+0545 SEVERE Service announcement failed after 3.02m. Next request will happen within 8000.00ms
2018-06-12T06:49:54.974+0545 WARNING Error fetching memory info from http://127.0.0.1:36641/v1/memory: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
2018-06-12T06:50:00.882+0545 SEVERE Expected service announcement after 4.00ms, but announcement was delayed 42.21s
2018-06-12T06:50:07.174+0545 WARNING Node state update request to http://127.0.0.1:44031/v1/info/state has not returned in 205.80s
2018-06-12T06:50:07.175+0545 WARNING Error fetching node state from http://127.0.0.1:44031/v1/info/state: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
2018-06-12T06:50:07.175+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:50:07.175+0545 INFO Major GC: application 7ms, stopped 5955ms:: 2.12GB -> 2.12GB

Should we increase tests memory limit? Could there be some mem leak?

CC: @electrum

The text was updated successfully, but these errors were encountered:

mbasmanova · 2018-06-18T19:43:43Z

@findepi Piotr, I'm still seeing these failures. Do you have more fixes in the pipeline? :-)

findepi · 2018-06-18T20:25:38Z

@mbasmanova i regret.. but i don't. The problem deserves some more investigation.

findepi · 2018-06-19T14:53:05Z

@mbasmanova oh, i just found some lost commits #10866 & #10865

mbasmanova · 2018-06-20T01:11:42Z

@findepi Thank you, Piotr. I got green tests on #10731

findepi · 2018-06-20T07:13:35Z

@mbasmanova i am glad to hear that, but this might still be incidental. I've seen a red build even with #10866.
(Timeout after 80 minutes, even though the next run took only 40 minutes)

findepi · 2018-07-16T13:32:25Z

Recently the problem became more server and more easily reproducible locally.
I am under impression that MemoryAwareQueryExecution adds listeners in ClusterMemoryManager:

memoryManager.addChangeListener(GENERAL_POOL, none -> start());
memoryManager.addChangeListener(RESERVED_POOL, none -> start());

that are never going to be removed.

I am not entirely sure yet, since all info I got so far is the 10M instances of com.facebook.presto.memory.ClusterMemoryManager$$Lambda$7855

cc @nezihyigitbasi @raghavsethi

findepi · 2018-07-16T13:53:42Z

I see now. 10M ClusterMemoryManager$$Lambda$7855 are queued executions in com.facebook.presto.memory.ClusterMemoryManager#listenerExecutor

Class Name                                                                                                | Shallow Heap | Retained Heap
-----------------------------------------------------------------------------------------------------------------------------------------
listenerExecutor java.util.concurrent.Executors$FinalizableDelegatedExecutorService @ 0x75bae0cd8         |           16 |            16
|- e java.util.concurrent.ThreadPoolExecutor @ 0x75bae0ce8                                                |           80 |   856,403,752
|  |- workQueue java.util.concurrent.LinkedBlockingQueue @ 0x75bae0d48                                    |           48 |   856,403,288
|  |  |- head java.util.concurrent.LinkedBlockingQueue$Node @ 0x75cefaf20                                 |           24 |   856,402,792
|  |  |  |- next java.util.concurrent.LinkedBlockingQueue$Node @ 0x75cefaf50                              |           24 |   856,402,768
|  |  |  |  |- next java.util.concurrent.LinkedBlockingQueue$Node @ 0x75cefb050                           |           24 |   856,402,512
|  |  |  |  |- item com.facebook.presto.memory.ClusterMemoryManager$$Lambda$7855 @ 0x75cefaf68            |           24 |            24
|  |  |  |  |  |- arg$2 com.facebook.presto.spi.memory.MemoryPoolInfo @ 0x75cefaf80                       |           48 |           208
|  |  |  |  |  |- arg$1 com.facebook.presto.execution.MemoryAwareQueryExecution$$Lambda$7853 @ 0x75c12ab68|           16 |            16
-----------------------------------------------------------------------------------------------------------------------------------------

I didn't dig why the executions queue up (might be a bottleneck, since the executor is single threaded, or some kind of a lock).

raghavsethi · 2018-07-19T16:35:49Z

@findepi Does this still need a fix?

findepi · 2018-07-19T20:36:03Z

@raghavsethi
as far as Travis is concerned, I think @arhimondr's #11062 fix the problem.
Regarding the (apparent) root cause -- listeners being attached to ClusterMemoryManager by MemoryAwareQueryExecution and not being deleted -- I think this still applies.

nezihyigitbasi · 2018-07-20T07:22:48Z

The way listeners are setup in MemoryAwareQueryExecution can add lots of work to the workQueue of ClusterMemoryManager.listenerExecutor by calling execute() on it, causing the issue reported above.

I was able to repro a similar issue locally by not closing the server in TestMemoryAwareExecution, and inserting an infinite loop to tearDown, so the test doesn't terminate (effectively delaying the server close indefinitely).

Here is what's happening:

Say, a query transitions to the FAILED state, that transition will call the state change listener registered in MemoryAwareQueryExecution, which will call memoryManager.removePreAllocation(delegate.getQueryId());. Then, removePreAllocation() will call listenerExecutor.execute(() -> listener.accept(info)); where the listener will call MemoryAwareQueryExecution.start(). start() will add yet another listener to delegate and it will immediately call this state change listener (since in terminal state StateMachine just calls the state change listener immediately) and start all over again. This infinite loop will continue as long as the query stays in that terminal state. It's like a ping-pong between ClusterMemoryManager::removePreAllocation and MemoryAwareQueryExecution::start(). Until the query leaves the terminal state the code will keep adding work to the executor workQueue.

nezihyigitbasi · 2018-07-20T17:23:24Z

I ran multiple experiments by disabling TestMemoryAwareExecution and it seems to help. Previously, I was getting 1/3 passes for presto-tests, when I disabled it I got 3/3 passes.

raghavsethi · 2018-07-20T17:27:44Z

Will definitely get to it today

…

On Fri, Jul 20, 2018 at 10:23 AM Nezih Yigitbasi ***@***.***> wrote: I ran multiple experiments by disabling TestMemoryAwareExecution and it seems to help. Previously, I was getting 1/3 passes for presto-tests, when I disabled it I got 3/3 passes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10812 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AArsDJCwPcj8DSBdQHn5FVmd1XFADt6Bks5uIhIQgaJpZM4UkIG9> .

raghavsethi · 2018-07-21T00:55:54Z

Talked to @nezihyigitbasi and will have a fix by Monday.

Previously, we would add a listener for state change *on* every state change. This could also cause listeners to be added in a loop in between failed state transitions. Fixes prestodb#10812.

sopel39 · 2018-11-16T15:34:20Z

@raghavsethi Can it be closed now?

raghavsethi · 2018-11-16T18:19:39Z

@dain has a PR that removes MemoryAwareExecution completely.

findepi added bug tests labels Jun 12, 2018

findepi mentioned this issue Jun 14, 2018

Close QueryAssertions #10833

Merged

findepi mentioned this issue Jun 19, 2018

Close and let GC QueryRunners in tests #10866

Merged

findepi changed the title ~~TEST_SPECIFIC_MODULES=presto-tests fail with JVM crashing~~ Memory leak in com.facebook.presto.memory.ClusterMemoryManager#changeListeners Jul 16, 2018

findepi removed the tests label Jul 16, 2018

findepi mentioned this issue Jul 16, 2018

Fix TestTime, TestDateTimeFunctions to always test non-legacy timestamp #11053

Merged

raghavsethi self-assigned this Jul 17, 2018

findepi mentioned this issue Jul 17, 2018

Fix memory leaks in tests from the presto-tests module #11062

Closed

raghavsethi mentioned this issue Jul 23, 2018

Fix memory leak in MemoryAwareQueryExecution #11117

Closed

arhimondr closed this as completed Nov 16, 2018

arhimondr reopened this Nov 16, 2018

raghavsethi removed their assignment May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in `com.facebook.presto.memory.ClusterMemoryManager#changeListeners` #10812

Memory leak in `com.facebook.presto.memory.ClusterMemoryManager#changeListeners` #10812

sopel39 commented Jun 12, 2018

mbasmanova commented Jun 18, 2018

findepi commented Jun 18, 2018

findepi commented Jun 19, 2018

mbasmanova commented Jun 20, 2018

findepi commented Jun 20, 2018

findepi commented Jul 16, 2018

findepi commented Jul 16, 2018

raghavsethi commented Jul 19, 2018

findepi commented Jul 19, 2018

nezihyigitbasi commented Jul 20, 2018

nezihyigitbasi commented Jul 20, 2018

raghavsethi commented Jul 20, 2018 via email

raghavsethi commented Jul 21, 2018 •

edited

sopel39 commented Nov 16, 2018

raghavsethi commented Nov 16, 2018

Memory leak in com.facebook.presto.memory.ClusterMemoryManager#changeListeners #10812

Memory leak in com.facebook.presto.memory.ClusterMemoryManager#changeListeners #10812

Comments

sopel39 commented Jun 12, 2018

mbasmanova commented Jun 18, 2018

findepi commented Jun 18, 2018

findepi commented Jun 19, 2018

mbasmanova commented Jun 20, 2018

findepi commented Jun 20, 2018

findepi commented Jul 16, 2018

findepi commented Jul 16, 2018

raghavsethi commented Jul 19, 2018

findepi commented Jul 19, 2018

nezihyigitbasi commented Jul 20, 2018

nezihyigitbasi commented Jul 20, 2018

raghavsethi commented Jul 20, 2018 via email

raghavsethi commented Jul 21, 2018 • edited

sopel39 commented Nov 16, 2018

raghavsethi commented Nov 16, 2018

Memory leak in `com.facebook.presto.memory.ClusterMemoryManager#changeListeners` #10812

Memory leak in `com.facebook.presto.memory.ClusterMemoryManager#changeListeners` #10812

raghavsethi commented Jul 21, 2018 •

edited