Synchronously close pools when closing AbstractChannelPoolMap #9857

mihalyr · 2019-12-07T12:48:54Z

Motivation:

In #9830 the get/remove/close methods implementation changed to avoid
deadlocks on event loops. The change involved modifying the methods to
close the managed ChannelPools asynchronously and return immediately.
While this behavior might be fine for get/remove, it is changing what
a user expects from a close() method and after returning from close()
there might be still resources open.

Modifications:

This change is a follow-up for #9830 to preserve the synchronous
behavior of the AbstractChannelPoolMap#close() method.

Result:

AbstractChannelPoolMap#close() returns once the managed pools have
been closed.

Related commit a322add
Related issues #8238, #9603

Motivation: In netty#9830 the get/remove/close methods implementation changed to avoid deadlocks on event loops. The change involved modifying the methods to close the managed ChannelPools asynchronously and return immediately. While this behavior might be fine for get/remove, it is changing what a user expects from a close() method and after returning from close() there might be still resources open. Modifications: This change is a follow-up for netty#9830 to preserve the synchronous behavior of the AbstractChannelPoolMap#close() method. Result: AbstractChannelPoolMap#close() returns once the managed pools have been closed.

netty-bot · 2019-12-07T12:49:04Z

Can one of the admins verify this patch?

mihalyr · 2019-12-07T12:55:54Z

@normanmaurer after taking a look at #9603 I realized there might be users expecting ChannelPoolMap#close() to close the underlying pools before returning. Since this was a synchronous call before my change in #9830 I think it would be safer to keep it synchronous.

The other thing is that it is now becoming a bit messy what is executed synchronously and what not in ChannelPoolMap, since the interface suggests all methods are synchronous (no Futures returned), but get and remove might be executing tasks also asynchronously after returning, while close is blocking.

To avoid breaking changes we could leave ChannelPoolMap/AbstractChannelPoolMap fully synchronous as the current interface suggests and adding a new AsyncChannelPoolMap/AbstractAsyncChannelPoolMap interface/class that returns Futures from get/remove/close that users can use to wait on if they would like to. This could go in tandem with an AsyncCloseablePool or similar interface that would make the current SimpleChannelPool#closeAsync method a member that can be used in these new pool maps without instance checks and internal branches more consistently.

Please let me know if you think this might be worth adding. Thanks.

normanmaurer · 2019-12-08T06:42:20Z

@netty-bot test this please

normanmaurer · 2019-12-10T13:37:51Z

@mihalyr so you are suggesting to revert your commit ?

mihalyr · 2019-12-10T15:50:37Z

@normanmaurer sorry for the confusion, I think about two options:

Leave Asynchronously close pools in AbstractChannelPoolMap (#8238) #9830 and add this (Synchronously close pools when closing AbstractChannelPoolMap #9857) on top of it to make sure the close behaviour remains the same
If we revert Asynchronously close pools in AbstractChannelPoolMap (#8238) #9830 then we will get back to the issue in FixedChannelPool.close deadlock #8238, to solve both in a cleaner way we might need to add new ChannelPoolMap interfaces and implementations that work asynchronously.

If you are good with the commit in this pull request, that is option 1 above, I think it should work for both cases, fixes #8238 and keeps the close behaviour, although with some added code complexity, which in this small class might be ok. But in a new major version probably some redesign of the ChannelPoolMap interface might help clear things up.

Edits: Fixed issue references.

normanmaurer · 2019-12-11T08:27:57Z

@mihalyr got it... lets merge this then. In the next major release all the pool stuff is gone.

normanmaurer · 2019-12-11T08:28:52Z

@mihalyr thanks

sherman · 2020-06-25T08:59:04Z

Hi, guys!

Seems, the problem is still exists.
So, I have an event loop group (AnotherTestPool). I have some code which is used it, under the hood.
When the AbstractChannelPoolMap.close() method is called often (cycle open/do smth/close) from the same pool, I still have a deadlock. See the following thread dump.

"AnotherTestPool-3-2" #17 prio=5 os_prio=0 cpu=42.68ms elapsed=19.40s tid=0x00007febd5455800 nid=0x64b9 in Object.wait()  [0x00007feb598db000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@11.0.2/Native Method)
	- waiting on <0x000000071a4717b0> (a io.netty.util.concurrent.DefaultPromise)
	at java.lang.Object.wait(java.base@11.0.2/Object.java:328)
	at io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:274)
	- waiting to re-lock in wait() <0x000000071a4717b0> (a io.netty.util.concurrent.DefaultPromise)
	at io.netty.util.concurrent.DefaultPromise.syncUninterruptibly(DefaultPromise.java:410)
	at io.netty.util.concurrent.DefaultPromise.syncUninterruptibly(DefaultPromise.java:35)
	at io.netty.channel.pool.AbstractChannelPoolMap.close(AbstractChannelPoolMap.java:150)
	at ru.yandex.misc.http.client.EventLoopAwareChannelPool.close(EventLoopAwareChannelPool.java:58)
	at ru.yandex.misc.http.client.HttpClient.close(HttpClient.java:338)
	at ru.yandex.yql.core.clusters.clients.clickhouse.ClickHouseClusterClientTest.lambda$executeQuery$1(ClickHouseClusterClientTest.java:251)
	at ru.yandex.yql.core.clusters.clients.clickhouse.ClickHouseClusterClientTest$$Lambda$127/0x0000000800363840.run(Unknown Source)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(java.base@11.0.2/Thread.java:834)


"globalEventExecutor-2-1" #18 prio=5 os_prio=0 cpu=3.16ms elapsed=19.39s tid=0x00007feb38002000 nid=0x64ba in Object.wait()  [0x00007feb597da000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@11.0.2/Native Method)
	- waiting on <0x000000071a4d6d30> (a io.netty.channel.DefaultChannelPromise)
	at java.lang.Object.wait(java.base@11.0.2/Object.java:328)
	at io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:274)
	- waiting to re-lock in wait() <0x000000071a4d6d30> (a io.netty.channel.DefaultChannelPromise)
	at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:137)
	at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:30)
	at io.netty.channel.pool.SimpleChannelPool.close(SimpleChannelPool.java:402)
	at io.netty.channel.pool.SimpleChannelPool$8.call(SimpleChannelPool.java:416)
	at io.netty.channel.pool.SimpleChannelPool$8.call(SimpleChannelPool.java:413)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:96)
	at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
	at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:243)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(java.base@11.0.2/Thread.java:834)

If I move the call of close() method to another thread pool, everything is working fine.
If I switch to 4.1.42, the problem is not reproduced.

mihalyr · 2020-06-25T15:22:27Z

Hi @sherman, it's been a while since the last time I touched Netty, but if I remember correctly the issue that I fixed was about opening pools causing deadlocks because get() internally closed the pools synchronously that were opened during race conditions.

Now someone extended the SimpleChannelPool class with a closeAsync method. My fix to the issue was to use this closeAsync method from inside get when the pool is instance of SimpleChannelPool. The closeAsync method was not exposed through an interface, we discussed this with Norman at the time and the interface change was going to require a major version bump that we avoided with keeping this method on the class.

Back to the AbstractChannelPoolMap.close(), after my first commit I accidentally changed this to async, but I realized this changes the behaviour for people who might be depending on the original sync behaviour e.g. for closing pools before shutdown etc. so I changed it back to sync. Therefore if you use close() it will be blocking until all channels are closed as you have noticed, see here.

Would it be possbile for you to implement closeAsync in your ChannelPoolMap like here just without the syncUninterruptibly() call and use that instead of close or overriding close?

sherman · 2020-06-26T07:52:44Z

Hi, @mihalyr!
Thank you for quick response and clear explanation of the problem. It helped a lot.
Replacing of block close() method to the async version works for me.

May be the decision to revert your changes in close() wasn't a best idea. Because, the semantics would changed, but it wouldn't follow to the deadlock.
As I can see, only one option here, is to totally rewrite channel pool map, due AbstractChannel Pool Map isn't looks like a class is ready for extending in this way (final methods, private state).

Again, thank you for helping!

mihalyr mentioned this pull request Dec 10, 2019

FixedChannelPool close(Async) still blocks after rarely "An event executor terminated with non-empty task queue" #9603

Closed

normanmaurer added this to the 4.1.44.Final milestone Dec 11, 2019

normanmaurer approved these changes Dec 11, 2019

View reviewed changes

normanmaurer merged commit af58bfc into netty:4.1 Dec 11, 2019

mihalyr deleted the 4.1-fixedchannelpool-deadlock branch December 11, 2019 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronously close pools when closing AbstractChannelPoolMap #9857

Synchronously close pools when closing AbstractChannelPoolMap #9857

mihalyr commented Dec 7, 2019

netty-bot commented Dec 7, 2019

mihalyr commented Dec 7, 2019 •

edited

Loading

normanmaurer commented Dec 8, 2019

normanmaurer commented Dec 10, 2019

mihalyr commented Dec 10, 2019 •

edited

Loading

normanmaurer commented Dec 11, 2019

normanmaurer commented Dec 11, 2019

sherman commented Jun 25, 2020

mihalyr commented Jun 25, 2020

sherman commented Jun 26, 2020 •

edited

Loading

Synchronously close pools when closing AbstractChannelPoolMap #9857

Synchronously close pools when closing AbstractChannelPoolMap #9857

Conversation

mihalyr commented Dec 7, 2019

netty-bot commented Dec 7, 2019

mihalyr commented Dec 7, 2019 • edited Loading

normanmaurer commented Dec 8, 2019

normanmaurer commented Dec 10, 2019

mihalyr commented Dec 10, 2019 • edited Loading

normanmaurer commented Dec 11, 2019

normanmaurer commented Dec 11, 2019

sherman commented Jun 25, 2020

mihalyr commented Jun 25, 2020

sherman commented Jun 26, 2020 • edited Loading

mihalyr commented Dec 7, 2019 •

edited

Loading

mihalyr commented Dec 10, 2019 •

edited

Loading

sherman commented Jun 26, 2020 •

edited

Loading