Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronously close pools when closing AbstractChannelPoolMap #9857

Merged
merged 1 commit into from
Dec 11, 2019

Conversation

mihalyr
Copy link
Contributor

@mihalyr mihalyr commented Dec 7, 2019

Motivation:

In #9830 the get/remove/close methods implementation changed to avoid
deadlocks on event loops. The change involved modifying the methods to
close the managed ChannelPools asynchronously and return immediately.
While this behavior might be fine for get/remove, it is changing what
a user expects from a close() method and after returning from close()
there might be still resources open.

Modifications:

This change is a follow-up for #9830 to preserve the synchronous
behavior of the AbstractChannelPoolMap#close() method.

Result:

AbstractChannelPoolMap#close() returns once the managed pools have
been closed.

Related commit a322add
Related issues #8238, #9603

Motivation:

In netty#9830 the get/remove/close methods implementation changed to avoid
deadlocks on event loops. The change involved modifying the methods to
close the managed ChannelPools asynchronously and return immediately.
While this behavior might be fine for get/remove, it is changing what
a user expects from a close() method and after returning from close()
there might be still resources open.

Modifications:

This change is a follow-up for netty#9830 to preserve the synchronous
behavior of the AbstractChannelPoolMap#close() method.

Result:

AbstractChannelPoolMap#close() returns once the managed pools have
been closed.
@netty-bot
Copy link

Can one of the admins verify this patch?

@mihalyr
Copy link
Contributor Author

mihalyr commented Dec 7, 2019

@normanmaurer after taking a look at #9603 I realized there might be users expecting ChannelPoolMap#close() to close the underlying pools before returning. Since this was a synchronous call before my change in #9830 I think it would be safer to keep it synchronous.

The other thing is that it is now becoming a bit messy what is executed synchronously and what not in ChannelPoolMap, since the interface suggests all methods are synchronous (no Futures returned), but get and remove might be executing tasks also asynchronously after returning, while close is blocking.

To avoid breaking changes we could leave ChannelPoolMap/AbstractChannelPoolMap fully synchronous as the current interface suggests and adding a new AsyncChannelPoolMap/AbstractAsyncChannelPoolMap interface/class that returns Futures from get/remove/close that users can use to wait on if they would like to. This could go in tandem with an AsyncCloseablePool or similar interface that would make the current SimpleChannelPool#closeAsync method a member that can be used in these new pool maps without instance checks and internal branches more consistently.

Please let me know if you think this might be worth adding. Thanks.

@normanmaurer
Copy link
Member

@netty-bot test this please

@normanmaurer
Copy link
Member

@mihalyr so you are suggesting to revert your commit ?

@mihalyr
Copy link
Contributor Author

mihalyr commented Dec 10, 2019

@normanmaurer sorry for the confusion, I think about two options:

  1. Leave Asynchronously close pools in AbstractChannelPoolMap (#8238) #9830 and add this (Synchronously close pools when closing AbstractChannelPoolMap #9857) on top of it to make sure the close behaviour remains the same
  2. If we revert Asynchronously close pools in AbstractChannelPoolMap (#8238) #9830 then we will get back to the issue in FixedChannelPool.close deadlock #8238, to solve both in a cleaner way we might need to add new ChannelPoolMap interfaces and implementations that work asynchronously.

If you are good with the commit in this pull request, that is option 1 above, I think it should work for both cases, fixes #8238 and keeps the close behaviour, although with some added code complexity, which in this small class might be ok. But in a new major version probably some redesign of the ChannelPoolMap interface might help clear things up.

Edits: Fixed issue references.

@normanmaurer
Copy link
Member

@mihalyr got it... lets merge this then. In the next major release all the pool stuff is gone.

@normanmaurer normanmaurer added this to the 4.1.44.Final milestone Dec 11, 2019
@normanmaurer normanmaurer merged commit af58bfc into netty:4.1 Dec 11, 2019
@normanmaurer
Copy link
Member

@mihalyr thanks

@mihalyr mihalyr deleted the 4.1-fixedchannelpool-deadlock branch December 11, 2019 11:39
@sherman
Copy link

sherman commented Jun 25, 2020

Hi, guys!

Seems, the problem is still exists.
So, I have an event loop group (AnotherTestPool). I have some code which is used it, under the hood.
When the AbstractChannelPoolMap.close() method is called often (cycle open/do smth/close) from the same pool, I still have a deadlock. See the following thread dump.

"AnotherTestPool-3-2" #17 prio=5 os_prio=0 cpu=42.68ms elapsed=19.40s tid=0x00007febd5455800 nid=0x64b9 in Object.wait()  [0x00007feb598db000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@11.0.2/Native Method)
	- waiting on <0x000000071a4717b0> (a io.netty.util.concurrent.DefaultPromise)
	at java.lang.Object.wait(java.base@11.0.2/Object.java:328)
	at io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:274)
	- waiting to re-lock in wait() <0x000000071a4717b0> (a io.netty.util.concurrent.DefaultPromise)
	at io.netty.util.concurrent.DefaultPromise.syncUninterruptibly(DefaultPromise.java:410)
	at io.netty.util.concurrent.DefaultPromise.syncUninterruptibly(DefaultPromise.java:35)
	at io.netty.channel.pool.AbstractChannelPoolMap.close(AbstractChannelPoolMap.java:150)
	at ru.yandex.misc.http.client.EventLoopAwareChannelPool.close(EventLoopAwareChannelPool.java:58)
	at ru.yandex.misc.http.client.HttpClient.close(HttpClient.java:338)
	at ru.yandex.yql.core.clusters.clients.clickhouse.ClickHouseClusterClientTest.lambda$executeQuery$1(ClickHouseClusterClientTest.java:251)
	at ru.yandex.yql.core.clusters.clients.clickhouse.ClickHouseClusterClientTest$$Lambda$127/0x0000000800363840.run(Unknown Source)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(java.base@11.0.2/Thread.java:834)


"globalEventExecutor-2-1" #18 prio=5 os_prio=0 cpu=3.16ms elapsed=19.39s tid=0x00007feb38002000 nid=0x64ba in Object.wait()  [0x00007feb597da000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@11.0.2/Native Method)
	- waiting on <0x000000071a4d6d30> (a io.netty.channel.DefaultChannelPromise)
	at java.lang.Object.wait(java.base@11.0.2/Object.java:328)
	at io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:274)
	- waiting to re-lock in wait() <0x000000071a4d6d30> (a io.netty.channel.DefaultChannelPromise)
	at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:137)
	at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:30)
	at io.netty.channel.pool.SimpleChannelPool.close(SimpleChannelPool.java:402)
	at io.netty.channel.pool.SimpleChannelPool$8.call(SimpleChannelPool.java:416)
	at io.netty.channel.pool.SimpleChannelPool$8.call(SimpleChannelPool.java:413)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:96)
	at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
	at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:243)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(java.base@11.0.2/Thread.java:834)

If I move the call of close() method to another thread pool, everything is working fine.
If I switch to 4.1.42, the problem is not reproduced.

@mihalyr
Copy link
Contributor Author

mihalyr commented Jun 25, 2020

Hi @sherman, it's been a while since the last time I touched Netty, but if I remember correctly the issue that I fixed was about opening pools causing deadlocks because get() internally closed the pools synchronously that were opened during race conditions.

Now someone extended the SimpleChannelPool class with a closeAsync method. My fix to the issue was to use this closeAsync method from inside get when the pool is instance of SimpleChannelPool. The closeAsync method was not exposed through an interface, we discussed this with Norman at the time and the interface change was going to require a major version bump that we avoided with keeping this method on the class.

Back to the AbstractChannelPoolMap.close(), after my first commit I accidentally changed this to async, but I realized this changes the behaviour for people who might be depending on the original sync behaviour e.g. for closing pools before shutdown etc. so I changed it back to sync. Therefore if you use close() it will be blocking until all channels are closed as you have noticed, see here.

Would it be possbile for you to implement closeAsync in your ChannelPoolMap like here just without the syncUninterruptibly() call and use that instead of close or overriding close?

@sherman
Copy link

sherman commented Jun 26, 2020

Hi, @mihalyr!
Thank you for quick response and clear explanation of the problem. It helped a lot.
Replacing of block close() method to the async version works for me.

May be the decision to revert your changes in close() wasn't a best idea. Because, the semantics would changed, but it wouldn't follow to the deadlock.
As I can see, only one option here, is to totally rewrite channel pool map, due AbstractChannel Pool Map isn't looks like a class is ready for extending in this way (final methods, private state).

Again, thank you for helping!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants