potential deadlock in DefaultApplicationLifecycle #7949

mberndt123 · 2017-10-23T21:49:39Z

Hi there,

I ran into an interesting bug today. In DefaultApplicationLifecycles stop() method, there is a comment that reads as follows:

    // Do we care if one hook executes on another hooks redeeming thread? Hopefully not.
    import play.api.libs.iteratee.Execution.Implicits.trampoline

The problem is, we actually do care if "one hook" is a hook that synchronously shuts down a thread pool (i. e. it waits for all threads to terminate) and "another hook" is a hook that executes in a thread in that thread pool. Then the hook that waits for all threads to terminate will itself run in the thread pool, leading to a deadlock.

If that isn't clear, here's some code to demonstrate the problem.

import play.api.inject.DefaultApplicationLifecycle
import java.util.concurrent.Executors.newFixedThreadPool
import java.util.concurrent.TimeUnit
import scala.concurrent.ExecutionContext
import scala.concurrent.Future

val lifecycle = new DefaultApplicationLifecycle
val threadPool = newFixedThreadPool(42)
lifecycle.addStopHook { () =>
  Future.successful {
    println("shutting down thread pool…")
    threadPool.shutdown()
    if (! threadPool.awaitTermination(3, TimeUnit.SECONDS))
      println("timeout")
  }
}
val ec = ExecutionContext.fromExecutorService(threadPool)
lifecycle.addStopHook { () =>
  Future {
    println("running some shutdown code in thread pool…")
    Thread.sleep(1000)
  }(ec)
}
lifecycle.stop()

This will print "timeout" after 3 seconds.

The text was updated successfully, but these errors were encountered:

gmethvin · 2017-10-29T02:39:14Z

Just curious, is this something you encountered in a real Play app?

I'd be ok with using the default thread pool here though. I can't see a benefit to using trampoline.

richdougherty · 2017-10-29T20:12:28Z

Yeah it's fine to use the default thread pool here. The trampoline is a performance measure that's not really needed during shutdown. If we use the default thread pool we'd want to be careful about how it's shut down. It would need to be the last shutdown hook executed. Perhaps it would be safest for the shutdown code to make its own temporary thread or thread pool.

samupra · 2017-10-31T03:45:23Z

I will be taking this.. :)

mberndt123 · 2017-11-01T13:16:21Z

@gmethvin, Yes, I did encounter this in a real app. We use a DB connection pool and, since we use a blocking driver, a dedicated thread pool that runs all the code that interacts with the DB. The thread pool registers its shutdown hook first, and then the connection pool registers its shutdown hook, and of course the latter runs the DB shutdown code in the aforementioned thread pool. This led to the issue I observed.

Your remark about Play controlling the thread that stop() runs in leads me to think that you don't understand the semantics of the trampoline thread pool. It doesn't matter what thread stop() is called in. The trampoline execution context will run code in whatever thread calls its execute method. Usually that means the thread that completes the underlying promise. In our case, that would be the DB thread pool, leading to the deadlock I observed, because the shutdown hook looks like () => Future.successful(…), meaning it will not switch to another ExecutionContext.

gmethvin · 2017-11-02T01:53:54Z

@mberndt123 Sorry, that's not what I meant to say. The problem is that Play currently assumes it has the freedom to decide where the stop hooks are executed. The assumption is that you don't care where the thread that creates the future is executed, and if you had something that was sensitive to which thread pool it's in, you would return a Future that explicitly uses a specific thread pool.

To be clear, I think this should be changed to use the default execution context. I'm just explaining the logic behind the way it was done originally.

I would recommend we also update the documentation to specify that all stop hooks are executed in the default execution context. That way this behavior can be relied upon in the future.

mberndt123 · 2017-11-03T23:22:05Z

@gmethvin, IIRC Play's default execution context is the actor system's dispatcher. Since the actor system itself is also shut down using a stop hook, I doubt that's going to work.

To me it seems that the stop hooks should run in the thread that calls the ApplicationLoader's load method when the application is started, since that is around before everything else. That would mean blocking that thread while it's waiting for the futures to complete – I guess that's acceptable in this case.

richdougherty · 2017-11-05T20:48:12Z

The thread pool registers its shutdown hook first, and then the connection pool registers its shutdown hook, and of course the latter runs the DB shutdown code in the aforementioned thread pool. This led to the issue I observed.

Are you observing the shutdown hooks running in reverse order, because that's what should be happening? The DB shutdown code should run before the thread pool shutdown code, so the DB shutdown code should have a normal, functioning thread pool while it's shutting down.

mberndt123 · 2017-11-07T23:21:57Z

@richdougherty, I don't think it's useful to discuss the details of my application. I have written the above example code precisely to allow you to understand the problem without having to worry about any of that stuff, and I assure you that it captures the essence of what's going on in my application.

I have also updated it just now to print some more useful messages and trigger the bug more consistently. If you need any more information to reproduce, understand or fix this, please let me know…

richdougherty · 2017-11-08T03:52:44Z

Hi @mberndt123, thanks for the info you've provided so far. You have indeed clearly shown how there can be a dependency between the shutdown code for a thread pool and the thread pool itself, resulting in a deadlock. I think we can straightforwardly reproduce and fix the bug based on the code you showed above.

My question was about something slightly different. Your message here suggests an interaction between two resources may be occurring, i.e. the thread pool should not shut down before the database is shut down. The way you phrased that suggested that the thread pool and database shutdown ordering is incorrect. I.e. the database depends on the thread pool but the thread pool shuts down first. That should not happen.

The logic of shutdown hook registration and execution should prevent issues with shutting down a thread pool before a thread pool consumer has been shut down. If you're seeing issues with a thread pool shutting down before some resource that depends on it, then that's a subtly different bug for us to consider. If there's a bug like this then it wouldn't just be restricted to thread pools, and it wouldn't be fixed by running shutdown hooks on a safe threadpool. For example, this bug could occur if we shut down a cache before we shut down something depending on that cache - nothing to do with thread pools at all.

If you could clarify what you meant in that message and whether you believe the shutdown hooks are running in the correct order then that would be helpful for us, so we can fix all the bugs in this area.

mberndt123 · 2017-11-08T10:43:47Z

Hi @richdougherty, I'm surprised you think so, because I don't think my phrasing suggests that. In any case, it is not what I meant, the shutdown hooks run in the correct order in my application.

//edit: and by correct, I mean the reverse order of how they were registered

richdougherty · 2017-11-08T19:28:33Z

Thanks for confirming.

richdougherty added type:defect help wanted labels Oct 29, 2017

dwijnand added the backlog label Feb 14, 2019

dwijnand removed the status:backlog label May 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential deadlock in DefaultApplicationLifecycle #7949

potential deadlock in DefaultApplicationLifecycle #7949

mberndt123 commented Oct 23, 2017 •

edited by mkurz

gmethvin commented Oct 29, 2017 •

edited

richdougherty commented Oct 29, 2017

samupra commented Oct 31, 2017

mberndt123 commented Nov 1, 2017

gmethvin commented Nov 2, 2017 •

edited

mberndt123 commented Nov 3, 2017 •

edited

richdougherty commented Nov 5, 2017

mberndt123 commented Nov 7, 2017 •

edited

richdougherty commented Nov 8, 2017

mberndt123 commented Nov 8, 2017 •

edited

richdougherty commented Nov 8, 2017

potential deadlock in DefaultApplicationLifecycle #7949

potential deadlock in DefaultApplicationLifecycle #7949

Comments

mberndt123 commented Oct 23, 2017 • edited by mkurz

gmethvin commented Oct 29, 2017 • edited

richdougherty commented Oct 29, 2017

samupra commented Oct 31, 2017

mberndt123 commented Nov 1, 2017

gmethvin commented Nov 2, 2017 • edited

mberndt123 commented Nov 3, 2017 • edited

richdougherty commented Nov 5, 2017

mberndt123 commented Nov 7, 2017 • edited

richdougherty commented Nov 8, 2017

mberndt123 commented Nov 8, 2017 • edited

richdougherty commented Nov 8, 2017

mberndt123 commented Oct 23, 2017 •

edited by mkurz

gmethvin commented Oct 29, 2017 •

edited

gmethvin commented Nov 2, 2017 •

edited

mberndt123 commented Nov 3, 2017 •

edited

mberndt123 commented Nov 7, 2017 •

edited

mberndt123 commented Nov 8, 2017 •

edited