Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential deadlock in DefaultApplicationLifecycle #7949

Open
mberndt123 opened this issue Oct 23, 2017 · 11 comments
Open

potential deadlock in DefaultApplicationLifecycle #7949

mberndt123 opened this issue Oct 23, 2017 · 11 comments

Comments

@mberndt123
Copy link

mberndt123 commented Oct 23, 2017

Hi there,

I ran into an interesting bug today. In DefaultApplicationLifecycles stop() method, there is a comment that reads as follows:

    // Do we care if one hook executes on another hooks redeeming thread? Hopefully not.
    import play.api.libs.iteratee.Execution.Implicits.trampoline

The problem is, we actually do care if "one hook" is a hook that synchronously shuts down a thread pool (i. e. it waits for all threads to terminate) and "another hook" is a hook that executes in a thread in that thread pool. Then the hook that waits for all threads to terminate will itself run in the thread pool, leading to a deadlock.

If that isn't clear, here's some code to demonstrate the problem.

import play.api.inject.DefaultApplicationLifecycle
import java.util.concurrent.Executors.newFixedThreadPool
import java.util.concurrent.TimeUnit
import scala.concurrent.ExecutionContext
import scala.concurrent.Future

val lifecycle = new DefaultApplicationLifecycle
val threadPool = newFixedThreadPool(42)
lifecycle.addStopHook { () =>
  Future.successful {
    println("shutting down thread pool…")
    threadPool.shutdown()
    if (! threadPool.awaitTermination(3, TimeUnit.SECONDS))
      println("timeout")
  }
}
val ec = ExecutionContext.fromExecutorService(threadPool)
lifecycle.addStopHook { () =>
  Future {
    println("running some shutdown code in thread pool…")
    Thread.sleep(1000)
  }(ec)
}
lifecycle.stop()

This will print "timeout" after 3 seconds.

@gmethvin
Copy link
Member

gmethvin commented Oct 29, 2017

Just curious, is this something you encountered in a real Play app?

I'd be ok with using the default thread pool here though. I can't see a benefit to using trampoline.

@richdougherty
Copy link
Member

Yeah it's fine to use the default thread pool here. The trampoline is a performance measure that's not really needed during shutdown. If we use the default thread pool we'd want to be careful about how it's shut down. It would need to be the last shutdown hook executed. Perhaps it would be safest for the shutdown code to make its own temporary thread or thread pool.

@samupra
Copy link
Contributor

samupra commented Oct 31, 2017

I will be taking this.. :)

@mberndt123
Copy link
Author

@gmethvin, Yes, I did encounter this in a real app. We use a DB connection pool and, since we use a blocking driver, a dedicated thread pool that runs all the code that interacts with the DB. The thread pool registers its shutdown hook first, and then the connection pool registers its shutdown hook, and of course the latter runs the DB shutdown code in the aforementioned thread pool. This led to the issue I observed.

Your remark about Play controlling the thread that stop() runs in leads me to think that you don't understand the semantics of the trampoline thread pool. It doesn't matter what thread stop() is called in. The trampoline execution context will run code in whatever thread calls its execute method. Usually that means the thread that completes the underlying promise. In our case, that would be the DB thread pool, leading to the deadlock I observed, because the shutdown hook looks like () => Future.successful(…), meaning it will not switch to another ExecutionContext.

@gmethvin
Copy link
Member

gmethvin commented Nov 2, 2017

@mberndt123 Sorry, that's not what I meant to say. The problem is that Play currently assumes it has the freedom to decide where the stop hooks are executed. The assumption is that you don't care where the thread that creates the future is executed, and if you had something that was sensitive to which thread pool it's in, you would return a Future that explicitly uses a specific thread pool.

To be clear, I think this should be changed to use the default execution context. I'm just explaining the logic behind the way it was done originally.

I would recommend we also update the documentation to specify that all stop hooks are executed in the default execution context. That way this behavior can be relied upon in the future.

@mberndt123
Copy link
Author

mberndt123 commented Nov 3, 2017

@gmethvin, IIRC Play's default execution context is the actor system's dispatcher. Since the actor system itself is also shut down using a stop hook, I doubt that's going to work.

To me it seems that the stop hooks should run in the thread that calls the ApplicationLoader's load method when the application is started, since that is around before everything else. That would mean blocking that thread while it's waiting for the futures to complete – I guess that's acceptable in this case.

@richdougherty
Copy link
Member

The thread pool registers its shutdown hook first, and then the connection pool registers its shutdown hook, and of course the latter runs the DB shutdown code in the aforementioned thread pool. This led to the issue I observed.

Are you observing the shutdown hooks running in reverse order, because that's what should be happening? The DB shutdown code should run before the thread pool shutdown code, so the DB shutdown code should have a normal, functioning thread pool while it's shutting down.

@mberndt123
Copy link
Author

mberndt123 commented Nov 7, 2017

@richdougherty, I don't think it's useful to discuss the details of my application. I have written the above example code precisely to allow you to understand the problem without having to worry about any of that stuff, and I assure you that it captures the essence of what's going on in my application.

I have also updated it just now to print some more useful messages and trigger the bug more consistently. If you need any more information to reproduce, understand or fix this, please let me know…

@richdougherty
Copy link
Member

Hi @mberndt123, thanks for the info you've provided so far. You have indeed clearly shown how there can be a dependency between the shutdown code for a thread pool and the thread pool itself, resulting in a deadlock. I think we can straightforwardly reproduce and fix the bug based on the code you showed above.

My question was about something slightly different. Your message here suggests an interaction between two resources may be occurring, i.e. the thread pool should not shut down before the database is shut down. The way you phrased that suggested that the thread pool and database shutdown ordering is incorrect. I.e. the database depends on the thread pool but the thread pool shuts down first. That should not happen.

The logic of shutdown hook registration and execution should prevent issues with shutting down a thread pool before a thread pool consumer has been shut down. If you're seeing issues with a thread pool shutting down before some resource that depends on it, then that's a subtly different bug for us to consider. If there's a bug like this then it wouldn't just be restricted to thread pools, and it wouldn't be fixed by running shutdown hooks on a safe threadpool. For example, this bug could occur if we shut down a cache before we shut down something depending on that cache - nothing to do with thread pools at all.

If you could clarify what you meant in that message and whether you believe the shutdown hooks are running in the correct order then that would be helpful for us, so we can fix all the bugs in this area.

@mberndt123
Copy link
Author

mberndt123 commented Nov 8, 2017

Hi @richdougherty, I'm surprised you think so, because I don't think my phrasing suggests that. In any case, it is not what I meant, the shutdown hooks run in the correct order in my application.

//edit: and by correct, I mean the reverse order of how they were registered

@richdougherty
Copy link
Member

Thanks for confirming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants