New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid lock contention #4761
Avoid lock contention #4761
Conversation
… ScheduledThreadPoolExecutor
5f50140
to
854938a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tremendous research and writeup.
I am surprised by the Fork-Join Pool. Most people use global
for brevity, but the strong recommendation in cats-effect is a fixed thread pool. I expect this is probably a big optimization independent of that decision?
val timeoutResponse = timer.sleep(finite).as(Response.timeout[F]) | ||
val timeoutResponse = Concurrent[F].cancelable[Response[F]] { callback => | ||
val cancellable = | ||
scheduler.schedule(() => callback(Right(Response.timeout[F])), executionContext, finite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably cache that Right(Response.timeout[F])
in a val as soon as we know F
, but it probably won't make a noticeable difference.
@@ -347,7 +356,11 @@ private[blaze] class Http1ServerStage[F[_]]( | |||
private[this] val raceTimeout: Request[F] => F[Response[F]] = | |||
responseHeaderTimeout match { | |||
case finite: FiniteDuration => | |||
val timeoutResponse = timer.sleep(finite).as(Response.timeout[F]) | |||
val timeoutResponse = Concurrent[F].cancelable[Response[F]] { callback => | |||
val cancellable = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very small nit: it's spelled cancelable
in cats-effect.
Something to keep in mind with FJP is it doesn't properly bound the number of workers to the physical threads, and it has a fairly dubious starvation-detection algorithm which results in a large number of workers being spuriously created even when none of them are blocked. This is the primary reasoning behind its banishment from Cats Effect. Though, obviously the fixed thread pool contention issues give all of that back when you have a machine with a relatively high core count (anything above 16 starts to hurt a lot). The correct answer here, at least in Cats Effect 2, is unfortunately somewhat conditional. Your benchmark is hitting the absolute best of FJP and the absolute worst of fixed thread pool. Applications with more complex handlers or any stretches of CPU-bound work will degrade quite rapidly in practice, which makes FJP start to look a lot worse and fixed thread pool start to look quite a bit better. So the answer really depends. Cats Effect 3's scheduler basically gives you all the benefits of FJP (though with better optimization and special-casing of various fiber scenarios) without the problem of unbounded workers, which in theory resolves the need to pick between the two, and also solves the artificial microbenchmark problem.
But at any rate, none of this surprises me. :-) Unfortunately, it's a little difficult to see what the best approach is in the short term (http4s running against cats effect 2), since the suggested compute pool change would result in significantly worst performance for many use-cases, despite the significant gains it would produce in this scenario. |
Techempower is a disservice in both conception and administration, but they continue to be a popular reference. It's wise to put our best foot forward. The right thread pool in that context is certainly whatever gets the best top-line ping route numbers, which is all the nuance anybody gets from those damned things. I look forward to the CE3 advances that make this less contextual. Again, we don't have the benchmarks to see the timer difference on a fixed pool, but I see no reason this wouldn't also help in that usage. And most people are using the global pool with Blaze, which is a fork-join pool. I think this is going to be an improvement for most, if not all, blaze-server users. |
I think the changes I've proposed in this PR are non-controversial. Less locking should be better. I'll do some more benchmarking with other configuration of threadpools to make sure that I'm not over-optimising for a specific case. I already did some tests with both http4s' I also have some ideas for reducing the amount of thread shifting. These should be beneficial in their own right (unless they cause some other issues), and should reduce the lock contention in ThreadPoolExecutor. I'm glad to head about the advances in scheduling in CE3. I'm looking forward to using CE3 in production. The way I see TFB in this context is that: http4s itself should be optimised for a wide variety of use-cases for the benefit of it's users. TFB is unable to reproduce this "wide variety", nontheless it can be useful for detecting some performance degrading phenomenon in http4s and quantifying them. As a secondary goal, improving TFB scores is good for marketing. |
Common test conditionsI've changed the test setup a bit, so I had to rerun also the results I gathered originally. Just a short reminder. The goal of this is not to compare the thread pool configurations, the goal is to see how the changes proposed in this PR affect throuput under different circumstances. Tested configurationsShared FJPThis is the configuration that I used at the beginning, There's one
Shared FTPThere's one (
Separate FTP:Http4s'
FTP for Cats, FJP for http4s:Again This is the most default configuration.
MeasurementsShared FJP
Shared FTP
Separate FTP
FTP for Cats, FJP for http4s:
InterpretationThe proposed changes are a clear win when paired with FJP. There is no lock contention, CPU utilization is high and so is the throughput. The difference in score for Shared FTP vs Separate FTP clearly show how bad is the lock contention on the thread pool. Having twice as many threads as cores should be bad, but having two locks instead of one is a huge win. Generally in scenarios other than the shared FJP, we see results ranging from -10.6% to +14.6%. Some of the differences are just test variance. But I rerun the "FTP for Cats, FJP for http4s" a few times and it's clear that the new solution scores worse in the scenarios with 512 / 1024 parallel connections. Overall there are ups and downs, but I still think less locking is better. I'm looking forward to seeing how this can improve performance when paired with the new thread pool from CE3 Daniel mentioned. And that brings us to the next topic. @rossabaker , I see you already merged this PR to series/0.21 and series/0.22. Should I prepare a PR (including measurements) with this change ported to 1.0? |
Motivation
I've got a use case in which I need my HTTP server (implemented with http4s-blaze-server) to process lots of lightweight requests (small request body, small response body, cheap business logic). The thoughput in that case is determined mainly by the http4s-blaze-server.
My main observations regarding performance of http4s-blaze-server in such case are:
I'm able to reliably reproduce similar conditions using https://github.com/TechEmpower/FrameworkBenchmarks "plaintext" test. And that's what I'm using to get some reproducable numbers.
I'm working with
F = cats.effect.IO
.I want to address these issues with a series of small and independent pull requests. This one is only address the CPU under-utilisation, and only partially.
Diagnosis
Profiling shows that the under-utilisation of CPU is caused (at least partially), buy blocking on three locks / monitors:
Http1ServerStage.parser
Solution
This PR addresses 1. by using TickWheelTimer from blaze instead of Timer[IO], and it addresses 3. by creating cancelTocken before acquiring the monitor of parser, and then only acquiring it in order to modify the
cancelToken
field. The 2. I've addressed by usingForkJoinPool
instead of the defaultThreadPoolExecutor
, but this is something that is outside of the scope of responsibility of http4s. Nontheless what could be done in http4s is reduction of thread shifts, which would reduce pressure on the scheduling mechanism of the thread pool, but this is outside of scope of this PR.Verifcation
The measurements are done using tfb (Techempower Frameworks Benchmark), specifically this commit RafalSumislawski/FrameworkBenchmarks@1436a3c from my fork. It includes the solution to 2. And so this solution is included in the baseline measurement.
This gives us a decent 33% max throughput increase, as well as improvement in terms of dealing with large number of parallel requests (which was to be expected from a change which fixes lock contention). The difference will most likely be smaller on machines will less cores, and significantly bigger on machines with lots of cores (aka servers).
The CPU utilisation for baseline is ~50% for this PR it's ~80%. Still not 100%. More work remains to be done in that topic...
http4s-blaze-client may need analogical changes, but I don't want to do them without setting up a benchmark for verification.