Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid lock contention #4761

Merged
merged 2 commits into from Apr 25, 2021

Conversation

RafalSumislawski
Copy link
Member

Motivation

I've got a use case in which I need my HTTP server (implemented with http4s-blaze-server) to process lots of lightweight requests (small request body, small response body, cheap business logic). The thoughput in that case is determined mainly by the http4s-blaze-server.

My main observations regarding performance of http4s-blaze-server in such case are:

  • http4s is the dominant consumer of CPU time (roughly 10% CPU time goes to actual application logic)
  • the application is unable to fully utilise processing power of a multicore CPU. For benchmarking I'm using Ryzen 5600X - 6 cores 12 threads CPU. On that platform it's utilising only 50% of CPU, despite having 13 blaze selector threads and 12 thread in the main ExecutorContext / ContextShift.
  • There's high allocation rate, with the dominant allocator being fs2, despite not really using any form of streaming. (EntityEncoder.simple converting non-streams into streams is the reason)
  • from time to time a request is processed twice

I'm able to reliably reproduce similar conditions using https://github.com/TechEmpower/FrameworkBenchmarks "plaintext" test. And that's what I'm using to get some reproducable numbers.

I'm working with F = cats.effect.IO.

I want to address these issues with a series of small and independent pull requests. This one is only address the CPU under-utilisation, and only partially.

Diagnosis

Profiling shows that the under-utilisation of CPU is caused (at least partially), buy blocking on three locks / monitors:

  1. The lock inside ScheduledThreadPoolExecutor inside the Timer[IO] - it needs to be locked while sheduling or cancelling timeouts
  2. The lock inside the default ExecutionContext / ContextShift provided by IOApp
  3. The monitor of Http1ServerStage.parser

Solution

This PR addresses 1. by using TickWheelTimer from blaze instead of Timer[IO], and it addresses 3. by creating cancelTocken before acquiring the monitor of parser, and then only acquiring it in order to modify the cancelToken field. The 2. I've addressed by using ForkJoinPool instead of the default ThreadPoolExecutor, but this is something that is outside of the scope of responsibility of http4s. Nontheless what could be done in http4s is reduction of thread shifts, which would reduce pressure on the scheduling mechanism of the thread pool, but this is outside of scope of this PR.

Verifcation

The measurements are done using tfb (Techempower Frameworks Benchmark), specifically this commit RafalSumislawski/FrameworkBenchmarks@1436a3c from my fork. It includes the solution to 2. And so this solution is included in the baseline measurement.

parallel requests Baseline (0.21.22) This PR
256 207343 req/s 246718 req/s
512 205574 req/s 276460 req/s
1024 201791 req/s 275930 req/s
2048 193030 req/s 269806 req/s
4096 183165 req/s 266459 req/s

This gives us a decent 33% max throughput increase, as well as improvement in terms of dealing with large number of parallel requests (which was to be expected from a change which fixes lock contention). The difference will most likely be smaller on machines will less cores, and significantly bigger on machines with lots of cores (aka servers).

The CPU utilisation for baseline is ~50% for this PR it's ~80%. Still not 100%. More work remains to be done in that topic...

http4s-blaze-client may need analogical changes, but I don't want to do them without setting up a benchmark for verification.

Copy link
Member

@rossabaker rossabaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tremendous research and writeup.

I am surprised by the Fork-Join Pool. Most people use global for brevity, but the strong recommendation in cats-effect is a fixed thread pool. I expect this is probably a big optimization independent of that decision?

val timeoutResponse = timer.sleep(finite).as(Response.timeout[F])
val timeoutResponse = Concurrent[F].cancelable[Response[F]] { callback =>
val cancellable =
scheduler.schedule(() => callback(Right(Response.timeout[F])), executionContext, finite)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably cache that Right(Response.timeout[F]) in a val as soon as we know F, but it probably won't make a noticeable difference.

@@ -347,7 +356,11 @@ private[blaze] class Http1ServerStage[F[_]](
private[this] val raceTimeout: Request[F] => F[Response[F]] =
responseHeaderTimeout match {
case finite: FiniteDuration =>
val timeoutResponse = timer.sleep(finite).as(Response.timeout[F])
val timeoutResponse = Concurrent[F].cancelable[Response[F]] { callback =>
val cancellable =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very small nit: it's spelled cancelable in cats-effect.

@djspiewak
Copy link
Contributor

Something to keep in mind with FJP is it doesn't properly bound the number of workers to the physical threads, and it has a fairly dubious starvation-detection algorithm which results in a large number of workers being spuriously created even when none of them are blocked. This is the primary reasoning behind its banishment from Cats Effect. Though, obviously the fixed thread pool contention issues give all of that back when you have a machine with a relatively high core count (anything above 16 starts to hurt a lot).

The correct answer here, at least in Cats Effect 2, is unfortunately somewhat conditional. Your benchmark is hitting the absolute best of FJP and the absolute worst of fixed thread pool. Applications with more complex handlers or any stretches of CPU-bound work will degrade quite rapidly in practice, which makes FJP start to look a lot worse and fixed thread pool start to look quite a bit better. So the answer really depends.

Cats Effect 3's scheduler basically gives you all the benefits of FJP (though with better optimization and special-casing of various fiber scenarios) without the problem of unbounded workers, which in theory resolves the need to pick between the two, and also solves the artificial microbenchmark problem.

Timer is also an interesting one. I generally agree that ScheduledExecutorService is quite bad, which is why Blaze has a HWT executor in the first place (Akka and Netty are two other notable frameworks which use this technique), and why Cats Effect is working on implementing something similar within its own scheduler. If our ideas pan out, the CE3 scheduler should be able to implement timed awakes theoretically without any overhead at all, most notably without consuming an additional thread.

But at any rate, none of this surprises me. :-) Unfortunately, it's a little difficult to see what the best approach is in the short term (http4s running against cats effect 2), since the suggested compute pool change would result in significantly worst performance for many use-cases, despite the significant gains it would produce in this scenario.

@rossabaker
Copy link
Member

Techempower is a disservice in both conception and administration, but they continue to be a popular reference. It's wise to put our best foot forward. The right thread pool in that context is certainly whatever gets the best top-line ping route numbers, which is all the nuance anybody gets from those damned things.

I look forward to the CE3 advances that make this less contextual. Again, we don't have the benchmarks to see the timer difference on a fixed pool, but I see no reason this wouldn't also help in that usage. And most people are using the global pool with Blaze, which is a fork-join pool. I think this is going to be an improvement for most, if not all, blaze-server users.

@RafalSumislawski
Copy link
Member Author

RafalSumislawski commented Apr 19, 2021

I think the changes I've proposed in this PR are non-controversial. Less locking should be better. I'll do some more benchmarking with other configuration of threadpools to make sure that I'm not over-optimising for a specific case. I already did some tests with both http4s' ExecutionContext and ContextShift sharing a ThreadPoolExecutor. In that case the baseline and this PR are scoring similarly. The lock in the ThreadPoolExecutor being the bottleneck. I'll do some more tests were the thread pools are not shared, and I'll post the result when ready.

I also have some ideas for reducing the amount of thread shifting. These should be beneficial in their own right (unless they cause some other issues), and should reduce the lock contention in ThreadPoolExecutor.

I'm glad to head about the advances in scheduling in CE3. I'm looking forward to using CE3 in production.

The way I see TFB in this context is that: http4s itself should be optimised for a wide variety of use-cases for the benefit of it's users. TFB is unable to reproduce this "wide variety", nontheless it can be useful for detecting some performance degrading phenomenon in http4s and quantifying them. As a secondary goal, improving TFB scores is good for marketing.

@rossabaker rossabaker added enhancement Feature requests and improvements module:blaze-server labels Apr 19, 2021
@rossabaker rossabaker added this to the 0.21.23 milestone Apr 19, 2021
@rossabaker rossabaker merged commit ccbc771 into http4s:series/0.21 Apr 25, 2021
@RafalSumislawski
Copy link
Member Author

Common test conditions

I've changed the test setup a bit, so I had to rerun also the results I gathered originally.
All used thread pools have 12 (numberOfAvailableProcessors) threads, that's the default in most cases and I didn't tinker with it.

Just a short reminder. The goal of this is not to compare the thread pool configurations, the goal is to see how the changes proposed in this PR affect throuput under different circumstances.

Tested configurations

Shared FJP

This is the configuration that I used at the beginning, There's one ForkJoinPool used as both the ContextShift and the http4s' ExcecutionContext. This is the only configuration exhibiting no blocking on locks.

  implicit override val executionContext: ExecutionContext = {
    val fjp = new ForkJoinPool()
    ExecutionContext.fromExecutor(fjp)
  }

  implicit override val contextShift: ContextShift[IO] =
    IO.contextShift(executionContext)

  implicit override val timer: Timer[IO] =
    IO.timer(executionContext)

  val http4sEc = {
    executionContext
  }

Shared FTP

There's one (fixed) ThreadPoolExecutor shared as ContextShift and ExecutionContext. That means one shared lock.

  implicit override val executionContext: ExecutionContext = {
    val ftp = Executors.newFixedThreadPool(Runtime.getRuntime.availableProcessors())
    ExecutionContext.fromExecutor(ftp)
  }

  implicit override val contextShift: ContextShift[IO] =
    IO.contextShift(executionContext)

  implicit override val timer: Timer[IO] =
    IO.timer(executionContext)

  val http4sEc = {
    executionContext
  }  

Separate FTP:

Http4s' ExecutionContext and ContextShift both use (fixed) ThreadPoolExecutor but these are two separate instances. This is similar what the TFB is using on master.

  implicit override val executionContext: ExecutionContext = {
    val ftp = Executors.newFixedThreadPool(Runtime.getRuntime.availableProcessors())
    ExecutionContext.fromExecutor(ftp)
  }

  implicit override val contextShift: ContextShift[IO] =
    IO.contextShift(executionContext)

  implicit override val timer: Timer[IO] =
    IO.timer(executionContext)

  val http4sEc = {
    val ftp = Executors.newFixedThreadPool(Runtime.getRuntime.availableProcessors())
    ExecutionContext.fromExecutor(ftp)
  }

FTP for Cats, FJP for http4s:

Again ExecutionContext and ContextShift are separate thread polls. This time Cats uses fixed ThreadPoolExecutor (just like the default from IOApp) and http4s uses ForkJoinPool just like ExecutionContext.global.

This is the most default configuration.

  implicit override val executionContext: ExecutionContext = {
    val ftp = Executors.newFixedThreadPool(Runtime.getRuntime.availableProcessors())
    ExecutionContext.fromExecutor(ftp)
  }

  implicit override val contextShift: ContextShift[IO] =
    IO.contextShift(executionContext)

  implicit override val timer: Timer[IO] =
    IO.timer(executionContext)

  val http4sEc = {
    val fjp = new ForkJoinPool()
    ExecutionContext.fromExecutor(fjp)
  }

Measurements

Shared FJP

parallelism 0.21.22 #4761 relative
256 211159 236896 112,19%
512 214493 276147 128,74%
1024 209068 277240 132,61%
2048 197243 270592 137,19%
4096 189699 242590 127,88%
best 214493 277240 129,25%

Shared FTP

parallelism 0.21.22 #4761 relative
256 123831 131733 106,38%
512 123164 122397 99,38%
1024 120050 115718 96,39%
2048 112269 110198 98,16%
4096 111049 105970 95,43%
best 123831 131733 106,38%

Separate FTP

parallelism 0.21.22 #4761 relative
256 173339 161148 92,97%
512 183562 177016 96,43%
1024 173130 168456 97,30%
2048 162228 159594 98,38%
4096 149259 150623 100,91%
best 183562 177016 96,43%

FTP for Cats, FJP for http4s:

parallelism 0.21.22 #4761 relative
256 201198 204126 101,46%
512 224797 201053 89,44%
1024 211299 208847 98,84%
2048 187344 210324 112,27%
4096 176081 201723 114,56%
best 224797 210324 93,56%

Interpretation

The proposed changes are a clear win when paired with FJP. There is no lock contention, CPU utilization is high and so is the throughput.

The difference in score for Shared FTP vs Separate FTP clearly show how bad is the lock contention on the thread pool. Having twice as many threads as cores should be bad, but having two locks instead of one is a huge win.

Generally in scenarios other than the shared FJP, we see results ranging from -10.6% to +14.6%.

Some of the differences are just test variance. But I rerun the "FTP for Cats, FJP for http4s" a few times and it's clear that the new solution scores worse in the scenarios with 512 / 1024 parallel connections.
My hypothesis for the reason of that is that solving lock contention in one place may have made lock contention in the other place (the FTP) even worse. I don't have any idea how to confirm or disprove this hypothesis.

Overall there are ups and downs, but I still think less locking is better. I'm looking forward to seeing how this can improve performance when paired with the new thread pool from CE3 Daniel mentioned. And that brings us to the next topic. @rossabaker , I see you already merged this PR to series/0.21 and series/0.22. Should I prepare a PR (including measurements) with this change ported to 1.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements module:blaze-server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants