Avoid lock contention #4761

RafalSumislawski · 2021-04-18T11:17:00Z

Motivation

I've got a use case in which I need my HTTP server (implemented with http4s-blaze-server) to process lots of lightweight requests (small request body, small response body, cheap business logic). The thoughput in that case is determined mainly by the http4s-blaze-server.

My main observations regarding performance of http4s-blaze-server in such case are:

http4s is the dominant consumer of CPU time (roughly 10% CPU time goes to actual application logic)
the application is unable to fully utilise processing power of a multicore CPU. For benchmarking I'm using Ryzen 5600X - 6 cores 12 threads CPU. On that platform it's utilising only 50% of CPU, despite having 13 blaze selector threads and 12 thread in the main ExecutorContext / ContextShift.
There's high allocation rate, with the dominant allocator being fs2, despite not really using any form of streaming. (EntityEncoder.simple converting non-streams into streams is the reason)
from time to time a request is processed twice

I'm able to reliably reproduce similar conditions using https://github.com/TechEmpower/FrameworkBenchmarks "plaintext" test. And that's what I'm using to get some reproducable numbers.

I'm working with F = cats.effect.IO.

I want to address these issues with a series of small and independent pull requests. This one is only address the CPU under-utilisation, and only partially.

Diagnosis

Profiling shows that the under-utilisation of CPU is caused (at least partially), buy blocking on three locks / monitors:

The lock inside ScheduledThreadPoolExecutor inside the Timer[IO] - it needs to be locked while sheduling or cancelling timeouts
The lock inside the default ExecutionContext / ContextShift provided by IOApp
The monitor of Http1ServerStage.parser

Solution

This PR addresses 1. by using TickWheelTimer from blaze instead of Timer[IO], and it addresses 3. by creating cancelTocken before acquiring the monitor of parser, and then only acquiring it in order to modify the cancelToken field. The 2. I've addressed by using ForkJoinPool instead of the default ThreadPoolExecutor, but this is something that is outside of the scope of responsibility of http4s. Nontheless what could be done in http4s is reduction of thread shifts, which would reduce pressure on the scheduling mechanism of the thread pool, but this is outside of scope of this PR.

Verifcation

The measurements are done using tfb (Techempower Frameworks Benchmark), specifically this commit RafalSumislawski/FrameworkBenchmarks@1436a3c from my fork. It includes the solution to 2. And so this solution is included in the baseline measurement.

parallel requests	Baseline (0.21.22)	This PR
256	207343 req/s	246718 req/s
512	205574 req/s	276460 req/s
1024	201791 req/s	275930 req/s
2048	193030 req/s	269806 req/s
4096	183165 req/s	266459 req/s

This gives us a decent 33% max throughput increase, as well as improvement in terms of dealing with large number of parallel requests (which was to be expected from a change which fixes lock contention). The difference will most likely be smaller on machines will less cores, and significantly bigger on machines with lots of cores (aka servers).

The CPU utilisation for baseline is ~50% for this PR it's ~80%. Still not 100%. More work remains to be done in that topic...

http4s-blaze-client may need analogical changes, but I don't want to do them without setting up a benchmark for verification.

… ScheduledThreadPoolExecutor

rossabaker

Tremendous research and writeup.

I am surprised by the Fork-Join Pool. Most people use global for brevity, but the strong recommendation in cats-effect is a fixed thread pool. I expect this is probably a big optimization independent of that decision?

rossabaker · 2021-04-18T16:03:34Z

blaze-server/src/main/scala/org/http4s/server/blaze/Http1ServerStage.scala

-        val timeoutResponse = timer.sleep(finite).as(Response.timeout[F])
+        val timeoutResponse = Concurrent[F].cancelable[Response[F]] { callback =>
+          val cancellable =
+            scheduler.schedule(() => callback(Right(Response.timeout[F])), executionContext, finite)


We could probably cache that Right(Response.timeout[F]) in a val as soon as we know F, but it probably won't make a noticeable difference.

rossabaker · 2021-04-18T16:09:27Z

blaze-server/src/main/scala/org/http4s/server/blaze/Http1ServerStage.scala

@@ -347,7 +356,11 @@ private[blaze] class Http1ServerStage[F[_]](
  private[this] val raceTimeout: Request[F] => F[Response[F]] =
    responseHeaderTimeout match {
      case finite: FiniteDuration =>
-        val timeoutResponse = timer.sleep(finite).as(Response.timeout[F])
+        val timeoutResponse = Concurrent[F].cancelable[Response[F]] { callback =>
+          val cancellable =


Very small nit: it's spelled cancelable in cats-effect.

djspiewak · 2021-04-18T16:28:13Z

Something to keep in mind with FJP is it doesn't properly bound the number of workers to the physical threads, and it has a fairly dubious starvation-detection algorithm which results in a large number of workers being spuriously created even when none of them are blocked. This is the primary reasoning behind its banishment from Cats Effect. Though, obviously the fixed thread pool contention issues give all of that back when you have a machine with a relatively high core count (anything above 16 starts to hurt a lot).

The correct answer here, at least in Cats Effect 2, is unfortunately somewhat conditional. Your benchmark is hitting the absolute best of FJP and the absolute worst of fixed thread pool. Applications with more complex handlers or any stretches of CPU-bound work will degrade quite rapidly in practice, which makes FJP start to look a lot worse and fixed thread pool start to look quite a bit better. So the answer really depends.

Cats Effect 3's scheduler basically gives you all the benefits of FJP (though with better optimization and special-casing of various fiber scenarios) without the problem of unbounded workers, which in theory resolves the need to pick between the two, and also solves the artificial microbenchmark problem.

Timer is also an interesting one. I generally agree that ScheduledExecutorService is quite bad, which is why Blaze has a HWT executor in the first place (Akka and Netty are two other notable frameworks which use this technique), and why Cats Effect is working on implementing something similar within its own scheduler. If our ideas pan out, the CE3 scheduler should be able to implement timed awakes theoretically without any overhead at all, most notably without consuming an additional thread.

But at any rate, none of this surprises me. :-) Unfortunately, it's a little difficult to see what the best approach is in the short term (http4s running against cats effect 2), since the suggested compute pool change would result in significantly worst performance for many use-cases, despite the significant gains it would produce in this scenario.

rossabaker · 2021-04-19T03:15:26Z

Techempower is a disservice in both conception and administration, but they continue to be a popular reference. It's wise to put our best foot forward. The right thread pool in that context is certainly whatever gets the best top-line ping route numbers, which is all the nuance anybody gets from those damned things.

I look forward to the CE3 advances that make this less contextual. Again, we don't have the benchmarks to see the timer difference on a fixed pool, but I see no reason this wouldn't also help in that usage. And most people are using the global pool with Blaze, which is a fork-join pool. I think this is going to be an improvement for most, if not all, blaze-server users.

RafalSumislawski · 2021-04-19T06:32:00Z

I think the changes I've proposed in this PR are non-controversial. Less locking should be better. I'll do some more benchmarking with other configuration of threadpools to make sure that I'm not over-optimising for a specific case. I already did some tests with both http4s' ExecutionContext and ContextShift sharing a ThreadPoolExecutor. In that case the baseline and this PR are scoring similarly. The lock in the ThreadPoolExecutor being the bottleneck. I'll do some more tests were the thread pools are not shared, and I'll post the result when ready.

I also have some ideas for reducing the amount of thread shifting. These should be beneficial in their own right (unless they cause some other issues), and should reduce the lock contention in ThreadPoolExecutor.

I'm glad to head about the advances in scheduling in CE3. I'm looking forward to using CE3 in production.

The way I see TFB in this context is that: http4s itself should be optimised for a wide variety of use-cases for the benefit of it's users. TFB is unable to reproduce this "wide variety", nontheless it can be useful for detecting some performance degrading phenomenon in http4s and quantifying them. As a secondary goal, improving TFB scores is good for marketing.

RafalSumislawski · 2021-04-28T06:15:38Z

Common test conditions

I've changed the test setup a bit, so I had to rerun also the results I gathered originally.
All used thread pools have 12 (numberOfAvailableProcessors) threads, that's the default in most cases and I didn't tinker with it.

Just a short reminder. The goal of this is not to compare the thread pool configurations, the goal is to see how the changes proposed in this PR affect throuput under different circumstances.

Tested configurations

Shared FJP

This is the configuration that I used at the beginning, There's one ForkJoinPool used as both the ContextShift and the http4s' ExcecutionContext. This is the only configuration exhibiting no blocking on locks.

  implicit override val executionContext: ExecutionContext = {
    val fjp = new ForkJoinPool()
    ExecutionContext.fromExecutor(fjp)
  }

  implicit override val contextShift: ContextShift[IO] =
    IO.contextShift(executionContext)

  implicit override val timer: Timer[IO] =
    IO.timer(executionContext)

  val http4sEc = {
    executionContext
  }

Shared FTP

There's one (fixed) ThreadPoolExecutor shared as ContextShift and ExecutionContext. That means one shared lock.

  implicit override val executionContext: ExecutionContext = {
    val ftp = Executors.newFixedThreadPool(Runtime.getRuntime.availableProcessors())
    ExecutionContext.fromExecutor(ftp)
  }

  implicit override val contextShift: ContextShift[IO] =
    IO.contextShift(executionContext)

  implicit override val timer: Timer[IO] =
    IO.timer(executionContext)

  val http4sEc = {
    executionContext
  }

Separate FTP:

Http4s' ExecutionContext and ContextShift both use (fixed) ThreadPoolExecutor but these are two separate instances. This is similar what the TFB is using on master.

  implicit override val executionContext: ExecutionContext = {
    val ftp = Executors.newFixedThreadPool(Runtime.getRuntime.availableProcessors())
    ExecutionContext.fromExecutor(ftp)
  }

  implicit override val contextShift: ContextShift[IO] =
    IO.contextShift(executionContext)

  implicit override val timer: Timer[IO] =
    IO.timer(executionContext)

  val http4sEc = {
    val ftp = Executors.newFixedThreadPool(Runtime.getRuntime.availableProcessors())
    ExecutionContext.fromExecutor(ftp)
  }

FTP for Cats, FJP for http4s:

Again ExecutionContext and ContextShift are separate thread polls. This time Cats uses fixed ThreadPoolExecutor (just like the default from IOApp) and http4s uses ForkJoinPool just like ExecutionContext.global.

This is the most default configuration.

  implicit override val executionContext: ExecutionContext = {
    val ftp = Executors.newFixedThreadPool(Runtime.getRuntime.availableProcessors())
    ExecutionContext.fromExecutor(ftp)
  }

  implicit override val contextShift: ContextShift[IO] =
    IO.contextShift(executionContext)

  implicit override val timer: Timer[IO] =
    IO.timer(executionContext)

  val http4sEc = {
    val fjp = new ForkJoinPool()
    ExecutionContext.fromExecutor(fjp)
  }

Measurements

Shared FJP

parallelism	0.21.22	#4761	relative
256	211159	236896	112,19%
512	214493	276147	128,74%
1024	209068	277240	132,61%
2048	197243	270592	137,19%
4096	189699	242590	127,88%
best	214493	277240	129,25%

Shared FTP

parallelism	0.21.22	#4761	relative
256	123831	131733	106,38%
512	123164	122397	99,38%
1024	120050	115718	96,39%
2048	112269	110198	98,16%
4096	111049	105970	95,43%
best	123831	131733	106,38%

Separate FTP

parallelism	0.21.22	#4761	relative
256	173339	161148	92,97%
512	183562	177016	96,43%
1024	173130	168456	97,30%
2048	162228	159594	98,38%
4096	149259	150623	100,91%
best	183562	177016	96,43%

FTP for Cats, FJP for http4s:

parallelism	0.21.22	#4761	relative
256	201198	204126	101,46%
512	224797	201053	89,44%
1024	211299	208847	98,84%
2048	187344	210324	112,27%
4096	176081	201723	114,56%
best	224797	210324	93,56%

Interpretation

The proposed changes are a clear win when paired with FJP. There is no lock contention, CPU utilization is high and so is the throughput.

The difference in score for Shared FTP vs Separate FTP clearly show how bad is the lock contention on the thread pool. Having twice as many threads as cores should be bad, but having two locks instead of one is a huge win.

Generally in scenarios other than the shared FJP, we see results ranging from -10.6% to +14.6%.

Some of the differences are just test variance. But I rerun the "FTP for Cats, FJP for http4s" a few times and it's clear that the new solution scores worse in the scenarios with 512 / 1024 parallel connections.
My hypothesis for the reason of that is that solving lock contention in one place may have made lock contention in the other place (the FTP) even worse. I don't have any idea how to confirm or disprove this hypothesis.

Overall there are ups and downs, but I still think less locking is better. I'm looking forward to seeing how this can improve performance when paired with the new thread pool from CE3 Daniel mentioned. And that brings us to the next topic. @rossabaker , I see you already merged this PR to series/0.21 and series/0.22. Should I prepare a PR (including measurements) with this change ported to 1.0?

RafalSumislawski added 2 commits April 18, 2021 14:30

Use TickWheelExecutor instead of Timer[F] to avoid lock contention in…

e180c8a

… ScheduledThreadPoolExecutor

avoid contention of parser's monitor

854938a

RafalSumislawski force-pushed the avoid-lock-contention branch from 5f50140 to 854938a Compare April 18, 2021 12:31

rossabaker reviewed Apr 18, 2021

View reviewed changes

rossabaker added enhancement Feature requests and improvements module:blaze-server labels Apr 19, 2021

rossabaker added this to the 0.21.23 milestone Apr 19, 2021

rossabaker merged commit ccbc771 into http4s:series/0.21 Apr 25, 2021

RafalSumislawski mentioned this pull request Apr 28, 2021

Reset idle timeout when request is received by blaze-server #4796

Merged

rossabaker mentioned this pull request Mar 27, 2022

Blaze server enhancements #6179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid lock contention #4761

Avoid lock contention #4761

RafalSumislawski commented Apr 18, 2021

rossabaker left a comment

rossabaker Apr 18, 2021

rossabaker Apr 18, 2021

djspiewak commented Apr 18, 2021

rossabaker commented Apr 19, 2021

RafalSumislawski commented Apr 19, 2021 •

edited

RafalSumislawski commented Apr 28, 2021

Avoid lock contention #4761

Avoid lock contention #4761

Conversation

RafalSumislawski commented Apr 18, 2021

Motivation

Diagnosis

Solution

Verifcation

rossabaker left a comment

Choose a reason for hiding this comment

rossabaker Apr 18, 2021

Choose a reason for hiding this comment

rossabaker Apr 18, 2021

Choose a reason for hiding this comment

djspiewak commented Apr 18, 2021

rossabaker commented Apr 19, 2021

RafalSumislawski commented Apr 19, 2021 • edited

RafalSumislawski commented Apr 28, 2021

Common test conditions

Tested configurations

Shared FJP

Shared FTP

Separate FTP:

FTP for Cats, FJP for http4s:

Measurements

Shared FJP

Shared FTP

Separate FTP

FTP for Cats, FJP for http4s:

Interpretation

RafalSumislawski commented Apr 19, 2021 •

edited