Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4096 thread limit in safekeepers. #3036

Closed
arssher opened this issue Dec 8, 2022 · 6 comments · Fixed by #4119
Closed

4096 thread limit in safekeepers. #3036

arssher opened this issue Dec 8, 2022 · 6 comments · Fixed by #4119
Assignees
Labels
c/storage/safekeeper Component: storage: safekeeper t/bug Issue Type: Bug

Comments

@arssher
Copy link
Contributor

arssher commented Dec 8, 2022

follow up of #2722

Essentially, due to tracing-subscriber usage we are limited to max 4096 threads, exceeding that causes panic.

I had a look at tracing-subscriber & sharded-slab crates. The former depends on the latter; sharded-slab allows to statically configure MAX_THREADS , but tracing-subscriber uses just default 4096 config.

see also tokio-rs/tracing#1485

@arssher arssher added the t/bug Issue Type: Bug label Dec 8, 2022
@arssher arssher self-assigned this Dec 8, 2022
@arssher
Copy link
Contributor Author

arssher commented Dec 8, 2022

We should fix this quickly; 4k is a low limit.

I'd probably open a PR to tracing which would make sharded-slab config adjustable. However, generally tracing-subscriber is evidently tailored to async world, so we'd better understand potential problems here -- e.g. how much memory high MAX_THREADS would consume.

However, right now I'd just switch safekeeper to async in kinda pageserver mode: with separate runtimes (to prevent different jobs blocking each other) and sync io.

  • It is easy to do.
  • I'd like to avoid spawn_blocking for io as safekeeper is disk intensive, and it is not clear how expensive is offloading to the thread pool. IOW, spawn_blocking is the app which does little but writing to disc is pointless.
  • Blocking in io might hurt performance as well, but considering that CPU job of safekeepers is very small and taking into account that we can making the runtime large, essentially getting close to thread per connection model I don't expect impact to be significant.

In a broader picture, async without async disc io doesn't make sense for disc intensive safekeepers -- they should saturate disk much earlier than experience lack of CPU due to large number of connections. However, we use tokio for handy cancellations, and also several used libs are async, so the basic model is thread-per-connection with current thread runtime in each one. So it would be nice to reconcile tracing-subscriber with that or use something different (slog or something custom) -- however the latter doesn't look appealing with tracing being used in other parts of the project.

An alternative is the actual async disc io -- uring etc, but right now this is not a priority.

CC @petuhovskiy @LizardWizzard

@LizardWizzard
Copy link
Contributor

However, right now I'd just switch safekeeper to async in kinda pageserver mode: with separate runtimes (to prevent different jobs blocking each other) and sync io.

Please note that this created other problems. Mixing sync io with async sometimes doesnt go really well. To me its clear that we should move to full async and doesnt stop here. Regarding "not really well part" see this issue by @koivunej

I'd like to avoid spawn_blocking for io as safekeeper is disk intensive, and it is not clear how expensive is offloading to the thread pool. IOW, spawn_blocking is the app which does little but writing to disc is pointless.

I cannot agree with that. Without spawn blocking you starve executor threads with blocking operations. Fsyncs can take seconds. You got number of fsyncs equal to number of executor threads? Everything is stuck now. Such calls cannot be preempted so at best it will cause latency spike for other tasks in the queue. Not to mention deadlocks e g with nonstealable queue slot in tokio scheduler (see the issue I mentioned above)

Tokio's own File operations offload actual io to a thread pool. Threads are started in advance. So to me it is rather unclear how it can make things significantly worse performance wise. It even allows to easily increase concurrency for writes (e g schedule several writes at once). In general async allows for higher concurrency which is IMO quite suitable for our use case and number of things we need to do concurrently.

so the basic model is thread-per-connection with current thread runtime in each one.

To me this model is rather pointless, async makes a lot of sense for networking when you have a lot of clients, so it would be beneficial to have runtime that can multiplex several connections in one thread. With per connection runtime it seems that we pick worst parts from both worlds.


Do we really have 4096 active threads? If you do not want to go down the async road its probably worth to consider become more aggressive in shutting down idle things. To me it looks like the fastest and less disruptive way to solve the issue

@arssher
Copy link
Contributor Author

arssher commented Dec 8, 2022

You got number of fsyncs equal to number of executor threads? Everything is stuck now.

What is stuck if basically the only thing we do is writing/reading to disk? It is actually untrivial question, it would be interesting to benchmark.

It even allows to easily increase concurrency for writes (e g schedule several writes at once).

How? I thought each spawn_blocking is an individual operation, no?

Do we really have 4096 active threads?

Currently no (we have had several such situations, but it were bugs), but having 1-2 thousand of active computes would do that.

In general, unless we are talking about async disc io, async makes sense only to save CPU from context switching to do actually useful work instead. But if we only actively write to disk, we don't have a problem with which async helps (we won't have more than dozens of thousand threads to suffer CPU doing mostly context switching -- disk would be saturated earlier) and we do job at which async is less efficient since it has to pass to separate pool -- disc io. How severe are these effects (what is the cost of offloading) I actually don't know, it would be interesting to benchmark.

@LizardWizzard
Copy link
Contributor

What is stuck if basically the only thing we do is writing/reading to disk?

The queue of tasks inside executor. It wont go forward when we're inside blocking syscall.

How? I thought each spawn_blocking is an individual operation, no?

It is, but you can have several of them and join them.

async makes sense only to save CPU from context switching to do actually useful work instead

And not only this. It also allows you to increase concurrency without increasing the number of threads. Cancellation support. Just convenience of coding when you do want to implement some sort of pipelinening or something like that.

we don't have a problem with which async helps

Looks like big number of threads is one of those :)

we won't have more than dozens of thousand threads to suffer CPU doing mostly context switching

And in this case we will have "dozen thousands of threads" doing exactly context switching. Do we know where are our limits? How many active computes one safekeeper can serve?

There are some techniques from Scylla's io scheduler when it tries to find the sweet spot between throughput and latency so disk is not overloaded and latency keeps being predictable. I can look for materials explaining this if you're interested.

How severe are these effects (what is the cost of offloading) I actually don't know, it would be interesting to benchmark.

It heavily depends on the concurrency, size of tasks, etc. I've seen papers where people offload writes to persistent memory (write takes around 200ns) to another thread and have performance uplift based partially on that.

The whole thing requires investing into benchmarking anyway. Going partial async is problematic. So easiest solution to me is still to find ways to decrease number of threads. Overloaded safekeeper (with 1000 of active computes) looks like a bug too. But rather in control plane which should be able to balance the load

@arssher
Copy link
Contributor Author

arssher commented Dec 8, 2022

The queue of tasks inside executor. It wont go forward when we're inside blocking syscall.

Sure. What I meant is that if all threads are in fsync, disc obviously can't carry out any more requests, so giving more execution time is (or mostly) pointless if writing to disc is the only job.

Cancellation support. Just convenience of coding

That's why we have thread local executor.

Looks like big number of threads is one of those :)

We don't have it yet, 4k (this issue) is about an artificial limitation.

Do we know where are our limits? How many active computes one safekeeper can serve?

More investigations would be nice, but clearly this heavily depends on the workload. Even a dozen of instances can fully utilize disc e.g. if they do COPY IN; OTOH, with weak write load (almost read only instances) I think up to 10k computes per safekeeper might be okay. Perhaps less.

I can look for materials explaining this

Yes, may be interesting.

@LizardWizzard
Copy link
Contributor

Sure. What I meant is that if all threads are in fsync, disc obviously can't carry out any more requests, so giving more execution time is (or mostly) pointless if writing to disc is the only job.

Seems to be not so obvious to me. Disk bandwidth is asymmetric, if you have n threads and all of them are in fsync that doesnt mean that read bandwidth is saturated. Modern nvmes have many io queues. What number of threads do we consider? One per core? One per core + threadpool? Thousands?

That's why we have thread local executor.

As I said having thousands of threads with each running single threaded executor is pointless. You dont use async multiplexing capabilities this way. It is only overhead of two schedulers.

We don't have it yet, 4k (this issue) is about an artificial limitation.

To me it is already a problem, 4k of constantly context switching threads waste much more cycles on context switches than nproc number of threads.

I think up to 10k computes per safekeeper might be okay. Perhaps less.

10k+ threads sounds really bad to me (even more if we count ones that stream data to pageservers). Especially with single-threaded executor running in each one.

Resourses:

https://www.p99conf.io/session/a-new-io-scheduler-algorithm-for-mixed-workloads/ this is a video from p99 conference talk. This is the written version https://www.scylladb.com/2022/08/03/implementing-a-new-io-scheduler-algorithm-for-mixed-read-write-workloads/ There are links inside the post to previous iterations of the algorithm

@shanyp shanyp added the c/storage/safekeeper Component: storage: safekeeper label Feb 26, 2023
arssher added a commit that referenced this issue Apr 29, 2023
This is a full switch, fs io operations are also tokio ones, working through
thread pool. Similar to pageserver, we have multiple runtimes for easier `top`
usage and isolation.

Notable points:
- Now that guts of safekeeper.rs are full of .await's, we need to be very
  careful not to drop task at random point, leaving timeline in unclear
  state. Currently the only writer is walreceiver and we don't have top
  level cancellation there, so we are good. But to be safe probably we should
  add a fuse panicking if task is being dropped while operation on a timeline
  is in progress.
- Timeline lock is Tokio one now, as we do disk IO under it.
- Collecting metrics got a crutch: since prometheus Collector is
  synchronous, there is now a special task copying once in a scrape period
  data from under async lock to sync one where collector can take it.
- Anything involving closures becomes significantly more complicated, as
  async fns are already kinda closures + 'async closures are unstable'.
- Main thread now tracks other main tasks, which got much easier.
- The only sync place left is initial data loading, as otherwise clippy
  complains on timeline map lock being held across await points -- which is
  not bad here as it happens only in single threaded runtime of main thread.
  But having it sync doesn't hurt either.

I'm concerned about performance of thread pool io offloading, async traits and
many await points; but we can try and see how it goes.

fixes #3036
fixes #3966
arssher added a commit that referenced this issue Jun 8, 2023
This is a full switch, fs io operations are also tokio ones, working through
thread pool. Similar to pageserver, we have multiple runtimes for easier `top`
usage and isolation.

Notable points:
- Now that guts of safekeeper.rs are full of .await's, we need to be very
  careful not to drop task at random point, leaving timeline in unclear
  state. Currently the only writer is walreceiver and we don't have top
  level cancellation there, so we are good. But to be safe probably we should
  add a fuse panicking if task is being dropped while operation on a timeline
  is in progress.
- Timeline lock is Tokio one now, as we do disk IO under it.
- Collecting metrics got a crutch: since prometheus Collector is
  synchronous, it spawns a thread with current thread runtime collecting data.
- Anything involving closures becomes significantly more complicated, as
  async fns are already kinda closures + 'async closures are unstable'.
- Main thread now tracks other main tasks, which got much easier.
- The only sync place left is initial data loading, as otherwise clippy
  complains on timeline map lock being held across await points -- which is
  not bad here as it happens only in single threaded runtime of main thread.
  But having it sync doesn't hurt either.

I'm concerned about performance of thread pool io offloading, async traits and
many await points; but we can try and see how it goes.

fixes #3036
fixes #3966
arssher added a commit that referenced this issue Jun 11, 2023
This is a full switch, fs io operations are also tokio ones, working through
thread pool. Similar to pageserver, we have multiple runtimes for easier `top`
usage and isolation.

Notable points:
- Now that guts of safekeeper.rs are full of .await's, we need to be very
  careful not to drop task at random point, leaving timeline in unclear
  state. Currently the only writer is walreceiver and we don't have top
  level cancellation there, so we are good. But to be safe probably we should
  add a fuse panicking if task is being dropped while operation on a timeline
  is in progress.
- Timeline lock is Tokio one now, as we do disk IO under it.
- Collecting metrics got a crutch: since prometheus Collector is
  synchronous, it spawns a thread with current thread runtime collecting data.
- Anything involving closures becomes significantly more complicated, as
  async fns are already kinda closures + 'async closures are unstable'.
- Main thread now tracks other main tasks, which got much easier.
- The only sync place left is initial data loading, as otherwise clippy
  complains on timeline map lock being held across await points -- which is
  not bad here as it happens only in single threaded runtime of main thread.
  But having it sync doesn't hurt either.

I'm concerned about performance of thread pool io offloading, async traits and
many await points; but we can try and see how it goes.

fixes #3036
fixes #3966
awestover pushed a commit that referenced this issue Jun 14, 2023
This is a full switch, fs io operations are also tokio ones, working through
thread pool. Similar to pageserver, we have multiple runtimes for easier `top`
usage and isolation.

Notable points:
- Now that guts of safekeeper.rs are full of .await's, we need to be very
  careful not to drop task at random point, leaving timeline in unclear
  state. Currently the only writer is walreceiver and we don't have top
  level cancellation there, so we are good. But to be safe probably we should
  add a fuse panicking if task is being dropped while operation on a timeline
  is in progress.
- Timeline lock is Tokio one now, as we do disk IO under it.
- Collecting metrics got a crutch: since prometheus Collector is
  synchronous, it spawns a thread with current thread runtime collecting data.
- Anything involving closures becomes significantly more complicated, as
  async fns are already kinda closures + 'async closures are unstable'.
- Main thread now tracks other main tasks, which got much easier.
- The only sync place left is initial data loading, as otherwise clippy
  complains on timeline map lock being held across await points -- which is
  not bad here as it happens only in single threaded runtime of main thread.
  But having it sync doesn't hurt either.

I'm concerned about performance of thread pool io offloading, async traits and
many await points; but we can try and see how it goes.

fixes #3036
fixes #3966
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/safekeeper Component: storage: safekeeper t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants