4096 thread limit in safekeepers. #3036

arssher · 2022-12-08T15:48:54Z

follow up of #2722

Essentially, due to tracing-subscriber usage we are limited to max 4096 threads, exceeding that causes panic.

I had a look at tracing-subscriber & sharded-slab crates. The former depends on the latter; sharded-slab allows to statically configure MAX_THREADS , but tracing-subscriber uses just default 4096 config.

see also tokio-rs/tracing#1485

The text was updated successfully, but these errors were encountered:

arssher · 2022-12-08T16:18:13Z

We should fix this quickly; 4k is a low limit.

I'd probably open a PR to tracing which would make sharded-slab config adjustable. However, generally tracing-subscriber is evidently tailored to async world, so we'd better understand potential problems here -- e.g. how much memory high MAX_THREADS would consume.

However, right now I'd just switch safekeeper to async in kinda pageserver mode: with separate runtimes (to prevent different jobs blocking each other) and sync io.

It is easy to do.
I'd like to avoid spawn_blocking for io as safekeeper is disk intensive, and it is not clear how expensive is offloading to the thread pool. IOW, spawn_blocking is the app which does little but writing to disc is pointless.
Blocking in io might hurt performance as well, but considering that CPU job of safekeepers is very small and taking into account that we can making the runtime large, essentially getting close to thread per connection model I don't expect impact to be significant.

In a broader picture, async without async disc io doesn't make sense for disc intensive safekeepers -- they should saturate disk much earlier than experience lack of CPU due to large number of connections. However, we use tokio for handy cancellations, and also several used libs are async, so the basic model is thread-per-connection with current thread runtime in each one. So it would be nice to reconcile tracing-subscriber with that or use something different (slog or something custom) -- however the latter doesn't look appealing with tracing being used in other parts of the project.

An alternative is the actual async disc io -- uring etc, but right now this is not a priority.

CC @petuhovskiy @LizardWizzard

LizardWizzard · 2022-12-08T18:26:22Z

However, right now I'd just switch safekeeper to async in kinda pageserver mode: with separate runtimes (to prevent different jobs blocking each other) and sync io.

Please note that this created other problems. Mixing sync io with async sometimes doesnt go really well. To me its clear that we should move to full async and doesnt stop here. Regarding "not really well part" see this issue by @koivunej

I'd like to avoid spawn_blocking for io as safekeeper is disk intensive, and it is not clear how expensive is offloading to the thread pool. IOW, spawn_blocking is the app which does little but writing to disc is pointless.

I cannot agree with that. Without spawn blocking you starve executor threads with blocking operations. Fsyncs can take seconds. You got number of fsyncs equal to number of executor threads? Everything is stuck now. Such calls cannot be preempted so at best it will cause latency spike for other tasks in the queue. Not to mention deadlocks e g with nonstealable queue slot in tokio scheduler (see the issue I mentioned above)

Tokio's own File operations offload actual io to a thread pool. Threads are started in advance. So to me it is rather unclear how it can make things significantly worse performance wise. It even allows to easily increase concurrency for writes (e g schedule several writes at once). In general async allows for higher concurrency which is IMO quite suitable for our use case and number of things we need to do concurrently.

so the basic model is thread-per-connection with current thread runtime in each one.

To me this model is rather pointless, async makes a lot of sense for networking when you have a lot of clients, so it would be beneficial to have runtime that can multiplex several connections in one thread. With per connection runtime it seems that we pick worst parts from both worlds.

Do we really have 4096 active threads? If you do not want to go down the async road its probably worth to consider become more aggressive in shutting down idle things. To me it looks like the fastest and less disruptive way to solve the issue

arssher · 2022-12-08T19:13:20Z

You got number of fsyncs equal to number of executor threads? Everything is stuck now.

What is stuck if basically the only thing we do is writing/reading to disk? It is actually untrivial question, it would be interesting to benchmark.

It even allows to easily increase concurrency for writes (e g schedule several writes at once).

How? I thought each spawn_blocking is an individual operation, no?

Do we really have 4096 active threads?

Currently no (we have had several such situations, but it were bugs), but having 1-2 thousand of active computes would do that.

In general, unless we are talking about async disc io, async makes sense only to save CPU from context switching to do actually useful work instead. But if we only actively write to disk, we don't have a problem with which async helps (we won't have more than dozens of thousand threads to suffer CPU doing mostly context switching -- disk would be saturated earlier) and we do job at which async is less efficient since it has to pass to separate pool -- disc io. How severe are these effects (what is the cost of offloading) I actually don't know, it would be interesting to benchmark.

LizardWizzard · 2022-12-08T19:58:54Z

What is stuck if basically the only thing we do is writing/reading to disk?

The queue of tasks inside executor. It wont go forward when we're inside blocking syscall.

How? I thought each spawn_blocking is an individual operation, no?

It is, but you can have several of them and join them.

async makes sense only to save CPU from context switching to do actually useful work instead

And not only this. It also allows you to increase concurrency without increasing the number of threads. Cancellation support. Just convenience of coding when you do want to implement some sort of pipelinening or something like that.

we don't have a problem with which async helps

Looks like big number of threads is one of those :)

we won't have more than dozens of thousand threads to suffer CPU doing mostly context switching

And in this case we will have "dozen thousands of threads" doing exactly context switching. Do we know where are our limits? How many active computes one safekeeper can serve?

There are some techniques from Scylla's io scheduler when it tries to find the sweet spot between throughput and latency so disk is not overloaded and latency keeps being predictable. I can look for materials explaining this if you're interested.

How severe are these effects (what is the cost of offloading) I actually don't know, it would be interesting to benchmark.

It heavily depends on the concurrency, size of tasks, etc. I've seen papers where people offload writes to persistent memory (write takes around 200ns) to another thread and have performance uplift based partially on that.

The whole thing requires investing into benchmarking anyway. Going partial async is problematic. So easiest solution to me is still to find ways to decrease number of threads. Overloaded safekeeper (with 1000 of active computes) looks like a bug too. But rather in control plane which should be able to balance the load

arssher · 2022-12-08T20:22:23Z

The queue of tasks inside executor. It wont go forward when we're inside blocking syscall.

Sure. What I meant is that if all threads are in fsync, disc obviously can't carry out any more requests, so giving more execution time is (or mostly) pointless if writing to disc is the only job.

Cancellation support. Just convenience of coding

That's why we have thread local executor.

Looks like big number of threads is one of those :)

We don't have it yet, 4k (this issue) is about an artificial limitation.

Do we know where are our limits? How many active computes one safekeeper can serve?

More investigations would be nice, but clearly this heavily depends on the workload. Even a dozen of instances can fully utilize disc e.g. if they do COPY IN; OTOH, with weak write load (almost read only instances) I think up to 10k computes per safekeeper might be okay. Perhaps less.

I can look for materials explaining this

Yes, may be interesting.

LizardWizzard · 2022-12-08T20:58:22Z

Sure. What I meant is that if all threads are in fsync, disc obviously can't carry out any more requests, so giving more execution time is (or mostly) pointless if writing to disc is the only job.

Seems to be not so obvious to me. Disk bandwidth is asymmetric, if you have n threads and all of them are in fsync that doesnt mean that read bandwidth is saturated. Modern nvmes have many io queues. What number of threads do we consider? One per core? One per core + threadpool? Thousands?

That's why we have thread local executor.

As I said having thousands of threads with each running single threaded executor is pointless. You dont use async multiplexing capabilities this way. It is only overhead of two schedulers.

We don't have it yet, 4k (this issue) is about an artificial limitation.

To me it is already a problem, 4k of constantly context switching threads waste much more cycles on context switches than nproc number of threads.

I think up to 10k computes per safekeeper might be okay. Perhaps less.

10k+ threads sounds really bad to me (even more if we count ones that stream data to pageservers). Especially with single-threaded executor running in each one.

Resourses:

https://www.p99conf.io/session/a-new-io-scheduler-algorithm-for-mixed-workloads/ this is a video from p99 conference talk. This is the written version https://www.scylladb.com/2022/08/03/implementing-a-new-io-scheduler-algorithm-for-mixed-read-write-workloads/ There are links inside the post to previous iterations of the algorithm

This is a full switch, fs io operations are also tokio ones, working through thread pool. Similar to pageserver, we have multiple runtimes for easier `top` usage and isolation. Notable points: - Now that guts of safekeeper.rs are full of .await's, we need to be very careful not to drop task at random point, leaving timeline in unclear state. Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good. But to be safe probably we should add a fuse panicking if task is being dropped while operation on a timeline is in progress. - Timeline lock is Tokio one now, as we do disk IO under it. - Collecting metrics got a crutch: since prometheus Collector is synchronous, there is now a special task copying once in a scrape period data from under async lock to sync one where collector can take it. - Anything involving closures becomes significantly more complicated, as async fns are already kinda closures + 'async closures are unstable'. - Main thread now tracks other main tasks, which got much easier. - The only sync place left is initial data loading, as otherwise clippy complains on timeline map lock being held across await points -- which is not bad here as it happens only in single threaded runtime of main thread. But having it sync doesn't hurt either. I'm concerned about performance of thread pool io offloading, async traits and many await points; but we can try and see how it goes. fixes #3036 fixes #3966

This is a full switch, fs io operations are also tokio ones, working through thread pool. Similar to pageserver, we have multiple runtimes for easier `top` usage and isolation. Notable points: - Now that guts of safekeeper.rs are full of .await's, we need to be very careful not to drop task at random point, leaving timeline in unclear state. Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good. But to be safe probably we should add a fuse panicking if task is being dropped while operation on a timeline is in progress. - Timeline lock is Tokio one now, as we do disk IO under it. - Collecting metrics got a crutch: since prometheus Collector is synchronous, it spawns a thread with current thread runtime collecting data. - Anything involving closures becomes significantly more complicated, as async fns are already kinda closures + 'async closures are unstable'. - Main thread now tracks other main tasks, which got much easier. - The only sync place left is initial data loading, as otherwise clippy complains on timeline map lock being held across await points -- which is not bad here as it happens only in single threaded runtime of main thread. But having it sync doesn't hurt either. I'm concerned about performance of thread pool io offloading, async traits and many await points; but we can try and see how it goes. fixes #3036 fixes #3966

arssher added the t/bug Issue Type: Bug label Dec 8, 2022

arssher self-assigned this Dec 8, 2022

shanyp added the c/storage/safekeeper Component: storage: safekeeper label Feb 26, 2023

arssher mentioned this issue Apr 29, 2023

Switch safekeepers to async. #4119

Merged

arssher closed this as completed in #4119 Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4096 thread limit in safekeepers. #3036

4096 thread limit in safekeepers. #3036

arssher commented Dec 8, 2022

arssher commented Dec 8, 2022 •

edited

LizardWizzard commented Dec 8, 2022

arssher commented Dec 8, 2022 •

edited

LizardWizzard commented Dec 8, 2022

arssher commented Dec 8, 2022

LizardWizzard commented Dec 8, 2022

4096 thread limit in safekeepers. #3036

4096 thread limit in safekeepers. #3036

Comments

arssher commented Dec 8, 2022

arssher commented Dec 8, 2022 • edited

LizardWizzard commented Dec 8, 2022

arssher commented Dec 8, 2022 • edited

LizardWizzard commented Dec 8, 2022

arssher commented Dec 8, 2022

LizardWizzard commented Dec 8, 2022

arssher commented Dec 8, 2022 •

edited

arssher commented Dec 8, 2022 •

edited