Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rayon feature causes deadlocks when used inside of an existing rayon threadpool #227

Closed
awused opened this issue Feb 21, 2022 · 23 comments · Fixed by #230
Closed

rayon feature causes deadlocks when used inside of an existing rayon threadpool #227

awused opened this issue Feb 21, 2022 · 23 comments · Fixed by #230

Comments

@awused
Copy link

awused commented Feb 21, 2022

When loading a jpeg with the rayon feature enabled the calling thread does no work of its own and only blocks until the work is completed on other threads, waiting for mpsc messages to come through. If the task is already running from inside the context of a threadpool this will block one of those threads until the other threads can process the actions. If an application is using a rayon threadpool - even just the global default rayon pool - to parallelize the loading of images it's possible for enough jpegs to start loading at once to exhaust the pool and deadlock themselves.

I believe refactoring it to use rayon::in_place_scope would work, but a more expedient option is probably to just have a dedicated rayon threadpool for decoding. If nothing else this should be documented and maybe made non-default.

This might apply to other image decoders if any use rayon, I've not inspected their code, but jpegs are especially common.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 21, 2022

If I understand, then the situation is as follows:

rayon-Pool
|-----|
|  …  |
| Jpg | --> { decode_internal } ----------\
|  …  |                   blocked on multithreaded.rs#L54
|  …  |                                    v
|-----|<--blocked-on free worker--| Component0 | Component1 | Component2 |

Is it possible to write a reliable test/reproduction for this? The best fix for this, that I cant think of, would be to allow the current thread running decode_internal to complete each work item, alternatively to the block on recv. This would allow rayon to silently degrade to single-threaded use by having the Jpg-Worker complete the components itself. However, I'm not sure how to achieve this with the API. The current parallelization strategy is also somewhat crude in the first place.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 21, 2022

Minimized reproduction:

    let pool = rayon::ThreadPoolBuilder::new()
        .num_threads(1)
        .build()
        .unwrap();
    pool.install(|| {
        let mut decoder = Decoder::new(File::open(&path).unwrap());
        let _ = decoder.decode().unwrap();
    });

@HeroicKatora

This comment was marked as off-topic.

@awused
Copy link
Author

awused commented Feb 21, 2022

I think keeping a reasonably small dedicated threadpool around would be acceptable enough. The overhead should be minimal, but since it does add permanent overhead it probably makes sense to make it non-default. There are also scoped threadpool crates that are stable and battle-tested enough that could be used for non-rayon.

If adding any dependencies or keeping idle threadpools around indefinitely isn't viable then rust-lang/rust#93203 seems to be making progress towards stabilization. As an outsider I think it'd be reasonable to disable the rayon feature entirely - to avoid the case where a transitive dependency on image re-enables jpeg_rayon - until that lands and the decoding can be refactored.

@HeroicKatora

This comment was marked as off-topic.

@awused

This comment was marked as off-topic.

@awused
Copy link
Author

awused commented Feb 25, 2022

To spell it out in no uncertain terms, I see a few options, all of which are reasonable to me as an outsider.

  1. Keep a dedicated rayon threadpool for use with the rayon feature.
    • Now when called from the context of an existing rayon threadpool all the real work will be done in that dedicated threadpool, leaving the calling thread free to block.
    • This requires the least work and is compatible with the existing std implementation.
    • Alternately you could construct a new rayon threadpool per call, but that rubs me the wrong way.
  2. Disable the rayon feature entirely, then when Tracking Issue for scoped threads rust-lang/rust#93203 lands, migrate the std multi-threading code to use scope.
    • This keeps the rayon version as efficient as it can be (no second threadpool)
    • This requires refactoring both code paths and disabling the rayon feature for an uncertain length of time.
  3. Using one of the existing scoped threadpool implementations in place of the std multi-threading code, then refactor the rayon version to not block the calling thread.
    • This adds an additional dependency and makes the "std" code not really fully "std", while still requiring all the work to migrate both versions.

Of course those were only the simplest, easiest options I could think of and the ones I directly alluded to in my previous comments.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 25, 2022

Please take a look at PR #230 regarding dedicated rayon threadpool. This is the (somwhat wrong as we agree) threadpool per call because the existing structure allowed for this more easily than other models. Also I should say 'threadpool-per-library' also rubs me the wrong way as it can not be effectively initialized if there are different consumers with different needs? The std-multithreading spawns (up to) 4 threads per image while a lazy_static!-threadpool (or similar) would be some static number per library. Maybe we should make this an explicit parameter during construction of the decoder? (E.g. A setter that accepts some struct that internally holds Arc<ThreadPool>.)

We have no need to migrate std multithreading to anything else. It's not the inability to use scoped thread that kepts us from doing a different than the current work structure. It's the coupling between decoding and actual work creation that did. rayon does not have join handles which makes it very hard to create tasks depending on results of previous tasks. To achieve the flexibility in different component treatment that the specification requires we must either be able to do that, or we introduce multiple 'global' join points which again loses some of the benefits of multithreading. No thing as free lunch.

@awused

This comment was marked as off-topic.

@fintelia
Copy link
Contributor

fintelia commented Feb 25, 2022

Rayon lets you detect whether you're running within a threadpool via rayon::current_thread_index().is_some(). As a quick workaround, would it be better for now to just fallback to our single-threaded decode implementation if we detect that?

@HeroicKatora

This comment was marked as off-topic.

@fintelia
Copy link
Contributor

That's insufficient. We need to detect if the number of threads available is larger than the number of blocking tasks (so, larger than the number of concurrent decode calls).

When called from a rayon threadpool, we need to ensure that if we do any blocking operations from within decode there are more threads available than other blocking tasks. My proposal is that decode should never do any blocking operations when called from a threadpool. This sidesteps the question of figuring out how many threads there are available.

@awused

This comment was marked as off-topic.

@fintelia
Copy link
Contributor

Moderation note: Let's keep this focused on figuring out a technical solution. There is no need to argue over who is confused, or who misunderstood whom, or whatever.

@HeroicKatora

This comment was marked as off-topic.

@awused

This comment was marked as off-topic.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 25, 2022

My proposal is that decode should never do any blocking operations when called from a threadpool.

Without a decently large refactor, our multithreaded task system always blocks at some point—the single threaded does not. This involves worker/tasks internally. As those tasks would always run on threadpool no matter what we do, implementing this suggestion implies, effectively, 'disable use of rayon'. At least we can keep std-multithreading unchanged this way. The above code exposition of an explicit thread pool scope was simply the easiest reproduction that offered enough control to deterministically reproduce. But the problem also occurs when we are using the global thread pool. It's just more hidden, not deterministic, and requires multiple decode calls in parallel.

Current assumed situation our internal architecture:

  • Single Thread (works)
  • Multi Threaded std (works)
  • Multi Threaded rayon (broken): Except if we can control the worker pool's number of threads to be strictly more than the number of concurrent decode calls.

Potential approaches:

  1. Disable rayon. As assumed above, this puts the library into a good state.
  2. Use a new rayon ThreadPool for the decoder: Simply spawn more than one thread. As decode(&mut) can only called once concurrently this guarantees the requirement to avoid the deadlock.
  3. Share our internally created rayon ThreadPool with a limited number of decoders. Also does this but is more complex. This avoids creating new workers for each Decoder though, which is one main technical advantage beyond the use std-multithreading.
  4. Rewrite the task system to remove the blocking code. Since rayon requires lexical scoping to get results, this would mean identifying serialization points at which all component results can be collected at the same time and then again faned out in new tasks to run the remaining compute_image computation. This is potentially less concurrent than the existing version and requires some reachitecturing. In any case, the existing std-multithreading can likely also cope with this solution, using the result retrieval it already implements but with the new order of computation.

Non-Solutions:

  1. Detect the rayon context. As we can not introspect, nor control, the surrounding ThreadPool that will be in use (no matter if global or local) this can not provide the guarantee we need for forward progress.
  2. Use a dedicated, but global jpeg-decoder ThreadPool. Again, this offers no control over the number of concurrent decodings that may happen. Since rayon can not rescale ThreadPools at runtime, we can't implement the required guarantee identified above.
  3. Changing, replacing or removing std multithreading.
  4. std-'scope-Tasks, as std works fine in either the current situation or lexical scoping. They might prove to have some small runtime advantage in the future though, due to having less communication latency (no explicit mpsc channel for the result).

@awused
Copy link
Author

awused commented Feb 25, 2022

Detect the rayon context. As we can not introspect, nor control, the surrounding ThreadPool that will be in use (no matter if global or local) this can not provide the guarantee we need for forward progress.

If called from an existing rayon threadpool, choosing to fall back to the single-threaded implementation or the std multi-threading code is enough to avoid deadlocks and guarantee forward progress. You can always block the thread you have been called on, that is not the problem. The problem only exists when blocking on rayon tasks from within the same threadpool, so if no rayon tasks are spawned then there is no deadlock. This does mean choosing the code path at run time, not at compile time based on feature flags, but the overhead would be negligible since rayon::spawn, itself, checks if it's being called from an existing threadpool using the same mechanism.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 25, 2022

This does mean choosing the code path at run time

Under which conditions, do you suppose, will we ever dynamically choose the rayon-utilizing code path? The whole point is that we can never safely do so. We will not be interacting with rayon at all in this case, no matter what kinds of feature flags are provided (which we will keep only for compatibility with downstream includes), nor would we actually have to declare it as a dependency.

@fintelia
Copy link
Contributor

Under which conditions, do you suppose, will we ever dynamically choose the rayon-utilizing code path?

If the rayon feature is enabled and rayon::current_thread_index().is_none().

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 25, 2022

No, that's not enough. Then any rayon-created tasks will initialize and spawn on the implicit global thread pool instead, which has the exact same problem as any local one (it has a constant number of threads) (based on num-cpus/some environment variable afaik). Again, the local one in the reproduction is sufficient but not necessary, it just turns out to be sufficient enough for a deterministic reproduction (which we can actually turn into a regression test).

@awused
Copy link
Author

awused commented Feb 25, 2022

Under which conditions, do you suppose, will we ever dynamically choose the rayon-utilizing code path? The whole point is that we can never safely do so.

Whenever the rayon::current_thread_index().is_some() call returns false, it is safe to do so, or at least as safe as ever using rayon will be if you're not managing your own pools. Local or global threadpools don't matter, you're always free to block the thread you're called from.

Poking into the code a bit more, it seems it assumes rayon::spawn is analogous to thread::spawn, and spawns long-running tasks that await more work on channels, so there may be an additional requirement on the minimum number of threads in the pool dedicated to each decoder. If I'm reading this code right, using the global rayon threadpool was never safe, and using a shared rayon threadpool between multiple decode calls will also be unsafe. This pattern would be better served by tokio or another async executor rather than rayon.

Again, the local one in the reproduction is sufficient but not necessary, it just turns out to be sufficient enough for a deterministic reproduction (which we can actually turn into a regression test).

I can see two other simple tests. Use build_global to change the global threadpool and then use decode from both inside and outside that pool.

@fintelia
Copy link
Contributor

Ah, the problem is specifically that the spawned tasks also block:

while let Ok(message) = rx.recv() {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants