Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UB when using desul's array lock on SYCL with RangePolicy #6745

Open
aelovikov-intel opened this issue Jan 24, 2024 · 8 comments
Open

UB when using desul's array lock on SYCL with RangePolicy #6745

aelovikov-intel opened this issue Jan 24, 2024 · 8 comments
Labels
Backend - SYCL Enhancement Improve existing capability; will potentially require voting

Comments

@aelovikov-intel
Copy link
Contributor

This test:

ASSERT_TRUE(
(TestAtomic::Loop<TestAtomic::SuperScalar<4>, TEST_EXECSPACE>(100, 1)));

ultimately ends enqueued via SYCL's range version of parallel_for (vs. nd_range being an alternative).

desul's implementation at https://github.com/desul/desul/blob/fd6cd0639863a48ae0c57ed5db286955d6a412e2/atomics/include/desul/atomics/Lock_Based_Fetch_Op_SYCL.hpp#L35 uses https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_free_function_queries.asciidoc extension that specifies in https://github.com/intel/llvm/blob/f694c841e234835bc08578c905b7e904a1a57b4b/sycl/doc/extensions/experimental/sycl_ext_oneapi_free_function_queries.asciidoc?plain=1#L199-L209 that

sub_group this_sub_group() ... shall only be called from within the parallel_for with nd_range as the argument
If prerequisites are not fulfilled the behavior is undefined

@masterleinad
Copy link
Contributor

Yes, that's difficult to work around if we still want the runtime to find a good workgroup size. I'm not sure if this is causing any real issues yet, though.

@masterleinad
Copy link
Contributor

Was there a good reason for this restriction anyway?

@aelovikov-intel
Copy link
Contributor Author

aelovikov-intel commented Jan 26, 2024

Was there a good reason for this restriction anyway?

I'd speculate that sycl::nd_item and sycl::sub_group only make sense in the context of using sycl::nd_range submission. The extension (semantically) just eases the access to them but doesn't create them out of thin air.

Tagging @rarutyun @Pennycook @steffenlarsen @gmlueck for the actual reasoning behind specification.

@Pennycook
Copy link

Pennycook commented Jan 26, 2024

I'd speculate that sycl::ndi_item and sycl::sub_group only make sense in the context of using sycl::nd_range submission. The extension (semantically) just eases the access to them but doesn't create them out of thin air.

This is correct.

A parallel_for(range) kernel is not guaranteed to be implemented as a parallel_for(nd_range) kernel with an automatic size. You should think of parallel_for(range) as being more similar to Kokkos::parallel_for, in the sense that each invocation of the lambda is only guaranteed to be executed by a thread of execution providing weakly parallel forward progress guarantees. Implementations can (and do) implement parallel_for(range) with a different number of work-items than you'd expect, potentially with a many-to-one mapping of lambda invocations to work-items.

DPC++ has an extension that enables you to request the ND-range execution model but using a runtime-provided size:

q.parallel_for(sycl::nd_range<1>{{N}, sycl::ext::oneapi::experimental::auto_range<1>()}, [=](sycl::nd_item<1>) {
  /* kernel body */
});

It's safe to use this extension in conjunction with the free-function queries.

@masterleinad
Copy link
Contributor

Even better! Let me update accordingly then.

@aelovikov-intel
Copy link
Contributor Author

I would still be very careful here. If range size is a prime number, I'd expect an if around the kernel body making the control flow non-convergent. Calling sub-group function from it would be UB as well. @Pennycook , please correct me if I'm wrong.

@masterleinad
Copy link
Contributor

It appears that this feature isn't available with the compiler drops on the Argonne tests beds yet. So we'll have to wait some.

Implementations can (and do) implement parallel_for(range) with a different number of work-items than you'd expect, potentially with a many-to-one mapping of lambda invocations to work-items.

It seems that we would possibly have worse performance for a parallel_for RangePolicy if we switched to using nd_range with auto_range then.

@Pennycook
Copy link

I would still be very careful here. If range size is a prime number, I'd expect an if around the kernel body making the control flow non-convergent. Calling sub-group function from it would be UB as well. @Pennycook , please correct me if I'm wrong.

Range-rounding is disabled for ND-range kernels for exactly this reason. Rounding an ND-range would effectively mean that all uses of groups and sub-groups were UB, not just uses with the free functions.

It seems that we would possibly have worse performance for a parallel_for RangePolicy if we switched to using nd_range with auto_range then.

If you don't do any checks of your loop bound, then I agree this is possible. Because the work-group size selected by an auto_range has to evenly divide the total range size, for something like a prime number the only valid choice for the runtime is a group size of 1. I assumed Kokkos already had some logic in place to defend against this case, though, because it's also required for other backends (e.g., in CUDA the block size must evenly divide the total number of threads).

If you're talking about cases where parallel_for(range) gives better performance because there's lower scheduling overhead... That can happen, but I think the effect would only be noticeable for very small kernels with a trivial amount of work.

@ajpowelsnl ajpowelsnl added the Enhancement Improve existing capability; will potentially require voting label Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend - SYCL Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

4 participants