UB when using desul's array lock on SYCL with RangePolicy #6745

aelovikov-intel · 2024-01-24T23:09:58Z

This test:

Lines 537 to 538 in 2dc7cbc

    
           ASSERT_TRUE( 
        
               (TestAtomic::Loop<TestAtomic::SuperScalar<4>, TEST_EXECSPACE>(100, 1)));

ultimately ends enqueued via SYCL's range version of parallel_for (vs. nd_range being an alternative).

desul's implementation at https://github.com/desul/desul/blob/fd6cd0639863a48ae0c57ed5db286955d6a412e2/atomics/include/desul/atomics/Lock_Based_Fetch_Op_SYCL.hpp#L35 uses https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_free_function_queries.asciidoc extension that specifies in https://github.com/intel/llvm/blob/f694c841e234835bc08578c905b7e904a1a57b4b/sycl/doc/extensions/experimental/sycl_ext_oneapi_free_function_queries.asciidoc?plain=1#L199-L209 that

sub_group this_sub_group() ... shall only be called from within the parallel_for with nd_range as the argument
If prerequisites are not fulfilled the behavior is undefined

The text was updated successfully, but these errors were encountered:

masterleinad · 2024-01-26T20:38:44Z

Yes, that's difficult to work around if we still want the runtime to find a good workgroup size. I'm not sure if this is causing any real issues yet, though.

masterleinad · 2024-01-26T20:44:09Z

Was there a good reason for this restriction anyway?

aelovikov-intel · 2024-01-26T20:52:19Z

Was there a good reason for this restriction anyway?

I'd speculate that sycl::nd_item and sycl::sub_group only make sense in the context of using sycl::nd_range submission. The extension (semantically) just eases the access to them but doesn't create them out of thin air.

Tagging @rarutyun @Pennycook @steffenlarsen @gmlueck for the actual reasoning behind specification.

Pennycook · 2024-01-26T21:07:07Z

I'd speculate that sycl::ndi_item and sycl::sub_group only make sense in the context of using sycl::nd_range submission. The extension (semantically) just eases the access to them but doesn't create them out of thin air.

This is correct.

A parallel_for(range) kernel is not guaranteed to be implemented as a parallel_for(nd_range) kernel with an automatic size. You should think of parallel_for(range) as being more similar to Kokkos::parallel_for, in the sense that each invocation of the lambda is only guaranteed to be executed by a thread of execution providing weakly parallel forward progress guarantees. Implementations can (and do) implement parallel_for(range) with a different number of work-items than you'd expect, potentially with a many-to-one mapping of lambda invocations to work-items.

DPC++ has an extension that enables you to request the ND-range execution model but using a runtime-provided size:

q.parallel_for(sycl::nd_range<1>{{N}, sycl::ext::oneapi::experimental::auto_range<1>()}, [=](sycl::nd_item<1>) {
  /* kernel body */
});

It's safe to use this extension in conjunction with the free-function queries.

masterleinad · 2024-01-26T21:13:06Z

Even better! Let me update accordingly then.

aelovikov-intel · 2024-01-26T21:15:36Z

I would still be very careful here. If range size is a prime number, I'd expect an if around the kernel body making the control flow non-convergent. Calling sub-group function from it would be UB as well. @Pennycook , please correct me if I'm wrong.

masterleinad · 2024-01-26T21:32:26Z

It appears that this feature isn't available with the compiler drops on the Argonne tests beds yet. So we'll have to wait some.

Implementations can (and do) implement parallel_for(range) with a different number of work-items than you'd expect, potentially with a many-to-one mapping of lambda invocations to work-items.

It seems that we would possibly have worse performance for a parallel_for RangePolicy if we switched to using nd_range with auto_range then.

Pennycook · 2024-01-26T21:43:21Z

I would still be very careful here. If range size is a prime number, I'd expect an if around the kernel body making the control flow non-convergent. Calling sub-group function from it would be UB as well. @Pennycook , please correct me if I'm wrong.

Range-rounding is disabled for ND-range kernels for exactly this reason. Rounding an ND-range would effectively mean that all uses of groups and sub-groups were UB, not just uses with the free functions.

It seems that we would possibly have worse performance for a parallel_for RangePolicy if we switched to using nd_range with auto_range then.

If you don't do any checks of your loop bound, then I agree this is possible. Because the work-group size selected by an auto_range has to evenly divide the total range size, for something like a prime number the only valid choice for the runtime is a group size of 1. I assumed Kokkos already had some logic in place to defend against this case, though, because it's also required for other backends (e.g., in CUDA the block size must evenly divide the total number of threads).

If you're talking about cases where parallel_for(range) gives better performance because there's lower scheduling overhead... That can happen, but I think the effect would only be noticeable for very small kernels with a trivial amount of work.

masterleinad mentioned this issue Feb 20, 2024

SYCL: Use sycl::ext::oneapi::experimental::auto_range #6828

Draft

ajpowelsnl added the Backend - SYCL label Feb 27, 2024

ajpowelsnl added the Enhancement Improve existing capability; will potentially require voting label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UB when using desul's array lock on SYCL with RangePolicy #6745

UB when using desul's array lock on SYCL with RangePolicy #6745

aelovikov-intel commented Jan 24, 2024

masterleinad commented Jan 26, 2024

masterleinad commented Jan 26, 2024

aelovikov-intel commented Jan 26, 2024 •

edited

Pennycook commented Jan 26, 2024 •

edited

masterleinad commented Jan 26, 2024

aelovikov-intel commented Jan 26, 2024

masterleinad commented Jan 26, 2024

Pennycook commented Jan 26, 2024

UB when using desul's array lock on SYCL with RangePolicy #6745

UB when using desul's array lock on SYCL with RangePolicy #6745

Comments

aelovikov-intel commented Jan 24, 2024

masterleinad commented Jan 26, 2024

masterleinad commented Jan 26, 2024

aelovikov-intel commented Jan 26, 2024 • edited

Pennycook commented Jan 26, 2024 • edited

masterleinad commented Jan 26, 2024

aelovikov-intel commented Jan 26, 2024

masterleinad commented Jan 26, 2024

Pennycook commented Jan 26, 2024

aelovikov-intel commented Jan 26, 2024 •

edited

Pennycook commented Jan 26, 2024 •

edited