SYCL RangePolicy: manually specify workgroup size through chunk size #4875

masterleinad · 2022-03-15T14:43:41Z

In some cases, selecting a custom workgroup size gives significantly better performance than relying on the compiler to choose.

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

pvelesko

With the changes I left in the review, the code runs correctly and now outperforms Kokkos::OpenMPTarget for my example

masterleinad · 2022-03-17T18:42:46Z

CUDA-10.1-Clang-Tidy timing out is clearly unrelated.

nmm0

lgtm pending approval by @pvelesko

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

masterleinad · 2022-03-25T18:33:15Z

Retest this please.

crtrott

We need to have a discussion about this. Note that what you use launch bounds for here, is NOT the intended use, and is NOT the effect you get on either HIP or CUDA. I.e. it does not actually set ranges. If people now start using it for that purpose because of SYCL we potentially create issues. I'd rather have you use chunk_size or something like that.

masterleinad · 2022-03-28T21:33:26Z

We need to have a discussion about this. Note that what you use launch bounds for here, is NOT the intended use, and is NOT the effect you get on either HIP or CUDA. I.e. it does not actually set ranges. If people now start using it for that purpose because of SYCL we potentially create issues. I'd rather have you use chunk_size or something like that.

Sure, it's slightly abusing the concept. My thought was that it fits close enough to what I'm trying to do here since it's an optimization hint (that is actually taking into account). Also, the values for LaunchBounds are normally backend-specific and I don't see any other use for LaunchBounds in SYCL (upcoming). Anyway, I'm happy to discuss alternatives. Note that this is for RangePolicy where we don't have chunk_size.

dalg24 · 2022-03-28T22:14:21Z

Actually we do (I had to look it up)

kokkos/core/src/Kokkos_ExecPolicy.hpp

Lines 201 to 207 in e765d32

    
           /** \brief set chunk_size to a discrete value*/ 
        
           inline RangePolicy set_chunk_size(int chunk_size_) const { 
        
             RangePolicy p        = *this; 
        
             p.m_granularity      = chunk_size_; 
        
             p.m_granularity_mask = p.m_granularity - 1; 
        
             return p; 
        
           }

and, as a side comment, I don't quite understand why we don't just update *this and return a reference to it.

masterleinad · 2022-03-28T22:37:24Z

OK, fair enough.

crtrott · 2022-09-02T21:50:41Z

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

+            (actual_range + wgroup_size - 1) / wgroup_size * wgroup_size;
+        FunctorWrapperRangePolicyParallelForCustom<Functor, Policy> f{
+            policy.begin(), functor, actual_range};
+        sycl::nd_range<1> range(launch_range, Policy::launch_bounds::maxTperB);


what happens if maxTperB is not set (i.e. -1 I think??

crtrott · 2022-09-18T19:31:05Z

@masterleinad we need some clear documentation for this, since this in particular is different from how OpenMP works, where chunksize is the number of consecutive iterations given to a single thread. However, practically always the only performant choice is 1, so that usage of chunk size is kinda useless anyway and reinterpreting it as something like the block size/work group size is probably not a bad idea. We should consider doing this for the other backends I guess.

dalg24 · 2022-09-19T12:19:24Z

Title line is misleading because it only applies to parallel_for, not parallel_reduce.
Please comment whether this can/should be implemented for other parallel constructs with a RangePolicy.

masterleinad · 2022-09-19T13:22:55Z

Please comment whether this can/should be implemented for other parallel constructs with a RangePolicy.

paralle_for was just the most relevant case, but I intended to add versions for other parallel constructs after merging this when I opened the pull request. Since we want to discuss what we want to do with different backends anyway, I would wait with that, though.

pvelesko reviewed Mar 16, 2022

View reviewed changes

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved

pvelesko reviewed Mar 16, 2022

View reviewed changes

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Show resolved Hide resolved

pvelesko suggested changes Mar 16, 2022

View reviewed changes

masterleinad marked this pull request as ready for review March 16, 2022 12:52

nmm0 approved these changes Mar 24, 2022

View reviewed changes

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved

Rombur approved these changes Mar 28, 2022

View reviewed changes

crtrott requested changes Mar 28, 2022

View reviewed changes

masterleinad changed the title ~~SYCL RangePolicy: manually specify workgroup size through LaunchBounds~~ SYCL RangePolicy: manually specify workgroup size through chunk size Apr 1, 2022

masterleinad requested a review from crtrott April 1, 2022 19:15

masterleinad added this to the Tentative 3.7 Release milestone Jun 6, 2022

crtrott requested changes Sep 2, 2022

View reviewed changes

masterleinad force-pushed the sycl_launch_bounds_wgroup_size branch from 13d5e2e to 0b7929a Compare September 6, 2022 14:14

masterleinad requested a review from crtrott September 6, 2022 14:17

masterleinad modified the milestones: Tentative 3.7 Release, Tentative 4.0 Release Sep 6, 2022

SYCL RangePolicy: manually specify workgroup size through chunk size

07c32f5

masterleinad force-pushed the sycl_launch_bounds_wgroup_size branch from 0b7929a to 07c32f5 Compare September 7, 2022 19:42

crtrott approved these changes Sep 18, 2022

View reviewed changes

crtrott merged commit 673a0ef into kokkos:develop Sep 18, 2022

masterleinad mentioned this pull request Sep 19, 2022

Document RangePolicy chunk_size for SYCL kokkos/kokkos-core-wiki#168

Merged

masterleinad deleted the sycl_launch_bounds_wgroup_size branch September 19, 2022 13:05

masterleinad mentioned this pull request Sep 19, 2022

Discuss chunk size for GPU backends with RangePolicy #5467

Open

masterleinad mentioned this pull request Sep 27, 2022

CHANGELOG: 4.0 #5439

Closed

masterleinad mentioned this pull request Nov 9, 2022

SYCL: Set RangePolicy default chunk_size to 1 #5625

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL RangePolicy: manually specify workgroup size through chunk size #4875

SYCL RangePolicy: manually specify workgroup size through chunk size #4875

masterleinad commented Mar 15, 2022

pvelesko left a comment

masterleinad commented Mar 17, 2022

nmm0 left a comment

masterleinad commented Mar 25, 2022

crtrott left a comment

masterleinad commented Mar 28, 2022

dalg24 commented Mar 28, 2022 •

edited

masterleinad commented Mar 28, 2022

crtrott Sep 2, 2022

masterleinad Sep 6, 2022

crtrott commented Sep 18, 2022

dalg24 commented Sep 19, 2022

masterleinad commented Sep 19, 2022

SYCL RangePolicy: manually specify workgroup size through chunk size #4875

SYCL RangePolicy: manually specify workgroup size through chunk size #4875

Conversation

masterleinad commented Mar 15, 2022

pvelesko left a comment

Choose a reason for hiding this comment

masterleinad commented Mar 17, 2022

nmm0 left a comment

Choose a reason for hiding this comment

masterleinad commented Mar 25, 2022

crtrott left a comment

Choose a reason for hiding this comment

masterleinad commented Mar 28, 2022

dalg24 commented Mar 28, 2022 • edited

masterleinad commented Mar 28, 2022

crtrott Sep 2, 2022

Choose a reason for hiding this comment

masterleinad Sep 6, 2022

Choose a reason for hiding this comment

crtrott commented Sep 18, 2022

dalg24 commented Sep 19, 2022

masterleinad commented Sep 19, 2022

dalg24 commented Mar 28, 2022 •

edited