You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For low level libraries (such as KokkosKernels or Stokhos) it can be useful to be able to do ThreadVectorRanges without the TeamHandle. We don't like that much because its dangerous to loose the information that you might run code in parallel when you don't expect it, but I understand the design point and it looks valid to me. What I am doing it giving a non-public API way to do this. Basically you can create the meta object for the parallel for (ThreadVectorRange looks like a non-templated class but is actually a function call which returns the implementation execution policy which lives in impl) directly, and we provide a constructor which does not require the TeamHandle:
instead of:
the latter over my own vectorized types. I then have template expressions and arithmetic operator overloads on these vectorized types, such that I never have to explictly write vector-level loops and still have basic linalg running essentially at the roofline limit. On CPUs, these are also testable on their own without an enclosing ThreadTeamRange and without using a view.
This seemed like a good way at the time to use ThreadVectorRange for SIMD on CPUs and coalesced access on GPUs (with an appropriately long vector length) at the same time at the cost of having to pack the data into vectors, of course.
Maybe there's a good reason that these constructors for Kokkos::Impl::ThreadVectorRangeBoundariesStruct have been removed and perhaps I should look into the simd type instead to achieve basically the same thing.
Would be very grateful to hear your thoughts on this. Thank you!
For low level libraries (such as KokkosKernels or Stokhos) it can be useful to be able to do ThreadVectorRanges without the TeamHandle. We don't like that much because its dangerous to loose the information that you might run code in parallel when you don't expect it, but I understand the design point and it looks valid to me. What I am doing it giving a non-public API way to do this. Basically you can create the meta object for the parallel for (ThreadVectorRange looks like a non-templated class but is actually a function call which returns the implementation execution policy which lives in impl) directly, and we provide a constructor which does not require the TeamHandle:
instead of:
do:
The text was updated successfully, but these errors were encountered: