-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subview construction on Cuda backend #615
Comments
Can you try already creating a RA view outside of the kernel, and then get the subview? |
This worked (and I was able to remove the View<double**, LayoutStride> A("A", N, N);
View<const double**, LayoutStride, RA> B(A);
parallel_for(N, KOKKOS_LAMBDA (const int i)
{
A(i,0) = 5.0;
View<const double*, LayoutStride, RA> v(B, i, ALL()); // works
printf("v(0): %g \n", v(0));
}); This results in:
It seems that One of my use cases is that I have a team_scratch view that is populated with values and then is not changed after some point. Perhaps I need to use different memory traits for these. Is that your take as well? |
The issue here is that you are lying and thus get what you deserve ;-) In this case the actual underlying issue is that texture fetches are non-coherent within a kernel. That means updates to the underlying data may or may not be seen depending on the state of the cash, data flushes etc. If you change the data of A in a separate kernel then you should see the changes. |
So that example is a little over simplistic, but it did reproduce the compile errors. Also, replacing This is a little more representative (pseudo-code): parallel_for (policy, …
{
View<double **> A(team_scratch);
func_to_populate_A(A);
parallel_for(TeamThreadRange … (const int& i)
{
View<const double*, RA> v(A, i, ALL);
parallel_for(ThreadVectorRange … (const int& j)
{
// do something with v
});
};)
}); |
You need to get a View<const double**,RA> c_A before the the parallel_for and get the subview from that. Then it should work. parallel_for (policy, …
{
View<double **> A(team_scratch);
View<const double**, RA> c_A(A);
func_to_populate_A(A);
parallel_for(TeamThreadRange … (const int& i)
{
View<const double*, RA> v(c_A, i, ALL);
parallel_for(ThreadVectorRange … (const int& j)
{
// do something with v
});
};)
}); |
Before the |
OK sorry I didn't look correctly. |
Here we go again (i mean now comes the real explanation after I understood this is using scratch space): you can't actually have a RandomAccess const view of scratch memory. This is an issue we probably need to fix. Basically texture objects can't reference scratch memory, so we would need to not use texture object. The right way to do this is most likely to enforce the usage of the correct memory space (exec_space::scratch_memory_space) for such scratch space views. That way everything would work, since the specialization for texture objects only kicks in for CudaSpace and CudaUVMSpace. |
Ah, ok. Perhaps for the time being I can use different memory traits so that it doesn't trigger the texture specialization. Currently I default to using the largest memory space for all scratch allocations whereas I should do something to take better advantage of the faster scratch memory spaces if possible. |
Identified a bug: Erroneously attempts to attach texture object to Cuda space view with unmanaged memory. |
Consider other cuda-const-random-access options, such as if an unmanaged view use 'ldg' intrinsics. |
Resolve bug by verifying that a View to const random access Cuda memory, which currently uses texture objects, has a texture object available. Otherwise generate a meaningful error message. |
Creating a const random access Cuda memory View assumes the use of texture objects, so verify the texture object can be created or retrieved.
Will this fix allow my team scratch views (const, random access) to use |
If |
Add error check with meaningful message for issue #615.
I checked out develop branch was trying this out and I am getting compile errors:
Any ideas? |
What is your calling code? |
Here is a small reproducer; it is simpler structure (no nested parallelism), but yields the same error. #include <Kokkos_Core.hpp>
#include <cstdio>
using Kokkos::parallel_for;
using Kokkos::MemoryTraits;
using Kokkos::RandomAccess;
using Kokkos::subview;
#define KOKKOSALL Kokkos::Impl::ALL_t()
using Kokkos::LayoutStride;
typedef MemoryTraits< RandomAccess > RA;
template <class D, class ... P>
using View = Kokkos::View<D, P ... >;
int main (int argc, char* argv[]) {
Kokkos::initialize(argc, argv);
const int N = 10;
View<double**, LayoutStride> A("A", N, N);
View<const double**, LayoutStride, RA> B(A);
// if use const view, compile error
//View<const double*[4], LayoutRight, RA> A("A", N);
parallel_for(N, KOKKOS_LAMBDA (const int i)
{
A(i,0) = 5.0;
View<const double*, LayoutStride, RA> v(A, i, KOKKOSALL);
//View<const double*, LayoutStride, RA> v(A, i, ALL());
printf("v(0): %g \n", v(0));
});
Kokkos::finalize();
return 0;
} |
What is the default execution space? |
Should be Cuda. Verification, I added: std::cout << "Default Execution Space: "
<< typeid(Kokkos::DefaultExecutionSpace).name() << "\n"; which prints out:
|
for future reference, I believe |
I have a reproducing unit test... |
Good to know. I'll need to change some things around to get this. Had been using |
Thats actually the same typedef Kokkos::MemoryTraits< Kokkos::Unmanaged | Kokkos::RandomAccess > MemoryRandomAccess |
Found the error and lack of good error message. When taking a subview inside a functor Solution:
|
What about for this (more complex, but more representative) case: parallel_for (policy, …
{
View<double **> A(team_scratch);
func_to_populate_A(A);
parallel_for(TeamThreadRange … (const int& i)
{
View<const double*, RA> v(A, i, ALL);
parallel_for(ThreadVectorRange … (const int& j)
{
// do something with v
});
};)
}); |
In this case the random access trait would not provide any help, even if it did compile, because the team scratch memory is already placed in |
I am having an issue with subviews on the Cuda backend. I was told by @crtrott to use this
View<> a(b, i, ALL)
rather thanView<> a(subview(b, i, ALL))
. I don't recall actual difference is but the preferred method results in a compile time error (on up to date master and develop branches) and the other method does not.However, using the method that compiles results in a runtime error (both branches):
:0: : block: [0,0,0], thread: [0,0,0] Assertion `Cannot create Cuda texture object from within a Cuda kernel` failed.
Code below should illustrate what I am describing.The text was updated successfully, but these errors were encountered: