-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize deep_copy_space early avoiding printing to std::cerr for Cuda #5151
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I would say looks good. (I would suggest we call abort()
instead of throwing an exception though.)
But, I suppose you are looking into this because of @stanmoore1 report on Slack
Stan Moore [12:14 PM]
I'm still seeing this error in LAMMPS:Kokkos::Cuda::Cuda instance constructor : ERROR device not initialized
and I don't quite follow how either of Cuda::Cuda()
or Cuda::Cuda(cudaStream_t, bool)
would get called.
Please elaborate.
The failing stacktrace looks like
|
Thanks so it is kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp Line 745 in acb3434
which calls kokkos/core/src/impl/Kokkos_SharedAlloc_timpl.hpp Lines 264 to 271 in acb3434
and the issue is our default constructing an execution space on L265. This is a release 3.6 defect (introduced in #4478) I suppose we should use |
I still think we should just fence the stream we are using for the copy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah lets abort.
See fbea785. |
We are currently printing an error message to
std::cerr
when finalizingCuda
execution spaces as the first commit shows.The problem is that we are finalizing
deep_copy_space
since that needs to deallocate data which calls a fence on the default execution space and we check if the execution space has already unsetm_dev
at this point. Wd didn't see this so far, since we are only printing tostd::cerr
but not throwing an exception or aborting.The proposed fix is to finalize
deep_copy_space
before unsettlingm_dev
.While looking at this, I also found that some of the code in
finalize
is guarded by a check for scratch space which didn't make sense to me.Throwing in
finalize
is debatable, of course, but makes the problem obvious (in the CI).