You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Through debugging, it was determined this was introduced in #1949 with the line ErrorCheck("ScaleBufferCudaImpl", cudaGetLastError());.
It does appear that this is somehow specific to Horovod, as it does not occur in pure MPI + CUDA. However, it is not known if this is a CUDA version specific issue. By introducing a similar call to the commit just before, this error also occurs, so it is not specific to other changes introduced in that commit.
As a workaround, we will remove this check until the underlying cause of this error is known.
Environment:
When running a Horovod job that spans multiple containers that reside on the same host (with isolated GPUs), the following error is raised:
Through debugging, it was determined this was introduced in #1949 with the line
ErrorCheck("ScaleBufferCudaImpl", cudaGetLastError());
.It does appear that this is somehow specific to Horovod, as it does not occur in pure MPI + CUDA. However, it is not known if this is a CUDA version specific issue. By introducing a similar call to the commit just before, this error also occurs, so it is not specific to other changes introduced in that commit.
As a workaround, we will remove this check until the underlying cause of this error is known.
cc @romerojosh
The text was updated successfully, but these errors were encountered: