cudaGetLastError raises "invalid device ordinal" when running multiple containers on one host #2230

tgaddair · 2020-09-03T01:08:08Z

Environment:

Framework: any
Framework version:
Horovod version: e4554de
MPI version:
CUDA version: 10.0
NCCL version: 2.6.4
Python version:
Spark / PySpark version:
OS and version:
GCC version:

When running a Horovod job that spans multiple containers that reside on the same host (with isolated GPUs), the following error is raised:

RuntimeError: ScaleBufferCudaImpl failed: invalid device ordinal

Through debugging, it was determined this was introduced in #1949 with the line ErrorCheck("ScaleBufferCudaImpl", cudaGetLastError());.

It does appear that this is somehow specific to Horovod, as it does not occur in pure MPI + CUDA. However, it is not known if this is a CUDA version specific issue. By introducing a similar call to the commit just before, this error also occurs, so it is not specific to other changes introduced in that commit.

As a workaround, we will remove this check until the underlying cause of this error is known.

cc @romerojosh

The text was updated successfully, but these errors were encountered:

tgaddair added the bug label Sep 3, 2020

tgaddair mentioned this issue Sep 3, 2020

Disable cudaGetLastError to avoid errors when running on colocated containers #2231

Merged

tgaddair mentioned this issue Dec 10, 2020

Disable cudaGetLastError check to fix multi-container same host #2515

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudaGetLastError raises "invalid device ordinal" when running multiple containers on one host #2230

cudaGetLastError raises "invalid device ordinal" when running multiple containers on one host #2230

tgaddair commented Sep 3, 2020 •

edited

cudaGetLastError raises "invalid device ordinal" when running multiple containers on one host #2230

cudaGetLastError raises "invalid device ordinal" when running multiple containers on one host #2230

Comments

tgaddair commented Sep 3, 2020 • edited

tgaddair commented Sep 3, 2020 •

edited