Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDLRUN + DeepSpeed on SUMMIT #61

Closed
agemagician opened this issue Feb 11, 2020 · 7 comments
Closed

DDLRUN + DeepSpeed on SUMMIT #61

agemagician opened this issue Feb 11, 2020 · 7 comments

Comments

@agemagician
Copy link

Hi,

I am trying to use deepspeed on SUMMIT using ddlrun, but it doesn't work properly.
I am testing it with cifar like:
ddlrun deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json

Could you please give us an example for using deepspeed with horovod , mpi and ddlrun ?

@agemagician agemagician changed the title DDLRUN + DeepSpeed DDLRUN + DeepSpeed on SUMMIT Feb 11, 2020
@ShadenSmith
Copy link
Contributor

Hello! Thank you for your interest in DeepSpeed. DeepSpeed uses its own launcher and relies on NCCL for communication instead of MPI. Codes need to use DeepSpeed's small API to run and no Horovod is used. To launch a DeepSpeed program, you just need a hostfile, which is compatible with many MPI implementations. DeepSpeed searches for /job/hostfile by default, or you can provide a hostfile with an argument: --hostfile=path/to/hostfile.

Finally, you can launch with:

deepspeed cifar_deepspeed.py --deepspeed --deepspeed_config=ds_config.json

@agemagician
Copy link
Author

Thanks for the clarification.
This will be a little tricky with SUMMIT, since I don't know what are the current hostnames.
I will try to check if bsub provide it somehow.

@jeffra
Copy link
Collaborator

jeffra commented Feb 27, 2020

@agemagician, we just merged in a new PR that should make this a bit easier for you and others who want to use MPI. Please see this new text in our README for more details: https://github.com/microsoft/DeepSpeed/#mpi-compatibility

In your case you should be able to do something like:
ddlrun python cifar10_deepspeed.py --deepspeed_mpi --deepspeed --deepspeed_config ds_config.json

Also make sure to install the python package mpi4py if you don't already have it.

@agemagician
Copy link
Author

Thanks @jeffra for the update.
I will test it and I will give you my feedback.

@agemagician
Copy link
Author

The ddlrun didn't work out, as follows:

2020-02-28 04:41:22.616358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.616342: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.616342: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.616370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.616370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.616420: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.633562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.633708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.633842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.634003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.634142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.634298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.651291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.651443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.651589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.651739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.651892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.654235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.669154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.669318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.669466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.669618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.669775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.672114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.687058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.687205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.687346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.687501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.687649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.690016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.704951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.705096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.705238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.705392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.705540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.707887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.722871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.722895: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.722962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.722997: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.723020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.723030: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.723045: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.723112: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.723150: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.723169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.723192: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.723187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.723255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.723292: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.723326: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.723325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.723350: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.723414: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.723452: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.723471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.723489: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.723493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.723548: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.723584: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.723617: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.724633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.724656: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.724721: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.724756: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.724790: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.778804: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778808: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778801: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778846: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778846: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778854: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778881: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.778854: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778904: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778849: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778883: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.778854: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778896: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778890: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.778897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778940: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.778930: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.778941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.990990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:22.991437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:22.991584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:22.991725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:22.991867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:22.992162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:23.008098: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.008098: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.008934: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.009407: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.009421: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.009645: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.010798: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15c157940 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.010839: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.010975: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x16f5b8af0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.010998: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.011583: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x157038d60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.011607: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.014651: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1204ed950 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.014674: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.014892: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15339d650 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.014917: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.015113: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x16933d290 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.015143: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.033880: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.033886: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.033952: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.033952: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.062768: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.063391: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.063428: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.063558: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.063765: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.064749: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.064869: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.064908: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.065097: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.065229: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.065453: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.065515: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.065567: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.066307: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.066389: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.066423: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.066552: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.068210: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.068217: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.068321: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.068787: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
2020-02-28 04:41:23.068957: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
2020-02-28 04:41:23.091814: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.094324: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.095155: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.095759: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.096799: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.097594: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.098078: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA

When I try Jsrun, I got another error:

THCudaCheck FAIL file=/opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp line=38 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 518, in main
    initialize_distributed(args)
  File "pretrain_bert.py", line 441, in initialize_distributed
    torch.cuda.set_device(device)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 280, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp:38
THCudaCheck FAIL file=/opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp line=38 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 518, in main
    initialize_distributed(args)
  File "pretrain_bert.py", line 441, in initialize_distributed
    torch.cuda.set_device(device)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 280, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp:38
THCudaCheck FAIL file=/opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp line=38 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 518, in main
    initialize_distributed(args)
  File "pretrain_bert.py", line 441, in initialize_distributed
    torch.cuda.set_device(device)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 280, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp:38
THCudaCheck FAIL file=/opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp line=38 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 518, in main
    initialize_distributed(args)
  File "pretrain_bert.py", line 441, in initialize_distributed
    torch.cuda.set_device(device)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 280, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp:38
THCudaCheck FAIL file=/opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp line=38 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 518, in main
    initialize_distributed(args)
  File "pretrain_bert.py", line 441, in initialize_distributed
    torch.cuda.set_device(device)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 280, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp:38

@agemagician
Copy link
Author

I tried to change the distributed-backend parameter to ddl, and I had another error:


2020-02-28 05:04:04.719317: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 05:04:04.719482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 05:04:04.734093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.734242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.734367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.734522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.734670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.734821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.750180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.750333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.750468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.750622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.750763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.750914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.765597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.765750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.765886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.766037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.766179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.766332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.781044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.781193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.781324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.781480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.781626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.781772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.798361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.798512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.798643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.798800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.799109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.799260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.816072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.816158: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.816195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.816219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816231: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.816244: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.816305: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.816345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.816350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816373: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.816383: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.816429: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.816467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.816504: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.816508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.816593: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.816633: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.816674: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.816817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.816898: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.816935: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.816962: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.816990: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.817047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.817085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.817122: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.818449: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818448: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818494: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818492: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818529: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:04.818533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:04.818560: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818577: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818603: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818603: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818621: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818640: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:04.818641: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:04.818656: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:04.818716: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818761: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818799: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:05.014825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.014973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.015120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.015250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.015402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.015546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.027646: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.027664: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.027664: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.030542: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.030785: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1d5295b00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.030805: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.030794: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a9f353c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.030814: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.030831: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.031007: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.031106: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x178883f00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.031136: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.034994: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035028: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035127: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035165: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035255: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035256: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035453: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035580: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035807: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035920: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.036083: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.036213: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.036294: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.036426: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.037278: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b48043b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.037351: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.037394: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.037758: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b2f527d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.037778: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.038390: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x183c1b700 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.038414: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.041883: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.041883: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.041933: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.042908: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.043108: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.043148: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.043477: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.043686: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.044267: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.044469: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1d52f8dd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.044486: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-02-28 05:04:05.044530: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.044575: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.045110: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.045164: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.045387: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.045488: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a9f986d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.045504: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-02-28 05:04:05.045932: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.046762: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1788e71f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.046776: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
	while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
	while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
	while setting up XLA_GPU_JIT device number 0
2020-02-28 05:04:05.053050: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b2fb6290 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.053070: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-02-28 05:04:05.053248: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b4867ea0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.053262: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-02-28 05:04:05.053294: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x183c7f160 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.053309: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].
	while setting up XLA_GPU_JIT device number 1
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
	while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
	while setting up XLA_GPU_JIT device number 0

@agemagician agemagician reopened this Feb 28, 2020
@agemagician
Copy link
Author

Oh, that was actually for using Megatron-LM code, which doesn't use DeepSpeed distributed code.

I will test it again with the cifar test.

rraminen pushed a commit to rraminen/DeepSpeed that referenced this issue Apr 28, 2021
* test commits in DSE

* Support for porgressive layer dropping

* Minor changes on PLD

* update the finetune script

* PLD client

* Remove theta option

Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>

Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>
delock pushed a commit to delock/DeepSpeedSYCLSupport that referenced this issue Sep 21, 2022
commit 9454f9ddb4c9da86a18a09c8c26e575620ec2814 (HEAD, origin/xpu-main, origin/HEAD)
Author: Guo Yejun <yejun.guo@intel.com>
Date:   Thu Aug 18 05:45:27 2022 -0700

    pretrain_gpt2.py: add atan op profiler for fwd for a given iteration

commit 531dd5fb1c9c6e6281d81c84e0769c2679e9fe4e
Author: Guo Yejun <yejun.guo@intel.com>
Date:   Fri Aug 19 21:54:55 2022 -0700

    scripts/gpt-3.6b.sh: change train iteration from 6 to 10

commit 1ca3add75ec688ed8aa365b6a5e39c803db79f39
Author: Guo Yejun <yejun.guo@intel.com>
Date:   Wed Aug 3 19:44:16 2022 -0700

    pretrain_gpt2.py: output more times for each train and whole process
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants