Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use GPU on 2nd machine #3801

Open
obaid1922 opened this issue Dec 30, 2022 · 0 comments
Open

Unable to use GPU on 2nd machine #3801

obaid1922 opened this issue Dec 30, 2022 · 0 comments

Comments

@obaid1922
Copy link

Hi I have setup horovod on a k8s cluster with 2 GPU nodes using spark-operator. I have executed the mnist example using tensorflow, and it was executed successfully on both nodes (utlilizing GPUs on both nodes). However when I am using KerasEstimator on spark, the training executes successfully but I think that only one gpu is getting used.

I am following this example:
https://docs.databricks.com/_static/notebooks/deep-learning/horovod-spark-estimator-keras.html

here are the logs:

[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Bootstrap : Using eth0:10.84.52.31<0>
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/IB : No device found.
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.52.31<0>
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Using network Socket
[1,0]:NCCL version 2.11.4+cuda11.4
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Bootstrap : Using eth0:10.84.179.52<0>
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/IB : No device found.
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.179.52<0>
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Using network Socket
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00/02 : 0 1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01/02 : 0 1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 00 : 0[2000] -> 1[4000] [receive] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 01 : 0[2000] -> 1[4000] [receive] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 00 : 1[4000] -> 0[2000] [send] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 01 : 1[4000] -> 0[2000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00 : 1[4000] -> 0[2000] [receive] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01 : 1[4000] -> 0[2000] [receive] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00 : 0[2000] -> 1[4000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01 : 0[2000] -> 1[4000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Connected all rings
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Connected all trees
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO comm 0x7fd2247488e0 rank 0 nranks 2 cudaDev 0 busId 2000 - Init COMPLETE
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Launch mode Parallel
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Connected all rings
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Connected all trees
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO comm 0x7fad647478a0 rank 1 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
[1,0]:
[1,1]:WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.0086s vs on_train_batch_end time: 0.0658s). Check your callbacks.
[1,0]:WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.0053s vs on_train_batch_end time: 0.0687s). Check your callbacks.
1/4851 [..............................] - ETA: 8:35:39 - loss: 1.0356 - accuracy: 0.4844[1,0]: ����
9/4851 [..............................] - ETA: 30s - loss: 0.9629 - accuracy: 0.4219 [1,0]:
17/4851 [..............................] - ETA: 31s - loss: 0.9131 - accuracy: 0.4265[1,0]:
24/4851 [..............................] - ETA: 33s - loss: 0.8747 - accuracy: 0.4421[1,0]:
31/4851 [..............................] - ETA: 34s - loss: 0.8364 - accuracy: 0.4768[1,0]:
39/4851 [..............................] - ETA: 34s - loss: 0.7905 - accuracy: 0.5445[1,0]:
48/4851 [..............................] - ETA: 32s - loss: 0.7389 - accuracy: 0.6286[1,0]:
56/4851 [..............................] - ETA: 32s - loss: 0.6957 - accuracy: 0.6816[1,0]:
64/4851 [..............................] - ETA: 32s - loss: 0.6540 - accuracy: 0.7214[1,0]:
71/4851 [..............................] - ETA: 32s - loss: 0.6205 - accuracy: 0.7489[1,0]:
79/4851 [..............................] - ETA: 32s - loss: 0.5844 - accuracy: 0.7743[1,0]:
87/4851 [..............................] - ETA: 32s - loss: 0.5504 - accuracy: 0.7951[1,0]:
95/4851 [..............................] - ETA: 32s - loss: 0.5194 - accuracy: 0.8123[1,0]:
103/4851 [..............................] - ETA: 32s - loss: 0.4912 - accuracy: 0.8269[1,0]:
112/4851 [..............................] - ETA: 31s - loss: 0.4623 - accuracy: 0.8408[1,0]:
121/4851 [..............................] - ETA: 31s - loss: 0.4364 - accuracy: 0.8525[1,0]:
131/4851 [..............................] - ETA: 30s - loss: 0.4106 - accuracy: 0.8637[1,0]:
140/4851 [..............................] - ETA: 30s - loss: 0.3886 - accuracy: 0.8724[1,0]:
148/4851 [..............................] - ETA: 30s - loss: 0.3706 - accuracy: 0.8793[1,0]:
156/4851 [..............................] - ETA: 30s - loss: 0.3542 - accuracy: 0.8855[1,0]:
164/4851 [>.............................] - ETA: 30s - loss: 0.3388 - accuracy: 0.8911[1,0]:
172/4851 [>.............................] - ETA: 30s - loss: 0.3246 - accuracy: 0.8962[1,0]:
180/4851 [>.............................] - ETA: 30s - loss: 0.3116 - accuracy: 0.9008[1,0]:
188/4851 [>.............................] - ETA: 30s - loss: 0.2994 - accuracy: 0.9050[1,0]:
196/4851 [>.............................] - ETA: 30s - loss: 0.2882 - accuracy: 0.9089[1,0]:
204/4851 [>.............................] - ETA: 30s - loss: 0.2778 - accuracy: 0.9125[1,0]:
212/4851 [>.............................] - ETA: 30s - loss: 0.2680 - accuracy: 0.9158[1,0]:
220/4851 [>.............................] - ETA: 30s - loss: 0.2588 - accuracy: 0.9188[1,0]:
227/4851 [>.............................] - ETA: 30s - loss: 0.2513 - accuracy: 0.9213[1,0]:
235/4851 [>.............................] - ETA: 30s - loss: 0.2432 - accuracy: 0.9240[1,0]:
243/4851 [>.............................] - ETA: 30s - loss: 0.2356 - accuracy: 0.9265[1,0]:
251/4851 [>.............................] - ETA: 30s - loss: 0.2285 - accuracy: 0.9288[1,0]:
259/4851 [>.............................] - ETA: 30s - loss: 0.2218 - accuracy: 0.9310[1,0]:
267/4851 [>.............................] - ETA: 30s - loss: 0.2155 - accuracy: 0.9331[1,0]:
275/4851 [>.............................] - ETA: 30s - loss: 0.2095 - accuracy: 0.9351[1,0]:
283/4851 [>.............................] - ETA: 30s - loss: 0.2038 - accuracy: 0.9369[1,0]:
291/4851 [>.............................] - ETA: 30s - loss: 0.1985 - accuracy: 0.9386[1,0]:
299/4851 [>.............................] - ETA: 30s - loss: 0.1933 - accuracy: 0.9403[1,0]:
307/4851 [>.............................] - ETA: 30s - loss: 0.1885 - accuracy: 0.9418[1,0]:
316/4851 [>.............................] - ETA: 30s - loss: 0.1833 - accuracy: 0.9435[1,0]:
325/4851 [=>............................] - ETA: 30s - loss: 0.1784 - accuracy: 0.9450[1,0]:
334/4851 [=>............................] - ETA: 30s - loss: 0.1738 - accuracy: 0.9465[1,0]:
343/4851 [=>............................] - ETA: 30s - loss: 0.1694 - accuracy: 0.9479[1,0]:
351/4851 [=>............................] - ETA: 29s - loss: 0.1656 - accuracy: 0.9491[1,0]:
358/4851 [=>............................] - ETA: 30s - loss: 0.1625 - accuracy: 0.9501[1,0]:
366/4851 [=>............................] - ETA: 29s - loss: 0.1590 - accuracy: 0.9512[1,0]:
374/4851 [=>............................] - ETA: 29s - loss: 0.1557 - accuracy: 0.9522[1,0]:
383/4851 [=>............................] - ETA: 29s - loss: 0.1521 - accuracy: 0.9534[1,0]:
391/4851 [=>............................] - ETA: 29s - loss: 0.1491 - accuracy: 0.9543[1,0]:
400/4851 [=>............................] - ETA: 29s - loss: 0.1458 - accuracy: 0.9554[1,0]:
408/4851 [=>............................] - ETA: 29s - loss: 0.1430 - accuracy: 0.9562[1,0]:
417/4851 [=>............................] - ETA: 29s - loss: 0.1400 - accuracy: 0.9572[1,0]:
422/4851 [=>............................] - ETA: 29s - loss: 0.1384 - accuracy: 0.9577[1,0]:
428/4851 [=>............................] - ETA: 29s - loss: 0.1365 - accuracy: 0.9583[1,0]:
437/4851 [=>............................] - ETA: 29s - loss: 0.1338 - accuracy: 0.9591[1,0]:
447/4851 [=>............................] - ETA: 29s - loss: 0.1314 - accuracy: 0.9600[1,0]:
456/4851 [=>............................] - ETA: 29s - loss: 0.1289 - accuracy: 0.9608[1,0]:
465/4851 [=>............................] - ETA: 29s - loss: 0.1264 - accuracy: 0.9616[1,0]:
474/4851 [=>............................] - ETA: 29s - loss: 0.1241 - accuracy: 0.9623[1,0]:
483/4851 [=>............................] - ETA: 29s - loss: 0.1218 - accuracy: 0.9630[1,0]:
491/4851 [==>...........................] - ETA: 28s - loss: 0.1199 - accuracy: 0.9636[1,0]:
499/4851 [==>...........................] - ETA: 28s - loss: 0.1180 - accuracy: 0.9642[1,0]:
508/4851 [==>...........................] - ETA: 28s - loss: 0.1160 - accuracy: 0.9648[1,0]:
518/4851 [==>...........................] - ETA: 28s - loss: 0.1138 - accuracy: 0.9655[1,0]:
527/4851 [==>...........................] - ETA: 28s - loss: 0.1118 - accuracy: 0.9661[1,0]:
536/4851 [==>...........................] - ETA: 28s - loss: 0.1100 - accuracy: 0.9667[1,0]:
545/4851 [==>...........................] - ETA: 28s - loss: 0.1082 - accuracy: 0.9672[1,0]:
554/4851 [==>...........................] - ETA: 28s - loss: 0.1065 - accuracy: 0.9677[1,0]:
562/4851 [==>...........................] - ETA: 28s - loss: 0.1050 - accuracy: 0.9682[1,0]:
572/4851 [==>...........................] - ETA: 27s - loss: 0.1032 - accuracy: 0.9688[1,0]:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants