You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi I have setup horovod on a k8s cluster with 2 GPU nodes using spark-operator. I have executed the mnist example using tensorflow, and it was executed successfully on both nodes (utlilizing GPUs on both nodes). However when I am using KerasEstimator on spark, the training executes successfully but I think that only one gpu is getting used.
Hi I have setup horovod on a k8s cluster with 2 GPU nodes using spark-operator. I have executed the mnist example using tensorflow, and it was executed successfully on both nodes (utlilizing GPUs on both nodes). However when I am using KerasEstimator on spark, the training executes successfully but I think that only one gpu is getting used.
I am following this example:
https://docs.databricks.com/_static/notebooks/deep-learning/horovod-spark-estimator-keras.html
here are the logs:
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Bootstrap : Using eth0:10.84.52.31<0>
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/IB : No device found.
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.52.31<0>
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Using network Socket
[1,0]:NCCL version 2.11.4+cuda11.4
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Bootstrap : Using eth0:10.84.179.52<0>
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/IB : No device found.
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.179.52<0>
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Using network Socket
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00/02 : 0 1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01/02 : 0 1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 00 : 0[2000] -> 1[4000] [receive] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 01 : 0[2000] -> 1[4000] [receive] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 00 : 1[4000] -> 0[2000] [send] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 01 : 1[4000] -> 0[2000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00 : 1[4000] -> 0[2000] [receive] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01 : 1[4000] -> 0[2000] [receive] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00 : 0[2000] -> 1[4000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01 : 0[2000] -> 1[4000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Connected all rings
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Connected all trees
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO comm 0x7fd2247488e0 rank 0 nranks 2 cudaDev 0 busId 2000 - Init COMPLETE
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Launch mode Parallel
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Connected all rings
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Connected all trees
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO comm 0x7fad647478a0 rank 1 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
[1,0]:
[1,1]:WARNING:tensorflow:Callback method
on_train_batch_end
is slow compared to the batch time (batch time: 0.0086s vson_train_batch_end
time: 0.0658s). Check your callbacks.[1,0]:WARNING:tensorflow:Callback method
on_train_batch_end
is slow compared to the batch time (batch time: 0.0053s vson_train_batch_end
time: 0.0687s). Check your callbacks.1/4851 [..............................] - ETA: 8:35:39 - loss: 1.0356 - accuracy: 0.4844[1,0]: ����
9/4851 [..............................] - ETA: 30s - loss: 0.9629 - accuracy: 0.4219 [1,0]:
17/4851 [..............................] - ETA: 31s - loss: 0.9131 - accuracy: 0.4265[1,0]:
24/4851 [..............................] - ETA: 33s - loss: 0.8747 - accuracy: 0.4421[1,0]:
31/4851 [..............................] - ETA: 34s - loss: 0.8364 - accuracy: 0.4768[1,0]:
39/4851 [..............................] - ETA: 34s - loss: 0.7905 - accuracy: 0.5445[1,0]:
48/4851 [..............................] - ETA: 32s - loss: 0.7389 - accuracy: 0.6286[1,0]:
56/4851 [..............................] - ETA: 32s - loss: 0.6957 - accuracy: 0.6816[1,0]:
64/4851 [..............................] - ETA: 32s - loss: 0.6540 - accuracy: 0.7214[1,0]:
71/4851 [..............................] - ETA: 32s - loss: 0.6205 - accuracy: 0.7489[1,0]:
79/4851 [..............................] - ETA: 32s - loss: 0.5844 - accuracy: 0.7743[1,0]:
87/4851 [..............................] - ETA: 32s - loss: 0.5504 - accuracy: 0.7951[1,0]:
95/4851 [..............................] - ETA: 32s - loss: 0.5194 - accuracy: 0.8123[1,0]:
103/4851 [..............................] - ETA: 32s - loss: 0.4912 - accuracy: 0.8269[1,0]:
112/4851 [..............................] - ETA: 31s - loss: 0.4623 - accuracy: 0.8408[1,0]:
121/4851 [..............................] - ETA: 31s - loss: 0.4364 - accuracy: 0.8525[1,0]:
131/4851 [..............................] - ETA: 30s - loss: 0.4106 - accuracy: 0.8637[1,0]:
140/4851 [..............................] - ETA: 30s - loss: 0.3886 - accuracy: 0.8724[1,0]:
148/4851 [..............................] - ETA: 30s - loss: 0.3706 - accuracy: 0.8793[1,0]:
156/4851 [..............................] - ETA: 30s - loss: 0.3542 - accuracy: 0.8855[1,0]:
164/4851 [>.............................] - ETA: 30s - loss: 0.3388 - accuracy: 0.8911[1,0]:
172/4851 [>.............................] - ETA: 30s - loss: 0.3246 - accuracy: 0.8962[1,0]:
180/4851 [>.............................] - ETA: 30s - loss: 0.3116 - accuracy: 0.9008[1,0]:
188/4851 [>.............................] - ETA: 30s - loss: 0.2994 - accuracy: 0.9050[1,0]:
196/4851 [>.............................] - ETA: 30s - loss: 0.2882 - accuracy: 0.9089[1,0]:
204/4851 [>.............................] - ETA: 30s - loss: 0.2778 - accuracy: 0.9125[1,0]:
212/4851 [>.............................] - ETA: 30s - loss: 0.2680 - accuracy: 0.9158[1,0]:
220/4851 [>.............................] - ETA: 30s - loss: 0.2588 - accuracy: 0.9188[1,0]:
227/4851 [>.............................] - ETA: 30s - loss: 0.2513 - accuracy: 0.9213[1,0]:
235/4851 [>.............................] - ETA: 30s - loss: 0.2432 - accuracy: 0.9240[1,0]:
243/4851 [>.............................] - ETA: 30s - loss: 0.2356 - accuracy: 0.9265[1,0]:
251/4851 [>.............................] - ETA: 30s - loss: 0.2285 - accuracy: 0.9288[1,0]:
259/4851 [>.............................] - ETA: 30s - loss: 0.2218 - accuracy: 0.9310[1,0]:
267/4851 [>.............................] - ETA: 30s - loss: 0.2155 - accuracy: 0.9331[1,0]:
275/4851 [>.............................] - ETA: 30s - loss: 0.2095 - accuracy: 0.9351[1,0]:
283/4851 [>.............................] - ETA: 30s - loss: 0.2038 - accuracy: 0.9369[1,0]:
291/4851 [>.............................] - ETA: 30s - loss: 0.1985 - accuracy: 0.9386[1,0]:
299/4851 [>.............................] - ETA: 30s - loss: 0.1933 - accuracy: 0.9403[1,0]:
307/4851 [>.............................] - ETA: 30s - loss: 0.1885 - accuracy: 0.9418[1,0]:
316/4851 [>.............................] - ETA: 30s - loss: 0.1833 - accuracy: 0.9435[1,0]:
325/4851 [=>............................] - ETA: 30s - loss: 0.1784 - accuracy: 0.9450[1,0]:
334/4851 [=>............................] - ETA: 30s - loss: 0.1738 - accuracy: 0.9465[1,0]:
343/4851 [=>............................] - ETA: 30s - loss: 0.1694 - accuracy: 0.9479[1,0]:
351/4851 [=>............................] - ETA: 29s - loss: 0.1656 - accuracy: 0.9491[1,0]:
358/4851 [=>............................] - ETA: 30s - loss: 0.1625 - accuracy: 0.9501[1,0]:
366/4851 [=>............................] - ETA: 29s - loss: 0.1590 - accuracy: 0.9512[1,0]:
374/4851 [=>............................] - ETA: 29s - loss: 0.1557 - accuracy: 0.9522[1,0]:
383/4851 [=>............................] - ETA: 29s - loss: 0.1521 - accuracy: 0.9534[1,0]:
391/4851 [=>............................] - ETA: 29s - loss: 0.1491 - accuracy: 0.9543[1,0]:
400/4851 [=>............................] - ETA: 29s - loss: 0.1458 - accuracy: 0.9554[1,0]:
408/4851 [=>............................] - ETA: 29s - loss: 0.1430 - accuracy: 0.9562[1,0]:
417/4851 [=>............................] - ETA: 29s - loss: 0.1400 - accuracy: 0.9572[1,0]:
422/4851 [=>............................] - ETA: 29s - loss: 0.1384 - accuracy: 0.9577[1,0]:
428/4851 [=>............................] - ETA: 29s - loss: 0.1365 - accuracy: 0.9583[1,0]:
437/4851 [=>............................] - ETA: 29s - loss: 0.1338 - accuracy: 0.9591[1,0]:
447/4851 [=>............................] - ETA: 29s - loss: 0.1314 - accuracy: 0.9600[1,0]:
456/4851 [=>............................] - ETA: 29s - loss: 0.1289 - accuracy: 0.9608[1,0]:
465/4851 [=>............................] - ETA: 29s - loss: 0.1264 - accuracy: 0.9616[1,0]:
474/4851 [=>............................] - ETA: 29s - loss: 0.1241 - accuracy: 0.9623[1,0]:
483/4851 [=>............................] - ETA: 29s - loss: 0.1218 - accuracy: 0.9630[1,0]:
491/4851 [==>...........................] - ETA: 28s - loss: 0.1199 - accuracy: 0.9636[1,0]:
499/4851 [==>...........................] - ETA: 28s - loss: 0.1180 - accuracy: 0.9642[1,0]:
508/4851 [==>...........................] - ETA: 28s - loss: 0.1160 - accuracy: 0.9648[1,0]:
518/4851 [==>...........................] - ETA: 28s - loss: 0.1138 - accuracy: 0.9655[1,0]:
527/4851 [==>...........................] - ETA: 28s - loss: 0.1118 - accuracy: 0.9661[1,0]:
536/4851 [==>...........................] - ETA: 28s - loss: 0.1100 - accuracy: 0.9667[1,0]:
545/4851 [==>...........................] - ETA: 28s - loss: 0.1082 - accuracy: 0.9672[1,0]:
554/4851 [==>...........................] - ETA: 28s - loss: 0.1065 - accuracy: 0.9677[1,0]:
562/4851 [==>...........................] - ETA: 28s - loss: 0.1050 - accuracy: 0.9682[1,0]:
572/4851 [==>...........................] - ETA: 27s - loss: 0.1032 - accuracy: 0.9688[1,0]:
The text was updated successfully, but these errors were encountered: