Skip to content

gpubench add more informations in output#296

Merged
Uburro merged 1 commit intodevfrom
gpubench-fix-logs
Jan 3, 2025
Merged

gpubench add more informations in output#296
Uburro merged 1 commit intodevfrom
gpubench-fix-logs

Conversation

@Uburro
Copy link
Collaborator

@Uburro Uburro commented Jan 3, 2025

k -n slurm logs test3-4mt9j | sed 's/\\n/\n/g' 
Link users from jail
Bind-mount slurm configs from K8S config map
Bind-mount munge key from K8S secret
Starting munge
Waiting until munge started
Start NCCL test benchmark
8 GPUs on each node are going to be benchmarked
cpu-bind=MASK - worker-1, task  0  0 [4980]: mask 0xff00000000000000ff set
{"level":"info","msg":"Starting nccl-benchmark","slurmNode":"worker-1","time":"2025-01-03T09:37:10Z"}
{"error":"exit status 3","level":"fatal","msg":"Failed to execute all_reduce_perf","output":"
# nThread 1 nGpus 8 minBytes 536870912 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   4990 on   worker-1 device  0 [0x8a] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid   4990 on   worker-1 device  1 [0x8b] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid   4990 on   worker-1 device  2 [0x8c] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid   4990 on   worker-1 device  3 [0x8d] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid   4990 on   worker-1 device  4 [0x9c] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid   4990 on   worker-1 device  5 [0x9d] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid   4990 on   worker-1 device  6 [0x9e] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid   4990 on   worker-1 device  7 [0x9f] NVIDIA H100 80GB HBM3
worker-1: Test NCCL failure common.cu:1005 'internal error - please report this issue to the NCCL developers / '
 .. worker-1 pid 4990: Test failure common.cu:891
","slurmNode":"worker-1","time":"2025-01-03T09:37:18Z"}
srun: error: worker-1: task 0: Exited with exit code 1
cpu-bind=MASK - worker-0, task  0  0 [4637]: mask 0xff00000000000000ff set
{"level":"info","msg":"Starting nccl-benchmark","slurmNode":"worker-0","time":"2025-01-03T09:37:10Z"}
{"error":"exit status 3","level":"fatal","msg":"Failed to execute all_reduce_perf","output":"
# nThread 1 nGpus 8 minBytes 536870912 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   4649 on   worker-0 device  0 [0x8a] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid   4649 on   worker-0 device  1 [0x8b] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid   4649 on   worker-0 device  2 [0x8c] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid   4649 on   worker-0 device  3 [0x8d] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid   4649 on   worker-0 device  4 [0x9c] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid   4649 on   worker-0 device  5 [0x9d] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid   4649 on   worker-0 device  6 [0x9e] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid   4649 on   worker-0 device  7 [0x9f] NVIDIA H100 80GB HBM3
worker-0: Test NCCL failure common.cu:1005 'internal error - please report this issue to the NCCL developers / '
 .. worker-0 pid 4649: Test failure common.cu:891
","slurmNode":"worker-0","time":"2025-01-03T09:37:18Z"}
srun: error: worker-0: task 0: Exited with exit code 1
All exit codes not 0 - 1
1

@Uburro Uburro added the fix label Jan 3, 2025
@Uburro Uburro force-pushed the gpubench-fix-logs branch from 5e77a07 to 6b2b3df Compare January 3, 2025 09:50
@Uburro Uburro merged commit f8aeece into dev Jan 3, 2025
@Uburro Uburro deleted the gpubench-fix-logs branch January 3, 2025 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants