Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

superbench failed at default most typical run config #519

Closed
jdgh000 opened this issue Apr 15, 2023 · 8 comments
Closed

superbench failed at default most typical run config #519

jdgh000 opened this issue Apr 15, 2023 · 8 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@jdgh000
Copy link

jdgh000 commented Apr 15, 2023

What's the issue, what's expected?:


TASK [Starting Container] ******************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "docker rm --force sb-workspace ||: && docker run -itd --name=sb-workspace  --privileged --net=host --ipc=host  --gpus=all    -w /root -v /root/sb-workspace:/root -v /mnt:/mnt  -v /var/run/docker.sock:/var/run/docker.sock  --entrypoint /bin/bash superbench/superbench && docker exec sb-workspace bash -c  \"chown -R root:root ~ && \\\n  sed -i 's/[# ]*Port.*/Port 22066/g' /etc/ssh/sshd_config && \\\n  service ssh restart && sb help\"\n", "delta": "0:00:36.069805", "end": "2023-04-15 03:01:11.455660", "msg": "non-zero return code", "rc": 125, "start": "2023-04-15 03:00:35.385855", "stderr": "Error response from daemon: No such container: sb-workspace\ndocker: Error response from daemon: could not select device driver \"\" with capabilities: [[gpu]].", "stderr_lines": ["Error response from daemon: No such container: sb-workspace", "docker: Error response from daemon: could not select device driver \"\" with capabilities: [[gpu]]."], "stdout": "28784ba8358530ee44bf82ec37213a691e8573b1b52231c794533c0db781483c", "stdout_lines": ["28784ba8358530ee44bf82ec37213a691e8573b1b52231c794533c0db781483c"]}

PLAY RECAP *********************************************************************
localhost                  : ok=10   changed=5    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0
[2023-04-15 03:01:11,663 jd-MS-7B22:26239][ansible.py:82][WARNING] Run failed, return code 2.
jd@jd-MS-7B22:~/gg/git/superbenchmark$
jd@jd-MS-7B22:~/gg/git/superbenchmark$
jd@jd-MS-7B22:~/gg/git/superbenchmark$ sudo docker container list
[sudo] password for jd:
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
jd@jd-MS-7B22:~/gg/git/superbenchmark$ sudo docker container list --all
CONTAINER ID   IMAGE                   COMMAND       CREATED       STATUS    PORTS     NAMES
28784ba83585   superbench/superbench   "/bin/bash"   5 hours ago   Created             sb-workspace
jd@jd-MS-7B22:~/gg/git/superbenchmark$

How to reproduce it?:
follow your own instruction at https://aka.ms/superbench.

Log message or shapshot?:
above

Additional information:
ubuntu 22.04 bare metal, gtx 2070, cuda 12.x

@abuccts
Copy link
Member

abuccts commented Apr 15, 2023

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Seems you haven't installed nvidia-container-toolkit correctly and Docker cannot mount GPUs by --gpus argument.

@jdgh000
Copy link
Author

jdgh000 commented Apr 15, 2023

i discovered that but nvidia container toolkit is not working either:
NVIDIA/nvidia-container-toolkit#60

@jdgh000
Copy link
Author

jdgh000 commented Apr 15, 2023

i managed to get nvidia container toolkit working but still getting error. See below log:

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2023.04.15 13:15:08 =~=~=~=~=~=~=~=~=~=~=~=
sudo sb run -f local.ini -c resnet.yaml 2>&1 | sudo tee ~/gg/log/sb.run.log 

[2023-04-15 13:15:10,635 guyen-MS-7B22:8261][ansible.py:60][INFO] {'host_pattern': 'all', 'cmdline': '--forks 1 --inventory /home/guyen/gg/git/superbenchmark/local.ini'}
[2023-04-15 13:15:10,647 guyen-MS-7B22:8261][runner.py:45][INFO] Runner uses config: {'superbench': {'benchmarks': {'bert_models': {'enable': True,
                                               'frameworks': ['pytorch'],
                                               'models': ['bert-base',
                                                          'bert-large'],
                                               'modes': [{'name': 'torch.distributed',
                                                          'node_num': 1,
                                                          'proc_num': 8}],
                                               'parameters': {'batch_size': 1,
                                                              'duration': 0,
                                                              'model_action': ['train'],
                                                              'num_steps': 128,
                                                              'num_warmup': 16,
                                                              'precision': ['float32',
                                                                            'float16']}},
                               'computation-communication-overlap': {'enable': True,
                                                                     'frameworks': ['pytorch'],
                                                                     'modes': [{'name': 'torch.distributed',
                                                                                'node_num': 1,
                                                                                'proc_num': 8}]},
                               'cpu-memory-bw-latency': {'enable': False,
                                                         'modes': [{'name': 'local',
                                                                    'parallel': False,
                                                                    'proc_num': 1}],
                                                         'parameters': {'tests': ['bandwidth_matrix',
                                                                                  'latency_matrix',
                                                                                  'max_bandwidth']}},
                               'cublas-function': {'enable': True,
                                                   'modes': [{'name': 'local',
                                                              'parallel': True,
                                                              'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                              'proc_num': 8}]},
                               'cudnn-function': {'enable': True,
                                                  'modes': [{'name': 'local',
                                                             'parallel': True,
                                                             'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                             'proc_num': 8}]},
                               'densenet_models': {'enable': True,
                                                   'frameworks': ['pytorch'],
                                                   'models': ['densenet169',
                                                              'densenet201'],
                                                   'modes': [{'name': 'torch.distributed',
                                                              'node_num': 1,
                                                              'proc_num': 8}],
                                                   'parameters': {'batch_size': 1,
                                                                  'duration': 0,
                                                                  'model_action': ['train'],
                                                                  'num_steps': 128,
                                                                  'num_warmup': 16,
                                                                  'precision': ['float32',
                                                                                'float16']}},
                               'disk-benchmark': {'enable': False,
                                                  'modes': [{'name': 'local',
                                                             'parallel': False,
                                                             'proc_num': 1}],
                                                  'parameters': {'block_devices': ['/dev/nvme0n1']}},
                               'gemm-flops': {'enable': True,
                                              'modes': [{'name': 'local',
                                                         'parallel': True,
                                                         'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                         'proc_num': 8}]},
                               'gpcnet-network-load-test': {'enable': False,
                                                            'modes': [{'env': {'UCX_NET_DEVICES': 'mlx5_0:1'},
                                                                       'mca': {'btl': '^uct',
                                                                               'btl_tcp_if_include': 'eth0',
                                                                               'pml': 'ucx'},
                                                                       'name': 'mpi',
                                                                       'proc_num': 1}]},
                               'gpcnet-network-test': {'enable': False,
                                                       'modes': [{'env': {'UCX_NET_DEVICES': 'mlx5_0:1'},
                                                                  'mca': {'btl': '^uct',
                                                                          'btl_tcp_if_include': 'eth0',
                                                                          'pml': 'ucx'},
                                                                  'name': 'mpi',
                                                                  'proc_num': 1}]},
                               'gpt_models': {'enable': True,
                                              'frameworks': ['pytorch'],
                                              'models': ['gpt2-small',
                                                         'gpt2-large'],
                                              'modes': [{'name': 'torch.distributed',
                                                         'node_num': 1,
                                                         'proc_num': 8}],
                                              'parameters': {'batch_size': 1,
                                                             'duration': 0,
                                                             'model_action': ['train'],
                                                             'num_steps': 128,
                                                             'num_warmup': 16,
                                                             'precision': ['float32',
                                                                           'float16']}},
                               'gpu-burn': {'enable': True,
                                            'modes': [{'name': 'local',
                                                       'parallel': False,
                                                       'proc_num': 1}],
                                            'parameters': {'doubles': True,
                                                           'tensor_core': True,
                                                           'time': 300}},
                               'gpu-copy-bw:correctness': {'enable': True,
                                                           'modes': [{'name': 'local',
                                                                      'parallel': False}],
                                                           'parameters': {'check_data': True,
                                                                          'copy_type': ['sm',
                                                                                        'dma'],
                                                                          'mem_type': ['htod',
                                                                                       'dtoh',
                                                                                       'dtod'],
                                                                          'num_loops': 1,
                                                                          'num_warm_up': 0,
                                                                          'size': 4096}},
                               'gpu-copy-bw:perf': {'enable': True,
                                                    'modes': [{'name': 'local',
                                                               'parallel': False}],
                                                    'parameters': {'copy_type': ['sm',
                                                                                 'dma'],
                                                                   'mem_type': ['htod',
                                                                                'dtoh',
                                                                                'dtod']}},
                               'ib-loopback': {'enable': True,
                                               'modes': [{'name': 'local',
                                                          'parallel': True,
                                                          'prefix': 'PROC_RANK={proc_rank} '
                                                                    'IB_DEVICES=0,2,4,6 '
                                                                    'NUMA_NODES=1,0,3,2',
                                                          'proc_num': 4},
                                                         {'name': 'local',
                                                          'parallel': True,
                                                          'prefix': 'PROC_RANK={proc_rank} '
                                                                    'IB_DEVICES=1,3,5,7 '
                                                                    'NUMA_NODES=1,0,3,2',
                                                          'proc_num': 4}]},
                               'ib-traffic': {'enable': False,
                                              'modes': [{'name': 'mpi',
                                                         'proc_num': 8}],
                                              'parameters': {'gpu_dev': '$LOCAL_RANK',
                                                             'ib_dev': 'mlx5_$LOCAL_RANK',
                                                             'msg_size': 8388608,
                                                             'numa_dev': '$((LOCAL_RANK/2))'}},
                               'kernel-launch': {'enable': True,
                                                 'modes': [{'name': 'local',
                                                            'parallel': True,
                                                            'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                            'proc_num': 8}]},
                               'lstm_models': {'enable': True,
                                               'frameworks': ['pytorch'],
                                               'models': ['lstm'],
                                               'modes': [{'name': 'torch.distributed',
                                                          'node_num': 1,
                                                          'proc_num': 8}],
                                               'parameters': {'batch_size': 1,
                                                              'duration': 0,
                                                              'model_action': ['train'],
                                                              'num_steps': 128,
                                                              'num_warmup': 16,
                                                              'precision': ['float32',
                                                                            'float16']}},
                               'matmul': {'enable': True,
                                          'frameworks': ['pytorch'],
                                          'modes': [{'name': 'local',
                                                     'parallel': True,
                                                     'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                     'proc_num': 8}]},
                               'mem-bw': {'enable': True,
                                          'modes': [{'name': 'local',
                                                     'parallel': False,
                                                     'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank} '
                                                               'numactl -N '
                                                               '$(({proc_rank}/2))',
                                                     'proc_num': 8}]},
                               'nccl-bw:default': {'enable': True,
                                                   'modes': [{'name': 'local',
                                                              'parallel': False,
                                                              'proc_num': 1}],
                                                   'parameters': {'ngpus': 8}},
                               'nccl-bw:gdr-only': {'enable': True,
                                                    'modes': [{'env': {'NCCL_IB_DISABLE': '0',
                                                                       'NCCL_IB_PCI_RELAXED_ORDERING': '1',
                                                                       'NCCL_MIN_NCHANNELS': '16',
                                                                       'NCCL_NET_GDR_LEVEL': '5',
                                                                       'NCCL_P2P_DISABLE': '1',
                                                                       'NCCL_SHM_DISABLE': '1'},
                                                               'name': 'local',
                                                               'parallel': False,
                                                               'proc_num': 1}],
                                                    'parameters': {'ngpus': 8}},
                               'ort-inference': {'enable': True,
                                                 'modes': [{'name': 'local',
                                                            'parallel': True,
                                                            'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                            'proc_num': 8}],
                                                 'parameters': {'batch_size': 1}},
                               'resnet_models': {'enable': True,
                                                 'frameworks': ['pytorch'],
                                                 'models': ['resnet50',
                                                            'resnet101',
                                                            'resnet152'],
                                                 'modes': [{'name': 'torch.distributed',
                                                            'node_num': 1,
                                                            'proc_num': 8}],
                                                 'parameters': {'batch_size': 128,
                                                                'duration': 0,
                                                                'model_action': ['train'],
                                                                'num_steps': 128,
                                                                'num_warmup': 16,
                                                                'precision': ['float32',
                                                                              'float16']}},
                               'sharding-matmul': {'enable': True,
                                                   'frameworks': ['pytorch'],
                                                   'modes': [{'name': 'torch.distributed',
                                                              'node_num': 1,
                                                              'proc_num': 8}]},
                               'tcp-connectivity': {'enable': False,
                                                    'modes': [{'name': 'local',
                                                               'parallel': False}],
                                                    'parameters': {'port': 22}},
                               'tensorrt-inference': {'enable': True,
                                                      'modes': [{'name': 'local',
                                                                 'parallel': True,
                                                                 'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                                 'proc_num': 8}],
                                                      'parameters': {'batch_size': 1,
                                                                     'precision': 'int8',
                                                                     'pytorch_models': ['resnet50',
                                                                                        'resnet101',
                                                                                        'resnet152',
                                                                                        'densenet169',
                                                                                        'densenet201',
                                                                                        'bert-base',
                                                                                        'bert-large'],
                                                                     'seq_length': 224}},
                               'vgg_models': {'enable': True,
                                              'frameworks': ['pytorch'],
                                              'models': ['vgg11',
                                                         'vgg13',
                                                         'vgg16',
                                                         'vgg19'],
                                              'modes': [{'name': 'torch.distributed',
                                                         'node_num': 1,
                                                         'proc_num': 8}],
                                              'parameters': {'batch_size': 1,
                                                             'duration': 0,
                                                             'model_action': ['train'],
                                                             'num_steps': 128,
                                                             'num_warmup': 16,
                                                             'precision': ['float32',
                                                                           'float16']}}},
                'enable': ['resnet_models'],
                'monitor': {'enable': True,
                            'sample_duration': 1,
                            'sample_interval': 10},
                'var': {'common_model_config': {'batch_size': 1,
                                                'duration': 0,
                                                'model_action': ['train'],
                                                'num_steps': 128,
                                                'num_warmup': 16,
                                                'precision': ['float32',
                                                              'float16']},
                        'default_local_mode': {'enable': True,
                                               'modes': [{'name': 'local',
                                                          'parallel': True,
                                                          'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                          'proc_num': 8}]},
                        'default_pytorch_mode': {'enable': True,
                                                 'frameworks': ['pytorch'],
                                                 'modes': [{'name': 'torch.distributed',
                                                            'node_num': 1,
                                                            'proc_num': 8}]}}},
 'version': 'v0.8'}.
[2023-04-15 13:15:10,648 guyen-MS-7B22:8261][runner.py:46][INFO] Runner writes to: /home/guyen/gg/git/superbenchmark/outputs/2023-04-15_13-15-10.
[2023-04-15 13:15:10,674 guyen-MS-7B22:8261][runner.py:51][INFO] Runner will run: ['resnet_models']
[2023-04-15 13:15:10,674 guyen-MS-7B22:8261][runner.py:202][INFO] Checking SuperBench environment.
[2023-04-15 13:15:10,699 guyen-MS-7B22:8261][ansible.py:127][INFO] Run playbook check_env.yaml ...


PLAY [Runtime Environment Check] ***********************************************


TASK [Checking container status] ***********************************************
changed: [localhost]


TASK [fail] ********************************************************************
skipping: [localhost]


PLAY [Runtime Environment Update] **********************************************


TASK [Gathering Facts] *********************************************************
ok: [localhost]


TASK [Ensure Workspace] ********************************************************
ok: [localhost]


TASK [Updating Config] *********************************************************
ok: [localhost]


TASK [Updating Env Variables] **************************************************
ok: [localhost] => (item=/root/sb-workspace/sb.env)
ok: [localhost] => (item=/tmp/sb.env)


TASK [Updating Hostfile to Remote] *********************************************
ok: [localhost]


TASK [Generating Hostfile to Local] ********************************************
changed: [localhost -> localhost]


PLAY RECAP *********************************************************************

localhost                  : ok=7    changed=2    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
[2023-04-15 13:15:15,163 guyen-MS-7B22:8261][ansible.py:79][INFO] Run succeed, return code 0.
[2023-04-15 13:15:15,165 guyen-MS-7B22:8261][runner.py:414][INFO] Runner is going to run resnet_models in torch.distributed mode, proc rank 0.
[2023-04-15 13:15:15,165 guyen-MS-7B22:8261][ansible.py:109][INFO] Run docker exec --env-file /tmp/sb.env sb-workspace bash -c 'python3 -m torch.distributed.launch --use_env --no_python --nproc_per_node=8 sb exec --output-dir outputs/2023-04-15_13-15-10 -c sb.config.yaml -C superbench.enable=resnet_models superbench.benchmarks.resnet_models.parameters.distributed_impl=ddp superbench.benchmarks.resnet_models.parameters.distributed_backend=nccl' on remote ...
[2023-04-15 13:15:15,165 guyen-MS-7B22:8261][ansible.py:73][INFO] Run as sudo ...
localhost | CHANGED | rc=0 >>

[2023-04-15 20:15:17,142 guyen-MS-7B22:298][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,172 guyen-MS-7B22:297][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,235 guyen-MS-7B22:301][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,235 guyen-MS-7B22:296][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,249 guyen-MS-7B22:312][monitor.py:118][INFO] Start monitoring.

[2023-04-15 20:15:17,299 guyen-MS-7B22:300][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,314 guyen-MS-7B22:302][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,590 guyen-MS-7B22:303][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,623 guyen-MS-7B22:299][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,909 guyen-MS-7B22:297][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:17,909 guyen-MS-7B22:297][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:17,968 guyen-MS-7B22:298][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:17,970 guyen-MS-7B22:298][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,151 guyen-MS-7B22:302][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,152 guyen-MS-7B22:302][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,165 guyen-MS-7B22:296][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,166 guyen-MS-7B22:296][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,216 guyen-MS-7B22:301][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,217 guyen-MS-7B22:301][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,241 guyen-MS-7B22:300][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,241 guyen-MS-7B22:300][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,393 guyen-MS-7B22:303][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,394 guyen-MS-7B22:303][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,477 guyen-MS-7B22:299][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,477 guyen-MS-7B22:299][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:23,074 guyen-MS-7B22:303][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,074 guyen-MS-7B22:296][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,074 guyen-MS-7B22:301][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,075 guyen-MS-7B22:296][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,075 guyen-MS-7B22:300][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,075 guyen-MS-7B22:298][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,075 guyen-MS-7B22:296][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,075 guyen-MS-7B22:301][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,075 guyen-MS-7B22:299][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,075 guyen-MS-7B22:301][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,075 guyen-MS-7B22:300][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,075 guyen-MS-7B22:299][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,075 guyen-MS-7B22:296][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,075 guyen-MS-7B22:297][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,075 guyen-MS-7B22:300][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,075 guyen-MS-7B22:299][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,075 guyen-MS-7B22:297][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,076 guyen-MS-7B22:301][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,076 guyen-MS-7B22:297][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,076 guyen-MS-7B22:300][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,076 guyen-MS-7B22:299][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,076 guyen-MS-7B22:297][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,076 guyen-MS-7B22:303][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,077 guyen-MS-7B22:303][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,077 guyen-MS-7B22:303][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,077 guyen-MS-7B22:298][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,078 guyen-MS-7B22:296][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,078 guyen-MS-7B22:296][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,078 guyen-MS-7B22:298][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,079 guyen-MS-7B22:299][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,079 guyen-MS-7B22:298][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,079 guyen-MS-7B22:299][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,079 guyen-MS-7B22:301][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,079 guyen-MS-7B22:301][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,079 guyen-MS-7B22:302][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,079 guyen-MS-7B22:300][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,079 guyen-MS-7B22:302][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,080 guyen-MS-7B22:300][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,080 guyen-MS-7B22:302][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,080 guyen-MS-7B22:302][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,080 guyen-MS-7B22:299][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,080 guyen-MS-7B22:299][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,081 guyen-MS-7B22:299][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,081 guyen-MS-7B22:301][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,081 guyen-MS-7B22:297][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,081 guyen-MS-7B22:301][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,081 guyen-MS-7B22:299][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,081 guyen-MS-7B22:301][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,081 guyen-MS-7B22:297][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,082 guyen-MS-7B22:301][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,082 guyen-MS-7B22:298][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,082 guyen-MS-7B22:297][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,082 guyen-MS-7B22:297][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,082 guyen-MS-7B22:298][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,082 guyen-MS-7B22:297][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,082 guyen-MS-7B22:297][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,083 guyen-MS-7B22:302][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,083 guyen-MS-7B22:302][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,083 guyen-MS-7B22:303][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,083 guyen-MS-7B22:300][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,084 guyen-MS-7B22:302][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,084 guyen-MS-7B22:300][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,084 guyen-MS-7B22:302][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,084 guyen-MS-7B22:300][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,084 guyen-MS-7B22:302][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,084 guyen-MS-7B22:299][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,084 guyen-MS-7B22:300][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,084 guyen-MS-7B22:302][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,084 guyen-MS-7B22:303][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,084 guyen-MS-7B22:299][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,085 guyen-MS-7B22:301][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,085 guyen-MS-7B22:298][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,085 guyen-MS-7B22:301][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,085 guyen-MS-7B22:298][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,085 guyen-MS-7B22:298][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,085 guyen-MS-7B22:297][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,086 guyen-MS-7B22:297][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,086 guyen-MS-7B22:298][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,087 guyen-MS-7B22:302][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,087 guyen-MS-7B22:300][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,087 guyen-MS-7B22:302][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,087 guyen-MS-7B22:300][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,088 guyen-MS-7B22:303][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,088 guyen-MS-7B22:303][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,088 guyen-MS-7B22:303][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,089 guyen-MS-7B22:298][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,089 guyen-MS-7B22:298][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,089 guyen-MS-7B22:303][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,090 guyen-MS-7B22:296][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,091 guyen-MS-7B22:296][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,091 guyen-MS-7B22:296][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,091 guyen-MS-7B22:296][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,092 guyen-MS-7B22:303][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,092 guyen-MS-7B22:303][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,094 guyen-MS-7B22:296][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,094 guyen-MS-7B22:296][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:24,085 guyen-MS-7B22:299][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,086 guyen-MS-7B22:301][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,086 guyen-MS-7B22:299][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,087 guyen-MS-7B22:299][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,087 guyen-MS-7B22:301][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,087 guyen-MS-7B22:297][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,088 guyen-MS-7B22:301][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,088 guyen-MS-7B22:297][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,088 guyen-MS-7B22:297][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,088 guyen-MS-7B22:302][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,089 guyen-MS-7B22:302][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,090 guyen-MS-7B22:302][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,090 guyen-MS-7B22:300][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,090 guyen-MS-7B22:298][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,090 guyen-MS-7B22:300][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,090 guyen-MS-7B22:298][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,091 guyen-MS-7B22:298][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,091 guyen-MS-7B22:300][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,101 guyen-MS-7B22:303][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,102 guyen-MS-7B22:303][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,102 guyen-MS-7B22:303][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,111 guyen-MS-7B22:296][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,112 guyen-MS-7B22:296][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,113 guyen-MS-7B22:296][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.
[2023-04-15 13:15:27,814 guyen-MS-7B22:8261][ansible.py:79][INFO] Run succeed, return code 0.
[2023-04-15 13:15:27,815 guyen-MS-7B22:8261][ansible.py:127][INFO] Run playbook fetch_results.yaml ...


PLAY [Fetch Results] ***********************************************************


TASK [Gathering Facts] *********************************************************
ok: [localhost]


TASK [Synchronize Output Directory] ********************************************
changed: [localhost]


PLAY RECAP *********************************************************************

localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
[2023-04-15 13:15:29,883 guyen-MS-7B22:8261][ansible.py:79][INFO] Run succeed, return code 0.
(venv) guyen@guyen-MS-7B22:~/gg/git/superbenchmark$ 

@jdgh000
Copy link
Author

jdgh000 commented Apr 15, 2023

I only left resnet 101 and now out of memory : rtx2070 super 8gb.
What is the memory requirement for resnet_models: resnet101?
Any smaller training that can git in 8gb?


 worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
[2023-04-15 20:32:44,711 guyen-MS-7B22:237][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 7.78 GiB total capacity; 6.58 GiB already allocated; 127.62 MiB free; 6.71 GiB reserved in total by PyTorch)
[2023-04-15 20:32:44,712 guyen-MS-7B22:237][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:32:44,712 guyen-MS-7B22:237][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:32:44,713 guyen-MS-7B22:237][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

@cp5555
Copy link
Contributor

cp5555 commented Apr 23, 2023

Sorry. SuperBenchmark doesn't officially support GeForce series at all. Therefore, it will introduce many unexpected issues. There is no plan for recent release on GeForce series support.

For GeForce related code & configuration setting (e.g. rtx2070), it would be great if you can contribute to it.

@cp5555 cp5555 self-assigned this Apr 23, 2023
@cp5555 cp5555 added the wontfix This will not be worked on label Apr 23, 2023
@jdgh000
Copy link
Author

jdgh000 commented Apr 24, 2023

it was due to memory size, i did manage some of the smaller training. you can close this.

@jdgh000
Copy link
Author

jdgh000 commented Apr 24, 2023

configuration

I dont think contribution is necessary. it is just same cuda chips with different brand name with smaller sizes and + gamind chips. Performance will be slower (GDDR5 intead of HBM, 8GB vs. 32GB etc) but i had no issue running scaled down workloads.

@cp5555
Copy link
Contributor

cp5555 commented Apr 24, 2023

Thanks for your discussion! Really appreciate it. We close this issue.

@cp5555 cp5555 closed this as completed Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants