Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Issue with Colossal-AI on Cuda 11.4 and Docker ? #1033

Closed
Adrian-1234 opened this issue May 26, 2022 · 15 comments
Closed

[BUG]: Issue with Colossal-AI on Cuda 11.4 and Docker ? #1033

Adrian-1234 opened this issue May 26, 2022 · 15 comments
Labels
bug Something isn't working

Comments

@Adrian-1234
Copy link

🐛 Describe the bug

Followed the installation guide here:
https://github.com/hpcaitech/ColossalAI

2001 mkdir colossalai
2002 cd colossalai/
2003 ll
2004 colossalai
2005 git clone https://github.com/hpcaitech/ColossalAI.git
2006 cd ColossalAI
2007 # install dependency
2008 pip install -r requirements/requirements.txt
2009 # install colossalai
2010 pip install .
2014 docker build -t colossalai ./docker

2015 docker run -ti --gpus all --rm --ipc=host colossalai bash

[root@dbf722d6d864 workspace]# colossalai check -i
Colossalai should be built with cuda extension to use the FP16 optimizer
If you want to activate cuda mode for MoE, please install with cuda_ext!
CUDA Version: 11.3
PyTorch Version: 1.10.1
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: ✓
CUDA Extension: x

The Cuda extension ^^^ isn't present?

[root@dbf722d6d864 workspace]# colossalai benchmark --gpus 8
Colossalai should be built with cuda extension to use the FP16 optimizer
If you want to activate cuda mode for MoE, please install with cuda_ext!
=== Benchmarking Parameters ===
gpus: 8
batch_size: 8
seq_len: 512
dimension: 1024
warmup_steps: 10
profile_steps: 50
layers: 2
model: mlp

Colossalai should be built with cuda extension to use the FP16 optimizer
If you want to activate cuda mode for MoE, please install with cuda_ext!

=== size: 8, mode: None ===
Average forward time: 0.0004958677291870118
Average backward time: 0.0010803651809692383
Max allocated GPU memory: 0.26564550399780273
Max cached GPU memory: 0.287109375

=== size: 8, mode: 1d ===
Average forward time: 0.004022541046142578
Average backward time: 0.0007260799407958985
Max allocated GPU memory: 0.2382950782775879
Max cached GPU memory: 0.287109375

=== size: 8, mode: 2.5d, depth: 2 ===
Average forward time: 0.001216425895690918
Average backward time: 0.002291984558105469
Max allocated GPU memory: 0.17383670806884766
Max cached GPU memory: 0.2734375

=== size: 8, mode: 3d ===
Average forward time: 0.000978093147277832
Average backward time: 0.0016768646240234374
Max allocated GPU memory: 0.05128049850463867
Max cached GPU memory: 0.185546875

Colossalai should be built with cuda extension to use the FP16 optimizer

What does this ^^^ really mean ?

This is a A100 based system:

$nvidia-smi
Thu May 26 18:43:56 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 |
| N/A 27C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off | 0 |
| N/A 26C P0 50W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off | 0 |
| N/A 26C P0 54W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off | 0 |
| N/A 25C P0 53W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off | 0 |
| N/A 30C P0 54W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off | 0 |
| N/A 29C P0 53W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off | 0 |
| N/A 29C P0 54W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off | 0 |
| N/A 29C P0 53W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Environment

This is a A100 based system:

$nvidia-smi
Thu May 26 18:43:56 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |

@Adrian-1234 Adrian-1234 added the bug Something isn't working label May 26, 2022
@ver217
Copy link
Member

ver217 commented May 27, 2022

Hi, could you show us the console output of docker build -t colossalai ./docker?

@Adrian-1234
Copy link
Author

colossal-ai.txt

@Adrian-1234
Copy link
Author

Output uploaded ^^^^^

@FrankLeeeee
Copy link
Contributor

Hi @Adrian-1234 , in order to build this Docker file, you need to have Nvidia runtime as the default Docker runtime. This usually requires sudo privilege (more details can be found here). If you find it too troublesome, you can download from https://www.colossalai.org/download directly.

@FrankLeeeee
Copy link
Contributor

The CUDA version you see via nvidia-smi is only the max CUDA version supported by the CUDA driver on your machine. You can check your torch cuda version by print(torch.version.cuda).

@Adrian-1234
Copy link
Author

Its version 11.1

@FrankLeeeee
Copy link
Contributor

You can choose the colossalai download link corresponding to your torch version and cuda 11.1 in https://www.colossalai.org/download directly.

@Adrian-1234
Copy link
Author

Hi Frank,
Regarding the Nvidia runtime, I have carried out the procedure as detailed in https://stackoverflow.com/questions/59691207/docker-build-with-nvidia-runtime
Steps for Ubuntu:

Install nvidia-container-runtime:

sudo apt-get install nvidia-container-runtime

Edit/create the /etc/docker/daemon.json with content:

{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Restart docker daemon:

sudo systemctl restart docker

Build your image (now GPU available during build):

docker build -t my_image_name:latest .

However, starting the resulting container and running 'colossalai check -i' still shows an 'x' against CUDA extension.

Also, I am confused about your comment downloading from https://www.colossalai.org/download/

How does this help me build a Cuda enabled Container ?

Thanks, Adrian.

@FrankLeeeee
Copy link
Contributor

Hi Adrian, I mean that you can install Colossal-AI pre-built with CUDA extensions instead of building Colossal-AI from scratch.

Some steps to install Colossal-AI directly.

# in your shell
docker pull hpcaitech/pytorch-cuda:1.10.1-11.3.0
docker run --gpus all -ti hpcaitech/pytorch-cuda:1.10.1-11.3.0 bash

# inside container
pip install colossalai==0.1.5+torch1.10cu11.3 -f https://release.colossalai.org

# check colossalai
colossalai check -i

@FrankLeeeee
Copy link
Contributor

If you wish to build Colossal-AI from scratch, do add '-v' flag to your pip install command, i.e. pip install -v . This will show logs while installing Colossal-AI. This will tell you why CUDA extension is not built.

If you do not want to build the docker image on your own, you can pull hpcaitech/colossalai:nightly-cuda11.3-torch1.10 directly.

@Adrian-1234
Copy link
Author

$ docker run --gpus all -ti hpcaitech/pytorch-cuda:1.10.1-11.3.0 bash

[root@1b313fc7f472 workspace]# pip install colossalai==0.1.5+torch1.10cu11.3 -f https://release.colossalai.org/

Looking in links: https://release.colossalai.org/

Collecting colossalai==0.1.5+torch1.10cu11.3

Downloading https://release.colossalai.org/colossalai-0.1.5%2Btorch1.10cu11.3-cp39-cp39-linux_x86_64.whl (8.9 MB)

 |████████████████████████████████| 8.9 MB 1.9 MB/s

Requirement already satisfied: torch>=1.8 in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (1.10.1)

Collecting fabric

Downloading fabric-2.7.0-py2.py3-none-any.whl (55 kB)

 |████████████████████████████████| 55 kB 335 kB/s

Collecting psutil

Downloading psutil-5.9.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281 kB)

 |████████████████████████████████| 281 kB 378 kB/s

Collecting pre-commit

Downloading pre_commit-2.19.0-py2.py3-none-any.whl (199 kB)

 |████████████████████████████████| 199 kB 527 kB/s

Collecting click

Downloading click-8.1.3-py3-none-any.whl (96 kB)

 |████████████████████████████████| 96 kB 820 kB/s

Collecting rich

Downloading rich-12.4.4-py3-none-any.whl (232 kB)

 |████████████████████████████████| 232 kB 497 kB/s

Collecting packaging

Downloading packaging-21.3-py3-none-any.whl (40 kB)

 |████████████████████████████████| 40 kB 8.4 MB/s

Requirement already satisfied: tqdm in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (4.63.0)

Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (1.22.3)

Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.9/site-packages (from torch>=1.8->colossalai==0.1.5+torch1.10cu11.3) (4.2.0)

Collecting invoke<2.0,>=1.3

Downloading invoke-1.7.1-py3-none-any.whl (215 kB)

 |████████████████████████████████| 215 kB 948 kB/s

Collecting pathlib2

Downloading pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB)

Collecting paramiko>=2.4

Downloading paramiko-2.11.0-py2.py3-none-any.whl (212 kB)

 |████████████████████████████████| 212 kB 333 kB/s

Collecting pynacl>=1.0.1

Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)

 |████████████████████████████████| 856 kB 500 kB/s

Requirement already satisfied: six in /opt/conda/lib/python3.9/site-packages (from paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (1.16.0)

Requirement already satisfied: cryptography>=2.5 in /opt/conda/lib/python3.9/site-packages (from paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (36.0.0)

Collecting bcrypt>=3.1.3

Downloading bcrypt-3.2.2-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (62 kB)

 |████████████████████████████████| 62 kB 285 kB/s

Requirement already satisfied: cffi>=1.1 in /opt/conda/lib/python3.9/site-packages (from bcrypt>=3.1.3->paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (1.15.0)

Requirement already satisfied: pycparser in /opt/conda/lib/python3.9/site-packages (from cffi>=1.1->bcrypt>=3.1.3->paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (2.21)

Collecting pyparsing!=3.0.5,>=2.0.2

Downloading pyparsing-3.0.9-py3-none-any.whl (98 kB)

 |████████████████████████████████| 98 kB 406 kB/s

Collecting toml

Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)

Collecting identify>=1.0.0

Downloading identify-2.5.1-py2.py3-none-any.whl (98 kB)

 |████████████████████████████████| 98 kB 327 kB/s

Collecting nodeenv>=0.11.1

Downloading nodeenv-1.6.0-py2.py3-none-any.whl (21 kB)

Collecting cfgv>=2.0.0

Downloading cfgv-3.3.1-py2.py3-none-any.whl (7.3 kB)

Collecting pyyaml>=5.1

Downloading PyYAML-6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (661 kB)

 |████████████████████████████████| 661 kB 323 kB/s

Collecting virtualenv>=20.0.8

Downloading virtualenv-20.14.1-py2.py3-none-any.whl (8.8 MB)

 |████████████████████████████████| 8.8 MB 1.4 MB/s

Collecting platformdirs<3,>=2

Downloading platformdirs-2.5.2-py3-none-any.whl (14 kB)

Collecting distlib<1,>=0.3.1

Downloading distlib-0.3.4-py2.py3-none-any.whl (461 kB)

 |████████████████████████████████| 461 kB 507 kB/s

Collecting filelock<4,>=3.2

Downloading filelock-3.7.0-py3-none-any.whl (10 kB)

Collecting commonmark<0.10.0,>=0.9.0

Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)

 |████████████████████████████████| 51 kB 3.9 MB/s

Collecting pygments<3.0.0,>=2.6.0

Downloading Pygments-2.12.0-py3-none-any.whl (1.1 MB)

 |████████████████████████████████| 1.1 MB 465 kB/s

Installing collected packages: pynacl, platformdirs, filelock, distlib, bcrypt, virtualenv, toml, pyyaml, pyparsing, pygments, pathlib2, paramiko, nodeenv, invoke, identify, commonmark, cfgv, rich, psutil, pre-commit, packaging, fabric, click, colossalai

Successfully installed bcrypt-3.2.2 cfgv-3.3.1 click-8.1.3 colossalai-0.1.5+torch1.10cu11.3 commonmark-0.9.1 distlib-0.3.4 fabric-2.7.0 filelock-3.7.0 identify-2.5.1 invoke-1.7.1 nodeenv-1.6.0 packaging-21.3 paramiko-2.11.0 pathlib2-2.3.7.post1 platformdirs-2.5.2 pre-commit-2.19.0 psutil-5.9.1 pygments-2.12.0 pynacl-1.5.0 pyparsing-3.0.9 pyyaml-6.0 rich-12.4.4 toml-0.10.2 virtualenv-20.14.1

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[root@1b313fc7f472 workspace]# colossalai check -i

Traceback (most recent call last):

File "/opt/conda/bin/colossalai", line 5, in

from colossalai.cli import cli

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/init.py", line 1, in

from .cli import cli

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/cli.py", line 4, in

from .benchmark import benchmark

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/benchmark/init.py", line 4, in

from .utils import *

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/benchmark/utils.py", line 3, in

from grpc import Call

ModuleNotFoundError: No module named 'grpc'

[root@1b313fc7f472 workspace]#$ docker run --gpus all -ti hpcaitech/pytorch-cuda:1.10.1-11.3.0 bash

[root@1b313fc7f472 workspace]# pip install colossalai==0.1.5+torch1.10cu11.3 -f https://release.colossalai.org/

Looking in links: https://release.colossalai.org/

Collecting colossalai==0.1.5+torch1.10cu11.3

Downloading https://release.colossalai.org/colossalai-0.1.5%2Btorch1.10cu11.3-cp39-cp39-linux_x86_64.whl (8.9 MB)

 |████████████████████████████████| 8.9 MB 1.9 MB/s

Requirement already satisfied: torch>=1.8 in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (1.10.1)

Collecting fabric

Downloading fabric-2.7.0-py2.py3-none-any.whl (55 kB)

 |████████████████████████████████| 55 kB 335 kB/s

Collecting psutil

Downloading psutil-5.9.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281 kB)

 |████████████████████████████████| 281 kB 378 kB/s

Collecting pre-commit

Downloading pre_commit-2.19.0-py2.py3-none-any.whl (199 kB)

 |████████████████████████████████| 199 kB 527 kB/s

Collecting click

Downloading click-8.1.3-py3-none-any.whl (96 kB)

 |████████████████████████████████| 96 kB 820 kB/s

Collecting rich

Downloading rich-12.4.4-py3-none-any.whl (232 kB)

 |████████████████████████████████| 232 kB 497 kB/s

Collecting packaging

Downloading packaging-21.3-py3-none-any.whl (40 kB)

 |████████████████████████████████| 40 kB 8.4 MB/s

Requirement already satisfied: tqdm in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (4.63.0)

Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (1.22.3)

Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.9/site-packages (from torch>=1.8->colossalai==0.1.5+torch1.10cu11.3) (4.2.0)

Collecting invoke<2.0,>=1.3

Downloading invoke-1.7.1-py3-none-any.whl (215 kB)

 |████████████████████████████████| 215 kB 948 kB/s

Collecting pathlib2

Downloading pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB)

Collecting paramiko>=2.4

Downloading paramiko-2.11.0-py2.py3-none-any.whl (212 kB)

 |████████████████████████████████| 212 kB 333 kB/s

Collecting pynacl>=1.0.1

Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)

 |████████████████████████████████| 856 kB 500 kB/s

Requirement already satisfied: six in /opt/conda/lib/python3.9/site-packages (from paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (1.16.0)

Requirement already satisfied: cryptography>=2.5 in /opt/conda/lib/python3.9/site-packages (from paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (36.0.0)

Collecting bcrypt>=3.1.3

Downloading bcrypt-3.2.2-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (62 kB)

 |████████████████████████████████| 62 kB 285 kB/s

Requirement already satisfied: cffi>=1.1 in /opt/conda/lib/python3.9/site-packages (from bcrypt>=3.1.3->paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (1.15.0)

Requirement already satisfied: pycparser in /opt/conda/lib/python3.9/site-packages (from cffi>=1.1->bcrypt>=3.1.3->paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (2.21)

Collecting pyparsing!=3.0.5,>=2.0.2

Downloading pyparsing-3.0.9-py3-none-any.whl (98 kB)

 |████████████████████████████████| 98 kB 406 kB/s

Collecting toml

Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)

Collecting identify>=1.0.0

Downloading identify-2.5.1-py2.py3-none-any.whl (98 kB)

 |████████████████████████████████| 98 kB 327 kB/s

Collecting nodeenv>=0.11.1

Downloading nodeenv-1.6.0-py2.py3-none-any.whl (21 kB)

Collecting cfgv>=2.0.0

Downloading cfgv-3.3.1-py2.py3-none-any.whl (7.3 kB)

Collecting pyyaml>=5.1

Downloading PyYAML-6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (661 kB)

 |████████████████████████████████| 661 kB 323 kB/s

Collecting virtualenv>=20.0.8

Downloading virtualenv-20.14.1-py2.py3-none-any.whl (8.8 MB)

 |████████████████████████████████| 8.8 MB 1.4 MB/s

Collecting platformdirs<3,>=2

Downloading platformdirs-2.5.2-py3-none-any.whl (14 kB)

Collecting distlib<1,>=0.3.1

Downloading distlib-0.3.4-py2.py3-none-any.whl (461 kB)

 |████████████████████████████████| 461 kB 507 kB/s

Collecting filelock<4,>=3.2

Downloading filelock-3.7.0-py3-none-any.whl (10 kB)

Collecting commonmark<0.10.0,>=0.9.0

Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)

 |████████████████████████████████| 51 kB 3.9 MB/s

Collecting pygments<3.0.0,>=2.6.0

Downloading Pygments-2.12.0-py3-none-any.whl (1.1 MB)

 |████████████████████████████████| 1.1 MB 465 kB/s

Installing collected packages: pynacl, platformdirs, filelock, distlib, bcrypt, virtualenv, toml, pyyaml, pyparsing, pygments, pathlib2, paramiko, nodeenv, invoke, identify, commonmark, cfgv, rich, psutil, pre-commit, packaging, fabric, click, colossalai

Successfully installed bcrypt-3.2.2 cfgv-3.3.1 click-8.1.3 colossalai-0.1.5+torch1.10cu11.3 commonmark-0.9.1 distlib-0.3.4 fabric-2.7.0 filelock-3.7.0 identify-2.5.1 invoke-1.7.1 nodeenv-1.6.0 packaging-21.3 paramiko-2.11.0 pathlib2-2.3.7.post1 platformdirs-2.5.2 pre-commit-2.19.0 psutil-5.9.1 pygments-2.12.0 pynacl-1.5.0 pyparsing-3.0.9 pyyaml-6.0 rich-12.4.4 toml-0.10.2 virtualenv-20.14.1

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[root@1b313fc7f472 workspace]# colossalai check -i

Traceback (most recent call last):

File "/opt/conda/bin/colossalai", line 5, in

from colossalai.cli import cli

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/init.py", line 1, in

from .cli import cli

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/cli.py", line 4, in

from .benchmark import benchmark

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/benchmark/init.py", line 4, in

from .utils import *

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/benchmark/utils.py", line 3, in

from grpc import Call

ModuleNotFoundError: No module named 'grpc'

[root@1b313fc7f472 workspace]#

However:

$ docker run --gpus all -ti hpcaitech/colossalai:nightly-cuda11.3-torch1.10 bash[root@2e86e3d40e57 workspace]# colossalai check -i

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

CUDA Version: 11.3

PyTorch Version: 1.10.1

CUDA Version in PyTorch Build: 11.3

PyTorch CUDA Version Match: ✓

CUDA Extension: ✓

Yes !!!!!

@FrankLeeeee
Copy link
Contributor

Great! The import error has been fixed in the latest code but not released yet.

@Adrian-1234
Copy link
Author

Cool ! many thanks for your help !

@FrankLeeeee
Copy link
Contributor

No problem! I will close this issue for now.

@Adrian-1234
Copy link
Author

In case you are interested, the 8 GPU BM now gives me
[root@2a070a24b99b workspace]# colossalai benchmark --gpus 8

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

=== Benchmarking Parameters ===

gpus: 8

batch_size: 8

seq_len: 512

dimension: 1024

warmup_steps: 10

profile_steps: 50

layers: 2

model: mlp

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

=== size: 8, mode: None ===

Average forward time: 0.0004566383361816406

Average backward time: 0.0007556915283203125

Max allocated GPU memory: 0.26564550399780273

Max cached GPU memory: 0.287109375

=== size: 8, mode: 1d ===

Average forward time: 0.0008962774276733399

Average backward time: 0.0008760929107666015

Max allocated GPU memory: 0.2382950782775879

Max cached GPU memory: 0.287109375

=== size: 8, mode: 2.5d, depth: 2 ===

Average forward time: 0.0010547828674316406

Average backward time: 0.0019134521484375

Max allocated GPU memory: 0.17383670806884766

Max cached GPU memory: 0.2734375

=== size: 8, mode: 3d ===

Average forward time: 0.0011489486694335937

Average backward time: 0.001375751495361328

Max allocated GPU memory: 0.05128049850463867

Max cached GPU memory: 0.18359375

[root@2a070a24b99b workspace]#

Is this performance as you'd expect ? This is an Nvidia DGX system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants