[BUG]: Issue with Colossal-AI on Cuda 11.4 and Docker ? #1033

Adrian-1234 · 2022-05-26T17:57:59Z

🐛 Describe the bug

Followed the installation guide here:
https://github.com/hpcaitech/ColossalAI

2001 mkdir colossalai
2002 cd colossalai/
2003 ll
2004 colossalai
2005 git clone https://github.com/hpcaitech/ColossalAI.git
2006 cd ColossalAI
2007 # install dependency
2008 pip install -r requirements/requirements.txt
2009 # install colossalai
2010 pip install .
2014 docker build -t colossalai ./docker

2015 docker run -ti --gpus all --rm --ipc=host colossalai bash

[root@dbf722d6d864 workspace]# colossalai check -i
Colossalai should be built with cuda extension to use the FP16 optimizer
If you want to activate cuda mode for MoE, please install with cuda_ext!
CUDA Version: 11.3
PyTorch Version: 1.10.1
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: ✓
CUDA Extension: x

The Cuda extension ^^^ isn't present?

[root@dbf722d6d864 workspace]# colossalai benchmark --gpus 8
Colossalai should be built with cuda extension to use the FP16 optimizer
If you want to activate cuda mode for MoE, please install with cuda_ext!
=== Benchmarking Parameters ===
gpus: 8
batch_size: 8
seq_len: 512
dimension: 1024
warmup_steps: 10
profile_steps: 50
layers: 2
model: mlp

Colossalai should be built with cuda extension to use the FP16 optimizer
If you want to activate cuda mode for MoE, please install with cuda_ext!

=== size: 8, mode: None ===
Average forward time: 0.0004958677291870118
Average backward time: 0.0010803651809692383
Max allocated GPU memory: 0.26564550399780273
Max cached GPU memory: 0.287109375

=== size: 8, mode: 1d ===
Average forward time: 0.004022541046142578
Average backward time: 0.0007260799407958985
Max allocated GPU memory: 0.2382950782775879
Max cached GPU memory: 0.287109375

=== size: 8, mode: 2.5d, depth: 2 ===
Average forward time: 0.001216425895690918
Average backward time: 0.002291984558105469
Max allocated GPU memory: 0.17383670806884766
Max cached GPU memory: 0.2734375

=== size: 8, mode: 3d ===
Average forward time: 0.000978093147277832
Average backward time: 0.0016768646240234374
Max allocated GPU memory: 0.05128049850463867
Max cached GPU memory: 0.185546875

Colossalai should be built with cuda extension to use the FP16 optimizer

What does this ^^^ really mean ?

This is a A100 based system:

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Environment

This is a A100 based system:

$nvidia-smi
Thu May 26 18:43:56 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |

ver217 · 2022-05-27T02:52:23Z

Hi, could you show us the console output of docker build -t colossalai ./docker?

Adrian-1234 · 2022-05-27T14:31:11Z

colossal-ai.txt

Adrian-1234 · 2022-05-27T14:31:42Z

Output uploaded ^^^^^

FrankLeeeee · 2022-05-27T15:47:01Z

Hi @Adrian-1234 , in order to build this Docker file, you need to have Nvidia runtime as the default Docker runtime. This usually requires sudo privilege (more details can be found here). If you find it too troublesome, you can download from https://www.colossalai.org/download directly.

FrankLeeeee · 2022-05-27T15:48:18Z

The CUDA version you see via nvidia-smi is only the max CUDA version supported by the CUDA driver on your machine. You can check your torch cuda version by print(torch.version.cuda).

Adrian-1234 · 2022-05-28T08:00:27Z

Its version 11.1

FrankLeeeee · 2022-05-28T08:01:28Z

You can choose the colossalai download link corresponding to your torch version and cuda 11.1 in https://www.colossalai.org/download directly.

Adrian-1234 · 2022-05-28T08:09:07Z

Hi Frank,
Regarding the Nvidia runtime, I have carried out the procedure as detailed in https://stackoverflow.com/questions/59691207/docker-build-with-nvidia-runtime
Steps for Ubuntu:

Install nvidia-container-runtime:

sudo apt-get install nvidia-container-runtime

Edit/create the /etc/docker/daemon.json with content:

{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Restart docker daemon:

sudo systemctl restart docker

Build your image (now GPU available during build):

docker build -t my_image_name:latest .

However, starting the resulting container and running 'colossalai check -i' still shows an 'x' against CUDA extension.

Also, I am confused about your comment downloading from https://www.colossalai.org/download/

How does this help me build a Cuda enabled Container ?

Thanks, Adrian.

FrankLeeeee · 2022-05-28T08:13:54Z

Hi Adrian, I mean that you can install Colossal-AI pre-built with CUDA extensions instead of building Colossal-AI from scratch.

Some steps to install Colossal-AI directly.

# in your shell
docker pull hpcaitech/pytorch-cuda:1.10.1-11.3.0
docker run --gpus all -ti hpcaitech/pytorch-cuda:1.10.1-11.3.0 bash

# inside container
pip install colossalai==0.1.5+torch1.10cu11.3 -f https://release.colossalai.org

# check colossalai
colossalai check -i

FrankLeeeee · 2022-05-28T08:17:45Z

If you wish to build Colossal-AI from scratch, do add '-v' flag to your pip install command, i.e. pip install -v . This will show logs while installing Colossal-AI. This will tell you why CUDA extension is not built.

If you do not want to build the docker image on your own, you can pull hpcaitech/colossalai:nightly-cuda11.3-torch1.10 directly.

Adrian-1234 · 2022-05-28T09:13:11Z

$ docker run --gpus all -ti hpcaitech/pytorch-cuda:1.10.1-11.3.0 bash

[root@1b313fc7f472 workspace]# pip install colossalai==0.1.5+torch1.10cu11.3 -f https://release.colossalai.org/

Looking in links: https://release.colossalai.org/

Collecting colossalai==0.1.5+torch1.10cu11.3

Downloading https://release.colossalai.org/colossalai-0.1.5%2Btorch1.10cu11.3-cp39-cp39-linux_x86_64.whl (8.9 MB)

 |████████████████████████████████| 8.9 MB 1.9 MB/s

Requirement already satisfied: torch>=1.8 in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (1.10.1)

Collecting fabric

Downloading fabric-2.7.0-py2.py3-none-any.whl (55 kB)

 |████████████████████████████████| 55 kB 335 kB/s

Collecting psutil

Downloading psutil-5.9.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281 kB)

 |████████████████████████████████| 281 kB 378 kB/s

Collecting pre-commit

Downloading pre_commit-2.19.0-py2.py3-none-any.whl (199 kB)

 |████████████████████████████████| 199 kB 527 kB/s

Collecting click

Downloading click-8.1.3-py3-none-any.whl (96 kB)

 |████████████████████████████████| 96 kB 820 kB/s

Collecting rich

Downloading rich-12.4.4-py3-none-any.whl (232 kB)

 |████████████████████████████████| 232 kB 497 kB/s

Collecting packaging

Downloading packaging-21.3-py3-none-any.whl (40 kB)

 |████████████████████████████████| 40 kB 8.4 MB/s

Requirement already satisfied: tqdm in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (4.63.0)

Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (1.22.3)

Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.9/site-packages (from torch>=1.8->colossalai==0.1.5+torch1.10cu11.3) (4.2.0)

Collecting invoke<2.0,>=1.3

Downloading invoke-1.7.1-py3-none-any.whl (215 kB)

 |████████████████████████████████| 215 kB 948 kB/s

Collecting pathlib2

Downloading pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB)

Collecting paramiko>=2.4

Downloading paramiko-2.11.0-py2.py3-none-any.whl (212 kB)

 |████████████████████████████████| 212 kB 333 kB/s

Collecting pynacl>=1.0.1

Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)

 |████████████████████████████████| 856 kB 500 kB/s

Requirement already satisfied: six in /opt/conda/lib/python3.9/site-packages (from paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (1.16.0)

Requirement already satisfied: cryptography>=2.5 in /opt/conda/lib/python3.9/site-packages (from paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (36.0.0)

Collecting bcrypt>=3.1.3

Downloading bcrypt-3.2.2-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (62 kB)

 |████████████████████████████████| 62 kB 285 kB/s

Requirement already satisfied: cffi>=1.1 in /opt/conda/lib/python3.9/site-packages (from bcrypt>=3.1.3->paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (1.15.0)

Requirement already satisfied: pycparser in /opt/conda/lib/python3.9/site-packages (from cffi>=1.1->bcrypt>=3.1.3->paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (2.21)

Collecting pyparsing!=3.0.5,>=2.0.2

Downloading pyparsing-3.0.9-py3-none-any.whl (98 kB)

 |████████████████████████████████| 98 kB 406 kB/s

Collecting toml

Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)

Collecting identify>=1.0.0

Downloading identify-2.5.1-py2.py3-none-any.whl (98 kB)

 |████████████████████████████████| 98 kB 327 kB/s

Collecting nodeenv>=0.11.1

Downloading nodeenv-1.6.0-py2.py3-none-any.whl (21 kB)

Collecting cfgv>=2.0.0

Downloading cfgv-3.3.1-py2.py3-none-any.whl (7.3 kB)

Collecting pyyaml>=5.1

Downloading PyYAML-6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (661 kB)

 |████████████████████████████████| 661 kB 323 kB/s

Collecting virtualenv>=20.0.8

Downloading virtualenv-20.14.1-py2.py3-none-any.whl (8.8 MB)

 |████████████████████████████████| 8.8 MB 1.4 MB/s

Collecting platformdirs<3,>=2

Downloading platformdirs-2.5.2-py3-none-any.whl (14 kB)

Collecting distlib<1,>=0.3.1

Downloading distlib-0.3.4-py2.py3-none-any.whl (461 kB)

 |████████████████████████████████| 461 kB 507 kB/s

Collecting filelock<4,>=3.2

Downloading filelock-3.7.0-py3-none-any.whl (10 kB)

Collecting commonmark<0.10.0,>=0.9.0

Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)

 |████████████████████████████████| 51 kB 3.9 MB/s

Collecting pygments<3.0.0,>=2.6.0

Downloading Pygments-2.12.0-py3-none-any.whl (1.1 MB)

 |████████████████████████████████| 1.1 MB 465 kB/s

Installing collected packages: pynacl, platformdirs, filelock, distlib, bcrypt, virtualenv, toml, pyyaml, pyparsing, pygments, pathlib2, paramiko, nodeenv, invoke, identify, commonmark, cfgv, rich, psutil, pre-commit, packaging, fabric, click, colossalai

Successfully installed bcrypt-3.2.2 cfgv-3.3.1 click-8.1.3 colossalai-0.1.5+torch1.10cu11.3 commonmark-0.9.1 distlib-0.3.4 fabric-2.7.0 filelock-3.7.0 identify-2.5.1 invoke-1.7.1 nodeenv-1.6.0 packaging-21.3 paramiko-2.11.0 pathlib2-2.3.7.post1 platformdirs-2.5.2 pre-commit-2.19.0 psutil-5.9.1 pygments-2.12.0 pynacl-1.5.0 pyparsing-3.0.9 pyyaml-6.0 rich-12.4.4 toml-0.10.2 virtualenv-20.14.1

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[root@1b313fc7f472 workspace]# colossalai check -i

Traceback (most recent call last):

File "/opt/conda/bin/colossalai", line 5, in

from colossalai.cli import cli

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/init.py", line 1, in

from .cli import cli

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/cli.py", line 4, in

from .benchmark import benchmark

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/benchmark/init.py", line 4, in

from .utils import *

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/benchmark/utils.py", line 3, in

from grpc import Call

ModuleNotFoundError: No module named 'grpc'

[root@1b313fc7f472 workspace]#$ docker run --gpus all -ti hpcaitech/pytorch-cuda:1.10.1-11.3.0 bash

[root@1b313fc7f472 workspace]# pip install colossalai==0.1.5+torch1.10cu11.3 -f https://release.colossalai.org/

Looking in links: https://release.colossalai.org/

Collecting colossalai==0.1.5+torch1.10cu11.3

Downloading https://release.colossalai.org/colossalai-0.1.5%2Btorch1.10cu11.3-cp39-cp39-linux_x86_64.whl (8.9 MB)

 |████████████████████████████████| 8.9 MB 1.9 MB/s

Requirement already satisfied: torch>=1.8 in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (1.10.1)

Collecting fabric

Downloading fabric-2.7.0-py2.py3-none-any.whl (55 kB)

 |████████████████████████████████| 55 kB 335 kB/s

Collecting psutil

Downloading psutil-5.9.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281 kB)

 |████████████████████████████████| 281 kB 378 kB/s

Collecting pre-commit

Downloading pre_commit-2.19.0-py2.py3-none-any.whl (199 kB)

 |████████████████████████████████| 199 kB 527 kB/s

Collecting click

Downloading click-8.1.3-py3-none-any.whl (96 kB)

 |████████████████████████████████| 96 kB 820 kB/s

Collecting rich

Downloading rich-12.4.4-py3-none-any.whl (232 kB)

 |████████████████████████████████| 232 kB 497 kB/s

Collecting packaging

Downloading packaging-21.3-py3-none-any.whl (40 kB)

 |████████████████████████████████| 40 kB 8.4 MB/s

Requirement already satisfied: tqdm in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (4.63.0)

Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (from colossalai==0.1.5+torch1.10cu11.3) (1.22.3)

Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.9/site-packages (from torch>=1.8->colossalai==0.1.5+torch1.10cu11.3) (4.2.0)

Collecting invoke<2.0,>=1.3

Downloading invoke-1.7.1-py3-none-any.whl (215 kB)

 |████████████████████████████████| 215 kB 948 kB/s

Collecting pathlib2

Downloading pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB)

Collecting paramiko>=2.4

Downloading paramiko-2.11.0-py2.py3-none-any.whl (212 kB)

 |████████████████████████████████| 212 kB 333 kB/s

Collecting pynacl>=1.0.1

Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)

 |████████████████████████████████| 856 kB 500 kB/s

Requirement already satisfied: six in /opt/conda/lib/python3.9/site-packages (from paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (1.16.0)

Requirement already satisfied: cryptography>=2.5 in /opt/conda/lib/python3.9/site-packages (from paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (36.0.0)

Collecting bcrypt>=3.1.3

Downloading bcrypt-3.2.2-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (62 kB)

 |████████████████████████████████| 62 kB 285 kB/s

Requirement already satisfied: cffi>=1.1 in /opt/conda/lib/python3.9/site-packages (from bcrypt>=3.1.3->paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (1.15.0)

Requirement already satisfied: pycparser in /opt/conda/lib/python3.9/site-packages (from cffi>=1.1->bcrypt>=3.1.3->paramiko>=2.4->fabric->colossalai==0.1.5+torch1.10cu11.3) (2.21)

Collecting pyparsing!=3.0.5,>=2.0.2

Downloading pyparsing-3.0.9-py3-none-any.whl (98 kB)

 |████████████████████████████████| 98 kB 406 kB/s

Collecting toml

Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)

Collecting identify>=1.0.0

Downloading identify-2.5.1-py2.py3-none-any.whl (98 kB)

 |████████████████████████████████| 98 kB 327 kB/s

Collecting nodeenv>=0.11.1

Downloading nodeenv-1.6.0-py2.py3-none-any.whl (21 kB)

Collecting cfgv>=2.0.0

Downloading cfgv-3.3.1-py2.py3-none-any.whl (7.3 kB)

Collecting pyyaml>=5.1

Downloading PyYAML-6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (661 kB)

 |████████████████████████████████| 661 kB 323 kB/s

Collecting virtualenv>=20.0.8

Downloading virtualenv-20.14.1-py2.py3-none-any.whl (8.8 MB)

 |████████████████████████████████| 8.8 MB 1.4 MB/s

Collecting platformdirs<3,>=2

Downloading platformdirs-2.5.2-py3-none-any.whl (14 kB)

Collecting distlib<1,>=0.3.1

Downloading distlib-0.3.4-py2.py3-none-any.whl (461 kB)

 |████████████████████████████████| 461 kB 507 kB/s

Collecting filelock<4,>=3.2

Downloading filelock-3.7.0-py3-none-any.whl (10 kB)

Collecting commonmark<0.10.0,>=0.9.0

Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)

 |████████████████████████████████| 51 kB 3.9 MB/s

Collecting pygments<3.0.0,>=2.6.0

Downloading Pygments-2.12.0-py3-none-any.whl (1.1 MB)

 |████████████████████████████████| 1.1 MB 465 kB/s

Installing collected packages: pynacl, platformdirs, filelock, distlib, bcrypt, virtualenv, toml, pyyaml, pyparsing, pygments, pathlib2, paramiko, nodeenv, invoke, identify, commonmark, cfgv, rich, psutil, pre-commit, packaging, fabric, click, colossalai

Successfully installed bcrypt-3.2.2 cfgv-3.3.1 click-8.1.3 colossalai-0.1.5+torch1.10cu11.3 commonmark-0.9.1 distlib-0.3.4 fabric-2.7.0 filelock-3.7.0 identify-2.5.1 invoke-1.7.1 nodeenv-1.6.0 packaging-21.3 paramiko-2.11.0 pathlib2-2.3.7.post1 platformdirs-2.5.2 pre-commit-2.19.0 psutil-5.9.1 pygments-2.12.0 pynacl-1.5.0 pyparsing-3.0.9 pyyaml-6.0 rich-12.4.4 toml-0.10.2 virtualenv-20.14.1

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[root@1b313fc7f472 workspace]# colossalai check -i

Traceback (most recent call last):

File "/opt/conda/bin/colossalai", line 5, in

from colossalai.cli import cli

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/init.py", line 1, in

from .cli import cli

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/cli.py", line 4, in

from .benchmark import benchmark

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/benchmark/init.py", line 4, in

from .utils import *

File "/opt/conda/lib/python3.9/site-packages/colossalai/cli/benchmark/utils.py", line 3, in

from grpc import Call

ModuleNotFoundError: No module named 'grpc'

[root@1b313fc7f472 workspace]#

However:

$ docker run --gpus all -ti hpcaitech/colossalai:nightly-cuda11.3-torch1.10 bash[root@2e86e3d40e57 workspace]# colossalai check -i

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

CUDA Version: 11.3

PyTorch Version: 1.10.1

CUDA Version in PyTorch Build: 11.3

PyTorch CUDA Version Match: ✓

CUDA Extension: ✓

Yes !!!!!

FrankLeeeee · 2022-05-28T09:14:23Z

Great! The import error has been fixed in the latest code but not released yet.

Adrian-1234 · 2022-05-28T09:17:13Z

Cool ! many thanks for your help !

FrankLeeeee · 2022-05-28T09:18:02Z

No problem! I will close this issue for now.

Adrian-1234 · 2022-05-28T09:21:31Z

In case you are interested, the 8 GPU BM now gives me
[root@2a070a24b99b workspace]# colossalai benchmark --gpus 8

/opt/conda/lib/python3.9/site-packages/apex/pyprof/init.py:5: FutureWarning: pyprof will be removed by the end of June, 2022

warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

=== Benchmarking Parameters ===

gpus: 8

batch_size: 8

seq_len: 512

dimension: 1024

warmup_steps: 10

profile_steps: 50

layers: 2

model: mlp