You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 23, 2023. It is now read-only.
cd ColossalAI/examples/language/gpt
torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch
bash: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/lib/libtinfo.so.6: no version information available (required by bash)
Colossalai should be built with cuda extension to use the FP16 optimizer
/home/lcfjr/.local/lib/python3.9/site-packages/torch/cuda/init.py:143: UserWarning:
NVIDIA A100-PCIE-80GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-80GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
colossalai - colossalai - 2022-02-24 15:04:02,751 INFO: process rank 0 is bound to device 0
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Build data loader
colossalai - colossalai - 2022-02-24 15:04:02,864 INFO: Build model
Traceback (most recent call last):
File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 118, in
main()
File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 49, in main
model = gpc.config.model.pop('type')(**gpc.config.model)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 402, in gpt2_small
return create_gpt_model(**model_kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 368, in create_gpt_model
model = GPT(**model_kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 261, in init
self.embed = GPTEmbedding(embedding_dim=dim,
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 33, in init
self.word_embeddings = col_nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx, dtype=dtype)
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/layer/colossalai_layer/embedding.py", line 69, in init
weight_initializer(self.embed.weight, fan_in=num_embeddings, fan_out=embedding_dim)
File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/init.py", line 31, in initializer
return nn.init.normal(tensor, mean, std)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 151, in normal
return no_grad_normal(tensor, mean, std)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 19, in no_grad_normal
return tensor.normal_(mean, std)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to send a keep-alive heartbeat to the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1150747) of binary: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.00041747093200683594 seconds
Traceback (most recent call last):
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 899, in _exit_barrier
store_util.barrier(
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 52, in synchronize
store.set(f"{key_prefix}{rank}", data)
RuntimeError: Broken pipe
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to shutdown the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/home/lcfjr/.local/bin/torchrun", line 10, in
sys.exit(main())
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
🐛 Describe the bug
bash: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/lib/libtinfo.so.6: no version information available (required by bash)
Colossalai should be built with cuda extension to use the FP16 optimizer
/home/lcfjr/.local/lib/python3.9/site-packages/torch/cuda/init.py:143: UserWarning:
NVIDIA A100-PCIE-80GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-80GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
colossalai - colossalai - 2022-02-24 15:04:02,751 INFO: process rank 0 is bound to device 0
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Build data loader
colossalai - colossalai - 2022-02-24 15:04:02,864 INFO: Build model
Traceback (most recent call last):
File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 118, in
main()
File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 49, in main
model = gpc.config.model.pop('type')(**gpc.config.model)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 402, in gpt2_small
return create_gpt_model(**model_kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 368, in create_gpt_model
model = GPT(**model_kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 261, in init
self.embed = GPTEmbedding(embedding_dim=dim,
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 33, in init
self.word_embeddings = col_nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx, dtype=dtype)
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/layer/colossalai_layer/embedding.py", line 69, in init
weight_initializer(self.embed.weight, fan_in=num_embeddings, fan_out=embedding_dim)
File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/init.py", line 31, in initializer
return nn.init.normal(tensor, mean, std)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 151, in normal
return no_grad_normal(tensor, mean, std)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 19, in no_grad_normal
return tensor.normal_(mean, std)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to send a keep-alive heartbeat to the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1150747) of binary: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.00041747093200683594 seconds
Traceback (most recent call last):
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 899, in _exit_barrier
store_util.barrier(
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 52, in synchronize
store.set(f"{key_prefix}{rank}", data)
RuntimeError: Broken pipe
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to shutdown the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/home/lcfjr/.local/bin/torchrun", line 10, in
sys.exit(main())
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_gpt.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-02-24_15:04:10
host : HPC-AI
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1150747)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Environment
No response
The text was updated successfully, but these errors were encountered: