failed to run gpt example #36

feifeibear · 2022-02-24T07:06:17Z

🐛 Describe the bug

cd ColossalAI/examples/language/gpt
torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch

bash: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/lib/libtinfo.so.6: no version information available (required by bash)
Colossalai should be built with cuda extension to use the FP16 optimizer
/home/lcfjr/.local/lib/python3.9/site-packages/torch/cuda/init.py:143: UserWarning:
NVIDIA A100-PCIE-80GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-80GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
colossalai - colossalai - 2022-02-24 15:04:02,751 INFO: process rank 0 is bound to device 0
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Build data loader
colossalai - colossalai - 2022-02-24 15:04:02,864 INFO: Build model
Traceback (most recent call last):
File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 118, in
main()
File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 49, in main
model = gpc.config.model.pop('type')(gpc.config.model)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 402, in gpt2_small
return create_gpt_model(model_kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 368, in create_gpt_model
model = GPT(**model_kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 261, in init**
self.embed = GPTEmbedding(embedding_dim=dim,
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 33, in init**
self.word_embeddings = col_nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx, dtype=dtype)
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/layer/colossalai_layer/embedding.py", line 69, in init**
weight_initializer(self.embed.weight, fan_in=num_embeddings, fan_out=embedding_dim)
File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/init.py", line 31, in initializer
return nn.init.normal(tensor, mean, std)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 151, in normal
return no_grad_normal(tensor, mean, std)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 19, in no_grad_normal
return tensor.normal_(mean, std)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to send a keep-alive heartbeat to the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1150747) of binary: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.00041747093200683594 seconds
Traceback (most recent call last):
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 899, in _exit_barrier
store_util.barrier(
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 52, in synchronize
store.set(f"{key_prefix}{rank}", data)
RuntimeError: Broken pipe
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to shutdown the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/home/lcfjr/.local/bin/torchrun", line 10, in
sys.exit(main())
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_gpt.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-02-24_15:04:10
host : HPC-AI
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1150747)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

No response

The text was updated successfully, but these errors were encountered:

feifeibear · 2022-02-24T08:25:49Z

torch version 1.10.2

└─(16:25:02)──> nvcc --version ──(Thu,Feb24)─┘
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

feifeibear · 2022-02-24T09:01:45Z

The issue comes from the version version of torch.
'1.10.2+cu102'
torch.nn.init.normal_ dose not support to normalize a gpu tensor.

feifeibear · 2022-02-24T15:21:16Z

Fixed the issue after I correcly install pytorch version.

feifeibear closed this as completed Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to run gpt example #36

failed to run gpt example #36

feifeibear commented Feb 24, 2022

feifeibear commented Feb 24, 2022

feifeibear commented Feb 24, 2022

feifeibear commented Feb 24, 2022

failed to run gpt example #36

failed to run gpt example #36

Comments

feifeibear commented Feb 24, 2022

🐛 Describe the bug

train_gpt.py FAILED

Failures: <NO_OTHER_FAILURES>

Environment

feifeibear commented Feb 24, 2022

feifeibear commented Feb 24, 2022

feifeibear commented Feb 24, 2022

Failures:
<NO_OTHER_FAILURES>