Skip to content
This repository has been archived by the owner on Mar 23, 2023. It is now read-only.

failed to run gpt example #36

Closed
feifeibear opened this issue Feb 24, 2022 · 3 comments
Closed

failed to run gpt example #36

feifeibear opened this issue Feb 24, 2022 · 3 comments

Comments

@feifeibear
Copy link
Contributor

🐛 Describe the bug

cd ColossalAI/examples/language/gpt
torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch

bash: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/lib/libtinfo.so.6: no version information available (required by bash)
Colossalai should be built with cuda extension to use the FP16 optimizer
/home/lcfjr/.local/lib/python3.9/site-packages/torch/cuda/init.py:143: UserWarning:
NVIDIA A100-PCIE-80GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-80GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
colossalai - colossalai - 2022-02-24 15:04:02,751 INFO: process rank 0 is bound to device 0
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Build data loader
colossalai - colossalai - 2022-02-24 15:04:02,864 INFO: Build model
Traceback (most recent call last):
File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 118, in
main()
File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 49, in main
model = gpc.config.model.pop('type')(**gpc.config.model)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 402, in gpt2_small
return create_gpt_model(**model_kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 368, in create_gpt_model
model = GPT(**model_kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 261, in init
self.embed = GPTEmbedding(embedding_dim=dim,
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 33, in init
self.word_embeddings = col_nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx, dtype=dtype)
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/layer/colossalai_layer/embedding.py", line 69, in init
weight_initializer(self.embed.weight, fan_in=num_embeddings, fan_out=embedding_dim)
File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/init.py", line 31, in initializer
return nn.init.normal
(tensor, mean, std)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 151, in normal

return no_grad_normal(tensor, mean, std)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 19, in no_grad_normal
return tensor.normal_(mean, std)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to send a keep-alive heartbeat to the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1150747) of binary: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.00041747093200683594 seconds
Traceback (most recent call last):
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 899, in _exit_barrier
store_util.barrier(
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 52, in synchronize
store.set(f"{key_prefix}{rank}", data)
RuntimeError: Broken pipe
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to shutdown the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/home/lcfjr/.local/bin/torchrun", line 10, in
sys.exit(main())
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_gpt.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-02-24_15:04:10
host : HPC-AI
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1150747)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

No response

@feifeibear
Copy link
Contributor Author

torch version 1.10.2

└─(16:25:02)──> nvcc --version ──(Thu,Feb24)─┘
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

@feifeibear
Copy link
Contributor Author

The issue comes from the version version of torch.
'1.10.2+cu102'
torch.nn.init.normal_ dose not support to normalize a gpu tensor.

@feifeibear
Copy link
Contributor Author

Fixed the issue after I correcly install pytorch version.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant