Cannot run 13B model #56

FrankChu0229 · 2023-03-02T14:01:44Z

/content/llama# torchrun --nproc_per_node 2 example.py --ckpt_dir /content/drive/MyDrive/models/13B --tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

initializing model parallel with size 2
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 58, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Loading
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2078) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
-tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:
*****************************************Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

initializing model parallel with size 2
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 58, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Loading
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2078) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

notune · 2023-03-03T18:08:58Z

Had the same problem and found out that you need to have exactly as many GPUs in your system as specified by --nproc_per_node, so in your case you need at least 2 GPUs.

ashishb · 2023-03-12T04:47:39Z

@Noahs-Git Can you run the 13B model with a single GPU or is that not possible at all?

notune · 2023-03-12T13:44:33Z

I haven't tried it yet but it should be possible with a few tweaks as discussed here: #101 (comment)

jspisak · 2023-09-06T18:17:20Z

Closing this issue - if it's still open, please create an issue on llama-recipes. thanks!

FrankChu0229 changed the title ~~Cannot run 13B model model.~~ Cannot run 13B model Mar 2, 2023

albertodepaola added the compatibility issues arising from specific hardware or system configs label Sep 6, 2023

jspisak closed this as completed Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot run 13B model #56

Cannot run 13B model #56

FrankChu0229 commented Mar 2, 2023

notune commented Mar 3, 2023

ashishb commented Mar 12, 2023

notune commented Mar 12, 2023 •

edited

Loading

jspisak commented Sep 6, 2023 •

edited

Loading

Cannot run 13B model #56

Cannot run 13B model #56

Comments

FrankChu0229 commented Mar 2, 2023

Failures: <NO_OTHER_FAILURES>

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-02_13:56:42 host : 5fbe06fc63ef rank : 1 (local_rank: 1) exitcode : 1 (pid: 2078) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

notune commented Mar 3, 2023

ashishb commented Mar 12, 2023

notune commented Mar 12, 2023 • edited Loading

jspisak commented Sep 6, 2023 • edited Loading

Failures:
<NO_OTHER_FAILURES>

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

notune commented Mar 12, 2023 •

edited

Loading

jspisak commented Sep 6, 2023 •

edited

Loading