Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Loading a checkpoint for MP=0 but world size is 1 #140

Open
IsraelAbebe opened this issue Dec 31, 2023 · 2 comments
Open

Comments

@IsraelAbebe
Copy link

Any idea what this error is and why it happens

AssertionError: Loading a checkpoint for MP=0 but world size is 1
[2023-12-31 16:19:42,100] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 115) of binary: /usr/bin/python3.9
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-31_16:19:42
  host      : azime-36475.0-balder.hpc.uni-saarland.de
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 115)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

my inferance code looks like this

torchrun --nproc_per_node 1 example.py \
--ckpt_dir LLaMA-7B/7B \
--tokenizer_path LLaMA-7B/tokenizer.model \
--adapter_path LLaMA-7B/llama_adapter_len10_layer30_release.pth \
--quantizer False

and I used THIS weights , with the adapters from this repo.

LLaMA-7B/
├── checklist.chk
├── consolidated.00.pth
├── llama_adapter_len10_layer30_release.pth
├── params.json
├── README.md
└── tokenizer.model

@csuhan
Copy link
Collaborator

csuhan commented Jan 4, 2024

Check if meta-llama/llama#40 helps

@LiuHaolan
Copy link

I also find this problem, try to print the "ckpt_dir " in your code, I suspect it's the problem with Fire python library failing to correctly parse the arguments.

I worked around this by using argparse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants