-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
an error while training #159
Comments
Starting training... ----------End global rank 1 STDOUT---------- During handling of the above exception, another exception occurred: Traceback (most recent call last): ----------End global rank 1 STDERR---------- |
I am also facing same error |
Hello @singhalshikha518 and @ChrisXULC , thanks for raising these! Its possible that Could you provide some system specs to help us debug? (GPU type, OS version, etc.). Our recommended setup can be found here: https://github.com/mosaicml/llm-foundry#prerequisites As a fallback, you can also turn off triton by setting the |
Closing this as stale -- please reopen if using |
Hi, just want to start by thanking you and your company for doing such great work in making Ai so accessible to the public. I am also having this same error training MPT-7B using triton. I had the same error when I tried going through the docker available on the LLM-Foundry github page but my current specs are: Torch as the attn_impl does work just fine but it's highly limiting for Sequence length for with my memory capacity so it would be really great to be able to use Triton. |
File "/root/anaconda3/envs/llava/lib/python3.10/multiprocessing/shared_memory.py", line 104, in init
self._fd = _posixshmem.shm_open(
FileExistsError: [Errno 17] File exists: '/000000_barrier'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/homeLMFlow/cn_llama/packages/new/llm-foundry/scripts/train/train.py", line 256, in
main(cfg)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/scripts/train/train.py", line 151, in main
train_loader = build_dataloader(
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/scripts/train/train.py", line 73, in build_dataloader
return build_text_dataloader(
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/scripts/train/llmfoundry/data/text_data.py", line 253, in build_text_dataloader
dataset = StreamingTextDataset(
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/scripts/train/llmfoundry/data/text_data.py", line 110, in init
super().init(
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/streaming/base/dataset.py", line 331, in init
self._worker_barrier = SharedBarrier(worker_barrier_filelock_path, worker_barrier_shm_path)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/streaming/base/shared.py", line 51, in init
shared_barrier_shm = CreateSharedMemory(name=shm_path, size=size)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/streaming/base/shared.py", line 215, in init
shm = SharedMemory(name, False, size)
File "/root/anaconda3/envs/llava/lib/python3.10/multiprocessing/shared_memory.py", line 104, in init
self._fd = _posixshmem.shm_open(
FileNotFoundError: [Errno 2] No such file or directory: '/000000_barrier'
----------End global rank 1 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 30509) exited with code 1
The text was updated successfully, but these errors were encountered: