an error while training #159

ChrisXULC · 2023-05-18T04:40:00Z

File "/root/anaconda3/envs/llava/lib/python3.10/multiprocessing/shared_memory.py", line 104, in init
self._fd = _posixshmem.shm_open(
FileExistsError: [Errno 17] File exists: '/000000_barrier'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/homeLMFlow/cn_llama/packages/new/llm-foundry/scripts/train/train.py", line 256, in
main(cfg)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/scripts/train/train.py", line 151, in main
train_loader = build_dataloader(
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/scripts/train/train.py", line 73, in build_dataloader
return build_text_dataloader(
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/scripts/train/llmfoundry/data/text_data.py", line 253, in build_text_dataloader
dataset = StreamingTextDataset(
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/scripts/train/llmfoundry/data/text_data.py", line 110, in init
super().init(
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/streaming/base/dataset.py", line 331, in init
self._worker_barrier = SharedBarrier(worker_barrier_filelock_path, worker_barrier_shm_path)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/streaming/base/shared.py", line 51, in init
shared_barrier_shm = CreateSharedMemory(name=shm_path, size=size)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/streaming/base/shared.py", line 215, in init
shm = SharedMemory(name, False, size)
File "/root/anaconda3/envs/llava/lib/python3.10/multiprocessing/shared_memory.py", line 104, in init
self._fd = _posixshmem.shm_open(
FileNotFoundError: [Errno 2] No such file or directory: '/000000_barrier'

----------End global rank 1 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 30509) exited with code 1

ChrisXULC · 2023-05-18T05:20:19Z

Starting training...

----------End global rank 1 STDOUT----------
----------Begin global rank 1 STDERR----------
/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: amp_fp16; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor.
warnings.warn(
Traceback (most recent call last):
File "", line 21, in _bwd_kernel
KeyError: ('2-.-0-.-0-5ef8f334a15fe35aaf5db62d90ceef62-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, None, torch.float16, torch.float32, torch.float16, torch.float16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('none', True, 64, False, True, True, True, 128, 128), (True, True, True, (False,), True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/homeLMFlow/cn_llama/packages/new/llm-foundry/scripts/train/train.py", line 257, in
main(cfg)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/scripts/train/train.py", line 245, in main
trainer.fit()
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1766, in fit
self._train_loop()
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1940, in _train_loop
total_loss_dict = self._train_batch(use_grad_scaling)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2118, in _train_batch
self._train_microbatches(microbatches, total_loss_dict)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches
microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2340, in _train_microbatch
microbatch_loss.backward(create_graph=self._backwards_create_graph)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/flash_attn/flash_attn_triton.py", line 827, in backward
_flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv,
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/flash_attn/flash_attn_triton.py", line 694, in _flash_attn_backward
_bwd_kernel[grid](
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "/homeLMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 73, in run
timings = {config: self._bench(*args, config=config, **kwargs)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 73, in
timings = {config: self._bench(*args, config=config, **kwargs)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 63, in _bench
return do_bench(kernel_call)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/triton/testing.py", line 140, in do_bench
fn()
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 62, in kernel_call
self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
File "/home/LMFlow/cn_llama/packages/new/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 200, in run
return self.fn.run(*args, **kwargs)
File "", line 43, in _bwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

----------End global rank 1 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 86132) exited with code 1

singhalshikha518 · 2023-05-18T11:21:07Z

I am also facing same error

hanlint · 2023-05-18T15:25:59Z

Hello @singhalshikha518 and @ChrisXULC , thanks for raising these! Its possible that RuntimeError: Triton Error [CUDA]: invalid argument arises because of (1) running on older hardware that the Triton kernel does not support, or (2) some incompatibilty in CUDA versions, etc. For example see microsoft/DeepSpeed#3382 or microsoft/DeepSpeed-MII#170 (comment).

Could you provide some system specs to help us debug? (GPU type, OS version, etc.). Our recommended setup can be found here: https://github.com/mosaicml/llm-foundry#prerequisites

As a fallback, you can also turn off triton by setting the attn_impl: torch. This is slower and uses more memory, but may work if Triton kernels are difficult to setup properly in your environment.

hanlint · 2023-05-26T17:23:34Z

Closing this as stale -- please reopen if using torch does not work!

mikeybellissimo · 2023-05-30T22:10:09Z

Hi, just want to start by thanking you and your company for doing such great work in making Ai so accessible to the public.

I am also having this same error training MPT-7B using triton. I had the same error when I tried going through the docker available on the LLM-Foundry github page but my current specs are:
OS: Ubuntu on Windows WSL2
GPU: 3090 (Which rules out the error being only present for GPU's that predate Ampere)
Triton: 2.0.0.dev20221202
Flash-attn:1.0.3.post0
Python: 3.10.9
Torch: Currently 2.0.1 (I've tried with 1.13.1 as well)
CUDA: 11.7

Torch as the attn_impl does work just fine but it's highly limiting for Sequence length for with my memory capacity so it would be really great to be able to use Triton.
Thanks!
Michael

hanlint self-assigned this May 18, 2023

hanlint closed this as completed May 26, 2023

shunk031 mentioned this issue Jul 8, 2023

Tensor/Pipeline Parallel support #437

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

an error while training #159

an error while training #159

ChrisXULC commented May 18, 2023

ChrisXULC commented May 18, 2023

singhalshikha518 commented May 18, 2023

hanlint commented May 18, 2023

hanlint commented May 26, 2023

mikeybellissimo commented May 30, 2023

an error while training #159

an error while training #159

Comments

ChrisXULC commented May 18, 2023

ChrisXULC commented May 18, 2023

singhalshikha518 commented May 18, 2023

hanlint commented May 18, 2023

hanlint commented May 26, 2023

mikeybellissimo commented May 30, 2023