Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fine tuning mpt7b using local dataset #143

Closed
singhalshikha518 opened this issue May 16, 2023 · 12 comments
Closed

fine tuning mpt7b using local dataset #143

singhalshikha518 opened this issue May 16, 2023 · 12 comments
Assignees

Comments

@singhalshikha518
Copy link

I tried fine tuning mpt7b using dolly dataset. Using below command:

composer train.py yamls/finetune/mpt-7b_dolly_sft.yaml

yaml file: https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/mpt-7b_dolly_sft.yaml

Before strating training i am getting below error:

[Eval batch=321/321] Eval on eval data:
Eval metrics/eval/LanguageCrossEntropy: 9.1594
Eval metrics/eval/LanguagePerplexity: 9503.6523
/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Traceback (most recent call last):
File "", line 21, in _bwd_kernel
KeyError: ('2-.-0-.-0-842f0fbd42a6607893f7134cdd9d16f2-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 254, in
main(cfg)
File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 243, in main
trainer.fit()
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1766, in fit
self._train_loop()
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1940, in _train_loop
total_loss_dict = self._train_batch(use_grad_scaling)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2115, in _train_batch
optimizer.step(closure=lambda **kwargs: self._train_microbatches(
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/optim/decoupled_weight_decay.py", line 288, in step
loss = closure()
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2115, in
optimizer.step(closure=lambda **kwargs: self._train_microbatches(
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches
microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2340, in _train_microbatch
microbatch_loss.backward(create_graph=self._backwards_create_graph)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/flash_attn/flash_attn_triton.py", line 827, in backward
_flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv,
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/flash_attn/flash_attn_triton.py", line 694, in _flash_attn_backward
_bwd_kernel[grid](
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 73, in run
timings = {config: self._bench(*args, config=config, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 73, in
timings = {config: self._bench(*args, config=config, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 63, in _bench
return do_bench(kernel_call)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/testing.py", line 140, in do_bench
fn()
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 62, in kernel_call
self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 200, in run
return self.fn.run(*args, **kwargs)
File "", line 43, in _bwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

@Paladiamors
Copy link

Paladiamors commented May 17, 2023

I'm also getting the same problem too, using triton 2.0.0.dev20221202 as recommended in the setup script.

@fbiere
Copy link

fbiere commented May 17, 2023

I am also seeing this problem.

@alextrott16
Copy link
Contributor

What kind of hardware are you using? And have you tried starting from our recommended docker image mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04?

Any other details about your environments would be helpful to know.

@singhalshikha518
Copy link
Author

singhalshikha518 commented May 18, 2023

@alextrott16
I am using slurm cluster with 4 A10 GPUs.
Cuda Version : 11.6
nvcc version : Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0

GCC version : gcc (GCC) 7.3.1 20180303

torch : 1.13.1+cu116

Along with the above error when i am trying multinode i am getting nccl error

@singhalshikha518
Copy link
Author

singhalshikha518 commented May 18, 2023

Also i am getting below error with 'attn_impl: torch' :
File "llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/modeling_mpt.py", line 142, in forward
raise NotImplementedError('MPT does not support training with left padding.')

@singhalshikha518
Copy link
Author

What kind of hardware are you using? And have you tried starting from our recommended docker image mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04?

Any other details about your environments would be helpful to know.

Looks like the error is with torch1.13.1+cu116 version.
and with torch2.0.1 getting error of saving checkpoint.
Which torch version should be used for cuda 11.6?

@vchiley
Copy link
Contributor

vchiley commented May 18, 2023

KeyError: ('2-.-0-.-0-842f0fbd42a6607893f7134cdd9d16f2-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

Is an error you see if you try to use torch>=2.0.0 (torch>=2.0.0 requires a version of triton that has issues)
We are working on a workaround.

@vchiley
Copy link
Contributor

vchiley commented May 18, 2023

This error tells you the issue.
Your dataset is outputting data with left padding. MPT does not support training with left padding.
This is a dataset issue.

@Louis-y-nlp
Copy link

This question may sound a bit silly, but why is right padding used during training while left padding is chosen during inference?

@jwatte
Copy link

jwatte commented May 24, 2023

I'm getting the same kernel crash / key error:

[Eval batch=1/13] Eval on eval data
[Eval batch=2/13] Eval on eval data
[Eval batch=3/13] Eval on eval data
[Eval batch=5/13] Eval on eval data
[Eval batch=6/13] Eval on eval data
[Eval batch=7/13] Eval on eval data
[Eval batch=8/13] Eval on eval data
[Eval batch=9/13] Eval on eval data
[Eval batch=11/13] Eval on eval data
[Eval batch=12/13] Eval on eval data
[Eval batch=13/13] Eval on eval data:
         Eval metrics/eval/LanguageCrossEntropy: 10.0889
         Eval metrics/eval/LanguagePerplexity: 24073.9668
Traceback (most recent call last):
  File "<string>", line 21, in _bwd_kernel
KeyError: ('2-.-0-.-0--2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

....

  File "/home/ubuntu/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 200, in run
    return self.fn.run(*args, **kwargs)
  File "<string>", line 43, in _bwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

Using a g5.24xlarge instance with 4xA10G GPUs on EC2.
It's using Torch 1.13.1 already:

ubuntu@ip-172-31-12-71:/opt/mpt-7b/llm-foundry/scripts/train$ pip uninstall torch
Found existing installation: torch 1.13.1

And triton 2.0.0.dev20221202:

ubuntu@ip-172-31-12-71:/opt/mpt-7b/llm-foundry/scripts/train$ pip uninstall triton
Found existing installation: triton 2.0.0.dev20221202

@abhi-mosaic
Copy link
Member

Hi @jwatte , could you try installing this fork of triton we have setup? It uses a pre-MLIR tag that should work with torch 1.13.1:

'triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python',

In general we have not tried training with A10s so it's a bit of uncharted territory. I hope we can get more internally so we can start adding it to our support matrix, but it's unlikely to happen in the next few weeks.

This question may sound a bit silly, but why is right padding used during training while left padding is chosen during inference?

I think the choice at training time is a bit arbitrary, but at inference time, left padding is used so that the ends of sequences line up, since you generate 1 token at a time, you want to make sure the new tokens are "lined up".

@abhi-mosaic abhi-mosaic self-assigned this Jun 2, 2023
bmosaicml pushed a commit that referenced this issue Jun 6, 2023
* Tweak example config dict + clarify running tests in subdirectories
@abhi-mosaic
Copy link
Member

Closing this issue as it's gone a bit stale, but I just want to note that we are actively testing A10 support now and will update the support matrix on the top README once we have confirmed that it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants