Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning script broken? #420

Closed
mscherrmann opened this issue Jul 20, 2023 · 4 comments
Closed

Finetuning script broken? #420

mscherrmann opened this issue Jul 20, 2023 · 4 comments

Comments

@mscherrmann
Copy link

mscherrmann commented Jul 20, 2023

Hey,

as finetuning after the import to transformers is not possible, I tried the finetuning script that you provide.
I tried to run the function 'test_classification_script()' from 'tests/test_classification.py' as a first step to test your finetuning framework.
To do so, I used a linux server with ubuntu and with 4 x NVIDIA Tesla P100 (16 GB).
For the setup, I followed all the steps that you recommend here, i.e.:

I have installed the cuda release 117, as the following output suggests:

'nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0'

To test your finetuning script, I simply did the following in the console:

>>> python 
>>> from tests.test_classification import test_classification_script 
>>> test_classification_script()

Here is the complete output:

Training using config:
tokenizer_name: prajjwal1/bert-tiny
max_seq_len: 32
run_name: test
model:
  name: mosaic_bert
  num_labels: 2
  pretrained_model_name: ${tokenizer_name}
  tokenizer_name: ${tokenizer_name}
train_loader:
  split: train
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: true
  drop_last: true
  num_workers: 4
eval_loader:
  split: validation
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: false
  drop_last: false
  num_workers: 4
scheduler:
  name: linear_decay_with_warmup
  t_warmup: 0.5dur
  alpha_f: 0.02
optimizer:
  name: decoupled_adamw
  lr: 0.0002
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
max_duration: 8ba
eval_interval: 8ba
eval_subset_num_batches: 2
global_train_batch_size: 4
seed: 17
device_eval_batch_size: 4
device_train_microbatch_size: 2
precision: fp32
progress_bar: false
log_to_console: false
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 4
  lr_monitor: {}

Initializing model...
n_params=4.4515e+06
Building train loader...
Found cached dataset glue (...)
Loading cached processed dataset at .../cache-qnli-prajjwal1,bert-tiny-tokenization-train.arrow
Building eval loader...
Found cached dataset glue (...)
Loading cached processed dataset at .../huggingface/datasets/glue/qnli/1.0.0.../cache-qnli-prajjwal1,bert-tiny-tokenization-validation.arrow
/usr/lib/python3/dist-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: fp32; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor.
  warnings.warn(
Logging config...
tokenizer_name: prajjwal1/bert-tiny
max_seq_len: 32
run_name: test
model:
  name: mosaic_bert
  num_labels: 2
  pretrained_model_name: ${tokenizer_name}
  tokenizer_name: ${tokenizer_name}
train_loader:
  split: train
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: true
  drop_last: true
  num_workers: 4
eval_loader:
  split: validation
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: false
  drop_last: false
  num_workers: 4
scheduler:
  name: linear_decay_with_warmup
  t_warmup: 0.5dur
  alpha_f: 0.02
optimizer:
  name: decoupled_adamw
  lr: 0.0002
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
max_duration: 8ba
eval_interval: 8ba
eval_subset_num_batches: 2
global_train_batch_size: 4
seed: 17
device_eval_batch_size: 4
device_train_microbatch_size: 2
precision: fp32
progress_bar: false
log_to_console: false
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 4
  lr_monitor: {}
n_gpus: 1
device_train_batch_size: 4

Starting training...
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-d962222789c30252d492a16cca3bf467-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('matrix', False, 64, True, True, True, 128, 128), (True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/examples/examples/benchmarks/bert/tests/test_classification.py", line 14, in test_classification_script
    main(config)
  File "/examples/examples/benchmarks/bert/sequence_classification.py", line 317, in main
    trainer.fit()
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1766, in fit
    self._train_loop()
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1940, in _train_loop
    total_loss_dict = self._train_batch(use_grad_scaling)
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2115, in _train_batch
    optimizer.step(closure=lambda **kwargs: self._train_microbatches(
  File "/usr/lib/python3/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/composer/optim/decoupled_weight_decay.py", line 288, in step
    loss = closure()
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2115, in <lambda>
    optimizer.step(closure=lambda **kwargs: self._train_microbatches(
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches
    microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2276, in _train_microbatch
    self.state.outputs = self.state.model(self.state.batch)
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/lib/python3/dist-packages/composer/models/huggingface.py", line 314, in forward
    output = self.model(**batch)  # type: ignore (thirdparty)
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 1009, in forward
    outputs = self.bert(
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 677, in forward
    encoder_outputs = self.encoder(
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 514, in forward
    hidden_states = layer_module(hidden_states,
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 395, in forward
    attention_output = self.attention(hidden_states, cu_seqlens, seqlen,
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 307, in forward
    self_output = self.self(input_tensor, cu_seqlens, max_s, indices,
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 237, in forward
    attention = flash_attn_qkvpacked_func(qkv, bias)
  File "/examples/examples/benchmarks/bert/src/flash_attn_triton.py", line 1021, in forward
    o, lse, ctx.softmax_scale = _flash_attn_forward(
  File "/examples/examples/benchmarks/bert/src/flash_attn_triton.py", line 826, in _flash_attn_forward
    _fwd_kernel[grid](  # type: ignore
  File "/usr/lib/python3/dist-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 86, in run
    return self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **kwargs, **config.kwargs)
  File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 200, in run
    return self.fn.run(*args, **kwargs)
  File "<string>", line 41, in _fwd_kernel
  File "/usr/lib/python3/dist-packages/triton/compiler.py", line 1268, in compile
    return CompiledKernel(name, so_cache_manager._make_path(so_name), fn_cache_manager.cache_dir, device)
  File "/usr/lib/python3/dist-packages/triton/compiler.py", line 1301, in __init__
    mod, func, n_regs, n_spills = _triton.code_gen.load_binary(metadata["name"], self.asm["cubin"], self.shared, device)
RuntimeError: CUDA: Error- invalid source

Note that I replaced in the output above the paths with my personal information by (...).

Also note that the commands

  • composer sequence_classification.py yamls/test/sequence_classification.yaml
  • composer sequence_classification.py yamls/test/sequence_classification.yaml model.name=mosaic_bert

yield the same error message.

Did I something wrong or is this an error in the code? I would be incredibly grateful for any guidance as I urgently need to fine-tune my model, but unfortunately, I'm currently facing the mentioned challenges that are preventing me from doing so.

Thank you very much!

@mscherrmann mscherrmann changed the title Finetuning script croken? Finetuning script broken? Jul 20, 2023
@dakinggg
Copy link
Collaborator

I believe that triton flash attention will not work on P100s. Could you try uninstalling flash_attn_triton before running anything? I think then it will fall back to torch attention properly instead of trying to use flash attention and failing.

@mscherrmann
Copy link
Author

Thank you for your quick response! Unfortunately I do not have a flash_attn_triton package installed. I only find flash_attn, but uninstalling it doesnt help.

@dakinggg
Copy link
Collaborator

Apologies, I think I got the package wrong, and it's actually triton you want to uninstall. flash_attn_triton is a file in our repo. We have a try/catch around importing it, which would disable the triton attention implementation, but I guess for you the import succeeds and then it fails when it starts actually running. So I want to make that import fail so that triton is disabled.

@mscherrmann
Copy link
Author

mscherrmann commented Jul 26, 2023

Did also not work for me unfortunately. However, I just switched to pretrain hf-bert, that works fine.

Thank you for your help!

@mscherrmann mscherrmann closed this as not planned Won't fix, can't repro, duplicate, stale Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants