Error with SFT of LLaVA-Next #1785

GohioAC · 2024-06-28T09:58:11Z

I'm trying to instruction tune llava-next models following the llava_vsft.py examples shared for llava-1.5.

python vsft.py \
    --dataset_name="HuggingFaceH4/llava-instruct-mix-vsft" \
    --model_name_or_path="llava-hf/llava-v1.6-mistral-7b-hf" \
    --report_to="tensorboard" \
    --learning_rate=2e-5 \
    --lr_scheduler_type="cosine" \
    --per_device_train_batch_size=8 \
    --gradient_accumulation_steps=1 \
    --output_dir="data/vsft-llava-1.5-7b-hf" \
    --logging_steps=1 \
    --num_train_epochs=1 \
    --gradient_checkpointing \
    --remove_unused_columns=False \
    --torch_dtype=float16 \
    --fp16=True \
    --max_seq_length=4096 \
    --attn_implementation="flash_attention_2"

The run keeps failing on a 8xH100 VM with the following error:

RuntimeError: Input tensor at index 1 has invalid shape [8, 2785, 32064], but expected [8, 2889, 32064]

The full code and error stack trace is available in this gist.

The text was updated successfully, but these errors were encountered:

qgallouedec · 2024-07-02T14:37:42Z

Hi, sorry for the delay. Can you double-check the command. When I run it, I get

Traceback (most recent call last):
  File "/fsx/qgallouedec/trl-2/vsft.py", line 122, in <module>
    trainer.train()
  File "/fsx/qgallouedec/trl-2/trl/trainer/sft_trainer.py", line 440, in train
    output = super().train(*args, **kwargs)
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 2314, in _inner_training_loop
    _grad_norm = self.accelerator.clip_grad_norm_(
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/accelerate/accelerator.py", line 2269, in clip_grad_norm_
    self.unscale_gradients()
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/accelerate/accelerator.py", line 2219, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
  0%|          | 0/32395 [00:05<?, ?it/s]

Also share the versions of try, transformers and torch please

GohioAC · 2024-07-03T13:42:04Z

I have double checked the command, code and output. Versions are as follows:

trl: 0.9.4
transformers: 4.41.2
torch: 2.3.1
cuda: 12.4
python: 3.10.14

GohioAC mentioned this issue Jul 2, 2024

Better llava next. huggingface/transformers#29850

Merged

5 tasks

qgallouedec mentioned this issue Jul 3, 2024

Lora seems to be invalid when using vsft_llava.py #1786

Open

qgallouedec added the vlm Related to Visual Language Model label Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with SFT of LLaVA-Next #1785

Error with SFT of LLaVA-Next #1785

GohioAC commented Jun 28, 2024

qgallouedec commented Jul 2, 2024 •

edited

Loading

GohioAC commented Jul 3, 2024

Error with SFT of LLaVA-Next #1785

Error with SFT of LLaVA-Next #1785

Comments

GohioAC commented Jun 28, 2024

qgallouedec commented Jul 2, 2024 • edited Loading

GohioAC commented Jul 3, 2024

qgallouedec commented Jul 2, 2024 •

edited

Loading