Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with SFT of LLaVA-Next #1785

Open
GohioAC opened this issue Jun 28, 2024 · 2 comments
Open

Error with SFT of LLaVA-Next #1785

GohioAC opened this issue Jun 28, 2024 · 2 comments
Labels
vlm Related to Visual Language Model

Comments

@GohioAC
Copy link

GohioAC commented Jun 28, 2024

I'm trying to instruction tune llava-next models following the llava_vsft.py examples shared for llava-1.5.

python vsft.py \
    --dataset_name="HuggingFaceH4/llava-instruct-mix-vsft" \
    --model_name_or_path="llava-hf/llava-v1.6-mistral-7b-hf" \
    --report_to="tensorboard" \
    --learning_rate=2e-5 \
    --lr_scheduler_type="cosine" \
    --per_device_train_batch_size=8 \
    --gradient_accumulation_steps=1 \
    --output_dir="data/vsft-llava-1.5-7b-hf" \
    --logging_steps=1 \
    --num_train_epochs=1 \
    --gradient_checkpointing \
    --remove_unused_columns=False \
    --torch_dtype=float16 \
    --fp16=True \
    --max_seq_length=4096 \
    --attn_implementation="flash_attention_2"

The run keeps failing on a 8xH100 VM with the following error:

RuntimeError: Input tensor at index 1 has invalid shape [8, 2785, 32064], but expected [8, 2889, 32064]

The full code and error stack trace is available in this gist.

@qgallouedec
Copy link
Member

qgallouedec commented Jul 2, 2024

Hi, sorry for the delay. Can you double-check the command. When I run it, I get

Traceback (most recent call last):
  File "/fsx/qgallouedec/trl-2/vsft.py", line 122, in <module>
    trainer.train()
  File "/fsx/qgallouedec/trl-2/trl/trainer/sft_trainer.py", line 440, in train
    output = super().train(*args, **kwargs)
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 2314, in _inner_training_loop
    _grad_norm = self.accelerator.clip_grad_norm_(
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/accelerate/accelerator.py", line 2269, in clip_grad_norm_
    self.unscale_gradients()
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/accelerate/accelerator.py", line 2219, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
  0%|          | 0/32395 [00:05<?, ?it/s]    

Also share the versions of try, transformers and torch please

@GohioAC
Copy link
Author

GohioAC commented Jul 3, 2024

I have double checked the command, code and output. Versions are as follows:

trl: 0.9.4
transformers: 4.41.2
torch: 2.3.1
cuda: 12.4
python: 3.10.14

@qgallouedec qgallouedec added the vlm Related to Visual Language Model label Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
vlm Related to Visual Language Model
Projects
None yet
Development

No branches or pull requests

2 participants