Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question of training utilizing A6000 #71

Closed
xogud3373 opened this issue Nov 15, 2023 · 6 comments
Closed

Question of training utilizing A6000 #71

xogud3373 opened this issue Nov 15, 2023 · 6 comments

Comments

@xogud3373
Copy link

xogud3373 commented Nov 15, 2023

Hello, first of all, I would like to express my deep gratitude for your excellent research.

I'm currently conducting training with A6000 x 8 GPUs.
But, I got below errors.
image

Is there a way to resolve this issue by not using flash-attention or by modifying another part of the code??

I did below train code.

torchrun --nproc_per_node 8 --master_port 29001 video_chatgpt/train/train_mem.py \
          --model_name_or_path ./LLaVA-Lightning-7B-v1-1 \
          --version v1 \
          --data_path video_chatgpt_training.json \
          --video_folder st_outputs1 \
          --tune_mm_mlp_adapter True \
          --mm_use_vid_start_end \
          --bf16 True \
          --output_dir ./Video-ChatGPT_7B-1.1_Checkpoints \
          --num_train_epochs 3 \
          --per_device_train_batch_size 4 \
          --per_device_eval_batch_size 4 \
          --gradient_accumulation_steps 8 \
          --evaluation_strategy "no" \
          --save_strategy "steps" \
          --save_steps 3000 \
          --save_total_limit 3 \
          --learning_rate 2e-5 \
          --weight_decay 0. \
          --warmup_ratio 0.03 \
          --lr_scheduler_type "cosine" \
          --logging_steps 100 \
          --tf32 True \
          --model_max_length 2048 \
          --gradient_checkpointing True \
          --lazy_preprocess True
@leesungjae7469
Copy link

leesungjae7469 commented Nov 15, 2023

When i tried to train this model, i couldn't train with A6000.

@CallmeBOKE
Copy link

Same issue here.

@JakePark-Kor
Copy link

I met same issue, if anyone has found the solution of it plz share :)

@xogud3373
Copy link
Author

I removed a 'replace_llama_attn_with_flash_attn()' statement from the 'video_chatgpt/train/train_mem.py' path and then the training proceeded. Could removing this code cause any issues with performance?

@Abyss-J
Copy link

Abyss-J commented Dec 3, 2023

I used A40 GPUs and got same issue here. How should I solve this problem?

@mmaaz60
Copy link
Member

mmaaz60 commented Apr 12, 2024

Hi @everyone,

Flash Attention only works on A100 or H100. In case if you want to train on any other GPU, commenting out the line at

replace_llama_attn_with_flash_attn()
should work. Thanks and Good Luck!

Please let me know if you will have any questions.

@mmaaz60 mmaaz60 closed this as completed Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants