Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM on two 80GB GPUs #49

Open
kyleliang919 opened this issue Jan 9, 2024 · 5 comments
Open

OOM on two 80GB GPUs #49

kyleliang919 opened this issue Jan 9, 2024 · 5 comments

Comments

@kyleliang919
Copy link

kyleliang919 commented Jan 9, 2024

accelerate launch finetune.py \
    --output-dir output/mistral-yarn-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 2 \
    --max-position-embeddings 16384 \
    --dataset emozilla/yarn-train-tokenized-8k-mistral \
    --sliding-window-attention-schedule 4096 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000

Both with or without lora hits the OOM error, this is on only 8K sequence length, so memory consumption should be around 4x smaller compared with training on 16K sequence length.

accelerate is configured to use two GPU and FSDP.

@edisonzf2020
Copy link

+1

1 similar comment
@TracyPlus
Copy link

+1

@YL-9
Copy link

YL-9 commented May 10, 2024

accelerate launch finetune.py \
    --output-dir output/mistral-yarn-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 2 \
    --max-position-embeddings 16384 \
    --dataset emozilla/yarn-train-tokenized-8k-mistral \
    --sliding-window-attention-schedule 4096 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000

Both with or without lora hits the OOM error, this is on only 8K sequence length, so memory consumption should be around 4x smaller compared with training on 16K sequence length.

accelerate is configured to use two GPU and FSDP.

I also encountered this problem, have you solved it now? @kyleliang919 @edisonzf2020

@kyleliang919
Copy link
Author

unfortunately no, I think you probably need at least 320 GB to handle this run.

@YL-9
Copy link

YL-9 commented May 13, 2024

unfortunately no, I think you probably need at least 320 GB to handle this run.

thank you for your reply.
I have 4xA100, but there is a process on each GPU, so it's still OOM, like 2xA100, I don't know how to configure. QAQ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants