Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error of distributed training on 80GB GPUs with Mistral-7b #59

Open
TracyPlus opened this issue Apr 8, 2024 · 3 comments
Open

OOM error of distributed training on 80GB GPUs with Mistral-7b #59

TracyPlus opened this issue Apr 8, 2024 · 3 comments

Comments

@TracyPlus
Copy link

I run the following train.sh on Mistral-7b:

accelerate launch finetune.py \
    --output-dir output/yarn-mistral-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 8 \
    --max-position-embeddings 4096tr \
    --dataset emozilla/yarn-train-tokenized-16k-mistral \
    --sliding-window-attention-schedule 65536 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000

with accelerate config as:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 2,3,4,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

but I encountered OutOfMemory problem on my 80G A800s:
截屏2024-04-06 20 19 52

截屏2024-04-06 20 19 05

I don't know if there's something wrong with my distributed training configuration、、🥺
Hope someone help me、、、🙏🙏

@TracyPlus TracyPlus changed the title OOM OOM error of distributed training on 80GB GPUs with Mistral-7b Apr 8, 2024
@YL-9
Copy link

YL-9 commented May 10, 2024

I run the following train.sh on Mistral-7b:

accelerate launch finetune.py \
    --output-dir output/yarn-mistral-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 8 \
    --max-position-embeddings 4096tr \
    --dataset emozilla/yarn-train-tokenized-16k-mistral \
    --sliding-window-attention-schedule 65536 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000

with accelerate config as:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 2,3,4,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

but I encountered OutOfMemory problem on my 80G A800s: 截屏2024-04-06 20 19 52

截屏2024-04-06 20 19 05 I don't know if there's something wrong with my distributed training configuration、、🥺 Hope someone help me、、、🙏🙏

I also encountered this problem, have you solved it now? @TracyPlus

@Kwen-Chen
Copy link

I also encountered this problem when i use Yarn by Llama2

1 similar comment
@kokolerk
Copy link

I also encountered this problem when i use Yarn by Llama2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants