OOM error of distributed training on 80GB GPUs with Mistral-7b #59

TracyPlus · 2024-04-08T02:30:38Z

I run the following train.sh on Mistral-7b:

accelerate launch finetune.py \
    --output-dir output/yarn-mistral-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 8 \
    --max-position-embeddings 4096tr \
    --dataset emozilla/yarn-train-tokenized-16k-mistral \
    --sliding-window-attention-schedule 65536 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000

with accelerate config as:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 2,3,4,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

but I encountered OutOfMemory problem on my 80G A800s:

I don't know if there's something wrong with my distributed training configuration、、🥺
Hope someone help me、、、🙏🙏

The text was updated successfully, but these errors were encountered:

YL-9 · 2024-05-10T17:14:40Z

I run the following train.sh on Mistral-7b:

accelerate launch finetune.py \
    --output-dir output/yarn-mistral-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 8 \
    --max-position-embeddings 4096tr \
    --dataset emozilla/yarn-train-tokenized-16k-mistral \
    --sliding-window-attention-schedule 65536 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000

with accelerate config as:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 2,3,4,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

but I encountered OutOfMemory problem on my 80G A800s:

I don't know if there's something wrong with my distributed training configuration、、🥺 Hope someone help me、、、🙏🙏

I also encountered this problem, have you solved it now? @TracyPlus

Kwen-Chen · 2024-05-15T02:12:44Z

I also encountered this problem when i use Yarn by Llama2

kokolerk · 2024-08-22T11:56:27Z

I also encountered this problem when i use Yarn by Llama2

TracyPlus changed the title ~~OOM~~ OOM error of distributed training on 80GB GPUs with Mistral-7b Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM error of distributed training on 80GB GPUs with Mistral-7b #59

OOM error of distributed training on 80GB GPUs with Mistral-7b #59

TracyPlus commented Apr 8, 2024

YL-9 commented May 10, 2024

Kwen-Chen commented May 15, 2024

kokolerk commented Aug 22, 2024

OOM error of distributed training on 80GB GPUs with Mistral-7b #59

OOM error of distributed training on 80GB GPUs with Mistral-7b #59

Comments

TracyPlus commented Apr 8, 2024

YL-9 commented May 10, 2024

Kwen-Chen commented May 15, 2024

kokolerk commented Aug 22, 2024