Multiple GPU setup help #812

BotLifeGamer · 2023-09-09T04:30:28Z

Hello I haven't found a guide for Multiple gpu setup for Kohya has anyone got a step by step guide I keep getting errors trying to go by this on my own. There is no clear guide for this. be greatly appreciated if someone can guide me in the right direction.

BootsofLagrangian · 2023-09-09T13:20:32Z

aceelerate launch --num_processes=[NUM_YOUR_GPUS_PER_MACHINE] --num_machines=[NUM_YOUR_INDEPENDENT_MACHINES] --multi_gpus --gpu_ids=[GPU_IDS] "train_network.py" args...

If you have 4 gpus and one machine, give args as
accelerate launch --num_processes=4 --multi_gpu --num_machines=1 --gpu_ids=0,1,2,3 "train_network.py" args...

BotLifeGamer · 2023-09-09T15:22:50Z

aceelerate launch --num_processes=[NUM_YOUR_GPUS_PER_MACHINE] --num_machines=[NUM_YOUR_INDEPENDENT_MACHINES] --multi_gpus --gpu_ids=[GPU_IDS] "train_network.py" args...

If you have 4 gpus and one machine, give args as accelerate launch --num_processes=4 --multi_gpu --num_machines=1 --gpu_ids=0,1,2,3 "train_network.py" args...

Thanks for the reply I'm slowly learning everything as I go along me and another friend spent hrs trying to figure it out before I asked read previous posts. So where does the arg go into what file into train_network.py?

NEXTAltair · 2023-09-09T23:24:55Z

paperspace gradiertを使ってA6000二枚で学習させる場合は､"accelerate config"をターミナルから設定すれば"bmaltais/kohya_ss"での学習が実行できた
"sd-scripts"で学習させる場合でも引数を設定せずに"accelerate"を使えば複数GPUに対応できた経験がある

"When using Paperspace Gradient with two A6000 GPUs for training, by initiating accelerate config from the terminal, training with bmaltais/kohya_ss became possible. Also, when training with sd-scripts, I recall being able to support multiple GPUs by using accelerate without setting specific arguments.

What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all

----------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?                                                                                                                   
This machine                                                                                                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?                                                                                                                            
No distributed training                                                                                                                                         
Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:NO                                                   
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO                                                                                               
Do you want to use DeepSpeed? [yes/NO]: NO                                                                                                                      
**What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all**                                                            
----------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?                                                                                                              
fp16

BootsofLagrangian · 2023-09-10T13:07:27Z

aceelerate launch --num_processes=[NUM_YOUR_GPUS_PER_MACHINE] --num_machines=[NUM_YOUR_INDEPENDENT_MACHINES] --multi_gpus --gpu_ids=[GPU_IDS] "train_network.py" args...
If you have 4 gpus and one machine, give args as accelerate launch --num_processes=4 --multi_gpu --num_machines=1 --gpu_ids=0,1,2,3 "train_network.py" args...

Thanks for the reply I'm slowly learning everything as I go along me and another friend spent hrs trying to figure it out before I asked read previous posts. So where does the arg go into what file into train_network.py?

You can identify args of train_network.py using following command line in terminal or prompt in sd-scripts directory.

python train_network.py -h

And if you want to use multi-gpus in sd-scripts, you need to know what accelerate library is.

BotLifeGamer · 2023-09-13T19:55:20Z

as accelerate launch --num_processes=4 --multi_gpu --num_machines=1 --gpu_ids=0,1,2,3 "train_network.py"

Does this look like I'm not the right path??

D:\Kohya_ss\kohya_ss>accelerate launch --num_processes=2 --multi_gpu --num_machines=1 --gpu_ids=0,1 "train_network.py" -- --resolution 1024
NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [AIBOT]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [AIBOT]:29500 (system error: 10049 - The requested address is not valid in its context.).
prepare tokenizer
prepare tokenizer
Using DreamBooth method.
Using DreamBooth method.
prepare images.
0 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
batch_size: 1
resolution: (1024, 1024)
enable_bucket: False

[Dataset 0]
loading image sizes.
0it [00:00, ?it/s]
prepare dataset
No data found. Please verify arguments (train_data_dir must be the parent of folders with images) / 画像がありません。引数指定を確認してください（train_data_dirには画像があるフォルダではなく、画像があるフォルダの親フォルダを指定する必要があります）
prepare images.
0 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
batch_size: 1
resolution: (1024, 1024)
enable_bucket: False

[Dataset 0]
loading image sizes.
0it [00:00, ?it/s]
prepare dataset
No data found. Please verify arguments (train_data_dir must be the parent of folders with images) / 画像がありません。引数指定を確認してください（train_data_dirには画像があるフォルダではなく、画像があるフォルダの親フォルダを指定する必要があります）

BootsofLagrangian · 2023-09-16T05:40:25Z

@BotLifeGamer

Here is a example command lines for training lora

accelerate launch --num_processes=2 --multi_gpu --num_machines=1 --gpu_ids=0,1 "train_network.py" --pretrained_model_name_or_path=[huggingface_path or base model path to use] --network_module=networks.lora --save_model_as=safetensors --caption_extension=".txt" --seed="42" --training_comment=[some comment ] --output_name=[output_model_name] --train_data_dir=./training/img --output_dir=./training/model --logging_dir=./training/logs --logging_dir=./training/logs --network_alpha=[LINEAR_ALPHA] --network_dim=[LINEAR_RANK] --network_args "conv_rank=[CONV_RANK]" "conv_alpha=[CONV_ALPHA]" --resolution=%RESOLUTION% --train_batch_size=%BATCH_SIZE% --learning_rate=%LEARNING_RATE% --unet_lr=%UNET_LR% --text_encoder_lr=%TE_LR% --max_train_steps=%TRAINING_STEP% --lr_warmup_steps=%WARMUP_STEP% --save_every_n_epochs=1 --lr_scheduler=%LR_SCHEDULER% --lr_scheduler_num_cycles=%LR_CYCLES% --optimizer_type=%OPTIMIZER% --optimizer_args %OPTIMIZER_ARGS% --max_grad_norm=1.0 --noise_offset=%NOISE_OFFSET% --mixed_precision=%PRECISION% --save_precision=%PRECISION% --enable_bucket --bucket_no_upscale --random_crop --bucket_reso_steps=%BUCKET_RESO_STEPS% --max_token_length=225 --shuffle_caption --xformers --gradient_checkpointing --persistent_data_loader_workers

If you want to do full fine tuning model, use "fine_tune.py" instead of "train_network.py"

Charmandrigo · 2024-01-13T00:48:46Z

what is the setup for two machines on the same network? I am failing to get that part setup, my second machine seems to be right, but the main one I have no idea what to place on the ip and port because when I run a training it says the port is already on use (by the kohya ui itself running on main)

BootsofLagrangian · 2024-01-30T11:13:42Z

@Charmandrigo
Sorry for that I have only experience of one machine training. But I think accelerate support multi-machine training.
If you run accelerate config, you can find options for multi-machine training for DDP.
And from now, kohya's sd-scripts supports only DDP, not ZeRO or FSDP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple GPU setup help #812

Multiple GPU setup help #812

BotLifeGamer commented Sep 9, 2023

BootsofLagrangian commented Sep 9, 2023

BotLifeGamer commented Sep 9, 2023 •

edited

Loading

NEXTAltair commented Sep 9, 2023

BootsofLagrangian commented Sep 10, 2023

BotLifeGamer commented Sep 13, 2023

BootsofLagrangian commented Sep 16, 2023 •

edited

Loading

Charmandrigo commented Jan 13, 2024

BootsofLagrangian commented Jan 30, 2024

Multiple GPU setup help #812

Multiple GPU setup help #812

Comments

BotLifeGamer commented Sep 9, 2023

BootsofLagrangian commented Sep 9, 2023

BotLifeGamer commented Sep 9, 2023 • edited Loading

NEXTAltair commented Sep 9, 2023

BootsofLagrangian commented Sep 10, 2023

BotLifeGamer commented Sep 13, 2023

BootsofLagrangian commented Sep 16, 2023 • edited Loading

Charmandrigo commented Jan 13, 2024

BootsofLagrangian commented Jan 30, 2024

BotLifeGamer commented Sep 9, 2023 •

edited

Loading

BootsofLagrangian commented Sep 16, 2023 •

edited

Loading