Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-gpu SDXL training #1000

Merged
merged 1 commit into from
Dec 13, 2023
Merged

Fix multi-gpu SDXL training #1000

merged 1 commit into from
Dec 13, 2023

Conversation

Isotr0py
Copy link
Contributor

These are related to text_encoders training, so I tested them by training text_encoders on 2 GPUs due to the limited VRAM.

@FurkanGozukara
Copy link

are you sure last layer of text_encoder1 is not trained? because i don't want single gpu training to be broken
by the way i did dual gpu training on kaggle and it works i have no error
trained text encoder 1

@Isotr0py
Copy link
Contributor Author

Isotr0py commented Dec 12, 2023

@FurkanGozukara

SDXL uses the output of the penultimate layer of Text Encoder 1 instead of the last layer. As a result, it won't participate in the loss calculation, but raise a RuntimeError while backward the gradients for DDP because it received no grad.

The grad of text_encoder1's last layer should be none neither in single GPU nor multi-gpus when I reproduced the RuntimeError.

Can you check the grad of text_encoder1's last layer or the grad between different device? It seems that the grads between GPUs are not synced correctly.

@FurkanGozukara
Copy link

@FurkanGozukara

SDXL uses the output of the penultimate layer of Text Encoder 1 instead of the last layer. As a result, it won't participate in the loss calculation, but raise a RuntimeError while backward the gradients for DDP because it received no grad.

The grad of text_encoder1's last layer should be none neither in single GPU nor multi-gpus when I reproduced the RuntimeError.

Can you check the grad of text_encoder1's last layer or the grad between different device? It seems that the grads between GPUs are not synced correctly.

how can I check latest layer difference? i have trained model right now that i can compare

@Isotr0py
Copy link
Contributor Author

Isotr0py commented Dec 12, 2023

You can just add a print after accelerator.backward(loss) in training:

print([k for k,v in text_encoder1.named_parameters() if v.grad is None])

It should outputs in neither single gpu nor multi-gpus:

['text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.11.self_attn.k_proj.bias', 'text_model.encoder.layers.11.self_attn.v_proj.weight', 'text_model.encoder.layers.11.self_attn.v_proj.bias', 'text_model.encoder.layers.11.self_attn.q_proj.weight', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.11.self_attn.out_proj.bias', 'text_model.encoder.layers.11.layer_norm1.weight', 'text_model.encoder.layers.11.layer_norm1.bias', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.11.mlp.fc1.bias', 'text_model.encoder.layers.11.mlp.fc2.weight', 'text_model.encoder.layers.11.mlp.fc2.bias', 'text_model.encoder.layers.11.layer_norm2.weight', 'text_model.encoder.layers.11.layer_norm2.bias', 'text_model.final_layer_norm.weight', 'text_model.final_layer_norm.bias']

Or you can print to compare between the device (this will print weight/grad in different values if you are using the main branch):

print(accelerator.device, THE WEIGHT/GRAD YOU WANT TO COMPARE)

jihnenglin added a commit to jihnenglin/sd-scripts that referenced this pull request Dec 13, 2023
jihnenglin added a commit to jihnenglin/sd-scripts that referenced this pull request Dec 13, 2023
jihnenglin added a commit to jihnenglin/sd-scripts that referenced this pull request Dec 13, 2023
jihnenglin added a commit to jihnenglin/sd-scripts that referenced this pull request Dec 13, 2023
@kohya-ss kohya-ss merged commit 471d274 into kohya-ss:dev Dec 13, 2023
1 check passed
@kohya-ss
Copy link
Owner

Thank you so much! I hope this will finally solve the DDP training issue.

I've changed to set DistributedDataParallelKwargs if one of the arguments is specified, and also changed the name of the arguments to ddp_*, to match the another option --ddp_timeout and to make it clear that they are for DDP-only. I appreciate your understanding.

@jihnenglin
Copy link

Thank you very much! I'm still running the SD XL training script, but the output images so far are very promising. Great improvements in details and texture.

@FurkanGozukara
Copy link

multi gpu broken kaggle training anyone has any guess how to fix?

#1272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants