Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better implementation for te autocast #895

Merged
merged 7 commits into from
Oct 28, 2023

Conversation

KohakuBlueleaf
Copy link
Contributor

When we disable the training for TE, we will not prepare it and in this case we will need explicit convert it to the target dtype (or it will remain in fp32 which may not the expected behavior)

So basically I do 2 things here:
1, explicitly convert the TE to target dtype/device when it is not be trained
2, explicitly add autocast for TE since sometime it may not be prepare. (or definitely will not be prepared eg: cached TE)

@kohya-ss
Copy link
Owner

Thank you for this! It seems to be very good. I will review and merge sooner.

Perhaps, when we cache Text Encoder outputs, another way might be fine to change the dtype of Text Encoders to fp16/bf16 in advance...

@KohakuBlueleaf
Copy link
Contributor Author

Thank you for this! It seems to be very good. I will review and merge sooner.

Perhaps, when we cache Text Encoder outputs, another way might be fine to change the dtype of Text Encoders to fp16/bf16 in advance...

Yeah I also add "te.to(weight_dtype)" too

@KohakuBlueleaf
Copy link
Contributor Author

@kohya-ss Oh I got what you said.
Basically we cache TE with weight_dtype is better?
I also found that sdxl_train.py cache TE in fp32.
I think this is bad (since cached context is actually huge, and will slow down the caching process)

@KohakuBlueleaf
Copy link
Contributor Author

@kohya-ss I added autocast TE caching into sdxl_train.py and also add seperated LR settings for TE in sdxl_train.py which is requested by @Linaqruf

@kohya-ss
Copy link
Owner

Basically we cache TE with weight_dtype is better?

I think so. In my understanding, it is same to apply autocast.
(However, we convert back TEs to float32 and move to cpu for sampling images.)

@kohya-ss I added autocast TE caching into sdxl_train.py and also add seperated LR settings for TE in sdxl_train.py which is requested by @Linaqruf

This is really nice!

@KohakuBlueleaf
Copy link
Contributor Author

Basically we cache TE with weight_dtype is better?

I think so. In my understanding, it is same to apply autocast. (However, we convert back TEs to float32 and move to cpu for sampling images.)

@kohya-ss I added autocast TE caching into sdxl_train.py and also add seperated LR settings for TE in sdxl_train.py which is requested by @Linaqruf

This is really nice!

autocast and changing dtype is actually different.
autocast will convert your weight into target dtype(if backend accept it) and then do computation.
So convert weight directly still need autocast to ensure all the operation can run normally

@kohya-ss
Copy link
Owner

autocast and changing dtype is actually different.
autocast will convert your weight into target dtype(if backend accept it) and then do computation.
So convert weight directly still need autocast to ensure all the operation can run normally

hmm thank you for clarification. When generating images, we call the model converted to float16 or bfloat16 directly, so I thought there would be no difference, but it is better to use autocast.

@Linaqruf
Copy link
Contributor

@kohya-ss I added autocast TE caching into sdxl_train.py and also add seperated LR settings for TE in sdxl_train.py which is requested by @Linaqruf

Nice 😆

@kohya-ss
Copy link
Owner

When I apply this PR and train with sdxl_train.py including Text Encoder, it seems that neither U-Net nor Text Encoder is trained. When training only U-Net, there is no problem.

I will investigate further. But I think that perhaps multiple models may not work when specified like: {"params": ..., "lr": ...}...

@KohakuBlueleaf
Copy link
Contributor Author

When I apply this PR and train with sdxl_train.py including Text Encoder, it seems that neither U-Net nor Text Encoder is trained. When training only U-Net, there is no problem.

I will investigate further. But I think that perhaps multiple models may not work when specified like: {"params": ..., "lr": ...}...

Ok this makes sense
Will also check that!

@KohakuBlueleaf
Copy link
Contributor Author

@kohya-ss it is weird.
I checked my last training with this fork
it is definitely learning things.

but it cannot run fp16 (full fp16 not work, full bf16 work)
said

AssertionError: No inf checks were recorded for this optimizer.

@araleza
Copy link

araleza commented Oct 25, 2023

Hi, I think you may need to make the fix to the Text Encoder learning rate code in the --block_lr section that I found here:

#890 (comment)

@KohakuBlueleaf
Copy link
Contributor Author

@araleza @kohya-ss This is weird for me
In my implementation, both TE and UNet only put their generator into the params dict
But When I tried to train it, It can definitely learn things. (Expected: don't learn anything since all the param generator are used)

@KohakuBlueleaf
Copy link
Contributor Author

But I still made a fix with it

@KohakuBlueleaf
Copy link
Contributor Author

@kohya-ss I also add a manual timeout settings for DDP since for some large dataset with multi gpu training. It is very likely to runout the default timeout(30min) when caching the latents or TextEncoder output.
(For example: the dataset I'm using need literally 3days to cache the Latents)

@FurkanGozukara
Copy link

when we can expect this to be merged thank you so much

@kohya-ss
Copy link
Owner

@KohakuBlueleaf
Thanks for the various updates! The script seems to be working fine.

I would like to add some features after the merge, such as specifying independent learning rates for Text Encoder 1 and 2, excluding from the optimizing parameters for models with 0 specified for the learning rate, etc.

@kohya-ss kohya-ss merged commit 1cefb2a into kohya-ss:dev Oct 28, 2023
1 check passed
@FurkanGozukara
Copy link

@KohakuBlueleaf Thanks for the various updates! The script seems to be working fine.

I would like to add some features after the merge, such as specifying independent learning rates for Text Encoder 1 and 2, excluding from the optimizing parameters for models with 0 specified for the learning rate, etc.

so this will support both SD 1.5 and SDXL?

wkpark pushed a commit to wkpark/sd-scripts that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants