[Experimental Feature] FP8 weight dtype for base model when running train_network (or sdxl_train_network) #1057

KohakuBlueleaf · 2024-01-17T15:16:52Z

Based on the PR for sd-webui on utilizing FP8, we can assume that we can also apply FP8 on the base model of train_network.
Since we don't need to update the weight of it, just need to compute things.

So I implement the first version of fp8 support in your framework and it works well!!

Actually I uploaded a experimental model very early for fp8 training, which only comsume 6.x GB vram when training SDXL with LyCORIS/LoRA.
If we cache the latent and TE, we can even use 4.4 GB vram to train all the things which is incredible.
(All the above experiments are done in 1024x1024 bs1 setup)

I think this is good for the community of SDXL.

BTW, my implementation is rely on the autocast right now which may be a good news for old GPU user or IPEX user. (But I think IPEX actually have autocast support, just slower then manual cast)

If you think it is also good idea I can try to make PR for manual cast. I already tried it can be used for training but may need some modification.

Dev

FurkanGozukara · 2024-01-17T15:18:42Z

Amazing work. Have you compared results with full BF16 training?

KohakuBlueleaf · 2024-01-17T15:52:34Z

Amazing work. Have you compared results with full BF16 training?

Have done some comparisons few months ago(

They do have some subtle difference but hard to say it is quality difference or performance differences

It is just, difference

FurkanGozukara · 2024-01-18T01:41:33Z

thanks for reply. so it will train fp8 and then save as fp16 as usual?

KohakuBlueleaf · 2024-01-18T02:44:39Z

thanks for reply. so it will train fp8 and then save as fp16 as usual?

Trainable part will not be converted to fp8

FurkanGozukara · 2024-01-18T02:49:35Z

thanks for reply. so it will train fp8 and then save as fp16 as usual?

Trainable part will not be converted to fp8

can you elaborate more? for example when training with DreamBooth of SDXL we train both network, UNET and Text Encoder I think all parts? or I am missing something. Thank you

kohya-ss · 2024-01-18T03:47:56Z

Thank you for this PR! The changes are less than expected. I will check as soon as possible.

KohakuBlueleaf · 2024-01-18T04:43:13Z

Thank you for this PR! The changes are less than expected. I will check as soon as possible.

I tested it a lot and basically what we did in the past is as same as FP8.
In the past we have fp16 base + fp32 network. now we just change to fp8 base.

The problem is also similar: you need autocast.

So some part of computing which doesn't use autocast may have problem but can be solved easily.
(Like cache TE, but I think it should not have problem. If you find some problem of it maybe we can consider to let user to enable autocast for cache TE procedure?)

KohakuBlueleaf · 2024-01-18T04:44:53Z

thanks for reply. so it will train fp8 and then save as fp16 as usual?

Trainable part will not be converted to fp8

can you elaborate more? for example when training with DreamBooth of SDXL we train both network, UNET and Text Encoder I think all parts? or I am missing something. Thank you

This PR is for lora/lycoris/hypernetwork(losalina) training.
Which means the base model (Unet/TE) will be freezed. We only train the additional network, and the trainable part should be in higher precision like fp16/bf16/fp32. But since these trainable parts are small (compare to original Unet/TE), so it doesn't matter.

laksjdjf · 2024-01-18T05:47:47Z

Hi, I have a question about this PR.
Is float8_e4m3fn better than float8_e5m2?

KohakuBlueleaf · 2024-01-18T06:01:50Z

Hi, I have a question about this PR. Is float8_e4m3fn better than float8_e5m2?

Yes
Some paper even claimed e3m4 e2m5 are better

I choose e4m3 based on my experiments

If we have better scaling method on it, maybe we can consider e5m2, but since we don't use fp8 for computing in here, i think the better precision is more important

laksjdjf · 2024-01-18T06:07:41Z

Thanks!

sdbds · 2024-01-18T15:27:45Z

good job

kohya-ss · 2024-01-20T00:47:05Z

Thank you again for the great work!

…rain_network (or sdxl_train_network) (kohya-ss#1057) * Add fp8 support * remove some debug prints * Better implementation for te * Fix some misunderstanding * as same as unet, add explicit convert * better impl for convert TE to fp8 * fp8 for not only unet * Better cache TE and TE lr * match arg name * Fix with list * Add timeout settings * Fix arg style * Add custom seperator * Fix typo * Fix typo again * Fix dtype error * Fix gradient problem * Fix req grad * fix merge * Fix merge * Resolve merge * arrangement and document * Resolve merge error * Add assert for mixed precision

Update README

KohakuBlueleaf added 30 commits October 24, 2023 22:32

Add fp8 support

ed883f2

remove some debug prints

c0c7109

Better implementation for te

9e2b99f

Fix some misunderstanding

628b44d

as same as unet, add explicit convert

65da2cc

better impl for convert TE to fp8

1623d6c

fp8 for not only unet

e2f32f5

Better cache TE and TE lr

ed82a1a

match arg name

f582142

Fix with list

f52cbc2

Add timeout settings

34f1cd6

Fix arg style

19d6617

Add custom seperator

e355764

Fix typo

3f670d2

Fix typo again

35c32a6

Merge remote-tracking branch 'upstream/dev' into fp8-experiments

3f0c40a

Fix dtype error

9be4f44

Fix gradient problem

b003446

Fix req grad

c0f8d28

Merge remote-tracking branch 'upstream/dev' into fp8-experiments

9bb4fcb

fix merge

67a3ad8

Fix merge

54f5f46

Merge remote-tracking branch 'upstream/dev' into fp8-experiments

fa82f2a

Merge pull request #1 from kohya-ss/dev

3f0414c

Dev

Merge branch 'kohya-ss:main' into fp8-experiments

4664c2d

Merge remote-tracking branch 'upstream/dev' into fp8-experiments

12151c9

Resolve merge

b4b872e

arrangement and document

9229282

Merge remote-tracking branch 'upstream/dev' into fp8-experiments

3438703

Resolve merge error

c5b9187

Add assert for mixed precision

100f852

kohya-ss merged commit 9cfa68c into kohya-ss:dev Jan 20, 2024
1 check passed

kohya-ss added a commit that referenced this pull request Jan 20, 2024

fix to work sample generation in fp8 ref #1057

1f77bb6

Disty0 pushed a commit to Disty0/sd-scripts that referenced this pull request Jan 28, 2024

fix to work sample generation in fp8 ref kohya-ss#1057

f0a8988

wkpark pushed a commit to wkpark/sd-scripts that referenced this pull request Feb 27, 2024

Merge pull request kohya-ss#1057 from bmaltais/dev2

2b8a680

Update README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental Feature] FP8 weight dtype for base model when running train_network (or sdxl_train_network) #1057

[Experimental Feature] FP8 weight dtype for base model when running train_network (or sdxl_train_network) #1057

KohakuBlueleaf commented Jan 17, 2024

FurkanGozukara commented Jan 17, 2024

KohakuBlueleaf commented Jan 17, 2024

FurkanGozukara commented Jan 18, 2024

KohakuBlueleaf commented Jan 18, 2024

FurkanGozukara commented Jan 18, 2024

kohya-ss commented Jan 18, 2024

KohakuBlueleaf commented Jan 18, 2024

KohakuBlueleaf commented Jan 18, 2024

laksjdjf commented Jan 18, 2024

KohakuBlueleaf commented Jan 18, 2024

laksjdjf commented Jan 18, 2024

sdbds commented Jan 18, 2024 •

edited

Loading

kohya-ss commented Jan 20, 2024

[Experimental Feature] FP8 weight dtype for base model when running train_network (or sdxl_train_network) #1057

[Experimental Feature] FP8 weight dtype for base model when running train_network (or sdxl_train_network) #1057

Conversation

KohakuBlueleaf commented Jan 17, 2024

FurkanGozukara commented Jan 17, 2024

KohakuBlueleaf commented Jan 17, 2024

FurkanGozukara commented Jan 18, 2024

KohakuBlueleaf commented Jan 18, 2024

FurkanGozukara commented Jan 18, 2024

kohya-ss commented Jan 18, 2024

KohakuBlueleaf commented Jan 18, 2024

KohakuBlueleaf commented Jan 18, 2024

laksjdjf commented Jan 18, 2024

KohakuBlueleaf commented Jan 18, 2024

laksjdjf commented Jan 18, 2024

sdbds commented Jan 18, 2024 • edited Loading

kohya-ss commented Jan 20, 2024

sdbds commented Jan 18, 2024 •

edited

Loading