Add support for `torch.compile` #1024

p1atdev · 2023-12-26T17:38:45Z

Added:

New options, --torch_compile and --dynamo_backend
- --torch_compile: Enables torch.compile. Default is False.
  - This option is currently incompatible with --xformers. Please use --sdpa option instead.
- --dynamo_backend: The backend used with torch.compile. Default is "inductor". "eager", "aot_eager", "inductor", "aot_ts_nvfuser", "nvprims_nvfuser", "cudagraphs", "ofi", "fx2trt", "onnxrt" are avaiable, but most are not tested.
  - inductor and eager were worked.

Changed:

Bumped the einops version from 0.6.0 to 0.6.1 due to be compatible with torch.compile. (more information)

Add accelerate torch.compile() support for faster training on Pytorch 2.0 #65

FurkanGozukara · 2023-12-26T17:42:26Z

Torch compile not available on Windows right?

Also what improvements / changes it brings?

p1atdev · 2023-12-26T17:57:31Z

Yes, torch.compile does not work on Windows, but it works on WSL.

In my small experiment, training with options --sdpa, --torch_compile and --dynamo_backend eager was faster than --xformers only. (RTX 3070Ti)

wandb: https://wandb.ai/p1atdev/sd-scripts-torch_compile/workspace?workspace=user-p1atdev

The training is very slow while a few steps after torch.compile but then gets faster. Also, torch.compile is expected to be high performance with a modern NVIDIA GPU (like H100, A100, or V100).

https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

Therefore, using torch.compile may be faster than just using xformers for large trainings.

FurkanGozukara · 2023-12-26T20:02:07Z

Yes, torch.compile does not work on Windows, but it works on WSL.

In my small experiment, training with options --sdpa, --torch_compile and --dynamo_backend eager was faster than --xformers only. (RTX 3070Ti)

wandb: https://wandb.ai/p1atdev/sd-scripts-torch_compile/workspace?workspace=user-p1atdev

The training is very slow while a few steps after torch.compile but then gets faster. Also, torch.compile is expected to be high performance with a modern NVIDIA GPU (like H100, A100, or V100).

https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

Therefore, using torch.compile may be faster than just using xformers for large trainings.

ty can you tell the it / s difference?

by the way best looking example is xformers

p1atdev · 2023-12-27T12:15:32Z

The following are screenshots taken during longer trainings:

First time of --sdpa, --torch_compile and --dynamo_backend eager
First time of --xformers
Second time of --sdpa, --torch_compile and --dynamo_backend eager

wandb: https://wandb.ai/p1atdev/pvc-torch_compile

I'm not familiar with torch.compile so I don't know the exact details, but I think torch.compile requires a certain steps of warm-up, so the second training with torch.compile is faster than xformers.

kohya-ss · 2024-01-04T02:11:06Z

Sorry for the delay. Thank you so much for this great PR! I don't use Linux/WSL personally, but this is really nice!

sdbds · 2024-01-04T07:52:48Z

Yes, torch.compile does not work on Windows, but it works on WSL.

In my small experiment, training with options --sdpa, --torch_compile and --dynamo_backend eager was faster than --xformers only. (RTX 3070Ti)

wandb: https://wandb.ai/p1atdev/sd-scripts-torch_compile/workspace?workspace=user-p1atdev

The training is very slow while a few steps after torch.compile but then gets faster. Also, torch.compile is expected to be high performance with a modern NVIDIA GPU (like H100, A100, or V100).

https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

Therefore, using torch.compile may be faster than just using xformers for large trainings.

I noticed that this is not sd-script's accelerator 0.0.23 but 0.0.25.
We should upgrade the dependency version otherwise a lot of options won't work.

kohya-ss · 2024-01-04T10:55:43Z

I updated accelerate to 0.0.25. I hope this makes this PR to work.

FurkanGozukara · 2024-01-04T11:55:08Z

p1atdev

what does second time of training means?

dill-shower · 2024-01-04T12:42:35Z

The following are screenshots taken during longer trainings

Can you please test speed with inductor backend?

kohya-ss · 2024-01-13T10:26:24Z

@p1atdev What version of PyTorch do you recommend, 2.1 or does 2.0 work fine? I would like to mention it in the documentation when updating.

p1atdev · 2024-01-13T16:15:36Z

I tested with PyTorch version 2.1.2+cu118 and it worked. Also according to the PyTorch release notes, torch.compile is more stable in version 2.1 or later.

https://github.com/pytorch/pytorch/releases/tag/v2.1.0

kohya-ss · 2024-01-14T12:46:54Z

Thank you for clarification!

feffy380 · 2024-01-18T20:40:05Z

@p1atdev What training script did you test with? With train_network.py (SD1.x lora) it crashes in the sdpa forward function:

torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function rearrange at 0x775e95df74c0>(*(FakeTensor(..., device='cuda:0', size=(4, s1, 320), dtype=torch.float16,
           grad_fn=<CloneBackward0>), 'b n (h d) -> b h n d'), **{'h': 8}):
unhashable type: non-singleton SymInt

from user code:
   File "/home/hope/src/sd/sd-scripts/library/original_unet.py", line 741, in resume_in_forward_sdpa_at_739
    q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q_in, k_in, v_in))
  File "/home/hope/src/sd/sd-scripts/library/original_unet.py", line 741, in <lambda>
    q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q_in, k_in, v_in))

(However, I am using pytorch-rocm 2.3 nightly and don't know if torch.compile is fully supported on AMD in the first place)

jdack41 · 2024-02-09T18:59:52Z

I've got same error on mac, and also got same error on wsl with cuda. Mac's torch version is 2.2.0, and wsl has 2.1.2.

jdack41 · 2024-02-10T06:34:46Z

@p1atdev @kohya-ss
It seems there is issue on training with sdxl_train.py. sdxl_train_network.py could run.

ultranationalism · 2024-02-11T20:27:03Z

I've got same error on mac, and also got same error on wsl with cuda. Mac's torch version is 2.2.0, and wsl has 2.1.2.

same error on torch 2.2.0 cu118

ultranationalism · 2024-02-12T12:45:59Z

@p1atdev @kohya-ss It seems there is issue on training with sdxl_train.py. sdxl_train_network.py could run.

try to upgrade your einops to the lastest version on torch2.1.2+cu118

iamargentum · 2024-02-13T04:23:26Z

i updated my torch to 2.1.1 and tried training with the torch_compile flag, but it keeps failing with this error - "LayerNormKernelImpl" not implemented for 'Half'
any idea about what this is and how it could be fixed?

jdack41 · 2024-02-13T09:46:04Z

@p1atdev @kohya-ss It seems there is issue on training with sdxl_train.py. sdxl_train_network.py could run.

try to upgrade your einops to the lastest version on torch2.1.2+cu118

Thank you for reply. This solved non hashable error both Mac and wsl(updated einops to 0.7.0).
But saved weights causes NansException on Automatic1111(not lora, finetuned weights).

jdack41 · 2024-02-15T01:40:29Z

https://discuss.pytorch.org/t/how-to-save-load-a-model-with-torch-compile/179739/2
According to this thread, torch.compile will add a prefix ‘_orig_mod.’ to state_dict() of the model.
So removing ‘_orig_mod.’ when saving weights solved NansException.
like this

    def update_sd(prefix, sd):
        for k, v in sd.items():
            key = prefix + k.replace('_orig_mod.', '')
            if save_dtype is not None:
                v = v.detach().clone().to("cpu").to(save_dtype)
            state_dict[key] = v

jdack41 · 2024-02-15T01:51:45Z

Also saving Lora probably needs to do the same or it is structurally broken.

p1atdev added 2 commits December 27, 2023 02:17

chore: bump eniops version due to support torch.compile

20296b4

feat: support torch.compile

62e7516

p1atdev changed the title ~~Add support torch.compile~~ Add support for torch.compile Dec 26, 2023

kohya-ss changed the base branch from main to dev January 4, 2024 01:49

kohya-ss merged commit 07bf2a2 into kohya-ss:dev Jan 4, 2024
1 check passed

kohya-ss added a commit that referenced this pull request Jan 4, 2024

Update dependencies ref #1024

716bad1

bmaltais mentioned this pull request Jan 16, 2024

v22.5.0 bmaltais/kohya_ss#1880

Merged

Disty0 pushed a commit to Disty0/sd-scripts that referenced this pull request Jan 28, 2024

Update dependencies ref kohya-ss#1024

aa68bd4

jdack41 mentioned this pull request Feb 15, 2024

torch compile issues #1122

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `torch.compile` #1024

Add support for `torch.compile` #1024

p1atdev commented Dec 26, 2023

FurkanGozukara commented Dec 26, 2023

p1atdev commented Dec 26, 2023 •

edited

Loading

FurkanGozukara commented Dec 26, 2023

p1atdev commented Dec 27, 2023

kohya-ss commented Jan 4, 2024 •

edited

Loading

sdbds commented Jan 4, 2024

kohya-ss commented Jan 4, 2024

FurkanGozukara commented Jan 4, 2024

dill-shower commented Jan 4, 2024

kohya-ss commented Jan 13, 2024

p1atdev commented Jan 13, 2024

kohya-ss commented Jan 14, 2024

feffy380 commented Jan 18, 2024 •

edited

Loading

jdack41 commented Feb 9, 2024

jdack41 commented Feb 10, 2024

ultranationalism commented Feb 11, 2024

ultranationalism commented Feb 12, 2024

iamargentum commented Feb 13, 2024

jdack41 commented Feb 13, 2024 •

edited

Loading

jdack41 commented Feb 15, 2024

jdack41 commented Feb 15, 2024

Add support for torch.compile #1024

Add support for torch.compile #1024

Conversation

p1atdev commented Dec 26, 2023

FurkanGozukara commented Dec 26, 2023

p1atdev commented Dec 26, 2023 • edited Loading

FurkanGozukara commented Dec 26, 2023

p1atdev commented Dec 27, 2023

kohya-ss commented Jan 4, 2024 • edited Loading

sdbds commented Jan 4, 2024

kohya-ss commented Jan 4, 2024

FurkanGozukara commented Jan 4, 2024

dill-shower commented Jan 4, 2024

kohya-ss commented Jan 13, 2024

p1atdev commented Jan 13, 2024

kohya-ss commented Jan 14, 2024

feffy380 commented Jan 18, 2024 • edited Loading

jdack41 commented Feb 9, 2024

jdack41 commented Feb 10, 2024

ultranationalism commented Feb 11, 2024

ultranationalism commented Feb 12, 2024

iamargentum commented Feb 13, 2024

jdack41 commented Feb 13, 2024 • edited Loading

jdack41 commented Feb 15, 2024

jdack41 commented Feb 15, 2024

Add support for `torch.compile` #1024

Add support for `torch.compile` #1024

p1atdev commented Dec 26, 2023 •

edited

Loading

kohya-ss commented Jan 4, 2024 •

edited

Loading

feffy380 commented Jan 18, 2024 •

edited

Loading

jdack41 commented Feb 13, 2024 •

edited

Loading