Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dreambooth doesn't train on 8GB #807

Closed
devilismyfriend opened this issue Oct 12, 2022 · 50 comments
Closed

Dreambooth doesn't train on 8GB #807

devilismyfriend opened this issue Oct 12, 2022 · 50 comments
Assignees
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@devilismyfriend
Copy link

Describe the bug

Per the example featured in the repo, it goes OOM when DeepSpeed is loading the optimizer, tested on a 3080 10GB + 64GB RAM in WSL2 and native Linux.

Reproduction

Follow the pastebin for setup purposes (on WSL2), or just try it yourself https://pastebin.com/0NHA5YTP

Logs

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `8` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2022-10-11 17:16:38,700] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2022-10-11 17:16:48,338] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.3, git-hash=unknown, git-branch=unknown
[2022-10-11 17:16:50,220] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-10-11 17:16:50,221] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2022-10-11 17:16:50,221] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2022-10-11 17:16:50,271] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = {basic_optimizer.__class__.__name__}
[2022-10-11 17:16:50,272] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2022-10-11 17:16:50,272] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:134:__init__] Reduce bucket size 500000000
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:135:__init__] Allgather bucket size 500000000
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:136:__init__] CPU Offload: True
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:137:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.15820693969726562 seconds
Rank: 0 partition count [1] and sizes[(859520964, False)]
[2022-10-11 17:16:52,613] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-10-11 17:16:52,614] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 3.27 GB         Max_CA 3 GB
[2022-10-11 17:16:52,614] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 7.68 GB, percent = 16.3%
Traceback (most recent call last):
  File "/root/github/diffusers-ttl/examples/dreambooth/train_dreambooth.py", line 598, in <module>
    main()
  File "/root/github/diffusers-ttl/examples/dreambooth/train_dreambooth.py", line 478, in main
    unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/accelerate/accelerator.py", line 679, in prepare
    result = self._prepare_deepspeed(*args)
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/accelerate/accelerator.py", line 890, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/__init__.py", line 124, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 320, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1144, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1395, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 512, in __init__
    self.initialize_optimizer_states()
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 599, in initialize_optimizer_states
    i].grad = single_grad_partition.pin_memory(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59) of binary: /root/anaconda3/envs/diffusers-ttl/bin/python

System Info

3080 10GB + 64GB RAM, WSL2 and Linux

@devilismyfriend devilismyfriend added the bug Something isn't working label Oct 12, 2022
@Ttl
Copy link
Contributor

Ttl commented Oct 12, 2022

It seems that it's failing at pinning the allocated CPU memory. I'm not sure what the issue is but this doesn't seem to be an issue that could be fixed in diffusers. Do you have multiple GPUs? That could limit the maximum pinned memory available: https://forums.developer.nvidia.com/t/max-amount-of-host-pinned-memory-available-for-allocation/56053/7

This pytorch code should do the same thing. It allocates 16 GB of CPU memory and tries to pin it:

import torch

cpu = torch.device('cpu')
alloc_size = 16e9

print('Allocating')
x = torch.zeros(int(alloc_size / 8),
                dtype=torch.float32,
                device=cpu)
print('Pinning')
x = x.pin_memory()

print('Accessing')
m = torch.mean(x)
assert m == 0
print('Done')

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 12, 2022

Check your WSL config file, it might have your max memory at a much lower value.

They are .wslconfig and wsl.conf

See docs here, https://learn.microsoft.com/en-us/windows/wsl/wsl-config

In powershell you can do

notepad "$env:USERPROFILE/.wslconfig"
you can then check if memory is set, here is mine, but I've made some changes.

[wsl2]
export DISPLAY=:0.0
memory=14GB
swap=14GB

you can change the memory to allow whatever you want but I think WSL2 might cap it to some percentage of system ram.

Edit:

Hmm

your output is,

[2022-10-11 17:16:52,613] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-10-11 17:16:52,614] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 3.27 GB         Max_CA 3 GB
[2022-10-11 17:16:52,614] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 7.68 GB, percent = 16.3%

So 7.68GB/16.3% = 47 GB of CPU VM total, thus 39GB should be available.

The MA, Max_MA, CA, Max_CA are

     f"MA {round(torch.cuda.memory_allocated() / (1024 * 1024 * 1024),2 )} GB \ 
     Max_MA {round(torch.cuda.max_memory_allocated() / (1024 * 1024 * 1024),2)} GB \ 
     CA {round(torch_memory_reserved() / (1024 * 1024 * 1024),2)} GB \ 
     Max_CA {round(torch_max_memory_reserved() / (1024 * 1024 * 1024))} GB ") 

some memory profiling suggestions here

microsoft/DeepSpeed#1437

@devilismyfriend
Copy link
Author

Check your WSL config file, it might have your max memory at a much lower value.

They are .wslconfig and wsl.conf

See docs here, https://learn.microsoft.com/en-us/windows/wsl/wsl-config

In powershell you can do

notepad "$env:USERPROFILE/.wslconfig" you can then check if memory is set, here is mine, but I've made some changes.

[wsl2]
export DISPLAY=:0.0
memory=14GB
swap=14GB

you can change the memory to allow whatever you want but I think WSL2 might cap it to some percentage of system ram.

Sorry forgot to mention I already edited the WSL config to allow 48GB of RAM, keep in mind this also didn't work on native linux.

@Thomas-MMJ
Copy link

Yeah was editing my reply after I realized your logs suggested you had 48 GB available.

@devilismyfriend
Copy link
Author

devilismyfriend commented Oct 12, 2022

It seems that it's failing at pinning the allocated CPU memory. I'm not sure what the issue is but this doesn't seem to be an issue that could be fixed in diffusers. Do you have multiple GPUs? That could limit the maximum pinned memory available: https://forums.developer.nvidia.com/t/max-amount-of-host-pinned-memory-available-for-allocation/56053/7

This pytorch code should do the same thing. It allocates 16 GB of CPU memory and tries to pin it:

import torch

cpu = torch.device('cpu')
alloc_size = 16e9

print('Allocating')
x = torch.zeros(int(alloc_size / 8),
                dtype=torch.float32,
                device=cpu)
print('Pinning')
x = x.pin_memory()

print('Accessing')
m = torch.mean(x)
assert m == 0
print('Done')

Yep this also fails, I do have an integrated intel gpu that's turned off in the bios.

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 12, 2022

I'd try pinning various sizes, say in 1 GB increments, or slightly faster but more complicated a binary search (keep halving till it succeeds, then increment by half between the success and last fail, repeat up/down till you get close enough).

That said this seems like a pytorch related bug perhaps, so might post it in their tracker.

Hmm might also relate to WSL,

microsoft/WSL#8447

@devilismyfriend
Copy link
Author

I'd try pinning various sizes, say in 1 GB increments, or slightly faster but more complicated a binary search (keep halving till it succeeds, then increment by half between the success and last fail, repeat up/down till you get close enough).

That said this seems like a pytorch related bug perhaps, so might post it in their tracker.

Hmm might also relate to WSL,

microsoft/WSL#8447

looks like it fails after 2.147481e9

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 12, 2022

Ah here we go, it is (probably) an NVIDIA driver related limitation under WSL.

Pinned system memory (example: System memory that an application makes resident for GPU accesses) availability for applications is limited. For example, some deep learning training workloads, depending on the framework, model and dataset size used, can exceed this limit and may not work.

https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps

@devilismyfriend
Copy link
Author

devilismyfriend commented Oct 12, 2022

Ah here we go, it is a NVIDIA driver related limitation under WSL.

https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps

yes but unfortunately it happens under native linux as well (ubuntu), although I did see reports of user having it working sometimes under wsl

@Thomas-MMJ
Copy link

Could you try running it under native linux and give the debug output? Perhaps there is a different error/cause for it failing there?

@devilismyfriend
Copy link
Author

Could you try running it under native linux and give the debug output? Perhaps there is a different error/cause for it failing there?

I'll have to reinstall it and report back

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 12, 2022

You could also try wsl --update potentially if your wsl is older might have different pinning than current. Will check my WSL and see what it fails at for pinning.

Edit

Tried it here, confirmed fail after more than 2.147481e9 in WSL. Also it is the total pinned that matters, tried allocating various smaller pins, and once more than the limit was reached it failed.

@DonStroganotti
Copy link

DonStroganotti commented Oct 13, 2022

Having the exact same problem on a 2070 SUPER 8GB, 64GB RAM, 3950x, Windows 10 WSL2

OP here apparently has it working under WSL on a 2070 non super, https://www.reddit.com/r/StableDiffusion/comments/xzbc2h/guide_for_dreambooth_with_8gb_vram_under_windows/?sort=new

@Thomas-MMJ
Copy link

note you can pass a deepspeed config file and set pin memory to false, this will slow things down but might allow it to work.

See this config that uses NVME offload, you can just replace NVME with CPU.

microsoft/DeepSpeed#1130

@leszekhanusz
Copy link
Contributor

For reference, I can report that with my RTX3080 10GB and 32GB of RAM it works correctly on Linux.
I tested on wsl on windows and I had the same problem reported here, with the test code by @Ttl failing.

Note:

  • you have to make sure to have 200 class images in the classes folder or it will complain about not enough VRAM when it tries to generate them automatically. A warning in that instance might be useful to help users.

@devilismyfriend
Copy link
Author

devilismyfriend commented Oct 13, 2022

For reference, I can report that with my RTX3080 10GB and 32GB of RAM it works correctly on Linux. I tested on wsl on windows and I had the same problem reported here, with the test code by @Ttl failing.

Note:

  • you have to make sure to have 200 class images in the classes folder or it will complain about not enough VRAM when it tries to generate them automatically. A warning in that instance might be useful to help users.

I'll try it again later tonight; from my previous tests it had the same issues but we'll see. seeing as some folks have it working in WSL it's bizzare

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 13, 2022

the config I linked earlier wasn't parsable by pythons json,

here is one that parses, don't know if it is set up correctly (most of you will want to change from "nvme" to "cpu" - and probably delete most of the other parameters).

note had to make it .txt since .json isn't allowed for attachments.

ds_config.txt

@A-Polyana
Copy link

A-Polyana commented Oct 13, 2022

this issue very interesting
i have too OOM(native ubuntu), RuntimeError: CUDA error: unknown error(WSL2)
with 5800X and 3070ti 8GB and 48GB Ram
maybe that error it happens when Over 8GB vram

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 13, 2022

this issue very interesting i have too OOM(native ubuntu), RuntimeError: CUDA error: unknown error(WSL2) with 5800X and 3070ti 8GB and 48GB Ram maybe that error it happens when Over 8GB vram

Pretty sure it it the memory pinning is limited to 2 GB,

please do

accelerate config

then configurate it with the following

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 0
Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes
Please enter the path to the json DeepSpeed config file: ds_config.json
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:1

and use the following config file (change the .txt to .json) with the file in the local directory

ds_config.txt

then run things in the standard way and I think you guys will be sorted out.

@A-Polyana
Copy link

ty! i will test it if i go home

@devilismyfriend
Copy link
Author

devilismyfriend commented Oct 13, 2022

this issue very interesting i have too OOM(native ubuntu), RuntimeError: CUDA error: unknown error(WSL2) with 5800X and 3070ti 8GB and 48GB Ram maybe that error it happens when Over 8GB vram

Pretty sure it it the memory pinning is limited to 2 GB,

please do

accelerate config

then configurate it with the following

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 0
Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes
Please enter the path to the json DeepSpeed config file: ds_config.json
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes
How many GPU(s) should be used for distributed training? [1]:1

and use the following config file (change the .txt to .json) with the file in the local directory

ds_config.txt

then run things in the standard way and I think you guys will be sorted out.

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `8` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2022-10-12 22:09:39,350] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
  File "/root/github/diffusers-ttl/examples/dreambooth/train_dreambooth.py", line 598, in <module>
    main()
  File "/root/github/diffusers-ttl/examples/dreambooth/train_dreambooth.py", line 479, in main
    unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/accelerate/accelerator.py", line 679, in prepare
    result = self._prepare_deepspeed(*args)
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/accelerate/accelerator.py", line 763, in _prepare_deepspeed
    if deepspeed_plugin.deepspeed_config["train_micro_batch_size_per_gpu"] == "auto":
KeyError: 'train_micro_batch_size_per_gpu'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22677) of binary: /root/anaconda3/envs/diffusers-ttl/bin/python

@A-Polyana
Copy link

A-Polyana commented Oct 13, 2022

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `4` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:342: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
  warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
[2022-10-13 14:49:36,828] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend ncclFetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 49344.75it/s]Traceback (most recent call last):
  File "/home/poly/github/diffusers/examples/dreambooth/train_dreambooth.py", line 592, in <module>
    main()
  File "/home/poly/github/diffusers/examples/dreambooth/train_dreambooth.py", line 345, in main
    sample_dataloader = accelerator.prepare(sample_dataloader)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py", line 679, in prepare    result = self._prepare_deepspeed(*args)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py", line 763, in _prepare_deepspeed
    if deepspeed_plugin.deepspeed_config["train_micro_batch_size_per_gpu"] == "auto":
KeyError: 'train_micro_batch_size_per_gpu'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 136) of binary: /home/poly/anaconda3/envs/diffusers/bin/python

use code:


export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="training"
export CLASS_DIR="classes"
export OUTPUT_DIR="model"
 
accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="a photo of sks person" \
  --class_prompt="a photo of person" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=20 \
  --max_train_steps=800 \
  --mixed_precision=fp16

Did I make a mistake?

System info
- Platform: 5.10.16.3-microsoft-standard-WSL2 2021 x86_64 x86_64 x86_64 GNU/Linux
- CUDA Toolkit Version: 11.7.1(WSL2, Windows Both)
- Using GPU in script?: 1x RTX 3070TI

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 13, 2022

also make sure you generate your 200 class images before you try and run accelerate.

you probably didn't make a mistake - my hope was just optimistic :)

@A-Polyana
Copy link

also make sure you generate your 200 class images before you try and run accelerate.

you probably didn't make a mistake - my hope was just optimistic :)

My WSL not work to before make class image
I think there's something before we reach 8GB VRAM
But thanks helpful!

@Squolly
Copy link

Squolly commented Oct 13, 2022

I had exactly the same issue with my RTX 3070 and 32GB RAM, Ubuntu 22.04 running on WSL2 in Windows 11.

However I just solved it by updating to Windows 22H2 (I had 21H2 previously) and afterwards updating WSL with
wsl --update

Maybe it helps you too.

@A-Polyana
Copy link

I had exactly the same issue with my RTX 3070 and 32GB RAM, Ubuntu 22.04 running on WSL2 in Windows 11.

However I just solved it by updating to Windows 22H2 (I had 21H2 previously) and afterwards updating WSL with wsl --update

Maybe it helps you too.

ok! let`s try this

@A-Polyana
Copy link

A-Polyana commented Oct 13, 2022

I had exactly the same issue with my RTX 3070 and 32GB RAM, Ubuntu 22.04 running on WSL2 in Windows 11.
However I just solved it by updating to Windows 22H2 (I had 21H2 previously) and afterwards updating WSL with wsl --update
Maybe it helps you too.

ok! let`s try this

I upgrade my Windows 10 21H2 WSL2 > Windows 11 22H2 WSL2

Now after this Work OOM(With out ds_config)
Generating class images: 0%|

if use ds_config same error

Code

`The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `4` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:342: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
  warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
[2022-10-13 22:38:48,180] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend ncclFetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 53303.31it/s]Generating class images:   0%|                                                                   | 0/50 [00:04<?, ?it/s]Traceback (most recent call last):
  File "/home/poly/github/diffusers/examples/dreambooth/train_dreambooth.py", line 592, in <module>
    main()
  File "/home/poly/github/diffusers/examples/dreambooth/train_dreambooth.py", line 351, in main
    images = pipeline(example["prompt"]).images
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 316, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/models/unet_2d_condition.py", line 322, in forward
    sample = upsample_block(
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/models/unet_blocks.py", line 1149, in forward
    hidden_states = attn(hidden_states, context=encoder_hidden_states)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/models/attention.py", line 162, in forward
    hidden_states = block(hidden_states, context=context)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/models/attention.py", line 211, in forward
    hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/models/attention.py", line 283, in forward
    hidden_states = self._attention(query, key, value)
  File "/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/models/attention.py", line 291, in _attention
    attention_scores = torch.matmul(query, key.transpose(-1, -2)) * self.scale
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 8.00 GiB total capacity; 4.81 GiB already allocated; 0 bytes free; 6.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 89) of binary: /home/poly/anaconda3/envs/diffusers/bin/python`

@Squolly
Copy link

Squolly commented Oct 13, 2022

Appearently I had already generated class images on earlier tries. When I try to generate new ones I get the same problem as you.

If I generate the class images with some other stable diffusion instance and put them into the class folder everything works (as there are no more class images to be generated). Maybe this could be a workaround.

I currently haven't checked it any further.

Edit:
You can try adding --sample_batch_size=1 to your call to train_dreambooth.py. This helped me with generating the class images. The default seems to be 4 which is probably to high for our graphic cards.

@pink-red
Copy link
Contributor

To reduce VRAM usage while generating class images, try to use --sample_batch_size=1 (the default is 4). Or generate them on the CPU by using accelerate launch --cpu train_dreambooth.py ..., then stop the script and restart the training on the GPU again.

@A-Polyana
Copy link

A-Polyana commented Oct 13, 2022

Appearently I had already generated class images on earlier tries. When I try to generate new ones I get the same problem as you.

If I generate the class images with some other stable diffusion instance and put them into the class folder everything works (as there are no more class images to be generated). Maybe this could be a workaround.

I currently haven't checked it any further.

Edit: You can try adding --sample_batch_size=1 to your call to train_dreambooth.py. This helped me with generating the class images. The default seems to be 4 which is probably to high for our graphic cards.

Perfect! It`s Work Generating class images! with use 6GB VRAM and 11GB RAM To Win11 WSL2

Now Train Use 7.7 Vram and 4.6 shared memory and it work 5~6?s/it

but

2022-10-14 06:49:17,793] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0`

I don't know [deepspeed] OVERFLOW! message is okay

Code Out

        `--num_cpu_threads_per_process` was set to `4` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:342: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
  warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
[2022-10-14 06:45:16,567] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Fetching 16 files: 100%|██████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 8007.26it/s]
Generating class images: 100%|██████████████████████████████████████████████████████████| 20/20 [01:48<00:00,  5.43s/it]
Downloading: 100%|█████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 1.11MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 544kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 468kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 806/806 [00:00<00:00, 759kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 592/592 [00:00<00:00, 575kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████| 492M/492M [00:47<00:00, 10.4MB/s]
[2022-10-14 06:48:36,774] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.3, git-hash=unknown, git-branch=unknown
[2022-10-14 06:48:37,719] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-10-14 06:48:37,720] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2022-10-14 06:48:37,720] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2022-10-14 06:48:37,765] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = {basic_optimizer.__class__.__name__}
[2022-10-14 06:48:37,765] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2022-10-14 06:48:37,765] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2022-10-14 06:48:37,765] [INFO] [stage_1_and_2.py:134:__init__] Reduce bucket size 500000000
[2022-10-14 06:48:37,765] [INFO] [stage_1_and_2.py:135:__init__] Allgather bucket size 500000000
[2022-10-14 06:48:37,765] [INFO] [stage_1_and_2.py:136:__init__] CPU Offload: True
[2022-10-14 06:48:37,765] [INFO] [stage_1_and_2.py:137:__init__] Round robin gradient partitioning: False
Using /home/poly/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Creating extension directory /home/poly/.cache/torch_extensions/py39_cu116/utils...
Emitting ninja build file /home/poly/.cache/torch_extensions/py39_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/include -isystem /home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/include/TH -isystem /home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/include/THC -isystem /home/poly/anaconda3/envs/diffusers/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
[2/2] c++ flatten_unflatten.o -shared -L/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 10.706848859786987 seconds
Rank: 0 partition count [1] and sizes[(859520964, False)]
[2022-10-14 06:48:50,989] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-10-14 06:48:50,989] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 3.65 GB         CA 3.27 GB         Max_CA 4 GB
[2022-10-14 06:48:50,989] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.12 GB, percent = 35.5%
[2022-10-14 06:48:57,201] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2022-10-14 06:48:57,217] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 3.27 GB         Max_CA 3 GB
[2022-10-14 06:48:57,217] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 21.36 GB, percent = 68.1%
[2022-10-14 06:48:57,217] [INFO] [stage_1_and_2.py:516:__init__] optimizer state initialized
[2022-10-14 06:48:57,259] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2022-10-14 06:48:57,259] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 3.27 GB         Max_CA 3 GB
[2022-10-14 06:48:57,259] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 21.36 GB, percent = 68.1%
[2022-10-14 06:48:57,260] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2022-10-14 06:48:57,260] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2022-10-14 06:48:57,260] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-10-14 06:48:57,260] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[(0.9, 0.999)]
[2022-10-14 06:48:57,261] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
[2022-10-14 06:48:57,261] [INFO] [config.py:991:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   amp_enabled .................. False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   amp_params ................... False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": null,
    "exps_dir": null,
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   bfloat16_enabled ............. False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   checkpoint_tag_validation_enabled  True
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   checkpoint_tag_validation_fail  False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f523e61db20>
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   communication_data_type ...... None
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   curriculum_enabled ........... False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   curriculum_params ............ False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   dataloader_drop_last ......... False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   disable_allgather ............ False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   dump_state ................... False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   dynamic_loss_scale_args ...... None
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_enabled ........... False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_gas_boundary_resolution  1
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_layer_num ......... 0
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_max_iter .......... 100
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_stability ......... 1e-06
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_tol ............... 0.01
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_verbose ........... False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   elasticity_enabled ........... False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   flops_profiler_config ........ {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   fp16_auto_cast ............... True
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   fp16_enabled ................. True
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   fp16_master_weights_and_gradients  False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   global_rank .................. 0
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   gradient_accumulation_steps .. 1
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   gradient_clipping ............ 1.0
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   gradient_predivide_factor .... 1.0
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   initial_dynamic_scale ........ 4294967296
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   load_universal_checkpoint .... False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   loss_scale ................... 0
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   memory_breakdown ............. False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x7f523e61da60>
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   optimizer_legacy_fusion ...... False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   optimizer_name ............... None
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   optimizer_params ............. None
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   pld_enabled .................. False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   pld_params ................... False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   prescale_gradients ........... False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   scheduler_name ............... None
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   scheduler_params ............. None
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   sparse_attention ............. None
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   sparse_gradients_enabled ..... False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   steps_per_print .............. inf
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   train_batch_size ............. 1
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   train_micro_batch_size_per_gpu  1
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   wall_clock_breakdown ......... False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   world_size ................... 1
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   zero_allow_untested_optimizer  True
[2022-10-14 06:48:57,266] [INFO] [config.py:991:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2022-10-14 06:48:57,266] [INFO] [config.py:991:print]   zero_enabled ................. True
[2022-10-14 06:48:57,266] [INFO] [config.py:991:print]   zero_optimization_stage ...... 2
[2022-10-14 06:48:57,266] [INFO] [config.py:976:print_user_config]   json = {
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu"
        },
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_clipping": 1.0,
    "steps_per_print": inf,
    "fp16": {
        "enabled": true,
        "auto_cast": true
    },
    "zero_allow_untested_optimizer": true
}
Using /home/poly/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0026106834411621094 seconds
Steps:   0%|                                                                                     | 0/50 [00:00<?, ?it/s][2022-10-14 06:49:01,093] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
Steps:   2%|█                                                       | 1/50 [00:03<02:54,  3.56s/it, loss=0.331, lr=5e-6][2022-10-14 06:49:02,128] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
Steps:   4%|██▏                                                     | 2/50 [00:04<01:39,  2.08s/it, loss=0.297, lr=5e-6][2022-10-14 06:49:03,178] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
Steps:   6%|███▎                                                    | 3/50 [00:05<01:15,  1.61s/it, loss=0.169, lr=5e-6][2022-10-14 06:49:04,194] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
Steps:   8%|████▍                                                   | 4/50 [00:06<01:03,  1.37s/it, loss=0.622, lr=5e-6][2022-10-14 06:49:05,312] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
Steps:  10%|█████▌                                                  | 5/50 [00:07<00:57,  1.28s/it, loss=0.183, lr=5e-6][2022-10-14 06:49:06,584] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
Steps:  12%|██████▌                                                | 6/50 [00:09<00:56,  1.28s/it, loss=0.0678, lr=5e-6][2022-10-14 06:49:07,609] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
Steps:  14%|███████▋                                               | 7/50 [00:10<00:51,  1.20s/it, loss=0.0284, lr=5e-6][2022-10-14 06:49:08,677] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
Steps:  16%|████████▉                                               | 8/50 [00:11<00:48,  1.15s/it, loss=0.428, lr=5e-6][2022-10-14 06:49:09,683] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
Steps:  18%|██████████                                              | 9/50 [00:12<00:45,  1.11s/it, loss=0.209, lr=5e-6][2022-10-14 06:49:10,757] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
Steps:  20%|███████████                                            | 10/50 [00:13<00:43,  1.10s/it, loss=0.447, lr=5e-6][2022-10-14 06:49:11,746] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
Steps:  22%|████████████                                           | 11/50 [00:14<00:41,  1.06s/it, loss=0.217, lr=5e-6][2022-10-14 06:49:12,739] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
Steps:  24%|█████████████▏                                         | 12/50 [00:15<00:39,  1.04s/it, loss=0.799, lr=5e-6][2022-10-14 06:49:13,718] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
Steps:  26%|██████████████▎                                        | 13/50 [00:16<00:37,  1.02s/it, loss=0.114, lr=5e-6][2022-10-14 06:49:14,754] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
Steps:  28%|███████████████▍                                       | 14/50 [00:17<00:36,  1.03s/it, loss=0.291, lr=5e-6][2022-10-14 06:49:15,812] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
Steps:  30%|████████████████▌                                      | 15/50 [00:18<00:36,  1.04s/it, loss=0.143, lr=5e-6][2022-10-14 06:49:16,804] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
Steps:  32%|█████████████████▌                                     | 16/50 [00:19<00:34,  1.02s/it, loss=0.688, lr=5e-6][2022-10-14 06:49:17,793] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
Fetching 16 files: 100%|██████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 4808.60it/s]
Steps: 100%|███████████████████████████████████████████████████████| 50/50 [03:34<00:00,  4.29s/it, loss=0.089, lr=5e-6]

@devilismyfriend
Copy link
Author

Appearently I had already generated class images on earlier tries. When I try to generate new ones I get the same problem as you.
If I generate the class images with some other stable diffusion instance and put them into the class folder everything works (as there are no more class images to be generated). Maybe this could be a workaround.
I currently haven't checked it any further.
Edit: You can try adding --sample_batch_size=1 to your call to train_dreambooth.py. This helped me with generating the class images. The default seems to be 4 which is probably to high for our graphic cards.

Perfect! It`s Work Generating class images! with use 6GB VRAM and 11GB RAM To Win11 WSL2

Now Train Use 7.7 Vram and 4.6 shared memory and it work 5~6?s/it

but

2022-10-14 06:49:17,793] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0`

I don't know [deepspeed] OVERFLOW! message is okay

Code Out

        `--num_cpu_threads_per_process` was set to `4` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:342: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
  warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
[2022-10-14 06:45:16,567] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Fetching 16 files: 100%|██████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 8007.26it/s]
Generating class images: 100%|██████████████████████████████████████████████████████████| 20/20 [01:48<00:00,  5.43s/it]
Downloading: 100%|█████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 1.11MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 544kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 468kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 806/806 [00:00<00:00, 759kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 592/592 [00:00<00:00, 575kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████| 492M/492M [00:47<00:00, 10.4MB/s]
[2022-10-14 06:48:36,774] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.3, git-hash=unknown, git-branch=unknown
[2022-10-14 06:48:37,719] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-10-14 06:48:37,720] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2022-10-14 06:48:37,720] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2022-10-14 06:48:37,765] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = {basic_optimizer.__class__.__name__}
[2022-10-14 06:48:37,765] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2022-10-14 06:48:37,765] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2022-10-14 06:48:37,765] [INFO] [stage_1_and_2.py:134:__init__] Reduce bucket size 500000000
[2022-10-14 06:48:37,765] [INFO] [stage_1_and_2.py:135:__init__] Allgather bucket size 500000000
[2022-10-14 06:48:37,765] [INFO] [stage_1_and_2.py:136:__init__] CPU Offload: True
[2022-10-14 06:48:37,765] [INFO] [stage_1_and_2.py:137:__init__] Round robin gradient partitioning: False
Using /home/poly/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Creating extension directory /home/poly/.cache/torch_extensions/py39_cu116/utils...
Emitting ninja build file /home/poly/.cache/torch_extensions/py39_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/include -isystem /home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/include/TH -isystem /home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/include/THC -isystem /home/poly/anaconda3/envs/diffusers/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
[2/2] c++ flatten_unflatten.o -shared -L/home/poly/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 10.706848859786987 seconds
Rank: 0 partition count [1] and sizes[(859520964, False)]
[2022-10-14 06:48:50,989] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-10-14 06:48:50,989] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 3.65 GB         CA 3.27 GB         Max_CA 4 GB
[2022-10-14 06:48:50,989] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.12 GB, percent = 35.5%
[2022-10-14 06:48:57,201] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2022-10-14 06:48:57,217] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 3.27 GB         Max_CA 3 GB
[2022-10-14 06:48:57,217] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 21.36 GB, percent = 68.1%
[2022-10-14 06:48:57,217] [INFO] [stage_1_and_2.py:516:__init__] optimizer state initialized
[2022-10-14 06:48:57,259] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2022-10-14 06:48:57,259] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 3.27 GB         Max_CA 3 GB
[2022-10-14 06:48:57,259] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 21.36 GB, percent = 68.1%
[2022-10-14 06:48:57,260] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2022-10-14 06:48:57,260] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2022-10-14 06:48:57,260] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-10-14 06:48:57,260] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[(0.9, 0.999)]
[2022-10-14 06:48:57,261] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
[2022-10-14 06:48:57,261] [INFO] [config.py:991:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   amp_enabled .................. False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   amp_params ................... False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": null,
    "exps_dir": null,
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   bfloat16_enabled ............. False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   checkpoint_tag_validation_enabled  True
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   checkpoint_tag_validation_fail  False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f523e61db20>
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   communication_data_type ...... None
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   curriculum_enabled ........... False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   curriculum_params ............ False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   dataloader_drop_last ......... False
[2022-10-14 06:48:57,262] [INFO] [config.py:991:print]   disable_allgather ............ False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   dump_state ................... False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   dynamic_loss_scale_args ...... None
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_enabled ........... False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_gas_boundary_resolution  1
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_layer_num ......... 0
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_max_iter .......... 100
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_stability ......... 1e-06
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_tol ............... 0.01
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   eigenvalue_verbose ........... False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   elasticity_enabled ........... False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   flops_profiler_config ........ {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   fp16_auto_cast ............... True
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   fp16_enabled ................. True
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   fp16_master_weights_and_gradients  False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   global_rank .................. 0
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   gradient_accumulation_steps .. 1
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   gradient_clipping ............ 1.0
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   gradient_predivide_factor .... 1.0
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   initial_dynamic_scale ........ 4294967296
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   load_universal_checkpoint .... False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   loss_scale ................... 0
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   memory_breakdown ............. False
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x7f523e61da60>
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2022-10-14 06:48:57,263] [INFO] [config.py:991:print]   optimizer_legacy_fusion ...... False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   optimizer_name ............... None
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   optimizer_params ............. None
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   pld_enabled .................. False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   pld_params ................... False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   prescale_gradients ........... False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   scheduler_name ............... None
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   scheduler_params ............. None
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   sparse_attention ............. None
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   sparse_gradients_enabled ..... False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   steps_per_print .............. inf
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   train_batch_size ............. 1
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   train_micro_batch_size_per_gpu  1
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   wall_clock_breakdown ......... False
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   world_size ................... 1
[2022-10-14 06:48:57,264] [INFO] [config.py:991:print]   zero_allow_untested_optimizer  True
[2022-10-14 06:48:57,266] [INFO] [config.py:991:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2022-10-14 06:48:57,266] [INFO] [config.py:991:print]   zero_enabled ................. True
[2022-10-14 06:48:57,266] [INFO] [config.py:991:print]   zero_optimization_stage ...... 2
[2022-10-14 06:48:57,266] [INFO] [config.py:976:print_user_config]   json = {
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu"
        },
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_clipping": 1.0,
    "steps_per_print": inf,
    "fp16": {
        "enabled": true,
        "auto_cast": true
    },
    "zero_allow_untested_optimizer": true
}
Using /home/poly/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0026106834411621094 seconds
Steps:   0%|                                                                                     | 0/50 [00:00<?, ?it/s][2022-10-14 06:49:01,093] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
Steps:   2%|█                                                       | 1/50 [00:03<02:54,  3.56s/it, loss=0.331, lr=5e-6][2022-10-14 06:49:02,128] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
Steps:   4%|██▏                                                     | 2/50 [00:04<01:39,  2.08s/it, loss=0.297, lr=5e-6][2022-10-14 06:49:03,178] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
Steps:   6%|███▎                                                    | 3/50 [00:05<01:15,  1.61s/it, loss=0.169, lr=5e-6][2022-10-14 06:49:04,194] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
Steps:   8%|████▍                                                   | 4/50 [00:06<01:03,  1.37s/it, loss=0.622, lr=5e-6][2022-10-14 06:49:05,312] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
Steps:  10%|█████▌                                                  | 5/50 [00:07<00:57,  1.28s/it, loss=0.183, lr=5e-6][2022-10-14 06:49:06,584] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
Steps:  12%|██████▌                                                | 6/50 [00:09<00:56,  1.28s/it, loss=0.0678, lr=5e-6][2022-10-14 06:49:07,609] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
Steps:  14%|███████▋                                               | 7/50 [00:10<00:51,  1.20s/it, loss=0.0284, lr=5e-6][2022-10-14 06:49:08,677] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
Steps:  16%|████████▉                                               | 8/50 [00:11<00:48,  1.15s/it, loss=0.428, lr=5e-6][2022-10-14 06:49:09,683] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
Steps:  18%|██████████                                              | 9/50 [00:12<00:45,  1.11s/it, loss=0.209, lr=5e-6][2022-10-14 06:49:10,757] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
Steps:  20%|███████████                                            | 10/50 [00:13<00:43,  1.10s/it, loss=0.447, lr=5e-6][2022-10-14 06:49:11,746] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
Steps:  22%|████████████                                           | 11/50 [00:14<00:41,  1.06s/it, loss=0.217, lr=5e-6][2022-10-14 06:49:12,739] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
Steps:  24%|█████████████▏                                         | 12/50 [00:15<00:39,  1.04s/it, loss=0.799, lr=5e-6][2022-10-14 06:49:13,718] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
Steps:  26%|██████████████▎                                        | 13/50 [00:16<00:37,  1.02s/it, loss=0.114, lr=5e-6][2022-10-14 06:49:14,754] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
Steps:  28%|███████████████▍                                       | 14/50 [00:17<00:36,  1.03s/it, loss=0.291, lr=5e-6][2022-10-14 06:49:15,812] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
Steps:  30%|████████████████▌                                      | 15/50 [00:18<00:36,  1.04s/it, loss=0.143, lr=5e-6][2022-10-14 06:49:16,804] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
Steps:  32%|█████████████████▌                                     | 16/50 [00:19<00:34,  1.02s/it, loss=0.688, lr=5e-6][2022-10-14 06:49:17,793] [INFO] [stage_1_and_2.py:1720:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
Fetching 16 files: 100%|██████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 4808.60it/s]
Steps: 100%|███████████████████████████████████████████████████████| 50/50 [03:34<00:00,  4.29s/it, loss=0.089, lr=5e-6]

what did you change exactly? I already have class images made and using the ds_config and it still won't work

@devilismyfriend
Copy link
Author

I had exactly the same issue with my RTX 3070 and 32GB RAM, Ubuntu 22.04 running on WSL2 in Windows 11.

However I just solved it by updating to Windows 22H2 (I had 21H2 previously) and afterwards updating WSL with wsl --update

Maybe it helps you too.

when you do wsl --update what kernel version does it say you have?
mine says Kernel version: 5.10.102.1

@pink-red
Copy link
Contributor

pink-red commented Oct 13, 2022

I don't know [deepspeed] OVERFLOW! message is okay

Yes, it's okay. For me, it repeats about 10 times at the start and then it might rarely occur again later. I think that this message means that DeepSpeed is adjusting settings to fit the model into currently available VRAM.

@A-Polyana
Copy link

A-Polyana commented Oct 14, 2022

what did you change exactly? I already have class images made and using the ds_config and it still won't work

  1. I upgrade Window 10 21H2 WSL2 > 11 22H2 WSL2 and after Windows 11 22H2 use cmd with wsl --update
  2. re-install Ubuntu 22.04 WSL
  3. in train_dreambooth.py Change sample_batch_size=1
  4. I use this config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 0
Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: no
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:1

I hope it helps you

@devilismyfriend
Copy link
Author

OK I can confirm updating to the preview windows release (22H2) and doing wsl --update got me past the pinning, only the overflow msg keeps on popping

@A-Polyana
Copy link

A-Polyana commented Oct 14, 2022

OK I can confirm updating to the preview windows release (22H2) and doing wsl --update got me past the pinning, only the overflow msg keeps on popping

before used window10 21H2 wsl2?

P.S
My Windows 11 22H2 Kernel Version is
Linux 5.15.68.1-microsoft-standard-WSL2 x86_64

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 14, 2022

Updated to 22H2 and can now pin large amounts of memory as well.

@patrickvonplaten
Copy link
Contributor

cc ing @patil-suraj here for dreambooth

@karlmacklin
Copy link

Did anyone of you ever get it solved to work for Windows 10?

I can't pin large amounts of ram, and I have all windows updates applied and run wsl --update.

I'm reluctant to update to Windows 11 unless necessary so just curious if this is the current solution.

@Thomas-MMJ
Copy link

It won't work on Windows 10, It didn't work on Windows 11 till a week ago with the latest update, and Windows 10 ceased getting any updates. To work apparently required support in the OS to enable larger memory pinning in WSL.

@A-Polyana
Copy link

Did anyone of you ever get it solved to work for Windows 10?

I can't pin large amounts of ram, and I have all windows updates applied and run wsl --update.

I'm reluctant to update to Windows 11 unless necessary so just curious if this is the current solution.

I won`t upgrade My window10 like you, but I upgrade WIN11 22H2 for this Work
Maybe it's very different from Windows 10 WSL2

@patrickvonplaten
Copy link
Contributor

@patil-suraj re-pinging you here - could you take a look? :-)

@patil-suraj
Copy link
Contributor

patil-suraj commented Oct 20, 2022

Thanks for the issue

Looks like some of you already got it working!

I'm not on expert of deepspeed, but I can run it fine on linux machines. So not sure what the issue is.
Last I had seen, deepspeed didn't have good support for windows.

Also note that, when using prior preservation, you can also generate prior images beforehand or grab them from internet and just specify the path to that dir. This way, we won't have mess with sample_batch_size

@karlmacklin
Copy link

Windows 10 22H2 update has been released (yesterday I think).

However this sadly does not solve the issue for memory pinning past 2GB in WSL2. I guess Windows 11 is the only way to go for this to work.

@camenduru
Copy link
Contributor

- Windows 11 22H2
- Ubuntu 22.04
- Linux PC 5.15.68.1-microsoft-standard-WSL2

- sudo apt-get install bzip2
- curl micro.mamba.pm/install.sh | bash
- micromamba create -n sd python=3.10 -c conda-forge
- micromamba activate sd
- micromamba install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
- micromamba install -c "nvidia/label/cuda-11.6.0" cuda
- git clone https://github.com/huggingface/diffusers
- cd diffusers/examples/dreambooth
- pip install -r requirements.txt
- pip install deepspeed
- git clone -b fp16 https://huggingface.co/CompVis/stable-diffusion-v1-4.git

- accelerate config
  -  In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
  -  Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 0
  -  Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:NO
  -  Do you want to use DeepSpeed? [yes/NO]: yes
  -  Do you want to specify a json file to a DeepSpeed config? [yes/NO]: no
  -  Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: NO
  -  How many GPU(s) should be used for distributed training? [1]:1

- export MODEL_NAME="stable-diffusion-v1-4"
- export INSTANCE_DIR="instance"
- export CLASS_DIR="class"
- export OUTPUT_DIR="output"

- accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="scarlettjohansson" \
  --class_prompt="person" \
  --resolution=512 \
  --train_batch_size=1 \
  --sample_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=12 \
  --max_train_steps=650 \
  --mixed_precision=fp16

Thanks everyone 🎉

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Nov 18, 2022
@patrickvonplaten
Copy link
Contributor

Sorry, just to be sure is this issue closed or was there still an open problem/bug?

@WakizashiYoshikawa
Copy link

Sorry, just to be sure is this issue closed or was there still an open problem/bug?

I still cannot run it on Windows 10 22H2, so I presume the only way would be Windows 11 although reddit users on Windows 11 also have issues...

@patrickvonplaten
Copy link
Contributor

We really need to get better support for Windows!
Also cc @anton-l FYI

@Thomas-MMJ
Copy link

I still cannot run it on Windows 10 22H2, so I presume the only way would be Windows 11 although reddit users on Windows 11 also have issues...

Did you test to see if the 22H2 on Windows 10 increased the amount of memory pinning? If the update didn't do so, then it still won't work on Windows 10.

Try this test mentioned above,

import torch

cpu = torch.device('cpu')
alloc_size = 16e9

print('Allocating')
x = torch.zeros(int(alloc_size / 8),
                dtype=torch.float32,
                device=cpu)
print('Pinning')
x = x.pin_memory()

print('Accessing')
m = torch.mean(x)
assert m == 0
print('Done')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests