fix: don't save state if no --save-state arg given #521

akshaal · 2023-05-18T18:46:13Z

Documentation states that:

save_stateオプションを同時に指定すると、optimizer等の状態も含めた学習状態を合わせて保存します（保存したモデルからも学習再開できますが、それに比べると精度の向上、学習時間の短縮が期待できます）。保存先はフォルダになります。

which translates to English as:

If you specify the save_state option at the same time, the learning state including the state of the optimizer, etc. will be saved together. . The save destination will be a folder.

It likely means that if --save_sate is NOT given, then no state should be saved. That is not how it works now. State will be saved regardless of whether --save-state option is given or not as long as --save_every_n_epochs specified. The given patch fixes the issue by requiring --save_sate in addition to the --save_every_n_epochs.

as a side note: I don't really mind it saving state, but it crashes with OOM by trying to save the state in case of low VRAM:

saving state at epoch 1
    train(args)
  File "/home/user/ksd/train_db.py", line 399, in train
    train_util.save_sd_model_on_epoch_end_or_stepwise(
  File "/home/user/ksd/library/train_util.py", line 3131, in save_sd_model_on_epoch_end_or_stepwise
    save_and_remove_state_on_epoch_end(args, accelerator, epoch_no)
  File "/home/user/ksd/library/train_util.py", line 3143, in save_and_remove_state_on_epoch_end
    accelerator.save_state(state_dir)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1634, in save_state
    weights.append(self.get_state_dict(model, unwrap=False))
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1811, in get_state_dict
    state_dict[k] = state_dict[k].float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 7.79 GiB total capacity; 5.31 GiB already allocated; 55.00 MiB free; 5.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
        | 2/1600 [00:06<1:31:30,  3.44s/it, loss=0.03]
Traceback (most recent call last):
  File "/home/user/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', 'train_db.py', '--pretrained_model_name_or_path=/data/xxxx.safetensors', '--dataset_config=/data/dataset.toml', '--output_dir=/data/output', '--output_name=my', '--save_model_as=safetensors', '--prior_loss_weight=1.0', '--max_train_steps=1600', '--learning_rate=1e-6', '--optimizer_type=AdamW8bit', '--xformers', '--save_every_n_epochs=1', '--save_precision=fp16', '--full_fp16', '--mixed_precision=fp16', '--gradient_checkpointing']' returned non-zero exit status 1.

kohya-ss · 2023-05-20T23:48:36Z

Thank you for this! I accidentally changed --save_state option to be ignored.

OOM seems to happen because accelerate is forced to convert the weights to float32 when saving the state. I have looked into it, but it seems that this behavior cannot be changed by options, etc. I think we can patch accelerate to save as float16, but I am not certain that it will not cause other side effects.

fix: don't save state if no --save-state arg given

0c94210

kohya-ss changed the base branch from main to dev May 20, 2023 23:48

kohya-ss merged commit bc909e8 into kohya-ss:dev May 20, 2023

bmaltais mentioned this pull request May 23, 2023

v21.5.12 bmaltais/kohya_ss#831

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't save state if no --save-state arg given #521

fix: don't save state if no --save-state arg given #521

akshaal commented May 18, 2023 •

edited

Loading

kohya-ss commented May 20, 2023

fix: don't save state if no --save-state arg given #521

fix: don't save state if no --save-state arg given #521

Conversation

akshaal commented May 18, 2023 • edited Loading

kohya-ss commented May 20, 2023

akshaal commented May 18, 2023 •

edited

Loading