Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't save state if no --save-state arg given #521

Merged
merged 1 commit into from
May 20, 2023

Conversation

akshaal
Copy link
Contributor

@akshaal akshaal commented May 18, 2023

Documentation states that:

save_stateオプションを同時に指定すると、optimizer等の状態も含めた学習状態を合わせて保存します(保存したモデルからも学習再開できますが、それに比べると精度の向上、学習時間の短縮が期待できます)。保存先はフォルダになります。

which translates to English as:

If you specify the save_state option at the same time, the learning state including the state of the optimizer, etc. will be saved together. . The save destination will be a folder.

It likely means that if --save_sate is NOT given, then no state should be saved. That is not how it works now. State will be saved regardless of whether --save-state option is given or not as long as --save_every_n_epochs specified. The given patch fixes the issue by requiring --save_sate in addition to the --save_every_n_epochs.


as a side note: I don't really mind it saving state, but it crashes with OOM by trying to save the state in case of low VRAM:

saving state at epoch 1
    train(args)
  File "/home/user/ksd/train_db.py", line 399, in train
    train_util.save_sd_model_on_epoch_end_or_stepwise(
  File "/home/user/ksd/library/train_util.py", line 3131, in save_sd_model_on_epoch_end_or_stepwise
    save_and_remove_state_on_epoch_end(args, accelerator, epoch_no)
  File "/home/user/ksd/library/train_util.py", line 3143, in save_and_remove_state_on_epoch_end
    accelerator.save_state(state_dir)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1634, in save_state
    weights.append(self.get_state_dict(model, unwrap=False))
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1811, in get_state_dict
    state_dict[k] = state_dict[k].float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 7.79 GiB total capacity; 5.31 GiB already allocated; 55.00 MiB free; 5.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
        | 2/1600 [00:06<1:31:30,  3.44s/it, loss=0.03]
Traceback (most recent call last):
  File "/home/user/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', 'train_db.py', '--pretrained_model_name_or_path=/data/xxxx.safetensors', '--dataset_config=/data/dataset.toml', '--output_dir=/data/output', '--output_name=my', '--save_model_as=safetensors', '--prior_loss_weight=1.0', '--max_train_steps=1600', '--learning_rate=1e-6', '--optimizer_type=AdamW8bit', '--xformers', '--save_every_n_epochs=1', '--save_precision=fp16', '--full_fp16', '--mixed_precision=fp16', '--gradient_checkpointing']' returned non-zero exit status 1.

@kohya-ss
Copy link
Owner

Thank you for this! I accidentally changed --save_state option to be ignored.

OOM seems to happen because accelerate is forced to convert the weights to float32 when saving the state. I have looked into it, but it seems that this behavior cannot be changed by options, etc. I think we can patch accelerate to save as float16, but I am not certain that it will not cause other side effects.

@kohya-ss kohya-ss changed the base branch from main to dev May 20, 2023 23:48
@kohya-ss kohya-ss merged commit bc909e8 into kohya-ss:dev May 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants