New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DreamBooth DeepSpeed support for under 8 GB VRAM training #735
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Just tried this version, it works pretty well! DreamBooth finally fits onto a 3060, although it used 10743 MiB for me. Does DeepSpeed adjust the training to fit into VRAM or was it something else? Here's time comparison for different cases (800 steps in each case, not including time to generate class images):
*ShivamShrirao's fork caches the latents, which speeds up the training, but takes some time itself. For me, it was 02:34 for latent caching (for 103 instance images) and 14:36 for the training itself, resulting in 17:10. **I've noticed that with DeepSpeed enabled my CPU was under pretty heavy load, so including it into table too. By the way, if you don't want to run accelerate launch --use_deepspeed --zero_stage=2 --gradient_accumulation_steps=1 --offload_param_device=cpu --offload_optimizer_device=cpu train_dreambooth.py *training_arguments* Full command I've used for training: accelerate launch --use_deepspeed --zero_stage=2 --gradient_accumulation_steps=1 --offload_param_device=cpu --offload_optimizer_device=cpu train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME --use_auth_token \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="a photo of sks dog" \
--class_prompt="a photo of dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--gradient_checkpointing \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800 \
--sample_batch_size=2 \
--mixed_precision=fp16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks clean to me!
Unfortunately, this did not work for me, tested on a 3080 10GB and 64GB of RAM.
|
Due to recent commits some casts to half precision are not necessary anymore. Mention that DeepSpeed's version of Adam is about 2x faster.
Enabling |
There is a patch in the deepspeed repo to allow 8bit_adam to work with deepspeed, I haven't tested it so not sure if it works since there hasn't been any progress/comments on it since December. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool! Thanks a lot for working on this. I just left a couple of nits and a questions about loss computation.
|
||
# Add the prior loss to the instance loss. | ||
loss = loss + args.prior_loss_weight * prior_loss | ||
else: | ||
loss = F.mse_loss(noise_pred, noise, reduction="none").mean([1, 2, 3]).mean() | ||
loss = F.mse_loss(noise_pred.float(), noise.float(), reduction="mean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why cast to float32
here ? Do we always want to compute loss in full precision ?
cc @patrickvonplaten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mixed precision best practice to calculate large reduction in higher precision. This calculates mean of batch_size * 4 * 64 * 64
halfs and mse_loss
is one of the operations that would be automatically casted to fp32 in fp16 with autocast. I'm not sure it it's necessary at low batch size as it does seem to work without it, but it doesn't really affect memory consumption since it's only one operation and should give some safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense!
Good to merge for me if @patil-suraj is happy with it! |
Thanks a lot for the amazing contribution ! |
It appears that mixed precision is broken now |
…e#735) * Support deepspeed * Dreambooth DeepSpeed documentation * Remove unnecessary casts, documentation Due to recent commits some casts to half precision are not necessary anymore. Mention that DeepSpeed's version of Adam is about 2x faster. * Review comments
Add instructions on how to enable DeepSpeed in DreamBooth example to allow training on under 8 GB VRAM. I was able to train a working network using this on 8 GB VRAM GPU.
It did not work out of the box with fp16 mixed precision and I had to add some explicit casts to make it run. Without them it raises Exception:
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same
.