New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v-prediction training support #1455
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing!
It looks like the DDIM scheduler is never used for training, is that correct? Do we need a get_velocity
function for it in that case?
|
||
# Add the prior loss to the instance loss. | ||
loss = loss + args.prior_loss_weight * prior_loss | ||
else: | ||
loss = F.mse_loss(noise_pred.float(), noise.float(), reduction="mean") | ||
loss = F.mse_loss(noise_pred.float(), target.float(), reduction="mean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we rename noise_pred
too? (Same comment for the other scripts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, we could call it model_output
instead of noise_pred
, wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_output
or pred
both sound fine to me, whatever you think is clearer. I think we use model_output
in more places though, so that'd be better then.
parser.add_argument( | ||
"--revision", | ||
type=str, | ||
default=None, | ||
required=False, | ||
help="Revision of pretrained model identifier from huggingface.co/models.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
def get_velocity( | ||
self, sample: torch.FloatTensor, noise: torch.FloatTensor, timesteps: torch.IntTensor | ||
) -> torch.FloatTensor: | ||
# Make sure alphas_cumprod and timestep have same device and dtype as sample | ||
self.alphas_cumprod = self.alphas_cumprod.to(device=sample.device, dtype=sample.dtype) | ||
timesteps = timesteps.to(sample.device) | ||
|
||
sqrt_alpha_prod = self.alphas_cumprod[timesteps] ** 0.5 | ||
sqrt_alpha_prod = sqrt_alpha_prod.flatten() | ||
while len(sqrt_alpha_prod.shape) < len(sample.shape): | ||
sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1) | ||
|
||
sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[timesteps]) ** 0.5 | ||
sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten() | ||
while len(sqrt_one_minus_alpha_prod.shape) < len(sample.shape): | ||
sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1) | ||
|
||
velocity = sqrt_alpha_prod * noise - sqrt_one_minus_alpha_prod * sample | ||
return velocity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to add_noise
doesn't it? Would it make sense to make both implementations rely on a common function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, could refactor it in a follow-up PR , wdyt @patrickvonplaten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use both add_noise
and get_velocity
in the same training step for v-prediction, I'm ok with keeping them separate. But longer-term we might benefit from factoring out or condensing the alpha_prod
code in both functions (everything above velocity =
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree ! I think it's clear to add it directly to add_noise
and then maye use of self.config.prediction_type
- could we do this in this PR maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scratch that it doesn't work as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In favour of keeping get_velocity
because for v-prediction we need both the velocity and noised image, so if we modify add_noise
, we'll need to return a tuple
as output, which will complicate it bit.
get_velocity
is clearer to understand as it makes it clear that it's different from add_noise.
def get_velocity( | ||
self, sample: torch.FloatTensor, noise: torch.FloatTensor, timesteps: torch.IntTensor | ||
) -> torch.FloatTensor: | ||
# Make sure alphas_cumprod and timestep have same device and dtype as sample | ||
self.alphas_cumprod = self.alphas_cumprod.to(device=sample.device, dtype=sample.dtype) | ||
timesteps = timesteps.to(sample.device) | ||
|
||
sqrt_alpha_prod = self.alphas_cumprod[timesteps] ** 0.5 | ||
sqrt_alpha_prod = sqrt_alpha_prod.flatten() | ||
while len(sqrt_alpha_prod.shape) < len(sample.shape): | ||
sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1) | ||
|
||
sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[timesteps]) ** 0.5 | ||
sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten() | ||
while len(sqrt_one_minus_alpha_prod.shape) < len(sample.shape): | ||
sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1) | ||
|
||
velocity = sqrt_alpha_prod * noise - sqrt_one_minus_alpha_prod * sample | ||
return velocity | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as in DDIM.
It's not really used for training, but think it can be. Adding the method just for consistency. |
Cool! We can add support for dpm solver too in a future PR :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
def get_velocity( | ||
self, sample: torch.FloatTensor, noise: torch.FloatTensor, timesteps: torch.IntTensor | ||
) -> torch.FloatTensor: | ||
# Make sure alphas_cumprod and timestep have same device and dtype as sample | ||
self.alphas_cumprod = self.alphas_cumprod.to(device=sample.device, dtype=sample.dtype) | ||
timesteps = timesteps.to(sample.device) | ||
|
||
sqrt_alpha_prod = self.alphas_cumprod[timesteps] ** 0.5 | ||
sqrt_alpha_prod = sqrt_alpha_prod.flatten() | ||
while len(sqrt_alpha_prod.shape) < len(sample.shape): | ||
sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1) | ||
|
||
sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[timesteps]) ** 0.5 | ||
sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten() | ||
while len(sqrt_one_minus_alpha_prod.shape) < len(sample.shape): | ||
sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1) | ||
|
||
velocity = sqrt_alpha_prod * noise - sqrt_one_minus_alpha_prod * sample | ||
return velocity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use both add_noise
and get_velocity
in the same training step for v-prediction, I'm ok with keeping them separate. But longer-term we might benefit from factoring out or condensing the alpha_prod
code in both functions (everything above velocity =
).
Hmmm, did this break the script for older stable diffusion models? When I try to run with
because the |
Hey @nlml , |
* add get_velocity * add v prediction for training * fix saving * add revision arg * fix saving * save checkpoints dreambooth * fix saving embeds * add instruction in readme * quality * noise_pred -> model_pred
Hello, i tried training using stable-diffusion-2-1 with resolution 768, but i get nan loss in the very first iteration !accelerate launch train_dreambooth.py i am using i checked all the model parts, the problem is with unet, it outputs a all nan tensor. the inputs to unet dont contain any nans or infs |
Hey @pkurz3nd , this is a known issue with the
|
* add get_velocity * add v prediction for training * fix saving * add revision arg * fix saving * save checkpoints dreambooth * fix saving embeds * add instruction in readme * quality * noise_pred -> model_pred
This PR adds support for v-prediction training in
This allows fine-tuning the SD2 768x768 model with these scripts.
To enable this, it adds
get_velocity
method to DDPM and DDIM scheduler to get the target during training. The type of training is automatically detected inside script using thenoise_scheduler.config.prediction_type
argument.Users will just have set the right
resolution
in the command, 512 for all models except the 768 one. And 768 for thestable-diffusion-2
model.