Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what is the input of conditional DDIM decoder? #66

Open
YixuannnL opened this issue Oct 30, 2023 · 6 comments
Open

what is the input of conditional DDIM decoder? #66

YixuannnL opened this issue Oct 30, 2023 · 6 comments

Comments

@YixuannnL
Copy link

I really appreciate your work!
But I'm confused about the input of your conditional DDIM decoder.
In your paper, you said 'For training, the stochastic subcode x_T is not needed.' So whether the input is your original image add a certain amount of noise? And the loss function is between the adding noise and the predicted noise?
Thanks for your patience!

@phizaz
Copy link
Owner

phizaz commented Oct 30, 2023

The word input is a bit ambiguous because there are actually "two" inputs:

  1. Input to the CNN encoder (semantic encoder): Input here is clean, original, no noise.
  2. Input to the UNET (diffusion model): Input here is noisy depending on t.
    There is only one loss function and it's for UNET (CNN is learned end-to-end as part of the UNET):
    loss = || UNET(xt) - x_clean|| or || UNET(xt) - noise ||
    In the paper, we used the latter version, but both should be equally valid.

@YixuannnL
Copy link
Author

The word input is a bit ambiguous because there are actually "two" inputs:

  1. Input to the CNN encoder (semantic encoder): Input here is clean, original, no noise.
  2. Input to the UNET (diffusion model): Input here is noisy depending on t.
    There is only one loss function and it's for UNET (CNN is learned end-to-end as part of the UNET):
    loss = || UNET(xt) - x_clean|| or || UNET(xt) - noise ||
    In the paper, we used the latter version, but both should be equally valid.

Many thanks for your reply!
Then I have a small question. Does the diffusion model itself has the ability of reconstruction? Have you ever tried using unconditional DDIM to train to reconstruct the images? I've used img2img of stable diffusion webui to try to reconstruct the original image, but failed. What is your intuition if I use deterministic DDIM reverse image to latent X_T(as your paper mentioned) then i feed this X_T to an unconditional DDIM, whether this is a feasible way of reconstruct the original image?

@YixuannnL
Copy link
Author

Moreover, another question is why not need a reconstruction loss to restrict the output of your ddim decoder to ensure the reconstruction of original image? Only the denoising loss is sufficient, why? I can't think of the logic.
Much appreciation for your patience!

@YixuannnL
Copy link
Author

Another question is i noticed in your code that in your DDIM decoder, the condition, z_sem, only used in ResBlock, together with timestep. But in Attention Module, you only use self-attn instead of cross-attn, can i ask the reason here?

@phizaz
Copy link
Owner

phizaz commented Oct 31, 2023

Diffusion autoencoders or even plain DDIM can definitely reconstruct the image.
I have heard (but not a first hand experience) that classifier-free guidance models (like Stable Diffusion) have problems with inversion. But I don't have any further intuition beyond that.

why not need a reconstruction loss to restrict the output of your ddim decoder to ensure the reconstruction of original image?

This is the property of DDIM itself. An intuition is that DDIM is an ODE, and an ODE can be thought as an invertible mapping (so you have the way to go from the input and back, hence reconstruction). I refer you to read more from the DDIM paper (Song et al, 2020).

Another question is i noticed in your code that in your DDIM decoder, the condition, z_sem, only used in ResBlock, together with timestep. But in Attention Module, you only use self-attn instead of cross-attn, can i ask the reason here?

You can definitely add conditioning signals to the attention modules as well for sure. Is it worth it or not? It's hard to tell without experiments. But the high-level motivation is Attention is attention: "you are working with the inputs" not really adding something new to it. Convolution, in this case, is a better choice for a layer that "adding something new".

@YixuannnL
Copy link
Author

Diffusion autoencoders or even plain DDIM can definitely reconstruct the image. I have heard (but not a first hand experience) that classifier-free guidance models (like Stable Diffusion) have problems with inversion. But I don't have any further intuition beyond that.

why not need a reconstruction loss to restrict the output of your ddim decoder to ensure the reconstruction of original image?

This is the property of DDIM itself. An intuition is that DDIM is an ODE, and an ODE can be thought as an invertible mapping (so you have the way to go from the input and back, hence reconstruction). I refer you to read more from the DDIM paper (Song et al, 2020).

Another question is i noticed in your code that in your DDIM decoder, the condition, z_sem, only used in ResBlock, together with timestep. But in Attention Module, you only use self-attn instead of cross-attn, can i ask the reason here?

You can definitely add conditioning signals to the attention modules as well for sure. Is it worth it or not? It's hard to tell without experiments. But the high-level motivation is Attention is attention: "you are working with the inputs" not really adding something new to it. Convolution, in this case, is a better choice for a layer that "adding something new".

Thank you so so much for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants