Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to calculate the loss #60

Open
orydatadudes opened this issue Feb 15, 2023 · 9 comments
Open

How to calculate the loss #60

orydatadudes opened this issue Feb 15, 2023 · 9 comments

Comments

@orydatadudes
Copy link

i didn't fully understand how the loss is calculated

regular diffusion models takes input image adding noise and during backward stage we learn how to undo that noise
the loss calculate base on how good we learn that noise for every stage in T

so what i don't understand is what the part of the target image ? ( we add the noise to input image )

that is what says in the paper

image

I also not sure what is it "task-specific conditions cf" in the loss , is it the target image ?

thanks

@xiankgx
Copy link

xiankgx commented Feb 16, 2023

This paper is about adding additional condition to an existing text-conditioned generative model. In the paper, there are many tasks mentioned (e.g. controlling generation with canny-edge, controlling generation with hough-line, controlling generation with user scribble, etc). For canny-edge input, c_f is the canny-edge image obtained from the ground truth image.

@orydatadudes
Copy link
Author

so it c_f is the ground truth (canny-edge for that task) and your model learn how to make prediction with that input how you can use that model for another/test image ?

@xiankgx
Copy link

xiankgx commented Feb 20, 2023

so it c_f is the ground truth (canny-edge for that task) and your model learn how to make prediction with that input how you can use that model for another/test image ?

During training, there is a set of such inputs and targets ((canny_edge_1, target_image_1), (canny_edge_2, target_image_2), ... (canny_edge_N, target_image_N)), and we minimize the loss on this dataset. The ultimate goal of all neural net training is for it to generalize (meaning also work reasonably well on other test images that it may not have seen during training). Hope this helps.

@orydatadudes
Copy link
Author

thanks for the answer but it's still not fully clear for me

for canny edge missing, during training the input to the model is [(canny_edge_1, target_image_1,text prompt)...()]
canny_edge_1 - is an image with only black & white edges
target_image - the image that the canny_edge image extracted from
text - prompt that describe the canny_edge_1 image (which is also the same description for the target_image)

now target_image + text area are used as "hint" to the backward process
that mean the diffusion process add noise to the canny_edge_1 ?
that also mean that for new image i can generate random noise, then add these "hint" (target_image + text) and get new canny edge to my new image & prompt

if that is true , that also mean that the function the learn how to reverse the noise should NOT get the canny_edge_1 image but the target_image which is different from the formula in the paper

i really get confuse and need some help to figure it out

@xiankgx
Copy link

xiankgx commented Feb 22, 2023

The diffusion process do NOT add noise to the canny edge input, it adds noise to the target image. Just like how the diffusion process also do NOT add noise to the text input. When comparing to standard SD, there is no change in how the forward and reverse diffusion process works.

What is changed is that new inputs are created (canny edge input) to influence the denoising process via the SD Unet.

The time variable, text conditioning input, and canny edge input is input to the Controlnet which is used to control/modify the SD Unet behavior.

@xiankgx
Copy link

xiankgx commented Feb 22, 2023

To make this easier, just imagine we replace the Unet in SD with a Unet that can take in additional input (canny edge input).

@xiankgx
Copy link

xiankgx commented Feb 22, 2023

image

Consider this image, to go from x_t to x_t-1, in normal SD, a Unet is used, and the time variable t, text conditioning data from CLIP text encoder is inputted to this Unet.

Now, with ControlNet, this Unet is modified, and this time in addition to the time variable t and text conditiioning data from CLIP text encoder, it also takes in canny edge map data.

@orydatadudes
Copy link
Author

thank you for your detailed response
for making canny edge mission when you say " it adds noise to the target image."
target image is that kind of images?

image

or that ?

image

@xiankgx
Copy link

xiankgx commented Feb 23, 2023

image

The target or label image is what you want the model to predict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants