Q: Is there any specifications on the image pixel values for the decoder? #33

xiankgx · 2022-04-29T15:41:52Z

Hi, do you know if there are any strict restrictions on the image input to the decoder? I remember it is mentioned to be [-1, +1] somewhere, but can we also used other values like [0, 1]? However, since we are adding random normal noise with mean 0 I guess not?

lucidrains · 2022-04-29T15:56:18Z

@xiankgx yea, this repository does it from -1 to +1, but i've seen some generative models out there (back when i trained a lot of GANs) that does it from 0 to 1. i don't actually know if i've ever read any papers that did a proper comparison between the two

xiankgx · 2022-04-29T16:11:45Z

We should be very careful when passing images to various places, for different CLIP implementations which expect different things, also as the prediction target or x_start for the decoder.

lucidrains · 2022-04-29T16:21:22Z

@xiankgx yes! we definitely need to keep an eye on normalization

rom1504 · 2022-04-29T16:32:29Z

All clip implementation I know of provide a preprocess function, we should use it to convert from image to tensor

However, i wouldn't recommend to do any clip forward for training and instead to use precomputed clip embeddings

The prior training takes as input clip text and clip image
The generator training take as input clip image
So for training we don't need to do any clip forward

The only time we may want to do clip forward is at inference time

lucidrains · 2022-04-29T16:39:56Z

@rom1504 yea, i think the issue is that the decoder will be trained images that are simply normalized to -1 to 1, but CLIP uses https://github.com/openai/CLIP/blob/main/clip/clip.py#L85 (but we can always just do this within the embed_image forward function for the CLIP adapter)

i think what will have to happen is that on CLIP image embedding forward, we unnormalize the image (back to 0 to 1) then run the CLIP normalization

rom1504 · 2022-04-29T17:20:48Z

I don't understand.
Doesn't the decoder take as input (image, clip embedding) ?
If yes, then how is the normalization used in the clip forward related with the decoder training process?

lucidrains · 2022-04-29T18:34:43Z

@rom1504

basically, images usually start off normalized from a range of 0 to 1 (shrunk from 0 to 255)

for DDPMs, we normalize them to -1 to 1 using 2 * image - 1

for CLIP, if we do all the embedding processing externally, then there is no problem - however, for the decoder, we currently take in the image and do both DDPM and derive the CLIP image embedding. so I just have to make sure to unnormalize the image before renormalizing it with what CLIP was trained on, before passing it into the attention net. you can double check my work here! https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L228

rom1504 · 2022-04-30T02:14:28Z

ah I see, makes sense

lucidrains closed this as completed Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q: Is there any specifications on the image pixel values for the decoder? #33

Q: Is there any specifications on the image pixel values for the decoder? #33

xiankgx commented Apr 29, 2022

lucidrains commented Apr 29, 2022

xiankgx commented Apr 29, 2022

lucidrains commented Apr 29, 2022

rom1504 commented Apr 29, 2022

lucidrains commented Apr 29, 2022 •

edited

rom1504 commented Apr 29, 2022

lucidrains commented Apr 29, 2022

rom1504 commented Apr 30, 2022

Q: Is there any specifications on the image pixel values for the decoder? #33

Q: Is there any specifications on the image pixel values for the decoder? #33

Comments

xiankgx commented Apr 29, 2022

lucidrains commented Apr 29, 2022

xiankgx commented Apr 29, 2022

lucidrains commented Apr 29, 2022

rom1504 commented Apr 29, 2022

lucidrains commented Apr 29, 2022 • edited

rom1504 commented Apr 29, 2022

lucidrains commented Apr 29, 2022

rom1504 commented Apr 30, 2022

lucidrains commented Apr 29, 2022 •

edited