Regarding learned image embedding and text embedding in Unet #38

xiankgx · 2022-04-30T15:34:50Z

According to the paper Section 2.1 Decoder, it says

We enable classifier-free guidance by randomly setting CLIP embeddings to zero (or a learned embedding) 10% of the time, and randomly dropping the text caption 50% of he time during training.

It seems that we are replacing the embeddings after turning them to condition sequences.

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1216-L1222
https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1229-L1234

And from the following it seems that that null text embeddings can vary according to their sequence position. For image embeddings, I feel it is fine, but what about for text encodings?

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1104

Also, it seems perhaps it is needed to have a separate a separate cond_drop_prob one for image embedding and one for text encodings.
If we do that, how do we modify forward_with_cond_scale()?

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1166-L1178

The text was updated successfully, but these errors were encountered:

lucidrains · 2022-04-30T15:49:07Z

@xiankgx 🙏 🙏 🙏 thank you for the Q&A i missed this detail!

fixed in the latest version :)

xiankgx · 2022-04-30T15:53:09Z

How about the time dependence of null_text_embed?

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1104

xiankgx · 2022-04-30T15:55:49Z

Suppose we have a sequence of text embeddings: v1 v2 v3 v4 v5

And we need to change the vector of this to some null vectors, say v2 and v4.

Then, we have v1 v_null, v3, v_null, v5.

But can v_null in position 2 and 4 have different values?

lucidrains · 2022-04-30T15:57:45Z

@xiankgx so that is actually some improvisation on my part, an observation from DALL-E v1 that the text conditioning works better if one does not mask, but just provide padding tokens in the places without https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/dalle_pytorch.py#L372 it has been proven out in many follow-up works, in my mind

lucidrains · 2022-04-30T15:59:54Z

@xiankgx but yes, there is one problem, where i believe the text encoding should always be padded to the maximum text length, so the conditioning is consistent across batches with variable lengthed text encodings

let me fix that now 🙏

lucidrains · 2022-04-30T16:14:47Z

@xiankgx haha, actually there was another issue with the null padding tokens, only uncovered because of your issues 1c1e508 should be ok now

keep it coming! 🙏

lucidrains closed this as completed Apr 30, 2022

xiankgx mentioned this issue Apr 30, 2022

Need help with decoder training #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding learned image embedding and text embedding in Unet #38

Regarding learned image embedding and text embedding in Unet #38

xiankgx commented Apr 30, 2022 •

edited

Loading

lucidrains commented Apr 30, 2022

xiankgx commented Apr 30, 2022

xiankgx commented Apr 30, 2022

lucidrains commented Apr 30, 2022

lucidrains commented Apr 30, 2022

lucidrains commented Apr 30, 2022

Regarding learned image embedding and text embedding in Unet #38

Regarding learned image embedding and text embedding in Unet #38

Comments

xiankgx commented Apr 30, 2022 • edited Loading

lucidrains commented Apr 30, 2022

xiankgx commented Apr 30, 2022

xiankgx commented Apr 30, 2022

lucidrains commented Apr 30, 2022

lucidrains commented Apr 30, 2022

lucidrains commented Apr 30, 2022

xiankgx commented Apr 30, 2022 •

edited

Loading