Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding learned image embedding and text embedding in Unet #38

Closed
xiankgx opened this issue Apr 30, 2022 · 6 comments
Closed

Regarding learned image embedding and text embedding in Unet #38

xiankgx opened this issue Apr 30, 2022 · 6 comments

Comments

@xiankgx
Copy link

xiankgx commented Apr 30, 2022

According to the paper Section 2.1 Decoder, it says

We enable classifier-free guidance by randomly setting CLIP embeddings to zero (or a learned embedding) 10% of the time, and randomly dropping the text caption 50% of he time during training.

It seems that we are replacing the embeddings after turning them to condition sequences.

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1216-L1222
https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1229-L1234

And from the following it seems that that null text embeddings can vary according to their sequence position. For image embeddings, I feel it is fine, but what about for text encodings?

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1104

Also, it seems perhaps it is needed to have a separate a separate cond_drop_prob one for image embedding and one for text encodings.
If we do that, how do we modify forward_with_cond_scale()?

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1166-L1178

@lucidrains
Copy link
Owner

@xiankgx 🙏 🙏 🙏 thank you for the Q&A i missed this detail!

fixed in the latest version :)

@xiankgx
Copy link
Author

xiankgx commented Apr 30, 2022

How about the time dependence of null_text_embed?

https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1104

@xiankgx
Copy link
Author

xiankgx commented Apr 30, 2022

Suppose we have a sequence of text embeddings: v1 v2 v3 v4 v5

And we need to change the vector of this to some null vectors, say v2 and v4.

Then, we have v1 v_null, v3, v_null, v5.

But can v_null in position 2 and 4 have different values?

@lucidrains
Copy link
Owner

@xiankgx so that is actually some improvisation on my part, an observation from DALL-E v1 that the text conditioning works better if one does not mask, but just provide padding tokens in the places without https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/dalle_pytorch.py#L372 it has been proven out in many follow-up works, in my mind

@lucidrains
Copy link
Owner

@xiankgx but yes, there is one problem, where i believe the text encoding should always be padded to the maximum text length, so the conditioning is consistent across batches with variable lengthed text encodings

let me fix that now 🙏

@lucidrains
Copy link
Owner

@xiankgx haha, actually there was another issue with the null padding tokens, only uncovered because of your issues 1c1e508 should be ok now

keep it coming! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants