-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[Core] refactor embeddings
#8722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -0,0 +1,34 @@ | |||
| from .combined import ( | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way nothing should break in terms of imports.
| return (latent + pos_embed).to(latent.dtype) | ||
|
|
||
|
|
||
| class ImagePositionalEmbeddings(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though it says ImagePositionalEmbeddings it really isn't about positions. See VQDiffusion for details.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: Yiyi Xu <yixu310@gmail.com>
| return embeddings | ||
|
|
||
|
|
||
| class HunyuanDiTAttentionPool(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and the other attention pool are both used by some other embedding((assume it's one of the combined ones), we can move it to where it is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I think they are both "text" so also ok to move to text_image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The other AttentionPool class is used for TextTimeEmbedding (which, in theory, is also a kind of combined embedding class, IMO).
However HunyuanDiTAttentionPool is used in HunyuanCombinedTimestepTextSizeStyleEmbedding that combines timesteps, text embeddings, and additional things. So, it's clearly not just text. So, IMO, it's better to keep it in combined.py.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However HunyuanDiTAttentionPool is used in HunyuanCombinedTimestepTextSizeStyleEmbedding that combines timesteps, text embeddings, and additional things. So, it's clearly not just text. So, IMO, it's better to keep it in combined.py
HunyuanCombinedTimestepTextSizeStyleEmbedding take a combination if inputs but HunyuanDiTAttentionPool is only used to project the text inputs - but I'm fine to put it in combined.py since it's probably won't be used on its own. Same with the other AttentionPool for TextTimeEmbedding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the iteration here. TextTimeEmbedding is really just about projecting hidden_states. So, no combination. And also, after taking into consideration what said above, I thought it would make sense to keep HunyuanCombinedTimestepTextSizeStyleEmbedding in image_text, actually. So, your original suggestion.
I hope it makes sense now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, no combination. And also, after taking into consideration what said above, I thought it would make sense to keep HunyuanCombinedTimestepTextSizeStyleEmbedding in image_text, actually. So, your original suggestion.
sorry when did I suggest to put in image_text? it it clearly combined, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HunyuanCombinedTimestepTextSizeStyleEmbedding take a combination if inputs but HunyuanDiTAttentionPool is only used to project the text inputs - but I'm fine to put it in combined.py since it's probably won't be used on its own. Same with the other AttentionPool for TextTimeEmbedding
I thought this meant putting it in "image_text" but you were okay if put in combined as well. Sorry, if I misunderstood it.
| import torch.nn.functional as F | ||
|
|
||
|
|
||
| class LabelEmbedding(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this one here now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's only used by a single class below. I followed the philosophy behind placing the attention pooling layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it! let's put it into others and put the attention pooling layers into text_image
I thought these attention pooling layers have been here for a long time and no one else has used them, so it is ok to just put them next to the class that uses it (same for LabelEmbedding), but if we want to come up with one rule that applies to all the same situation, I think it is better to always put them under the respective files so that it is more likely to be reused.
| return emb | ||
|
|
||
|
|
||
| class TextImageProjection(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it should be in combined no?
| return x.squeeze(0) | ||
|
|
||
|
|
||
| class HunyuanCombinedTimestepTextSizeStyleEmbedding(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this is better suited to combined?
| return self.norm(x) | ||
|
|
||
|
|
||
| class PixArtAlphaTextProjection(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we want to rename this to a generic name because it's used in 3 models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted us to agree on the separation of classes first. Renaming, etc. can be dealt with later.
|
@DN6 this is a point to be worked out. What criterion to follow for placing a class in |
| LabelEmbedding, | ||
| PixArtAlphaCombinedTimestepSizeEmbeddings, | ||
| ) | ||
| from .image_text import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we split these into image and text?
I was trying to follow this: IMO, it's perhaps better to have all the embedding classes in Sorry about the back and forth here. |
This makes sense to me. |
|
Alright. Will wait for @yiyixuxu to comment as well before making changes. |
|
combined does not need "combined" in names, of course
for this, I am not sure what you mean. Do you want to put all the timestep embeddings into
ok with this
and in addition, I made a comment here #8722 (comment) - we can put these attention pool layers into |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
Extension of #7995.
Some comments are in line.
Todos