Skip to content

Conversation

@sayakpaul
Copy link
Member

@sayakpaul sayakpaul commented Jun 27, 2024

What does this PR do?

Extension of #7995.

Some comments are in line.

Todos

  • Add licensing

@@ -0,0 +1,34 @@
from .combined import (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way nothing should break in terms of imports.

return (latent + pos_embed).to(latent.dtype)


class ImagePositionalEmbeddings(nn.Module):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it says ImagePositionalEmbeddings it really isn't about positions. See VQDiffusion for details.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sayakpaul sayakpaul requested review from DN6 and yiyixuxu June 27, 2024 11:43
@sayakpaul sayakpaul marked this pull request as ready for review June 27, 2024 11:43
return embeddings


class HunyuanDiTAttentionPool(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the other attention pool are both used by some other embedding((assume it's one of the combined ones), we can move it to where it is used

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I think they are both "text" so also ok to move to text_image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The other AttentionPool class is used for TextTimeEmbedding (which, in theory, is also a kind of combined embedding class, IMO).

However HunyuanDiTAttentionPool is used in HunyuanCombinedTimestepTextSizeStyleEmbedding that combines timesteps, text embeddings, and additional things. So, it's clearly not just text. So, IMO, it's better to keep it in combined.py.

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However HunyuanDiTAttentionPool is used in HunyuanCombinedTimestepTextSizeStyleEmbedding that combines timesteps, text embeddings, and additional things. So, it's clearly not just text. So, IMO, it's better to keep it in combined.py

HunyuanCombinedTimestepTextSizeStyleEmbedding take a combination if inputs but HunyuanDiTAttentionPool is only used to project the text inputs - but I'm fine to put it in combined.py since it's probably won't be used on its own. Same with the other AttentionPool for TextTimeEmbedding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the iteration here. TextTimeEmbedding is really just about projecting hidden_states. So, no combination. And also, after taking into consideration what said above, I thought it would make sense to keep HunyuanCombinedTimestepTextSizeStyleEmbedding in image_text, actually. So, your original suggestion.

I hope it makes sense now.

Copy link
Collaborator

@yiyixuxu yiyixuxu Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, no combination. And also, after taking into consideration what said above, I thought it would make sense to keep HunyuanCombinedTimestepTextSizeStyleEmbedding in image_text, actually. So, your original suggestion.

sorry when did I suggest to put in image_text? it it clearly combined, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HunyuanCombinedTimestepTextSizeStyleEmbedding take a combination if inputs but HunyuanDiTAttentionPool is only used to project the text inputs - but I'm fine to put it in combined.py since it's probably won't be used on its own. Same with the other AttentionPool for TextTimeEmbedding

I thought this meant putting it in "image_text" but you were okay if put in combined as well. Sorry, if I misunderstood it.

@sayakpaul sayakpaul requested review from DN6 and yiyixuxu July 3, 2024 02:54
import torch.nn.functional as F


class LabelEmbedding(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this one here now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's only used by a single class below. I followed the philosophy behind placing the attention pooling layers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it! let's put it into others and put the attention pooling layers into text_image

I thought these attention pooling layers have been here for a long time and no one else has used them, so it is ok to just put them next to the class that uses it (same for LabelEmbedding), but if we want to come up with one rule that applies to all the same situation, I think it is better to always put them under the respective files so that it is more likely to be reused.

return emb


class TextImageProjection(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should be in combined no?

return x.squeeze(0)


class HunyuanCombinedTimestepTextSizeStyleEmbedding(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is better suited to combined?

return self.norm(x)


class PixArtAlphaTextProjection(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we want to rename this to a generic name because it's used in 3 models?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted us to agree on the separation of classes first. Renaming, etc. can be dealt with later.

@sayakpaul
Copy link
Member Author

@DN6 this is a point to be worked out. What criterion to follow for placing a class in combined and what to follow for image_text. At the moment, embedding classes dealing with images and texts are placed in image_text. Combined is usually for classes that take either timesteps or other forms of conditioning along with {image,text} into account.

LabelEmbedding,
PixArtAlphaCombinedTimestepSizeEmbeddings,
)
from .image_text import (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we split these into image and text?

@sayakpaul
Copy link
Member Author

sayakpaul commented Jul 3, 2024

Can't we split these into image and text?

I was trying to follow this:

#7995 (comment)

IMO, it's perhaps better to have all the embedding classes in combined.py that deal with timestep or mix modalities (image and text, for example) regardless of whether they have Combined in their class names or not. And then split image_text into image and text separately w.r.t to their types. So, if a class is ONLY dealing with text, we put it in embeddings/text.py.

WDYT @DN6 @yiyixuxu ?

Sorry about the back and forth here.

@DN6
Copy link
Collaborator

DN6 commented Jul 3, 2024

IMO, it's perhaps better to have all the embedding classes in combined.py that deal with timestep or mix modalities (image and text, for example) regardless of whether they have Combined in their class names or not. And then split image_text into image and text separately w.r.t to their types. So, if a class is ONLY dealing with text, we put it in embeddings/text.py.

This makes sense to me.

@sayakpaul
Copy link
Member Author

Alright. Will wait for @yiyixuxu to comment as well before making changes.

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Jul 3, 2024

combined does not need "combined" in names, of course

regardless of whether they have Combined in their class names or not.

for this, I am not sure what you mean. Do you want to put all the timestep embeddings into combined.py? timestep embeddings should have their own file, but it is a "combined" embedding that takes various inputs, including timestep, of course, it should definitely go into combined

it's perhaps better to have all the embedding classes in combined.py that deal with timestep or mix modalities

ok with this

And then split image_text into image and text separately w.r.t to their types. So, if a class is ONLY dealing with text, we put it in embeddings/text.py.

and in addition, I made a comment here #8722 (comment) - we can put these attention pool layers into text

@sayakpaul sayakpaul requested review from DN6 and yiyixuxu July 4, 2024 03:34
@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024
@yiyixuxu yiyixuxu closed this Nov 20, 2024
@yiyixuxu yiyixuxu deleted the embeddings-refactor branch November 20, 2024 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Issues that haven't received updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants