Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unused arguments: n_queries and attn_pooler_heads in MultimodalCfg #434

Open
fedshyvana opened this issue Feb 18, 2023 · 4 comments
Open

Comments

@fedshyvana
Copy link

fedshyvana commented Feb 18, 2023

Hi, thanks @gpucce for implementing CoCa! I am reading the code, and under MultimodalCfg (used by CoCa to construct the multimodal decoder) there is n_queries and attn_pooler_heads (here), but correct me if I am wrong, I don't see they are used anywhere? It seems currently the cross-attention layers in the multimodal decoder simply attends to a sequence of arbitrarily long img embedding tokens. Whereas I believe in lucidrains implementation, they're first reduced to a sequence of fixed length by using a set number of queries (e.g. 256). Any clarification would be appreciated! Thanks.

@gpucce
Copy link
Contributor

gpucce commented Feb 18, 2023

Hi, you are right that the arguments are not passed down to the vision transformer, I will fix that.

While for the query shape, in attentional pooler, I think there isn't a difference, also in lucidrains implementation, the queries that are passed to the attentional pooler are 256 same as here and both here and there they have a length that is decided at model initialisation, should be "embed_dim" here and "dim" in coca-pytorch, and they are used to compute attention with the output of the image encoder in both cases.

Does it make sense to you?

@gpucce
Copy link
Contributor

gpucce commented Feb 18, 2023

Also, to avoid further confusion, while those argument are outdated the equivalent ones can be passed to the vision transformers and should be used correctly, thanks for pointing it out!

@fedshyvana
Copy link
Author

Hi @gpucce, thank you so much for the reply and clarification! Yes the use of n_queries/attn_pooler_heads in the visual_cfg seems clear. But for clarification, this currently only applies to models instantiated from the Visiontransformer class correct? i.e. if we use TimmModel, an AttentionalPooler + attention-pooling setup will not be added and the arguments are ignored. Would it make sense to add the usage of AttentionalPooler to TimmModel as well so that the behavior of the two arguments are consistent across all visual encoder models? Thanks again.

@gpucce
Copy link
Contributor

gpucce commented Feb 18, 2023

Hi @gpucce, thank you so much for the reply and clarification! Yes the use of n_queries/attn_pooler_heads in the visual_cfg seems clear. But for clarification, this currently only applies to models instantiated from the Visiontransformer class correct? i.e. if we use TimmModel, an AttentionalPooler + attention-pooling setup will not be added and the arguments are ignored. Would it make sense to add the usage of AttentionalPooler to TimmModel as well so that the behavior of the two arguments are consistent across all visual encoder models? Thanks again.

Yeah indeed, I want to add support for the rest of the models but I don't think I will be able to do it shortly unfortunately :(

I will write here once I start!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants