-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VisionTextDualEncoder #13511
VisionTextDualEncoder #13511
Conversation
c131df7
to
bcbd648
Compare
src/transformers/models/vision_text_dual_encoder/configuration_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/modeling_flax_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/modeling_vision_text_dual_encoder.py
Show resolved
Hide resolved
c171e6d
to
3ac5e95
Compare
src/transformers/models/vision_text_dual_encoder/configuration_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/configuration_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/configuration_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/modeling_flax_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/modeling_flax_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/modeling_flax_vision_text_dual_encoder.py
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/modeling_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/modeling_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/modeling_vision_text_dual_encoder.py
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/modeling_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/modeling_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/processing_vision_text_dual_encoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_text_dual_encoder/processing_vision_text_dual_encoder.py
Show resolved
Hide resolved
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
94e01df
to
09eca4e
Compare
@@ -511,6 +511,8 @@ Flax), PyTorch, and/or TensorFlow. | |||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ | |||
| Vision Encoder decoder | ❌ | ❌ | ✅ | ❌ | ✅ | | |||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ | |||
| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | ✅ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also update the README no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @sgugger was against this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we removed generic classes like the Vision Encoder Decoder, especially since they did not have a research article coming with it. This one has an article as an example, so you can add it if you really want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no official research implementation and pre-trained checkpoint I would also prefer to not add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Think the only thing left to do is to update the README to include the model class there as well :-)
* init vision_text_dual_encoder * fix merge * remove extra heads * fix tests * remove VISION_TEXT_DUAL_ENCODER_PRETRAINED_CONFIG_ARCHIVE_MAP * remove archive map * fix imports * fix more imports * fix init * delete tokenizers * fix imports * clean * support clip's vision model * handle None config * begin tests * more test and few fixes * warn about newly init weights * more tests * add loss to model * remove extra classes from doc * add processor * doc and small fixes * add start docstr * update flax model * flax tests * more flax tests * doc * quality * doc and quality * fix doc * doc * remove comments * update warning * quality * fix docs * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * replace asserts, fix imports * update imports * fix import * address some review comments * fix check * reduce tolerance * fix test * add flax integration test * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * address Sylvain's comments * fix style * add pt_flax_equivalence test in PT tests * add pt integration test * update test * use pre-trained checkpoint in examples Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
What does this PR do?
This PR adds
VisionTextDualEncoder
model in PyTorch and Flax to be able to load any pre-trained vision (ViT
,DeiT
,BeiT
, CLIP's vision model) and text (BERT
,ROBERTA
) model in the library for vision-text tasks like CLIP.This model pairs a vision and text encoder and adds projection layers to project the embeddings to another embeddings space with similar dimensions. which can then be used to align the two modalities.
The API to load the config and model is similar to the API of
EncoderDecoder
andVisionEncoderDecoder
models.vit-bert
model from configSince this is a multi-modal model, this PR also adds a generic
VisionTextDualEncoderProcessor
, which wraps any feature extractor and tokenizer