Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VisionTextDualEncoder #13511

Merged
merged 51 commits into from
Nov 30, 2021
Merged

Conversation

patil-suraj
Copy link
Contributor

@patil-suraj patil-suraj commented Sep 10, 2021

What does this PR do?

This PR adds VisionTextDualEncoder model in PyTorch and Flax to be able to load any pre-trained vision (ViT, DeiT, BeiT, CLIP's vision model) and text (BERT, ROBERTA) model in the library for vision-text tasks like CLIP.

This model pairs a vision and text encoder and adds projection layers to project the embeddings to another embeddings space with similar dimensions. which can then be used to align the two modalities.

The API to load the config and model is similar to the API of EncoderDecoder and VisionEncoderDecoder models.

  • load vit-bert model from config
config_vision = ViTConfig()
config_text = BertConfig()

config = VisionTextDualEncoderConfig.from_vision_text_configs(config_vision, config_text, projection_dim=512)
# Initializing a BERT and ViT model
 model = VisionTextDualEncoderModel(config=config)
  • load using pre-trained vision and text model
model = VisionTextDualEncoderModel.from_vision_text_pretrained(
    "google/vit-base-patch16-224", "bert-base-uncased"
)

Since this is a multi-modal model, this PR also adds a generic VisionTextDualEncoderProcessor, which wraps any feature extractor and tokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
processor = VisionTextDualEncoderProcessor(feature_extractor, tokenizer)

@patil-suraj patil-suraj added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Oct 11, 2021
@huggingface huggingface deleted a comment from github-actions bot Oct 11, 2021
@patil-suraj patil-suraj changed the title [WIP] VisionTextDualEncoder VisionTextDualEncoder Nov 18, 2021
@patil-suraj patil-suraj removed the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Nov 30, 2021
@@ -511,6 +511,8 @@ Flax), PyTorch, and/or TensorFlow.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Vision Encoder decoder | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | ✅ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update the README no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @sgugger was against this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we removed generic classes like the Vision Encoder Decoder, especially since they did not have a research article coming with it. This one has an article as an example, so you can add it if you really want it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no official research implementation and pre-trained checkpoint I would also prefer to not add it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Think the only thing left to do is to update the README to include the model class there as well :-)

@patil-suraj patil-suraj merged commit fc1d97f into huggingface:master Nov 30, 2021
@patil-suraj patil-suraj deleted the vision-text-clip branch November 30, 2021 16:51
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
* init vision_text_dual_encoder

* fix merge

* remove extra heads

* fix tests

* remove VISION_TEXT_DUAL_ENCODER_PRETRAINED_CONFIG_ARCHIVE_MAP

* remove archive map

* fix imports

* fix more imports

* fix init

* delete tokenizers

* fix imports

* clean

* support clip's vision model

* handle None config

* begin tests

* more test and few fixes

* warn about newly init weights

* more tests

* add loss to model

* remove extra classes from doc

* add processor

* doc and small fixes

* add start docstr

* update flax model

* flax tests

* more flax tests

* doc

* quality

* doc and quality

* fix doc

* doc

* remove comments

* update warning

* quality

* fix docs

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* replace asserts, fix imports

* update imports

* fix import

* address some review comments

* fix check

* reduce tolerance

* fix test

* add flax integration test

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* address Sylvain's comments

* fix style

* add pt_flax_equivalence test in PT tests

* add pt integration test

* update test

* use pre-trained checkpoint in examples

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants