VisionTextDualEncoder #13511

patil-suraj · 2021-09-10T06:06:48Z

What does this PR do?

This PR adds VisionTextDualEncoder model in PyTorch and Flax to be able to load any pre-trained vision (ViT, DeiT, BeiT, CLIP's vision model) and text (BERT, ROBERTA) model in the library for vision-text tasks like CLIP.

This model pairs a vision and text encoder and adds projection layers to project the embeddings to another embeddings space with similar dimensions. which can then be used to align the two modalities.

The API to load the config and model is similar to the API of EncoderDecoder and VisionEncoderDecoder models.

load vit-bert model from config

config_vision = ViTConfig()
config_text = BertConfig()

config = VisionTextDualEncoderConfig.from_vision_text_configs(config_vision, config_text, projection_dim=512)
# Initializing a BERT and ViT model
 model = VisionTextDualEncoderModel(config=config)

load using pre-trained vision and text model

model = VisionTextDualEncoderModel.from_vision_text_pretrained(
    "google/vit-base-patch16-224", "bert-base-uncased"
)

Since this is a multi-modal model, this PR also adds a generic VisionTextDualEncoderProcessor, which wraps any feature extractor and tokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
processor = VisionTextDualEncoderProcessor(feature_extractor, tokenizer)

src/transformers/models/vision_text_dual_encoder/configuration_vision_text_dual_encoder.py

src/transformers/models/vision_text_dual_encoder/modeling_flax_vision_text_dual_encoder.py

src/transformers/models/vision_text_dual_encoder/modeling_vision_text_dual_encoder.py

docs/source/model_doc/vision_text_dual_encoder.rst

src/transformers/models/vision_text_dual_encoder/__init__.py

docs/source/model_doc/vision_text_dual_encoder.rst

src/transformers/models/vision_text_dual_encoder/__init__.py

src/transformers/models/vision_text_dual_encoder/configuration_vision_text_dual_encoder.py

src/transformers/models/vision_text_dual_encoder/modeling_flax_vision_text_dual_encoder.py

src/transformers/models/vision_text_dual_encoder/modeling_vision_text_dual_encoder.py

src/transformers/models/vision_text_dual_encoder/processing_vision_text_dual_encoder.py

tests/test_modeling_flax_vision_text_dual_encoder.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

patrickvonplaten · 2021-11-30T15:22:22Z

docs/source/index.rst

@@ -511,6 +511,8 @@ Flax), PyTorch, and/or TensorFlow.
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |   Vision Encoder decoder    |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
+|    VisionTextDualEncoder    |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |


Let's also update the README no?

I believe @sgugger was against this.

Yes we removed generic classes like the Vision Encoder Decoder, especially since they did not have a research article coming with it. This one has an article as an example, so you can add it if you really want it.

Since there is no official research implementation and pre-trained checkpoint I would also prefer to not add it.

patrickvonplaten

Looks great! Think the only thing left to do is to update the README to include the model class there as well :-)

* init vision_text_dual_encoder * fix merge * remove extra heads * fix tests * remove VISION_TEXT_DUAL_ENCODER_PRETRAINED_CONFIG_ARCHIVE_MAP * remove archive map * fix imports * fix more imports * fix init * delete tokenizers * fix imports * clean * support clip's vision model * handle None config * begin tests * more test and few fixes * warn about newly init weights * more tests * add loss to model * remove extra classes from doc * add processor * doc and small fixes * add start docstr * update flax model * flax tests * more flax tests * doc * quality * doc and quality * fix doc * doc * remove comments * update warning * quality * fix docs * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * replace asserts, fix imports * update imports * fix import * address some review comments * fix check * reduce tolerance * fix test * add flax integration test * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * address Sylvain's comments * fix style * add pt_flax_equivalence test in PT tests * add pt integration test * update test * use pre-trained checkpoint in examples Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

patil-suraj added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Oct 11, 2021

huggingface deleted a comment from github-actions bot Oct 11, 2021

patil-suraj force-pushed the vision-text-clip branch from c131df7 to bcbd648 Compare November 16, 2021 12:23

patil-suraj mentioned this pull request Nov 16, 2021

Errors while importing FlaxHybridCLIP checkpoints to FlaxCLIPModel or CLIPModel #14417

Closed

patil-suraj commented Nov 18, 2021

View reviewed changes

patil-suraj changed the title ~~[WIP] VisionTextDualEncoder~~ VisionTextDualEncoder Nov 18, 2021

patil-suraj requested review from sgugger, LysandreJik and patrickvonplaten November 18, 2021 12:14

patil-suraj commented Nov 18, 2021

View reviewed changes

docs/source/model_doc/vision_text_dual_encoder.rst Outdated Show resolved Hide resolved

src/transformers/models/vision_text_dual_encoder/__init__.py Outdated Show resolved Hide resolved

patil-suraj force-pushed the vision-text-clip branch from c171e6d to 3ac5e95 Compare November 18, 2021 13:45