Add VATT model #19865

johko · 2022-10-25T08:36:37Z

Model description

Hey,
as discussed with @NielsRogge a few weeks back, I'd like to work on adding the "VATT: Transformers for Multimodal
Self-Supervised Learning from Raw Video, Audio and Text" model from Google.

It is basically three transformers(Video/Audio/Text) that are trained jointly in an unsupervised manner using contrastive loss functions. For downstreams tasks they fine-tune the Transformers separately, but also explore a version that shares the weights for all modalities.

For Pre-Traning they use text-video-audio triplets from HowTo100M and video-audio pairs from AudioSet. The authors describe how to fine-tune VATT for vision and audio classification tasks and provide weights for the fine-tuned versions.

The backbone for vision is ViT, for audio WaveFormTransformer and for text they are using BERT/T5

Open source status

The model implementation is available
The model weights are available

Provide useful links for the implementation

Paper: https://arxiv.org/pdf/2104.11178.pdf
GitHub: https://github.com/google-research/google-research/tree/master/vatt

fcakyon · 2022-11-20T18:08:25Z

@johko have you started implementing it?

johko · 2022-11-22T16:51:20Z

@fcakyon yes I have started, but progress is still rather slow, as that is my first model contribution and I have to figure out some stuff.

fcakyon · 2022-11-22T16:57:01Z

@johko I totally understand it. Interested in your implementation since I will be using VATT in my research next year :)

Are you working on a TF implementation?

johko · 2022-11-27T10:55:34Z

@johko I totally understand it. Interested in your implementation since I will be using VATT in my research next year :)

Are you working on a TF implementation?

Sorry for the late reply (again 🙈). Yes I'm working on a TF implementation. As the original repo is using it, I'm first doing that and then see about pytorch.

fcakyon · 2022-11-27T11:01:29Z

@johko, thanks for the response! I may also help with the pytorch part once you finalize the TF implementation 👍

johko · 2022-11-27T11:20:43Z

@fcakyon that would be great, as my expertise is more in TF 🙂

johko · 2023-01-24T11:28:54Z

Hey @NielsRogge , I'm sorry but I think I have to stop working this for good. I'd love to finish it, but every time I think now I finally have some time to do it, something else comes around 😞

I think I just can't provide a big contribution like this atm and will rather focus on smaller things. But maybe @fcakyon wants to pick up on it.

Sorry for blocking this so long.

pretbc · 2023-09-20T09:39:16Z

any news about VATT PyTorch implementation ?

johko added the New model label Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VATT model #19865

Add VATT model #19865

johko commented Oct 25, 2022

fcakyon commented Nov 20, 2022

johko commented Nov 22, 2022

fcakyon commented Nov 22, 2022

johko commented Nov 27, 2022

fcakyon commented Nov 27, 2022

johko commented Nov 27, 2022

johko commented Jan 24, 2023

pretbc commented Sep 20, 2023

Add VATT model #19865

Add VATT model #19865

Comments

johko commented Oct 25, 2022

Model description

Open source status

Provide useful links for the implementation

fcakyon commented Nov 20, 2022

johko commented Nov 22, 2022

fcakyon commented Nov 22, 2022

johko commented Nov 27, 2022

fcakyon commented Nov 27, 2022

johko commented Nov 27, 2022

johko commented Jan 24, 2023

pretbc commented Sep 20, 2023