Skip to content

Latest commit

 

History

History
62 lines (43 loc) · 6.68 KB

README.md

File metadata and controls

62 lines (43 loc) · 6.68 KB

EVA

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Abstract

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.

Results and models

merged-30M

The pre-trained models on merged-30M are used to fine-tune, and therefore don't have evaluation results.

Model patch size resolution Download
EVA-G (eva-g-p14_3rdparty_30m)* 14 224x224 model
EVA-G (eva-g-p16_3rdparty_30m)* 14 to 16 224x224 model

Models with * are converted from the official repo.

ImageNet-21k

The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don't have evaluation results.

Model Pretrain resolution Download
EVA-G (eva-g-p14_30m-pre_3rdparty_in21k)* merged-30M 224x224 model
EVA-L (eva-l-p14_3rdparty-mim_in21k)* From scratch with MIM 224x224 model
EVA-L (eva-l-p14_mim-pre_3rdparty_in21k)* MIM 224x224 model

Models with * are converted from the official repo.

ImageNet-1k

Model Pretrain resolution Params(M) Flops(G) Top-1 (%) Top-5 (%) Config Download
EVA-G (eva-g-p14_30m-in21k-pre_3rdparty_in1k-336px)* merged-30M & ImageNet-21k 336x336 1013.01 620.64 89.61 98.93 config model
EVA-G (eva-g-p14_30m-in21k-pre_3rdparty_in1k-560px)* merged-30M & ImageNet-21k 560x560 1014.45 1906.76 89.71 98.96 config model
EVA-L (eva-l-p14_mim-pre_3rdparty_in1k-336px)* MIM 336x336 304.53 191.10 88.66 98.75 config model
EVA-L (eva-l-p14_mim-in21k-pre_3rdparty_in1k-336px)* MIM & ImageNet-21k 336x336 304.53 191.10 89.17 98.86 config model
EVA-L (eva-l-p14_mim-pre_3rdparty_in1k-196px)* MIM 196x196 304.14 61.57 87.94 98.50 config model
EVA-L (eva-l-p14_mim-in21k-pre_3rdparty_in1k-196px)* MIM & ImageNet-21k 196x196 304.14 61.57 88.58 98.65 config model

Models with * are converted from the official repo. The config files of these models are only for inference.

Citation

@article{EVA,
  title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
  author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
  journal={arXiv preprint arXiv:2211.07636},
  year={2022}
}