Masked Autoencoding with dBOT

[arXiv] [BibTex]

This is the official PyTorch implementation of Exploring Target Representations for Masked Autoencoders.

News 🎉

January 2024 - The paper is accepted by ICLR 2024.
November 2022 - Release the code and pre-trained models.
September 2022 - Release the pre-print on arXiv.

Installation

Installation and preparation please follow MAE and iBOT. This repo is built upon python==3.6, timm==0.4.12 and pytorch==1.9.0.

Pre-Training

See pre-training instruction for details.

Downstream Tasks

See downstream instruction for details.

Pre-Trained and Fine-Tuned Models

We provide the pre-trained model (pt. model) and the finetuned model (ft. model) of dBOT in each experimental setup. You can download the pre-trained models for downstream tasks. asym. enc-dec being √ denotes that the decoder is appended after encoder with fixed delayed mask and sin-cos position embedding. It being × denotes that the vanillia ViT is used with no delayed mask and relative position embedding.

Arch.	Teacher	asym. enc-dec	cls.	det.	seg.	download
ViT-B	ViT-B	✓	84.5%	52.7	49.5	pt. model	ft. model	pt. log
	ViT-L	✓	84.6%	53.1	50.1	pt. model	ft. model	pt. log
	ViT-H	✓	84.6%	53.5	50.8	pt. model	ft. model	pt. log
	CLIP-B/16	✘	85.7%	53.6	52.9	pt. model	ft. model	pt. log
ViT-L	ViT-L	✓	86.6%	56.0	54.5	pt. model	ft. model	pt. log
	ViT-H	✓	86.8%	56.1	55.2	pt. model	ft. model	pt. log
	CLIP-L/14	✘	87.8%	56.8	56.2	pt. model	ft. model	pt. log
ViT-H	ViT-H	✓	87.4%	-	-	pt. model	ft. model	pt. log
ViT-H	CLIP-L/14	✘	88.5%	-	-	pt. model	ft. model	pt. log
ViT-H₄₄₈	ViT-H	✓	88.0%	-	-	pt. model	ft. model	pt. log
ViT-H₄₄₈	CLIP-L/14	✘	89.1%	-	-	pt. model	ft. model	pt. log

🎯 This branch is the implementation of dBOT with default asymmetric encoder-decoder architecture. For symmetric architecture with which we use CLIP as the pre-trained teacher, please see beit branch for details.

Property Analysis

To demonstrate models' differences in terms of their weigths and outputs, we conduct property analysis using averaged attention distance and singular value decomposition. We first compute the averaged attention distance for each attention head of different Transformer blocks. The results are averaged over IN1K validation set:

We also compute the percentage of tok-k (varing from 1 to 5) singular values of the embedding w.r.t each layer:

The student networks distilled from different initialized teachers exhibit similar behaviors, which clearly indicate that the teacher network does not matter with bootstrapped teachers.

Acknowledgement

This reposity is modified upon the MAE repository and iBOT repository.

License

This project is under the Apache 2.0 license as found in LICENSE file.

Citing dBOT

Please consider citing dBOT and giving a star if dBOT helps your research:

@article{liu2022exploring,
  title={Exploring target representations for masked autoencoders},
  author={Liu, Xingbin and Zhou, Jinghao and Kong, Tao and Lin, Xianming and Ji, Rongrong},
  journal={arXiv preprint arXiv:2209.03917},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
analysis		analysis
evaluation		evaluation
util		util
.gitignore		.gitignore
DOWNSTREAM.md		DOWNSTREAM.md
LICENSE		LICENSE
PRETRAIN.md		PRETRAIN.md
README.md		README.md
engine_finetune.py		engine_finetune.py
engine_pretrain.py		engine_pretrain.py
main_finetune.py		main_finetune.py
main_pretrain.py		main_pretrain.py
models_mae.py		models_mae.py
models_vit.py		models_vit.py
run.sh		run.sh

License

liuxingbin/dbot

Folders and files

Latest commit

History

Repository files navigation

Masked Autoencoding with dBOT

News 🎉

Installation

Pre-Training

Downstream Tasks

Pre-Trained and Fine-Tuned Models

Property Analysis

Acknowledgement

License

Citing dBOT

About

Topics

Resources

License

Stars

Watchers

Forks

Languages