Self-Distilled Vision Transformer for Domain Generalization (ACCV'22 -- Oral)

Maryam Sultana, Muzammal Naseer, Muhammad Haris Khan, Salman Khan, and Fahad Shahbaz Khan

Abstract: In the recent past, several domain generalization (DG) methods have been proposed, showing encouraging performance, however, almost all of them build on convolutional neural networks (CNNs). There is little to no progress on studying the DG performance of vision transformers (ViTs), which are challenging the supremacy of CNNs on standard benchmarks, often built on i.i.d assumption. This renders the real-world deployment of ViTs doubtful. In this paper, we attempt to explore ViTs towards addressing the DG problem. Similar to CNNs, ViTs also struggle in out-of-distribution scenarios and the main culprit is overfitting to source domains. Inspired by the modular architecture of ViTs, we propose a simple DG approach for ViTs, coined as self-distillation for ViTs. It reduces the overfitting of source domains by easing the learning of input-output mapping problem through curating non-zero entropy supervisory signals for intermediate transformer blocks. Further, it does not introduce any new parameters and can be seamlessly plugged into the modular composition of different ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets. Moreover, we report favorable performance against recent state-of-the-art DG methods. Our code along with pre-trained models are made available publicly.

State-of-the-Art Vision Transformers for Domain Generalization

PACS

CvT-21 - 88.9 ± 0.5 @ 224
DeiT-Small - 86.7 ± 0.2 @224
T2T-ViT-14 - 87.8 ± 0.6 @224

VLCS

CvT-21 - 81.9 ± 0.4 @ 224
DeiT-Small - 81.6 ± 0.1 @224
T2T-ViT-14 - 81.2 ± 0.3 @224

OfficeHome

CvT-21 - 77.0 ± 0.2 @ 224
DeiT-Small - 72.5 ± 0.3 @224
T2T-ViT-14 - 75.5 ± 0.2 @224

TerraIncognita

CvT-21 - 51.4 ± 0.7 @ 224
DeiT-Small - 44.9 ± 0.4 @224
T2T-ViT-14 - 50.5 ± 0.6 @224

DomainNet

CvT-21 - 52.0 ± 0.0 @ 224
Deit-Small - 47.4 ± 0.1 @224
T2T-ViT-14 - 50.2 ± 0.1 @224

Citation

If you find our work useful. Please consider giving a star ⭐ and a citation.

@InProceedings{Sultana_2022_ACCV,
    author    = {Sultana, Maryam and Naseer, Muzammal and Khan, Muhammad Haris and Khan, Salman and Khan, Fahad Shahbaz},
    title     = {Self-Distilled Vision Transformer for Domain Generalization},
    booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
    month     = {December},
    year      = {2022},
    pages     = {3068-3085}
}

Highlights

Inspired by the modular architecture of ViTs, we propose a light-weight plug-and-play DG approach for ViTs, namely self-distillation for ViT (SDViT). It explicitly encourages the model towards learning generalizable, comprehensive features.
We show that by improving the intermediate blocks, which are essentially multiple feature pathways, through soft supervision from final classifier facilitates the model towards learning crossdomain generalizable features. Our approach naturally fits into the modular and compositional architecture of different ViTs, and does not introduce any new parameters. As such it adds a minimal training overhead over the baseline.

In the Fig. above, we plot the block-wise accuracy of baseline (ERM-ViT) and our method (ERM-SDViT). Random sub-model distillation improves the accuracy of all blocks, in particular, the improvement is more pronounced for the earlier blocks. Besides later blocks, it also encourages earlier blocks to bank on transferable representations, yet discriminative representations. Since these earlier blocks manifest multiple discriminative feature pathways, we believe that they better facilitate the overall model towards capturing the semantics of the object class.

Installation

To install conda env with conda, run the following command in your terminal:

conda env create -n ViT_DGbed --file ViT_DGbed.yml

Activate the conda environment:

conda activate ViT_DGbed

Datasets

python3 -m domainbed.scripts.download \
       --data_dir=./domainbed/data --dataset pacs

Note: for downloading other datasets change --dataset pacs with other datasets (e.g., vlcs, office_home, terra_incognita, domainnet).

Model selection criteria

We computed results on the following model selection

IIDAccuracySelectionMethod: A random subset from the input data of the training source domains.

Training Self-Distilled Vision Transformer

Step 1: Download the pretrained models on Imagenet, such as CVT-21, T2T-ViT-14
Step 2: Place the models in the path ./domainbed/pretrained_models/Model_name/
Step 3: Run the followng commands:

Launching a sweep on ViT Baselines:

./Baseline_sweep.sh

Launching a sweep on SDViT Model:

./Grid_Search_sweep.sh

Note: For above all commands change --dataset PACS for training on other datasets such as OfficeHome, VLCS, TerraIncognita and DomainNet and backbone to CVTSmall or T2T14.

Pretrained Models

Pretrained ViT models:

Dataset	Baseline (ERM-ViT)	Ours (ERM-SDViT)
PACS	Link	Link
VLCS	Link	Link
OfficeHome	Link	Link
TerraIncognita	Link	Link
DomainNet	Link	Link

Evaluating for Domain Generalization

To view the results using our pre-trained models:

Step 1: Download the pretrained models uisng the links in above Table and place them dataset wise under the folder `Results'
Step 2: Run the following command to get outputs

python -m domainbed.scripts.collect_results\
       --input_dir=/Results/Dataset/Model/Backbone/ --get_recursively True

Note: Replace the text with dataset and model names (e.g: Results/PACS/ERM-ViT/DeiT-Small/ and so on....) to view results on various models. Test-Time Classifier Adjuster (T3A) is exploited in our proposed method as a complimentary approach, for details please refer to following instructions: T3A

Results:

Accuracy on three Backbone Networks using PACS dataset.
Accuracy on three Backbone Networks using five benchmark datasets in comparison with DG SOTA.

Attention Visualizations

Comparison of attention maps between the baseline ERM-ViT and our proposed ERM-SDViT (backbone: DeiT-Small) on four target domains of PACS dataset.

Comparison of attention maps between the baseline ERM-ViT and our proposed ERM-SDViT (backbone: DeiT-Small) on four target domains of VLCS and OfficeHome dataset.

Comparison of attention maps between the baseline ERM-ViT and our proposed ERM-SDViT (backbone: DeiT-Small) on six target domains of DomainNet dataset.

Acknowledgment

The code is build on the top of DomainBed: a PyTorch suite containing benchmark datasets and algorithms for domain generalization, as introduced in In Search of Lost Domain Generalization. ViT Code is based on T2T, CVT, DeiT repository and TIMM library. We thank the authors for releasing their codes.

License

This source code is released under the MIT license, included here.

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
Figures		Figures
domainbed		domainbed
.gitignore		.gitignore
Baseline_sweep.sh		Baseline_sweep.sh
Grid_Search_sweep.sh		Grid_Search_sweep.sh
README.md		README.md
ViT_DGbed.yml		ViT_DGbed.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Distilled Vision Transformer for Domain Generalization (ACCV'22 -- Oral)

State-of-the-Art Vision Transformers for Domain Generalization

Citation

Contents

Highlights

Installation

Datasets

Model selection criteria

Training Self-Distilled Vision Transformer

Pretrained Models

Evaluating for Domain Generalization

Attention Visualizations

Acknowledgment

License

About

Releases

Packages

Contributors 2

Languages

maryam089/SDViT

Folders and files

Latest commit

History

Repository files navigation

Self-Distilled Vision Transformer for Domain Generalization (ACCV'22 -- Oral)

State-of-the-Art Vision Transformers for Domain Generalization

Citation

Contents

Highlights

Installation

Datasets

Model selection criteria

Training Self-Distilled Vision Transformer

Pretrained Models

Evaluating for Domain Generalization

Attention Visualizations

Acknowledgment

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages