DCAL addresses the challenge of learning subtle feature embeddings for fine-grained recognition tasks. The method introduces two complementary cross-attention mechanisms:
-
Global-Local Cross-Attention (GLCA): Uses attention rollout to identify high-response regions and computes cross-attention between selected local queries and global key-value pairs, reinforcing spatial-wise discriminative clues.
-
Pair-Wise Cross-Attention (PWCA): A training-only regularization technique that introduces confusion by computing cross-attention between query vectors of one image and combined key-value vectors from both images in a pair, helping discover more discriminative regions and reducing overfitting.
| GLCA (Global-Local Cross-Attention) | PWCA (Pair-Wise Cross-Attention) |
|---|---|
![]() |
![]() |
The architecture consists of:
L=12Self-Attention (SA) blocksM=1GLCA blocksT=12PWCA blocks (training only)
During inference, only SA and GLCA modules are used, with no additional computational cost from PWCA.
# Clone the repository
git clone https://github.com/mossbee/dcal.git
cd dcal
# Install requirements
pip install -r requirements.txtDownload pre-trained weights (e.g., ViT-B_16) and place them in the weights/ directory from Google Cloud Storage: ViT Weights
- Download the dataset from the official website
- Extract and organize the dataset as follows:
data/CUB_200_2011/
├── images/
│ ├── 001.Black_footed_Albatross/
│ ├── 002.Laysan_Albatross/
│ └── ...
├── images.txt
├── train_test_split.txt
├── classes.txt
└── image_class_labels.txt
- Download the dataset from the official source
- Organize the dataset as follows:
data/VeRi_776/
├── image_train/
├── image_query/
├── image_test/
├── name_train.txt
├── name_query.txt
├── name_test.txt
├── train_label.xml
└── test_label.xml
PYTHONPATH=src python3 -m tasks.fgvc_cub \
--data-root data/CUB_200_2011 \
--weights weights/ViT-B_16.npz \
--output runs/fgvc_cub \
--log-interval 25 \
--val-interval 1 \
--wandb --wandb-project dcal --wandb-run-name cub-runTraining Configuration:
- Input size: 448×448 (resized from 550×550, then randomly cropped)
- Batch size: 16
- Optimizer: AdamW (weight decay: 0.05)
- Learning rate:
5e-4 / 512 × batch_sizewith cosine decay - Epochs: 100
- Local query ratio (R): 10%
- Stochastic depth: Enabled
PYTHONPATH=src python3 -m tasks.reid_veri \
--data-root data/VeRi_776 \
--weights weights/ViT-B_16.npz \
--output runs/reid_veri \
--log-interval 50 \
--val-interval 1 \
--wandb --wandb-project dcal --wandb-run-name veri-runTraining Configuration:
- Input size: 256×256
- Batch size: 64 (4 images per identity)
- Optimizer: SGD (momentum: 0.9, weight decay: 1e-4)
- Learning rate: 0.008 with cosine decay
- Epochs: 120
- Local query ratio (R): 30%
- Loss: Cross-entropy + Triplet loss
plan_dcal/
├── src/
│ ├── models/ # ViT backbone and DCAL implementation
│ │ ├── vit_backbone.py
│ │ └── vit_dcal.py
│ ├── attention/ # Attention mechanisms
│ │ ├── rollout.py
│ │ └── stochastic_depth.py
│ ├── datasets/ # Dataset loaders
│ │ ├── cub.py
│ │ └── veri.py
│ ├── tasks/ # Training entrypoints
│ │ ├── fgvc_cub.py
│ │ └── reid_veri.py
│ ├── losses/ # Loss functions
│ │ └── uncertainty.py
│ └── utils/ # Utilities
│ ├── data.py
│ ├── metrics.py
│ └── wandb_logging.py
├── weights/ # Pre-trained model weights
├── figures/ # Architecture diagrams
└── refs/ # Reference implementations
If you use this code in your research, please cite the original paper:
@inproceedings{zhu2022dual,
title={Dual cross-attention learning for fine-grained visual categorization and object re-identification},
author={Zhu, Haowei and Ke, Wenjing and Li, Dong and Liu, Ji and Tian, Lu and Shan, Yi},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
pages={4692--4702},
year={2022}
}This implementation is built upon the following excellent open-source projects:
- ViT-pytorch by @jeonsworld - Vision Transformer implementation in PyTorch
- pytorch-stochastic-depth by @tasptz - Stochastic depth implementation for PyTorch
- vit-explain by @jacobgil - Attention rollout visualization for Vision Transformers
I am grateful to the authors for making their code publicly available, which greatly facilitated this implementation.
This project is released for research purposes. Please refer to the original paper and dataset licenses for usage terms.
Note: This is an unofficial implementation. The original authors have not released their official code. This implementation has been verified against the paper's methodology (see DCAL_VERIFICATION_REPORT.md for details).

