Dual Cross-Attention Learning (DCAL)

📋 Table of Contents

Overview
Installation
Dataset Preparation
Usage
Project Structure
Citation
Acknowledgments
License

Overview

DCAL addresses the challenge of learning subtle feature embeddings for fine-grained recognition tasks. The method introduces two complementary cross-attention mechanisms:

Global-Local Cross-Attention (GLCA): Uses attention rollout to identify high-response regions and computes cross-attention between selected local queries and global key-value pairs, reinforcing spatial-wise discriminative clues.
Pair-Wise Cross-Attention (PWCA): A training-only regularization technique that introduces confusion by computing cross-attention between query vectors of one image and combined key-value vectors from both images in a pair, helping discover more discriminative regions and reducing overfitting.

Architecture Diagrams

GLCA (Global-Local Cross-Attention)	PWCA (Pair-Wise Cross-Attention)

The architecture consists of:

L=12 Self-Attention (SA) blocks
M=1 GLCA blocks
T=12 PWCA blocks (training only)

During inference, only SA and GLCA modules are used, with no additional computational cost from PWCA.

🔧 Installation

Setup

# Clone the repository
git clone https://github.com/mossbee/dcal.git
cd dcal

# Install requirements
pip install -r requirements.txt

Download pre-trained weights (e.g., ViT-B_16) and place them in the weights/ directory from Google Cloud Storage: ViT Weights

📦 Dataset Preparation

CUB-200-2011 (Fine-Grained Visual Categorization)

Download the dataset from the official website
Extract and organize the dataset as follows:

data/CUB_200_2011/
├── images/
│   ├── 001.Black_footed_Albatross/
│   ├── 002.Laysan_Albatross/
│   └── ...
├── images.txt
├── train_test_split.txt
├── classes.txt
└── image_class_labels.txt

VeRi-776 (Vehicle Re-Identification)

Download the dataset from the official source
Organize the dataset as follows:

data/VeRi_776/
├── image_train/
├── image_query/
├── image_test/
├── name_train.txt
├── name_query.txt
├── name_test.txt
├── train_label.xml
└── test_label.xml

🚀 Usage

Training

Fine-Grained Visual Categorization on CUB-200-2011

PYTHONPATH=src python3 -m tasks.fgvc_cub \
  --data-root data/CUB_200_2011 \
  --weights weights/ViT-B_16.npz \
  --output runs/fgvc_cub \
  --log-interval 25 \
  --val-interval 1 \
  --wandb --wandb-project dcal --wandb-run-name cub-run

Training Configuration:

Input size: 448×448 (resized from 550×550, then randomly cropped)
Batch size: 16
Optimizer: AdamW (weight decay: 0.05)
Learning rate: 5e-4 / 512 × batch_size with cosine decay
Epochs: 100
Local query ratio (R): 10%
Stochastic depth: Enabled

Vehicle Re-Identification on VeRi-776

PYTHONPATH=src python3 -m tasks.reid_veri \
  --data-root data/VeRi_776 \
  --weights weights/ViT-B_16.npz \
  --output runs/reid_veri \
  --log-interval 50 \
  --val-interval 1 \
  --wandb --wandb-project dcal --wandb-run-name veri-run

Training Configuration:

Input size: 256×256
Batch size: 64 (4 images per identity)
Optimizer: SGD (momentum: 0.9, weight decay: 1e-4)
Learning rate: 0.008 with cosine decay
Epochs: 120
Local query ratio (R): 30%
Loss: Cross-entropy + Triplet loss

📁 Project Structure

plan_dcal/
├── src/
│   ├── models/          # ViT backbone and DCAL implementation
│   │   ├── vit_backbone.py
│   │   └── vit_dcal.py
│   ├── attention/       # Attention mechanisms
│   │   ├── rollout.py
│   │   └── stochastic_depth.py
│   ├── datasets/        # Dataset loaders
│   │   ├── cub.py
│   │   └── veri.py
│   ├── tasks/           # Training entrypoints
│   │   ├── fgvc_cub.py
│   │   └── reid_veri.py
│   ├── losses/          # Loss functions
│   │   └── uncertainty.py
│   └── utils/           # Utilities
│       ├── data.py
│       ├── metrics.py
│       └── wandb_logging.py
├── weights/             # Pre-trained model weights
├── figures/             # Architecture diagrams
└── refs/                # Reference implementations

📝 Citation

If you use this code in your research, please cite the original paper:

@inproceedings{zhu2022dual,
  title={Dual cross-attention learning for fine-grained visual categorization and object re-identification},
  author={Zhu, Haowei and Ke, Wenjing and Li, Dong and Liu, Ji and Tian, Lu and Shan, Yi},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={4692--4702},
  year={2022}
}

🙏 Acknowledgments

This implementation is built upon the following excellent open-source projects:

ViT-pytorch by @jeonsworld - Vision Transformer implementation in PyTorch
pytorch-stochastic-depth by @tasptz - Stochastic depth implementation for PyTorch
vit-explain by @jacobgil - Attention rollout visualization for Vision Transformers

I am grateful to the authors for making their code publicly available, which greatly facilitated this implementation.

📄 License

This project is released for research purposes. Please refer to the original paper and dataset licenses for usage terms.

Note: This is an unofficial implementation. The original authors have not released their official code. This implementation has been verified against the paper's methodology (see DCAL_VERIFICATION_REPORT.md for details).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dual Cross-Attention Learning (DCAL)

📋 Table of Contents

Overview

Architecture Diagrams

🔧 Installation

Setup

📦 Dataset Preparation

CUB-200-2011 (Fine-Grained Visual Categorization)

VeRi-776 (Vehicle Re-Identification)

🚀 Usage

Training

Fine-Grained Visual Categorization on CUB-200-2011

Vehicle Re-Identification on VeRi-776

📁 Project Structure

📝 Citation

🙏 Acknowledgments

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
figures		figures
src		src
weights		weights
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Dual Cross-Attention Learning (DCAL)

📋 Table of Contents

Overview

Architecture Diagrams

🔧 Installation

Setup

📦 Dataset Preparation

CUB-200-2011 (Fine-Grained Visual Categorization)

VeRi-776 (Vehicle Re-Identification)

🚀 Usage

Training

Fine-Grained Visual Categorization on CUB-200-2011

Vehicle Re-Identification on VeRi-776

📁 Project Structure

📝 Citation

🙏 Acknowledgments

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages