This is the official implementation of the paper ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers.
It is a plug-in module, used for clustering the input of Detection Transformers, based on their entropy which is learnable. In its current state, it can be plugged only in Detection Transformers that have a Multi-Head Self-Attention module in their encoder.
In this repository, we plug ENACT to three such models, which are the DETR, Conditional DETR and Anchor DETR.
We provide comparisons in GPU memory usage, training and inference times (in seconds per image) between detection transformer models, with and without ENACT.
model | backbone | epochs | batch | GPU (GB) | train time | inf time |
---|---|---|---|---|---|---|
DETR-C5 | R50 | 300 | 8 | 36.5 | 0.0541 | 0.0482 |
DETR-C5 + ENACT | R50 | 300 | 8 | 23.5 | 0.0488 | 0.0472 |
Conditional DETR-C5 | R101 | 50 | 8 | 46.6 | 0.0826 | 0.0637 |
Conditional DETR-C5 + ENACT | R101 | 50 | 8 | 36.7 | 0.0779 | 0.0605 |
Anchor DETR-DC5 | R50 | 50 | 4 | 29.7 | 0.0999 | 0.0712 |
Anchor DETR-DC5 + ENACT | R50 | 50 | 4 | 17.7 | 0.0845 | 0.0608 |
model | AP | AP50 | APS | APM | APL | url |
---|---|---|---|---|---|---|
DETR-C5 | 40.6 | 61.6 | 19.9 | 44.3 | 60.2 | - |
DETR-C5 + ENACT | 39.0 | 59.1 | 18.3 | 42.2 | 57.0 | model | log |
Conditional DETR-C5 | 42.8 | 63.7 | 21.7 | 46.6 | 60.9 | - |
Conditional DETR-C5 + ENACT | 41.5 | 62.2 | 21.3 | 45.5 | 59.3 | model | log |
Anchor DETR-DC5 | 44.3 | 64.9 | 25.1 | 48.1 | 61.1 | - |
Anchor DETR-DC5 + ENACT | 42.9 | 63.5 | 25.0 | 46.8 | 58.5 | model | log |
Initially, clone the repository.
git clone https://github.com/GSavathrakis/ENACT.git
cd ENACT
You should download the MS COCO dataset. This module was trained on the COCO 2017 dataset. The structure of the downloaded files should be the following:
path_to_coco/
├── train2017/
├── val2017/
└── annotations/
├── instances_train2017.json
└── instances_val2017.json
Subsequently, set up an anaconda environment. This repo was tested on python 3.10 with cuda 11.7
conda create -n "env name" python="3.10 or above"
conda activate "env name"
Next you need to install cuda in your conda environment and the additional packages
conda install nvidia/label/cuda-11.7.0::cuda
pip install torch==2.0.0 torchvision cython scipy pycocotools tqdm numpy==1.23 opencv-python
Alternatively, you can create a docker container using the Dockerfile and the .yml files provided.
docker compose build
docker compose up
In order to train one of the detection transformers, with the ENACT module, you should run:
python "Path to one of the DETR variants models"/main.py --coco_path "Path to COCO dataset" --output_dir "Path to the directory where you want to save checkpoints"
For example, if you want to train the Anchor-DETR model with ENACT you should run:
python Anchor-DETR-ENACT/main.py --coco_path "Path to COCO dataset" --output_dir "Path to the directory where you want to save checkpoints"
You can also evaluate the ENACT module on the three models using the pretrained models that can be downloaded from the links in the second table.
For example if you want to evaluate the DETR model with ENACT you should run:
python DETR-ENACT/main.py --coco_path "Path to COCO dataset" --output_dir "Path to the directory where you want to save checkpoints" --resume "Path to DETR-ENACT checkpoint" --eval
If you find this work useful for your research, please cite:
@misc{savathrakis2024enactentropybasedclusteringattention,
title={ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers},
author={Giorgos Savathrakis and Antonis Argyros},
year={2024},
eprint={2409.07541},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2409.07541},
}
as well as the works of the transformer networks used:
@InProceedings{10.1007/978-3-030-58452-8_13,
author="Carion, Nicolas
and Massa, Francisco
and Synnaeve, Gabriel
and Usunier, Nicolas
and Kirillov, Alexander
and Zagoruyko, Sergey",
editor="Vedaldi, Andrea
and Bischof, Horst
and Brox, Thomas
and Frahm, Jan-Michael",
title="End-to-End Object Detection with Transformers",
booktitle="Computer Vision -- ECCV 2020",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="213--229",
abstract="We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster R-CNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.",
isbn="978-3-030-58452-8"
}
@inproceedings{wang2022anchor,
title={Anchor detr: Query design for transformer-based detector},
author={Wang, Yingming and Zhang, Xiangyu and Yang, Tong and Sun, Jian},
booktitle={Proceedings of the AAAI conference on artificial intelligence},
volume={36},
pages={2567--2575},
year={2022}
}
@inproceedings{meng2021conditional,
title={Conditional detr for fast training convergence},
author={Meng, Depu and Chen, Xiaokang and Fan, Zejia and Zeng, Gang and Li, Houqiang and Yuan, Yuhui and Sun, Lei and Wang, Jingdong},
booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
pages={3651--3660},
year={2021}
}