Implementation of some popular Recurrent Image Annotation papers on Corel-5k dataset with PyTorch library
(for more information see CNN_Image_Annotation_dataset)
Performance of LSTM in one time step:
(more information can be found at CNN_Image_Annotation_convolutional_models)
RIA is an encoder-decoder model that uses CNN as an encoder and LSTM as a decoder. In the training phase, it is trained using training images and human annotations. It is necessary to sort the label set as a label sequence before using the annotations as the input for LSTM. A rare-first order is used, which put the rarer label before the more frequent ones (based on label frequency in the dataset). During the test phase, the RIA model predicts the first output label after receiving the input image and being triggered by the start signal. Using the previous output as input for the next time step, it predicts the tag sequence recursively. The loop will continue until the stop signal is predicted. Its structure for the test phase is shown in the images below:
SR-CNN-RNN is another encoder-decoder model that has a similar architecture to RIA. The differences between them are that semantic concept learning is now done by the CNN model, which uses input images to generate a probabilistic estimate of semantic concepts. In order to generate label sequences, the RNN model takes the concept of probability estimates and models their correlations.
Attention networks are widely used in deep learning. Models can use them to determine which parts of the encoding are relevant to the related task. Using the attention mechanism, pixels with more importance can be highlighted in image annotation. In most cases, labels are conceptual and cannot be annotated by the objects that appear in the image. Therefore, the attention mechanism is not able to improve results significantly. Its structure for the test phase is shown in the images below:
Inspired by Image Captioning
It has been proposed to align the labels to the predictions of the network before computing the loss to reduce the problems caused by imposing a fixed order on the labels. A Hungarian algorithm can be used to solve the minimization problem since it is an assignment problem. So by preserving the attention architecture, we utilize minimal loss alignment (MLA) as the loss function instead of cross-entropy loss. (Furthermore, the frequency of a label is independent of the size of a given object in a dataset. Less frequent but larger objects can cause the LSTM prediction to stop earlier because of their domination in the image and their ranking in the prediction step.)
The beam search can be used instead of greedy searches to resolve this issue. However, experiments have shown that the RNN model cannot learn the complicated relationships between labels properly, and using beam search won't have any effect on the result.
(check out CNN_Image_Annotation_evaluation_metrics for more information)
To train and evaluate models in Spyder IDE use the codes below:
run main.py --method RIA --max-seq-len 5 --order-free None --is_glove --sort
run main.py --method RIA --max-seq-len 5 --order-free None --is_glove --evaluate
run main.py --method SR-CNN-RNN --max-seq-len 5 --order-free None --is_glove --sort
run main.py --method SR-CNN-RNN --max-seq-len 5 --order-free None --is_glove --evaluate
run main.py --method Attention --max-seq-len 5 --order-free None --is_glove --sort
run main.py --method Attention --max-seq-len 5 --order-free None --is_glove --evaluate
run main.py --method Attention --max-seq-len 5 --order-free MLA --is_glove
run main.py --method Attention --max-seq-len 5 --order-free MLA --is_glove --evaluate
batch-size | num of training images | image-size | epoch time | GloVe weights | features embedding dim | label embedding dim |
---|---|---|---|---|---|---|
32 | 4500 | 448 * 448 | 136s | True | 2048 | 300 |
data | precision | recall | f1-score |
---|---|---|---|
testset per-image metrics | 0.647 | 0.606 | 0.626 |
testset per-class metrics | 0.409 | 0.421 | 0.415 |
data | N+ |
---|---|
testset | 156 |
batch-size | num of training images | image-size | epoch time | GloVe weights | predicted labels embedding dim | label embedding dim |
---|---|---|---|---|---|---|
32 | 4500 | 448 * 448 | 135s | True | 2048 | 300 |
data | precision | recall | f1-score |
---|---|---|---|
testset per-image metrics | 0.680 | 0.616 | 0.646 |
testset per-class metrics | 0.405 | 0.391 | 0.398 |
data | N+ |
---|---|
testset | 145 |
batch-size | num of training images | image-size | epoch time | GloVe weights | features embedding dim | attention dim | label embedding dim |
---|---|---|---|---|---|---|---|
32 | 4500 | 448 * 448 | 150s | True | 2048 | 1024 | 300 |
data | precision | recall | f1-score |
---|---|---|---|
testset per-image metrics | 0.663 | 0.616 | 0.638 |
testset per-class metrics | 0.438 | 0.429 | 0.434 |
data | N+ |
---|---|
testset | 160 |
batch-size | num of training images | image-size | epoch time | GloVe weights | features embedding dim | attention dim | label embedding dim |
---|---|---|---|---|---|---|---|
32 | 4500 | 448 * 448 | 158s | True | 2048 | 1024 | 300 |
data | precision | recall | f1-score |
---|---|---|---|
testset per-image metrics | 0.656 | 0.608 | 0.632 |
testset per-class metrics | 0.449 | 0.413 | 0.431 |
data | N+ |
---|---|
testset | 155 |
J. Jin, and H. Nakayama.
"Recurrent Image Annotator for Arbitrary Length Image Tagging" (ICPR-2016)
F. Liu, T. Xiang, T. M Hospedales, W. Yang, and C. Sun.
"Semantic Regularisation for Recurrent Image Annotation" (CVPR-2017)
V. O. Yazici, A. Gonzalez-Garcia, A. Ramisa, B. Twardowski, and J. van de Weijer
"Orderless Recurrent Models for Multi-label Classification" (CVPR-2020)