Skip to content

Implementation of some popular Recurrent Image Annotation papers on Corel-5k dataset with PyTorch library

Notifications You must be signed in to change notification settings

parham1998/Recurrent_Image_Annotation

Repository files navigation

Recurrent_Image_Annotation

Implementation of some popular Recurrent Image Annotation papers on Corel-5k dataset with PyTorch library

Dataset

There is a 'Corel-5k' folder that contains the (Corel-5k) dataset with 5000 images, which has 260 labels in the vocabulary.

(for more information see CNN_Image_Annotation_dataset)

Long short-term memory (LSTM)

0

Performance of LSTM in one time step:

1 2 3 4

Convolutional neural network (CNN)

As compared to other CNNs in my experiments, TResNet produced the best results for extracting features of images, so it has been chosen as the feature extractor.

(more information can be found at CNN_Image_Annotation_convolutional_models)

CNN+LSTM models

1) RIA:
RIA is an encoder-decoder model that uses CNN as an encoder and LSTM as a decoder. In the training phase, it is trained using training images and human annotations. It is necessary to sort the label set as a label sequence before using the annotations as the input for LSTM. A rare-first order is used, which put the rarer label before the more frequent ones (based on label frequency in the dataset). During the test phase, the RIA model predicts the first output label after receiving the input image and being triggered by the start signal. Using the previous output as input for the next time step, it predicts the tag sequence recursively. The loop will continue until the stop signal is predicted. Its structure for the test phase is shown in the images below:

lstm1

The labels are mapped to embedding vectors by using lookup tables instead of one-hot vectors. The lookup table can be trained and learn what kind of representation to generate. However, experiments have shown that using pre-trained weights like the GloVe embedding weights provide better results.

2) SR-CNN-RNN:
SR-CNN-RNN is another encoder-decoder model that has a similar architecture to RIA. The differences between them are that semantic concept learning is now done by the CNN model, which uses input images to generate a probabilistic estimate of semantic concepts. In order to generate label sequences, the RNN model takes the concept of probability estimates and models their correlations.

3) CNN-RNN + Attention:
Attention networks are widely used in deep learning. Models can use them to determine which parts of the encoding are relevant to the related task. Using the attention mechanism, pixels with more importance can be highlighted in image annotation. In most cases, labels are conceptual and cannot be annotated by the objects that appear in the image. Therefore, the attention mechanism is not able to improve results significantly. Its structure for the test phase is shown in the images below:

lstm2 Inspired by Image Captioning

Some examples of attention: 1 2 3

4) CNN-RNN + Attention + MLA:
It has been proposed to align the labels to the predictions of the network before computing the loss to reduce the problems caused by imposing a fixed order on the labels. A Hungarian algorithm can be used to solve the minimization problem since it is an assignment problem. So by preserving the attention architecture, we utilize minimal loss alignment (MLA) as the loss function instead of cross-entropy loss. (Furthermore, the frequency of a label is independent of the size of a given object in a dataset. Less frequent but larger objects can cause the LSTM prediction to stop earlier because of their domination in the image and their ranking in the prediction step.)

Beam Search

Choosing the word with the highest score and predicting the next word would be the greedy approach. However, this isn't optimal since the rest of the sequence depends on the first word. It isn't just the first word that determines whether the process is optimal or not; each word in the sequence has consequences for the words following it. (The best sequence might be like this: The third best word might have been selected at the first step, the second best word at the second step, and so on.)
The beam search can be used instead of greedy searches to resolve this issue. However, experiments have shown that the RNN model cannot learn the complicated relationships between labels properly, and using beam search won't have any effect on the result.

Evaluation Metrics

Precision, Recall, F1-score, and N+ are the most popular metrics for evaluating different models in image annotation tasks. I've used per-class (per-label) and per-image (overall) precision, recall, f1-score, and also N+ which are common in image annotation papers.

(check out CNN_Image_Annotation_evaluation_metrics for more information)

Train and Evaluation

To train and evaluate models in Spyder IDE use the codes below:

1) RIA:
run main.py --method RIA --max-seq-len 5 --order-free None --is_glove --sort
run main.py --method RIA --max-seq-len 5 --order-free None --is_glove --evaluate
2) SR-CNN-RNN:
run main.py --method SR-CNN-RNN --max-seq-len 5 --order-free None --is_glove --sort
run main.py --method SR-CNN-RNN --max-seq-len 5 --order-free None --is_glove --evaluate
3) CNN-RNN + Attention:
run main.py --method Attention --max-seq-len 5 --order-free None --is_glove --sort
run main.py --method Attention --max-seq-len 5 --order-free None --is_glove --evaluate
4) CNN-RNN + Attention + MLA:
run main.py --method Attention --max-seq-len 5 --order-free MLA --is_glove
run main.py --method Attention --max-seq-len 5 --order-free MLA --is_glove --evaluate

Results

1) RIA:
batch-size num of training images image-size epoch time GloVe weights features embedding dim label embedding dim
32 4500 448 * 448 136s True 2048 300
data precision recall f1-score
testset per-image metrics 0.647 0.606 0.626
testset per-class metrics 0.409 0.421 0.415
data N+
testset 156
2) SR-CNN-RNN: The CNN and LSTM models were pre-trained with ground truth labels separately, as mentioned in the paper.
batch-size num of training images image-size epoch time GloVe weights predicted labels embedding dim label embedding dim
32 4500 448 * 448 135s True 2048 300
data precision recall f1-score
testset per-image metrics 0.680 0.616 0.646
testset per-class metrics 0.405 0.391 0.398
data N+
testset 145
3) CNN-RNN + Attention:
batch-size num of training images image-size epoch time GloVe weights features embedding dim attention dim label embedding dim
32 4500 448 * 448 150s True 2048 1024 300
data precision recall f1-score
testset per-image metrics 0.663 0.616 0.638
testset per-class metrics 0.438 0.429 0.434
data N+
testset 160
4) CNN-RNN + Attention + MLA:
batch-size num of training images image-size epoch time GloVe weights features embedding dim attention dim label embedding dim
32 4500 448 * 448 158s True 2048 1024 300
data precision recall f1-score
testset per-image metrics 0.656 0.608 0.632
testset per-class metrics 0.449 0.413 0.431
data N+
testset 155

References

J. Jin, and H. Nakayama.
"Recurrent Image Annotator for Arbitrary Length Image Tagging" (ICPR-2016)

F. Liu, T. Xiang, T. M Hospedales, W. Yang, and C. Sun.
"Semantic Regularisation for Recurrent Image Annotation" (CVPR-2017)

V. O. Yazici, A. Gonzalez-Garcia, A. Ramisa, B. Twardowski, and J. van de Weijer
"Orderless Recurrent Models for Multi-label Classification" (CVPR-2020)