Recurrent_Image_Annotation

Implementation of some popular Recurrent Image Annotation papers on Corel-5k dataset with PyTorch library

Dataset

There is a 'Corel-5k' folder that contains the (Corel-5k) dataset with 5000 images, which has 260 labels in the vocabulary.

(for more information see CNN_Image_Annotation_dataset)

Long short-term memory (LSTM)

Performance of LSTM in one time step:

Convolutional neural network (CNN)

As compared to other CNNs in my experiments, TResNet produced the best results for extracting features of images, so it has been chosen as the feature extractor.

(more information can be found at CNN_Image_Annotation_convolutional_models)

CNN+LSTM models

1) RIA:
RIA is an encoder-decoder model that uses CNN as an encoder and LSTM as a decoder. In the training phase, it is trained using training images and human annotations. It is necessary to sort the label set as a label sequence before using the annotations as the input for LSTM. A rare-first order is used, which put the rarer label before the more frequent ones (based on label frequency in the dataset). During the test phase, the RIA model predicts the first output label after receiving the input image and being triggered by the start signal. Using the previous output as input for the next time step, it predicts the tag sequence recursively. The loop will continue until the stop signal is predicted. Its structure for the test phase is shown in the images below:

The labels are mapped to embedding vectors by using lookup tables instead of one-hot vectors. The lookup table can be trained and learn what kind of representation to generate. However, experiments have shown that using pre-trained weights like the GloVe embedding weights provide better results.

2) SR-CNN-RNN:
SR-CNN-RNN is another encoder-decoder model that has a similar architecture to RIA. The differences between them are that semantic concept learning is now done by the CNN model, which uses input images to generate a probabilistic estimate of semantic concepts. In order to generate label sequences, the RNN model takes the concept of probability estimates and models their correlations.

3) CNN-RNN + Attention:
Attention networks are widely used in deep learning. Models can use them to determine which parts of the encoding are relevant to the related task. Using the attention mechanism, pixels with more importance can be highlighted in image annotation. In most cases, labels are conceptual and cannot be annotated by the objects that appear in the image. Therefore, the attention mechanism is not able to improve results significantly. Its structure for the test phase is shown in the images below:

Inspired by Image Captioning

Some examples of attention:

4) CNN-RNN + Attention + MLA:
It has been proposed to align the labels to the predictions of the network before computing the loss to reduce the problems caused by imposing a fixed order on the labels. A Hungarian algorithm can be used to solve the minimization problem since it is an assignment problem. So by preserving the attention architecture, we utilize minimal loss alignment (MLA) as the loss function instead of cross-entropy loss. (Furthermore, the frequency of a label is independent of the size of a given object in a dataset. Less frequent but larger objects can cause the LSTM prediction to stop earlier because of their domination in the image and their ranking in the prediction step.)

Beam Search

Choosing the word with the highest score and predicting the next word would be the greedy approach. However, this isn't optimal since the rest of the sequence depends on the first word. It isn't just the first word that determines whether the process is optimal or not; each word in the sequence has consequences for the words following it. (The best sequence might be like this: The third best word might have been selected at the first step, the second best word at the second step, and so on.)
The beam search can be used instead of greedy searches to resolve this issue. However, experiments have shown that the RNN model cannot learn the complicated relationships between labels properly, and using beam search won't have any effect on the result.

Evaluation Metrics

Precision, Recall, F1-score, and N+ are the most popular metrics for evaluating different models in image annotation tasks. I've used per-class (per-label) and per-image (overall) precision, recall, f1-score, and also N+ which are common in image annotation papers.

(check out CNN_Image_Annotation_evaluation_metrics for more information)

Train and Evaluation

To train and evaluate models in Spyder IDE use the codes below:

1) RIA:

run main.py --method RIA --max-seq-len 5 --order-free None --is_glove --sort

run main.py --method RIA --max-seq-len 5 --order-free None --is_glove --evaluate

2) SR-CNN-RNN:

run main.py --method SR-CNN-RNN --max-seq-len 5 --order-free None --is_glove --sort

run main.py --method SR-CNN-RNN --max-seq-len 5 --order-free None --is_glove --evaluate

3) CNN-RNN + Attention:

run main.py --method Attention --max-seq-len 5 --order-free None --is_glove --sort

run main.py --method Attention --max-seq-len 5 --order-free None --is_glove --evaluate

4) CNN-RNN + Attention + MLA:

run main.py --method Attention --max-seq-len 5 --order-free MLA --is_glove

run main.py --method Attention --max-seq-len 5 --order-free MLA --is_glove --evaluate

Results

1) RIA:

batch-size	num of training images	image-size	epoch time	GloVe weights	features embedding dim	label embedding dim
32	4500	448 * 448	136s	True	2048	300

data	precision	recall	f1-score
testset per-image metrics	0.647	0.606	0.626
testset per-class metrics	0.409	0.421	0.415

data	N+
testset	156

2) SR-CNN-RNN: The CNN and LSTM models were pre-trained with ground truth labels separately, as mentioned in the paper.

batch-size	num of training images	image-size	epoch time	GloVe weights	predicted labels embedding dim	label embedding dim
32	4500	448 * 448	135s	True	2048	300

data	precision	recall	f1-score
testset per-image metrics	0.680	0.616	0.646
testset per-class metrics	0.405	0.391	0.398

data	N+
testset	145

3) CNN-RNN + Attention:

batch-size	num of training images	image-size	epoch time	GloVe weights	features embedding dim	attention dim	label embedding dim
32	4500	448 * 448	150s	True	2048	1024	300

data	precision	recall	f1-score
testset per-image metrics	0.663	0.616	0.638
testset per-class metrics	0.438	0.429	0.434

data	N+
testset	160

4) CNN-RNN + Attention + MLA:

batch-size	num of training images	image-size	epoch time	GloVe weights	features embedding dim	attention dim	label embedding dim
32	4500	448 * 448	158s	True	2048	1024	300

data	precision	recall	f1-score
testset per-image metrics	0.656	0.608	0.632
testset per-class metrics	0.449	0.413	0.431

data	N+
testset	155

References

J. Jin, and H. Nakayama.
"Recurrent Image Annotator for Arbitrary Length Image Tagging" (ICPR-2016)

F. Liu, T. Xiang, T. M Hospedales, W. Yang, and C. Sun.
"Semantic Regularisation for Recurrent Image Annotation" (CVPR-2017)

V. O. Yazici, A. Gonzalez-Garcia, A. Ramisa, B. Twardowski, and J. van de Weijer
"Orderless Recurrent Models for Multi-label Classification" (CVPR-2020)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.spyproject/config		.spyproject/config
Corel-5k		Corel-5k
checkpoints		checkpoints
glove		glove
README.md		README.md
beam_search.py		beam_search.py
datasets.py		datasets.py
engine.py		engine.py
evaluation_metrics.py		evaluation_metrics.py
main.py		main.py
models.py		models.py
models_attention.py		models_attention.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recurrent_Image_Annotation

Dataset

Long short-term memory (LSTM)

Convolutional neural network (CNN)

CNN+LSTM models

Beam Search

Evaluation Metrics

Train and Evaluation

Results

References

About

Languages

parham1998/Recurrent_Image_Annotation

Folders and files

Latest commit

History

Repository files navigation

Recurrent_Image_Annotation

Dataset

Long short-term memory (LSTM)

Convolutional neural network (CNN)

CNN+LSTM models

Beam Search

Evaluation Metrics

Train and Evaluation

Results

References

About

Topics

Resources

Stars

Watchers

Forks

Languages