This is a academic research focused on the image captioning and machine translation problems, approached in a fundamental and accessible manner. In this project, for the task image captioning, we utilize 7 pretrained models for image feature extraction: VGG16, VGG19, InceptionV3, ResNet50, EfficientNetV2L, DenseNet201, and InceptionResNetV2. These models are then combined with 2 sequence models, LSTM and GRU, resulting in a total of 14 model combinations aimed at comparing their effectiveness.
Regarding of machine translation task, we utilized LSTM and GRU. Both models are trained on a dataset containing Vietnamese and English sentences, and are compared the performance with the Google Translate API. The purpose of our machine translation system, is to apply translations to caption sentences generated by the image captioning system. This will ultimately result in the creation of a bilingual image captioning system, seamlessly translating captions between Vietnamese and English.
Image Captioning:
Our evaluation was conducted on the UIT-ViIC dataset, comprising 3,850 images related to various ball sports. This dataset was curated from the 2017 version of the Microsoft COCO dataset and therefore, UIT-ViIC also provides five Vietnamese captions for each image, resulting in a total of 19,250 captions.
Example:
- Image:
- Annotation:
Machine Translation:
We leveraged caption sentences in both Vietnamese and English, which used to described images from the Flickr8k dataset. Subsequently, we adjusted the data structure to optimize it for the training process, resulting in a corpus comprising 4,000 sentences for each language.
Example:
Image Captioning:
Machine Translation:
Chi Thanh Dang, Thuy Hong Thi Dang and Tien Duong Pham
Faculty of Information Science and Engineering, University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam.
20520761@gm.uit.edu.vn, 20520523@gm.uit.edu.vn, 20521222@gm.uit.edu.vn