Cross Modal Integration Image Captioning with Multimodal Pretrained Models and Machine Translation

Abstract

This is a academic research focused on the image captioning and machine translation problems, approached in a fundamental and accessible manner. In this project, for the task image captioning, we utilize 7 pretrained models for image feature extraction: VGG16, VGG19, InceptionV3, ResNet50, EfficientNetV2L, DenseNet201, and InceptionResNetV2. These models are then combined with 2 sequence models, LSTM and GRU, resulting in a total of 14 model combinations aimed at comparing their effectiveness.

Regarding of machine translation task, we utilized LSTM and GRU. Both models are trained on a dataset containing Vietnamese and English sentences, and are compared the performance with the Google Translate API. The purpose of our machine translation system, is to apply translations to caption sentences generated by the image captioning system. This will ultimately result in the creation of a bilingual image captioning system, seamlessly translating captions between Vietnamese and English.

Dataset

Image Captioning:

Our evaluation was conducted on the UIT-ViIC dataset, comprising 3,850 images related to various ball sports. This dataset was curated from the 2017 version of the Microsoft COCO dataset and therefore, UIT-ViIC also provides five Vietnamese captions for each image, resulting in a total of 19,250 captions.

Example:

Image:

Annotation:

Machine Translation:

We leveraged caption sentences in both Vietnamese and English, which used to described images from the Flickr8k dataset. Subsequently, we adjusted the data structure to optimize it for the training process, resulting in a corpus comprising 4,000 sentences for each language.

Example:

Model Evaluation

Image Captioning:

Machine Translation:

Contact

Chi Thanh Dang, Thuy Hong Thi Dang and Tien Duong Pham

Faculty of Information Science and Engineering, University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam.

20520761@gm.uit.edu.vn, 20520523@gm.uit.edu.vn, 20521222@gm.uit.edu.vn

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Image_Captioning_Intergrated_with_Machine_Translation.ipynb		Image_Captioning_Intergrated_with_Machine_Translation.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image_Captioning_Intergrated_with_Machine_Translation.ipynb

Image_Captioning_Intergrated_with_Machine_Translation.ipynb

README.md

README.md

Repository files navigation

Cross Modal Integration Image Captioning with Multimodal Pretrained Models and Machine Translation

Abstract

Dataset

Model Evaluation

Contact

About

Releases

Packages

Languages

motcapbovit/Cross-Modal-Integration-Image-Captioning-with-Multimodal-Pretrained-Models-and-Machine-Translation

Folders and files

Latest commit

History

Image_Captioning_Intergrated_with_Machine_Translation.ipynb

Image_Captioning_Intergrated_with_Machine_Translation.ipynb

README.md

README.md

Repository files navigation

Cross Modal Integration Image Captioning with Multimodal Pretrained Models and Machine Translation

Abstract

Dataset

Model Evaluation

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages