Skip to content

This is the project for the course DS310 - Natural Language Processing For Data Science at the University of Information Technology - Vietnam National University, Ho Chi Minh City.

Notifications You must be signed in to change notification settings

motcapbovit/Cross-Modal-Integration-Image-Captioning-with-Multimodal-Pretrained-Models-and-Machine-Translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Cross Modal Integration Image Captioning with Multimodal Pretrained Models and Machine Translation

Abstract

This is a academic research focused on the image captioning and machine translation problems, approached in a fundamental and accessible manner. In this project, for the task image captioning, we utilize 7 pretrained models for image feature extraction: VGG16, VGG19, InceptionV3, ResNet50, EfficientNetV2L, DenseNet201, and InceptionResNetV2. These models are then combined with 2 sequence models, LSTM and GRU, resulting in a total of 14 model combinations aimed at comparing their effectiveness.

Regarding of machine translation task, we utilized LSTM and GRU. Both models are trained on a dataset containing Vietnamese and English sentences, and are compared the performance with the Google Translate API. The purpose of our machine translation system, is to apply translations to caption sentences generated by the image captioning system. This will ultimately result in the creation of a bilingual image captioning system, seamlessly translating captions between Vietnamese and English.

Dataset

Image Captioning:

Our evaluation was conducted on the UIT-ViIC dataset, comprising 3,850 images related to various ball sports. This dataset was curated from the 2017 version of the Microsoft COCO dataset and therefore, UIT-ViIC also provides five Vietnamese captions for each image, resulting in a total of 19,250 captions.

Example:

  • Image:
Screenshot 2024-03-08 162739
  • Annotation:
Screenshot 2024-03-08 162949

Machine Translation:

We leveraged caption sentences in both Vietnamese and English, which used to described images from the Flickr8k dataset. Subsequently, we adjusted the data structure to optimize it for the training process, resulting in a corpus comprising 4,000 sentences for each language.

Example:

Screenshot 2024-03-08 211942

Model Evaluation

Image Captioning:

Screenshot 2024-03-08 163306

Machine Translation:

Screenshot 2024-03-08 203132

Contact

Chi Thanh Dang, Thuy Hong Thi Dang and Tien Duong Pham

Faculty of Information Science and Engineering, University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam.

20520761@gm.uit.edu.vn, 20520523@gm.uit.edu.vn, 20521222@gm.uit.edu.vn

About

This is the project for the course DS310 - Natural Language Processing For Data Science at the University of Information Technology - Vietnam National University, Ho Chi Minh City.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published