Skip to content

kaylode/caption-transformer

Repository files navigation

Image Captioning with Transformer

This project applies Transformer-based model for Image captioning task. In this study project, most of the work are reimplemented, some are adapted with lots of modification. The purpose of this project is to test the performance of the Transformer architecture and Bottom-Up feature, I conduct experiment and compare two different ways to extract features from visual input (image) and encode it to a sequence.

Notebook: Notebook

The following figure gives an overview of the baseline model architectures.

screen

There are 2 ways to embed visual inputs:

  • In the patch-based architecture, image features can be extracted by split the image into patches (16x16), then flatten (same method as Vision-Transformer). Or extracted using a fixed grid tiles (8x8) follows an Inception V3 model.
Patch-based Encoders
screen
screen
  • In the architecture which uses bottom-up attention, FasterRCNN is used to extract features for each detected object in the image. This method captures visual meanings with object-aware semantics and generates some very good captions (in my opinion though).
Bottom-Up Encoder
screen

Vocabulary can be built in two ways:

  • Use AutoTokenizer from Huggingface Transformer librabry
  • From scratch (suggested if using small dataset)

Extract features and save as numpy array

  • To extract features from InceptionV3, use preprocess/grid/cnn/preprocess.py
  • To extract bottom-up features, I provide Colab Notebook which adapts code from Detectron model

Datasets

I train both the patch-based and bottom-up models on Flickr30k dataset which contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators for each image. Download COCO-format Flickr30k

For COCO captioning data format, see COCO format

Results

The results shown here are recorded after training for 100 epochs, on validation split. Captions are generated by using Beam search with width of size 3.

Model Bleu_1 Bleu_2 Bleu_3 Bleu_4 METEOR ROUGE_L CIDEr SPICE
Transformer (deit_tiny_distilled_patch16_224) 0.61111 0.432 0.30164 0.21026 0.18603 0.44001 0.39589 0.1213
Transformer (frcnn_bottomup_attention) 0.61693 0.44336 0.31383 0.22263 0.2128 0.46285 0.4904 0.15042
Images Caption with beam size = 3
screen Bottom-up: A man sits on a bench with a newspaper
Patch-based (flatten): A man in a hat and a hat is sitting on a bench
screen Bottom-up: A snow boarder in a red jacket is jumping in the air
Patch-based (flatten): A snow boarder in a yellow shirt is jumping over a snowy hill
screen Bottom-up: A man is sitting on a chair with a basket full of bread in front of him
Patch-based (flatten): A woman is selling fruit at a market
screen Bottom-up: A group of people are playing music in a dark room
Patch-based (flatten): A man in a black shirt is standing in front of a large crowd of people
screen Bottom-up: A man in a red uniform is riding a white horse
Patch-based (flatten): A man in a red shirt and white pants is riding a white horse

Usage

  • To train patch-based / bottom-up architecture:
python train.py (--bottom-up)
  • To evalualte trained model:
python evaluate.py --weight=<checkpoint path> (--bottom-up)

Paper References

Ideas from:

Code References