Skip to content

jmrf/im2txt-demo

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Show and Tell: A Neural Image Caption Generator

A TensorFlow implementation of the image-to-text model described in the paper:

"Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge."

Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.

IEEE transactions on pattern analysis and machine intelligence (2016).

Full text available at: http://arxiv.org/abs/1609.06647

Important note

The full version of this model can be found, together with many others, at the TensorFlow models github repo.

This is only an adapted and simplified version of the project for demo purposes, showing just the results at inference time. Using pre-trained models as described later on this README.

Run

  1. Download the pre-trained checkpoints and place them in model/train

  2. Fix the code depending on your TF version. (Recommended version 1.2)

    python fix_ckpoints.py

    This will generate the checkpoint files needed, pointing to the TF checkpoint files.

  3. Fix the python code. See fixes section

  4. Run inference on an image:

    ./inference.sh

    Modifying the paths as needed

Contents

Model Overview

Introduction

The Show and Tell model is a deep neural network that learns how to describe the content of images. For example:

Example captions

Architecture

The Show and Tell model is an example of an encoder-decoder neural network. It works by first "encoding" an image into a fixed-length vector representation, and then "decoding" the representation into a natural language description.

The image encoder is a deep convolutional neural network. This type of network is widely used for image tasks and is currently state-of-the-art for object recognition and detection. Our particular choice of network is the Inception v3 image recognition model pretrained on the ILSVRC-2012-CLS image classification dataset.

The decoder is a long short-term memory (LSTM) network. This type of network is commonly used for sequence modeling tasks such as language modeling and machine translation. In the Show and Tell model, the LSTM network is trained as a language model conditioned on the image encoding.

Words in the captions are represented with an embedding model. Each word in the vocabulary is associated with a fixed-length vector representation that is learned during training.

The following diagram illustrates the model architecture.

Show and Tell Architecture

In this diagram, {s0, s1, ..., sN-1} are the words of the caption and {wes0, wes1, ..., wesN-1} are their corresponding word embedding vectors. The outputs {p1, p2, ..., pN} of the LSTM are probability distributions generated by the model for the next word in the sentence. The terms {log p1(s1), log p2(s2), ..., log pN(sN)} are the log-likelihoods of the correct word at each step; the negated sum of these terms is the minimization objective of the model.

During the first phase of training the parameters of the Inception v3 model are kept fixed: it is simply a static image encoder function. A single trainable layer is added on top of the Inception v3 model to transform the image embedding into the word embedding vector space. The model is trained with respect to the parameters of the word embeddings, the parameters of the layer on top of Inception v3 and the parameters of the LSTM. In the second phase of training, all parameters - including the parameters of Inception v3 - are trained to jointly fine-tune the image encoder and the LSTM.

Given a trained model and an image we use beam search to generate captions for that image. Captions are generated word-by-word, where at each step t we use the set of sentences already generated with length t - 1 to generate a new set of sentences with length t. We keep only the top k candidates at each step, where the hyperparameter k is called the beam size. We have found the best performance with k = 3.

Getting Started

Install Required Packages

First ensure that you have installed the following required packages:

Generating Captions

Your trained Show and Tell model can generate captions for any JPEG image! The following command line will generate captions for an image from the test set.

NOTE: 
This file can be found already configured for this project in `inference.sh`
in the root directory.
# Path to checkpoint file or a directory containing checkpoint files. Passing
# a directory will only work if there is also a file named 'checkpoint' which
# lists the available checkpoints in the directory. It will not work if you
# point to a directory with just a copy of a model checkpoint: in that case,
# you will need to pass the checkpoint path explicitly.
CHECKPOINT_PATH="${HOME}/im2txt/model/train"

# Vocabulary file generated by the preprocessing script.
VOCAB_FILE="${HOME}/im2txt/data/mscoco/word_counts.txt"

# JPEG image file to caption.
IMAGE_FILE="${HOME}/im2txt/data/mscoco/raw-data/val2014/COCO_val2014_000000224477.jpg"

# Build the inference binary.
cd tensorflow-models/im2txt
bazel build -c opt //im2txt:run_inference

# Ignore GPU devices (only necessary if your GPU is currently memory
# constrained, for example, by running the training script).
export CUDA_VISIBLE_DEVICES=""

# Run inference to generate captions.
bazel-bin/im2txt/run_inference \
  --checkpoint_path=${CHECKPOINT_PATH} \
  --vocab_file=${VOCAB_FILE} \
  --input_files=${IMAGE_FILE}

Example output:

Captions for image COCO_val2014_000000224477.jpg:
  0) a man riding a wave on top of a surfboard . (p=0.040413)
  1) a person riding a surf board on a wave (p=0.017452)
  2) a man riding a wave on a surfboard in the ocean . (p=0.005743)

Note: you may get different results. Some variation between different models is expected.

Here is the image:

Surfer

Fixes:

Depending on the TensorFlow version checkpoint tensor key values might need to be renamed. This can be done with fix_ckpoints.py

More information on transitioning from TF 1.0 to 1.2 can be found here:

In addition, if running on Python 3.x tf.gfile.GFile needs to read files in rb mode. (See im2txt/run_inference.py line 74)

About

TF im2txt model for demoing inference on NLP + CV task

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published