Skip to content
To build networks capable of perceiving contextual subtleties in images, to relate observations to both the scene and the real world, and to output succinct and accurate image descriptions; all tasks that we as people can do almost effortlessly.
Jupyter Notebook Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Image Captioning Tensorflow 2.0.ipynb
README.md
final.ipynb
image_cap.py
train_captions

README.md

Image-captioning-with-visual-attention

To build networks capable of perceiving contextual subtleties in images, to relate observations to both the scene and the real world, and to output succinct and accurate image descriptions; all tasks that we as people can do almost effortlessly.

Deep Learning is a very rampant field right now – with so many applications coming out day by day. In this case study, I have made an Image Captioning refers to the process of generating textual description from an image – based on the objects and actions in the image. For example: image

Problem Statemtent

Image captioning is an interesting problem, where we can learn both computer vision techniques and natural language processing techniques. In case study I have followed Show, Attend and Tell: Neural Image Caption Generation with Visual Attention and create an image caption generation model using Flicker 8K data. This model takes a single image as input and output the caption to this image and read that predicted caption.

Dependencies

  • Python 3
  • Tensorflow 2.0
  • gtts

Business Objectives and Constraints

  • Predict a correct caption as per the input image.
  • Incorrect caption could impact the negative impression on user.
  • No strict latency constraints.

Data Overview

Flilckr8K contains 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations.

Sources:

Flickr8k_Dataset: Contains all the images

Flickr8k_text: Contains all the captions

Mapping the real-world problem to a Deep Learning Problem

To accomplish this, we'll use an attention-based model, which enables us to see what parts of the image the model focuses on as it generates a caption. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” by Xu et al. (2015) — the first paper, to our knowledge, that introduced the concept of attention into image captioning. The work takes inspiration from attention’s application in other sequence and image recognition problems. image

Key Performance Indicator (KPI)

As per the Research paper:

The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches. These matches are position-independent. The more the matches, the better the candidate translation is.

BLEU is a well-acknowledged metric to measure the similarly of one hypothesis sentence to multiple reference sentences. Given a single hypothesis sentence and multiple reference sentences, it returns value between 0 and 1.

The metric close to 1 means that the two are very similar. The metric was introduced in 2002 BLEU: a Method for Automatic Evaluation of Machine Translation. Although there are many problems in this metric, for example grammatical correctness are not taken into account, BLEU is very well accepted partly because it is easy to calculate.

  • Higher the score better the quality of caption

References:

You can’t perform that action at this time.