Skip to content

Describer is an image captioning system which generates textual captions describing the images fed to it. It is fine tuned on Flickr8k dataset. It uses the InceptionV3 model to generate image embeddings and GloVe to generate captions' embeddings which is then passed to LSTM . These embeddings then go to Feedforward network to generate next word.

License

Notifications You must be signed in to change notification settings

malayjoshi13/Describer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Describer

Describer is an image captioning system which generates textual captions describing the images fed to it. The system is trained on Flickr8k data.

It uses Merge Architecture where the pre-trained InceptionV3 convolutional neural network generates image embeddings and the GloVe (having 6 Billion pairs of words and their corresponding 200 dim representational vectors) initialized Embedding layer generates caption's word embeddings. Then, image embeddings go to a dense layer which compresses image embeddings into 256 dim and word embeddings go to LSTM recurrent neural network which outputs 256 dim representation. These two embeddings/representations then get added together and fed to the Feed-forward network which outputs the next word of the caption.

Got the following BLEU scores during model evaluation:

BLEU-1 score BLEU-2 score BLEU-3 score BLEU-4 score
Greedy Search 0.79 0.66 0.58 0.39

The objective of this project is to take the first step towards developing a solution to help visually impaired people understand visual information around them.

image

Get your image captioned!! (aka Inference)

Check out this easy-to-use script inference.ipynb Open In Colab. Upload your image, add its path in this notebook and get your captions!

Re-training - on your own dataset or set of hyperparameters or both

Check out this easy-to-train script training.ipynb Open In Colab. Run each code block of this script to train on default hyperparameter settings and the Flickr8K dataset. Otherwise, you can also choose your own dataset and value of hyperparameters within this script.


To re-train again on the Flick8K dataset with your own set of hyperparameter values, then:

  • Create a directory in your Google Drive with the name Describer (skip this step if you have already cloned this Github repo in your Google Drive during the inference/default training/evaluating process).
  • Request the Flickr8k dataset from this link https://illinois.edu/fb/sec/1713398. Download and place it inside the Describer/dataset folder in your Google Drive.
  • Now rename few files in ./dataset folder to following filenames:-
    • Flickr8k.token.txt to captions.txt.
    • Flickr_8k.trainImages.txt to TrainImagesName.txt.
    • Flickr_8k.devImages.txt to DevImagesName.txt.
    • Flickr_8k.testImages.txt to TestImagesName.txt.
    • Flickr8k_Dataset contains to All_images.

A quick hack: as the Flickr8K dataset can't be distributed you have to individually get access to it. In my case, once got access and downloaded it in my GDrive, then whenever I need the dataset, I simply copy Gdrive link of ./dataset folder from my main folder (in GDrive) and then use it in any my other GDrive account. To use in other GDrive account, I paste the copied GDrive link to my another account and create a copy/shortcut of ./dataset folder in the directory of my cloned GitHub repo. This way I don't need to upload the dataset every time I work in another GDrive account.

Evaluating - default trained model or your own trained model

Check out this easy-to-evaluate script evaluating.ipynb Open In Colab. Using this you can either evaluate default trained model or model re-trained by you. In both cases, you must have the ./dataset folder within your working directory (if evaluating on Flickr8K, follow above steps).

End-note

Thank you for patiently reading till here. I am pretty sure just like me, you would have also learnt something new about integrating the capabilities of CNN and RNN models to build a real-world application to help visually impaired people to understand visual data around them. Using these learnt concepts I will push myself to scale this project further to improve captioning capabilities. I encourage you also to do the same!!

Contributing

You are welcome to contribute to the repository with your PRs. In case of query or feedback, please write to me at 13.malayjoshi@gmail.com or https://www.linkedin.com/in/malayjoshi13/.

Licence

License: MIT

About

Describer is an image captioning system which generates textual captions describing the images fed to it. It is fine tuned on Flickr8k dataset. It uses the InceptionV3 model to generate image embeddings and GloVe to generate captions' embeddings which is then passed to LSTM . These embeddings then go to Feedforward network to generate next word.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published