udacity-cvnd-projects/P2_Image_Captioning at master · pdudero/udacity-cvnd-projects

History

Name		Name	Last commit message	Last commit date
parent directory ..
images		images
0_Dataset.ipynb		0_Dataset.ipynb
1_Preliminaries.ipynb		1_Preliminaries.ipynb
2_Training.ipynb		2_Training.ipynb
3_Inference.ipynb		3_Inference.ipynb
LICENSE		LICENSE
README.md		README.md
README_Udacity.md		README_Udacity.md
data_loader.py		data_loader.py
model.py		model.py
vocabulary.py		vocabulary.py

README.md

Project instructions are found in the Udacity README.

Method:

The Dataset notebook initializes the COCO API (the "pycocotools" library) used to access data from the MS COCO (Common Objects in Context) dataset, which is "commonly used to train and benchmark object detection, segmentation, and captioning algorithms." The notebook also depicts the processing pipeline using the following diagram:

The left half of the diagram depicts the "EncoderCNN", which encodes the critical information contained in a regular picture file into a "feature vector" of a specific size. That feature vector is fed into the "DecoderRNN" on the right half of the diagram (which is "unfolded" in time - each box labeled "LSTM" represents the same cell at a different time step). Each word appearing as output at the top is fed back to the network as input (at the bottom) in a subsequent time step, until the entire caption is generated. The arrow pointing right that connects the LSTM boxes together represents hidden state information, which represents the network's "memory", also fed back to the LSTM at each time step.

The Preliminaries notebook uses the pycocotools, torchvision transforms, and NLTK to preprocess the images and the captions for network training. It also explores the EncoderCNN, which is taken pretrained from torchvision.models, the ResNet50 architecture.

The implementations of the EncoderCNN, which is supplied, and the DecoderRNN, which is left to the student, are found in the model.py file.

In the Training notebook one finds selection of hyperparameter values and EncoderRNN training. The hyperparameter selections are explained.

The Inference notebook contains the testing of the trained networks to generate captions for additional images. No rigorous validation or accuracy measurement was performed, only sample images were generated. See below.

Results:

Four sample images were captioned, two in which the caption matches well,

...and two for which the caption doesn't match quite so well:

Steps for additional improvement would be exploring the optional validation task at the end of the training notebook, and also additional training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P2_Image_Captioning

P2_Image_Captioning

images

images

0_Dataset.ipynb

0_Dataset.ipynb

1_Preliminaries.ipynb

1_Preliminaries.ipynb

2_Training.ipynb

2_Training.ipynb

3_Inference.ipynb

3_Inference.ipynb

LICENSE

LICENSE

README.md

README.md

README_Udacity.md

README_Udacity.md

data_loader.py

data_loader.py

model.py

model.py

vocabulary.py

vocabulary.py

README.md

Method:

Results:

Files

P2_Image_Captioning

Directory actions

More options

Directory actions

More options

Latest commit

History

P2_Image_Captioning

Folders and files

parent directory

Method:

Results: