Image-Captioning:

Generate a caption that describes the content of an image.

Network:

Encoder: A pretrained resnet network is used to capture the image features. The final layer of the resnet network is replaced with a linear layer.

- Layer I/O Pretrained

1 Resnet50 (3,224,224)/(1,2048(flattened)) 1

2 Linear (1,2048)/(1,256) 0
Decoder: LSTM together with linear layers are responsible for generating a sequence of words that describes the image.

- Layer I/O Pretrained

1 Embedding (1)/(1,256) 0

2 LSTM (1,256)/(1,512) 0

3 Linear (1,512)/(9955) 0

Preprocessing:

Image Preprocessing:
- The images undergo some transformations such as resizing and normalization to match the specification of the input of the pretrained resnet network.
- In the training phase, random flipping and cropping are added to the transformations to avoid over-fitting.
Caption Preprocessing:
- Captions need to be tokenized therefore, a vocab dictionary is created from the captions associated with the training images.
- Each word is represented by a unique token.

Training:

The model gets an image and then it starts to generate the next token in the sequence.
Each generated word is compared to the ground truth caption to define the loss.
The LSTM layer is fed with the ground truth word till it finishes the caption.

Inference:

The model predicts the next token in the sequence based on the previous tokens and the image features.
Each predicted word is fed back to the LSTM layer till the end end of the caption.
The model outputs the probability of each word in the dictionary to be selected as the best word to form the sequence. By taking the word that associated with the maximum probability (Greedy Search) that might not lead to the best possible caption.
Two methods were adapted in this project:
1. Greedy Search.
  - In Greedy search, the word with the maximum probability is selected.
2. Beam Search.
  - In Beam search, using a b-sized window at each step to select best possible b words to form the sequence.
  - In order to get the best possible sequence, the sum of the logarithmic probability of each word in a sequence should be maximized.
  - Currently, the implementation of the beam search is working on cpu not gpu.

Hyperparameters:

-	Parameter	Value
1	Batch size	128
2	Vocab threshold	4
3	embedding size	256
4	LSTM Hidden size	512
5	number of epochs	20

Results:

Note:

This project is one of three projects accomplished in the Udacity computer vision Nanodegree program.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
models		models
.gitignore		.gitignore
0_Dataset.ipynb		0_Dataset.ipynb
1_Preliminaries.ipynb		1_Preliminaries.ipynb
2_Training.ipynb		2_Training.ipynb
3_Inference.ipynb		3_Inference.ipynb
LICENSE		LICENSE
README.md		README.md
data_loader.py		data_loader.py
model1.py		model1.py
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

models

models

.gitignore

.gitignore

0_Dataset.ipynb

0_Dataset.ipynb

1_Preliminaries.ipynb

1_Preliminaries.ipynb

2_Training.ipynb

2_Training.ipynb

3_Inference.ipynb

3_Inference.ipynb

LICENSE

LICENSE

README.md

README.md

data_loader.py

data_loader.py

model1.py

model1.py

vocabulary.py

vocabulary.py

Repository files navigation

Image-Captioning:

Network:

Preprocessing:

Training:

Inference:

Hyperparameters:

Results:

Note:

About

Languages

-	Layer	I/O	Pretrained
1	Resnet50	(3,224,224)/(1,2048(flattened))	1
2	Linear	(1,2048)/(1,256)	0

-	Layer	I/O
1	Embedding	(1)/(1,256)
2	LSTM	(1,256)/(1,512)
3	Linear	(1,512)/(9955)

License

HatemSelim94/Image-Captioning

Folders and files

Latest commit

History

Repository files navigation

Image-Captioning:

Network:

Preprocessing:

Training:

Inference:

Hyperparameters:

Results:

Note:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages