## LSTM Captioning

This is a very basic model: 

*  Take the featurized images (2048d), and tokenised captions
*  Add a (trainable) features -> 50d dense layer
*  Use a 50d GloVe embedding (for the LSTM inputs, non-trainable)
   *  #stop-words ~ 150 (say)
*  50d of hidden units for the LSTM
*  But have a 'pluggable' output transform :
   *   Concat : (256 one-hot - including '0'=mask, '1'={UNK}, '2'={START}, '3'={STOP}, '4'={UseOther})
   *   (a) UseOther + (8192-250 of more one-hot)
   *   (b) UseOther + (50d of same GloVe embedding, for nearest-neighbour)
   *   (c) UseOther + (log2(8192)==13 bits + error correction of index of word)
*  Want to monitor some kind of score over time for test cases   

In [None]:
import os

import numpy as np

import random
import pickle

TRAIN_PCT=0.9

In [None]:
# Load in the captions/corpus/embedding
with open('./data/cache/CAPTIONS_data_Flickr30k_2017-06-07_23-15.pkl', 'rb') as f:
    text_data=pickle.load(f, encoding='iso-8859-1')

"""
text_data ~ dict(
    img_to_captions = img_to_valid_captions,
    
    action_words = action_words, 
    stop_words = stop_words_sorted,
    
    embedding = embedding,
    embedding_word_arr = embedding_word_arr,
    
    img_arr = img_arr_save,
    train_test = np.random.random( (len(img_arr_save),) ),
)"""

dictionary = { w:i for i,w in enumerate(text_data['embedding_word_arr']) }

img_arr_train = [ img for i, img in enumerate(text_data['img_arr']) if text_data['train_test'][i]<TRAIN_PCT ]

print("Loaded captions, corpus and embedding")

In [None]:
# Load in the features
with open('./data/cache/FEATURES_data_Flickr30k_flickr30k-images_2017-06-06_18-07.pkl', 'rb') as f:
    image_data=pickle.load(f, encoding='iso-8859-1')

"""
image_data ~ dict(
    features = features,
    img_arr = img_arr,
)
"""
image_feature_idx = { img:idx for idx, img in enumerate(image_data['img_arr']) }

print("Loaded image features for all images")

In [None]:
CAPTION_LEN = 32

In [None]:
def caption_to_idx_arr(caption):  # This is actually 1 longer than max - need to shift about a bit later
    ret = np.zeros( (CAPTION_LEN+1,), dtype='int32')  # {MASK}.idx===0
    ret[0] = dictionary['{START}']
    for i, w in enumerate( caption.lower().split() ):
        ret[i+1] = dictionary.get(w, dictionary['{UNK}'])
    ret[i+2] = dictionary['{STOP}']
    return ret

In [None]:
#for j in range(0,10):
#    print(j)
#print(j)    

In [None]:
def caption_training_example():
    img_arr = img_arr_train
    while True:
        random.shuffle( img_arr )
        for img in img_arr:
            captions = text_data['img_to_captions'][img]
            caption = random.choice(captions)
            print(caption)
            yield image_feature_idx[ img ], caption_to_idx_arr( caption )
        print("Captions : Looping")
caption_training_example_gen = caption_training_example()

In [None]:
next(caption_training_example_gen)