# LSTM with Keras (Predict next work given x previous words)

##### keras version 2.1.2
##### tensorflow version 1.3.0

## RNN:

1. We have a finite length of input sequence and hence we unroll the network to match that 

<img src="img/rnn_unrolled.png">

2. Each word in the above image is embedded into some space (typically w2v embeddings of some sort)

3. Configurations for RNNs could be many-to-many , many-to-one, one-to-many
e.g. : 

<img src="img/rnn_configs.png">

4. Weights are shared across steps

5. RNNs have an issue of vanishing gradient problem -> multiplying multile gradient backwards in time, with each being a really small number or very large -> hence resulting a valishing or exploding gradient problem
for small gradients -> results is an almost zero gradient flowing back all the way and hence the input at time step t-10 will have almost no effect at time t

## LSTM:

Math behind LSTM: 
https://www.youtube.com/watch?v=9zhrxE5PQgY

Attention is all you need -

- https://www.youtube.com/watch?v=iDulhoQ2pro&t=1305s
- https://www.youtube.com/watch?v=rBCqOTEfxvg

For this example, we have -
- each word embedding in 650 dimensions
- with a seq size of 35 
- a batch size of 20 

Hence, the tensor input will be 20 X 35 X 650 , reordered into 35 X 20 X 650 to make it "time major" format

The (20, 35, 650) is then flattened and fed into softmax for a classification problem i.e. 
(20, 35, 650) => (700, 650) and then fed into softmax


<img src="img/target_state_to_achieve.png">




Objective: Predict next word, given previous x words 

Sentence: The cat ate the mouse
X -> The cat ate the
y -> cat ate the mouse

predict "mouse", given "the cat ate the"

In [1]:
import collections
import os
import tensorflow as tf
from keras.models import Sequential, load_model
from keras.layers import Dense, Activation, Embedding, Flatten, Dropout, TimeDistributed, Reshape, Lambda
from keras.layers import LSTM
from keras.optimizers import RMSprop, Adam, SGD
from keras import backend as K
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint
import numpy as np
import argparse
import pdb

Using TensorFlow backend.


In [2]:
#read text file and convert to words with a <eos> for end of statement
def read_words(filename):
    with tf.gfile.GFile(filename, "rb") as f:
        return f.read().decode("utf-8").replace("\n", "<eos>").split()

#build vocab so that each word is converted to an integer -> word to word_id
def build_vocab(filename):
    data = read_words(filename)

    counter = collections.Counter(data)
    count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))

    words, _ = list(zip(*count_pairs))
    word_to_id = dict(zip(words, range(len(words))))

    return word_to_id

#convert files to seq of word_ids instead of seq of words
def file_to_word_ids(filename, word_to_id):
    data = read_words(filename)
    return [word_to_id[word] for word in data if word in word_to_id]

In [3]:
#download data from here - http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz

In [4]:
def load_data(data_path="", verbose=True):
    # get the data paths
    train_path = os.path.join(data_path, "ptb.train.txt")
    valid_path = os.path.join(data_path, "ptb.valid.txt")
    test_path = os.path.join(data_path, "ptb.test.txt")

    # build the complete vocabulary, then convert text data to list of integers
    word_to_id = build_vocab(train_path)
    train_data = file_to_word_ids(train_path, word_to_id)
    valid_data = file_to_word_ids(valid_path, word_to_id)
    test_data = file_to_word_ids(test_path, word_to_id)
    vocabulary = len(word_to_id)
    reversed_dictionary = dict(zip(word_to_id.values(), word_to_id.keys()))
    
    if verbose:
        print(train_data[:5])
        print(word_to_id)
        print(vocabulary)
        print(" ".join([reversed_dictionary[x] for x in train_data[:10]]))
        
    return train_data, valid_data, test_data, vocabulary, reversed_dictionary

In [5]:
data_path = "/Users/nikhildharap/ml_nlp_experiments/deep_experiments/LSTM_with_TF_Keras/simple-examples/data"
train_data, valid_data, test_data, vocabulary, reversed_dictionary = load_data(data_path=data_path, verbose=False)

### Batch generator , generates x and y that could be then fed to keras

#### skip steps is used to move the pointer of current position by some words 
e.g. 
sentence: "the cat ate the mouse and ran away into the alley"
iter_1
x => "the cat ate the"
y => "cat ate the mouse"

if skip_step == 4, 
iter_2 will look like 
x => "mouse and ran away"
y => "and ran away into"

In [6]:
class KerasBatchGenerator(object):

    def __init__(self, data, num_steps, batch_size, vocabulary, skip_step=5):
        self.data = data
        self.num_steps = num_steps
        self.batch_size = batch_size
        self.vocabulary = vocabulary
        # this will track the progress of the batches sequentially through the
        # data set - once the data reaches the end of the data set it will reset
        # back to zero
        self.current_idx = 0
        # skip_step is the number of words which will be skipped before the next
        # batch is skimmed from the data set
        self.skip_step = skip_step
        
    def generate_batch():
        #x will be batch_size X  num_steps
        x = np.zeros((self.batch_size, self.num_steps))
        
        #y will be batch_size X num_steps X vocab_size(one-hot encoded version of numbers)
        # this converts seq of numbers to seq of one-hot encoded rep eventually (e.g. (100,1) => (100,10000))
        y = np.zeros((self.batch_size, self.num_steps, self.vocabulary))
        
        while True:
            for i in range(self.batch_size):
                if self.current_idx + self.num_steps >= len(self.data):
                    #reset the pointer to start of the dataset
                    self.current_idx = 0
                    
                x[i, :] = self.data[self.current_idx:self.current_idx+self.num_steps]
                y_temp = self.data[self.current_idx+1:self.current_idx+self.num_steps+1]
                
                #convert all of the y_temp to one hot encoded representation
                y[i, :] = to_categorical(y_temp, num_classes=self.vocabulary)
                self.current_idx += self.skip_step

In [7]:
num_steps = 30
batch_size = 20
train_data_generator = KerasBatchGenerator(train_data, num_steps, batch_size, vocabulary,
                                           skip_step=num_steps)
valid_data_generator = KerasBatchGenerator(valid_data, num_steps, batch_size, vocabulary,
                                           skip_step=num_steps)

In [8]:
hidden_size = 500
use_dropout=True

This Embedding() layer takes -
- the size of the vocabulary as its first argument
- then the size of the resultant embedding vector that you want as the next argument
- Finally, because this layer is the first layer in the network, we must specify the “length” of the input i.e. the number of steps/words in each sample.

LSTM layer takes -
- input size as the first argument, which is the hidden size for us . This is the input size for each LSTM cell i.e. __number of nodes in the hidden layers within the LSTM cell, e.g. the number of cells in the forget gate layer, the tanh squashing input layer and so on.__



- return seq will return output from each LSTM cell


<img src="img/LSTM_Cell_Return_Seq.png">

In [9]:
model = Sequential()
#the input to the embedding layer is (batch_size, num_steps) and the output is (batch_size, num_steps, hidden_size)
model.add(Embedding(vocabulary, hidden_size, input_length=num_steps))
#add LSTM cells - two layers
#output shape of LSTM layer is (batch_size, num_steps, hidden_size) because we want return_seq = True
model.add(LSTM(hidden_size, return_sequences=True))
model.add(LSTM(hidden_size, return_sequences=True))
#check for drop out
if use_dropout:
    model.add(Dropout(0.5))
    
    
#there is a special Keras layer for use in recurrent neural networks called TimeDistributed. 
#This function adds an independent layer for each time step in the recurrent model. 
#So, for instance, if we have 10 time steps in a model, 
#a TimeDistributed layer operating on a Dense layer would produce 10 independent Dense layers, 
#one for each time step.

model.add(TimeDistributed(Dense(vocabulary)))
model.add(Activation('softmax'))

In [10]:
optimizer = Adam()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

In [11]:
checkpointer = ModelCheckpoint(filepath=data_path + '/model-{epoch:02d}.hdf5', verbose=1)

In [None]:
model.fit_generator(train_data_generator.generate(), len(train_data)//(batch_size*num_steps), num_epochs,
                        validation_data=valid_data_generator.generate(),
                        validation_steps=len(valid_data)//(batch_size*num_steps), callbacks=[checkpointer])

In [None]:
#after 40 epoch
model = load_model(data_path + "\model-40.hdf5")
dummy_iters = 40
example_training_generator = KerasBatchGenerator(train_data, num_steps, 1, vocabulary,
                                                     skip_step=1)
print("Training data:")
for i in range(dummy_iters):
    dummy = next(example_training_generator.generate())
num_predict = 10
true_print_out = "Actual words: "
pred_print_out = "Predicted words: "
for i in range(num_predict):
    data = next(example_training_generator.generate())
    prediction = model.predict(data[0])
    predict_word = np.argmax(prediction[:, num_steps-1, :])
    true_print_out += reversed_dictionary[train_data[num_steps + dummy_iters + i]] + " "
    pred_print_out += reversed_dictionary[predict_word] + " "
print(true_print_out)
print(pred_print_out)

Reference: http://adventuresinmachinelearning.com/keras-lstm-tutorial/