## Recurrent Neural Networks with Keras
***

In this notebook we will work through an example using a recurrent neural network with the Keras wrapper on Tensorflow. 


Recurrent neural networks use a specific architecture that ensures <em>persistence</em> of information that it has seen in the past. This *memory* allows these networks to learn information sequences though it sees parts of the sequence one at a time.

RNNs achieve this because of a recurrent connection in their hidden layer allowing information from one time step to go not only to the output layer but also back into the network. 

RNNs suffer from a problem called the *vanishing gradient* problem. In short, this keeps them from learning very long term dependencies, causing them to *forget* information from very early on.


**Some Resources**:

- [Christopher Olah's blogpost on LSTMs and GRUs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Jason Brownlee's detailed post on LSTMs](https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/)


Make sure you have Keras and Tensorflow installed. Run the cell below to import the required libraries

In [None]:
from __future__ import print_function
from keras.models import Sequential
from keras import layers
import numpy as np
from six.moves import range

#### RNN learns to perform addition on two numbers: Sequence to sequence learning

This is a simple example [available on the Keras website](https://keras.io/examples/addition_rnn/). It does offer insight into how the RNN/LSTM leverages sequential information. 

In this you will take in a sequence of characters denoting the sum of two numbers e.g. '301+4' and train a network to return an output sequence of characters representing the answer e.g., '305'. The input sequences will be one-hot encoded before sending it to the network.


In the cell below the methods for encoding and decoding the character strings to and from one-hot code is written. Please run this cell to use these.

In [None]:
class CharacterTable(object):
    """Given a set of characters:
    + Encode them to a one-hot integer representation
    + Decode the one-hot or integer representation to their character output
    + Decode a vector of probabilities to their character output
    """
    def __init__(self, chars):
        """Initialize character table.

        # Arguments
            chars: Characters that can appear in the input.
        """
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))

    def encode(self, C, num_rows):
        """One-hot encode given string C.

        # Arguments
            C: string, to be encoded.
            num_rows: Number of rows in the returned one-hot encoding. This is
                used to keep the # of rows for each data the same.
        """
        x = np.zeros((num_rows, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x

    def decode(self, x, calc_argmax=True):
        """Decode the given vector or 2D array to their character output.

        # Arguments
            x: A vector or a 2D array of probabilities or one-hot representations;
                or a vector of character indices (used with `calc_argmax=False`).
            calc_argmax: Whether to find the character index with maximum
                probability, defaults to `True`.
        """
        if calc_argmax:
            x = x.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in x)


class colors:
    ok = '\033[92m'
    fail = '\033[91m'
    close = '\033[0m'



**Step 1:** We will generate a number of example sequences and corresponding expected results for the addition in the cell below. `TRAINING_SIZE` determines the total number of examples generated. `DIGITS` represents maximum number of digits in each number in the addition. 

In [None]:
# Parameters for the model and dataset.
TRAINING_SIZE = 50000
DIGITS = 3
REVERSE = True

# Maximum length of input is 'int + int' (e.g., '345+678'). Maximum length of
# int is DIGITS.
MAXLEN = DIGITS + 1 + DIGITS

# All the numbers, plus sign and space for padding.
chars = '0123456789+ '
ctable = CharacterTable(chars)

questions = []
expected = []
seen = set()
print('Generating data...')
while len(questions) < TRAINING_SIZE:
    f = lambda: int(''.join(np.random.choice(list('0123456789'))
                    for i in range(np.random.randint(1, DIGITS + 1))))
    a, b = f(), f()
    # Skip any addition questions we've already seen
    # Also skip any such that x+Y == Y+x (hence the sorting).
    key = tuple(sorted((a, b)))
    if key in seen:
        continue
    seen.add(key)
    # Pad the data with spaces such that it is always MAXLEN.
    q = '{}+{}'.format(a, b)
    query = q + ' ' * (MAXLEN - len(q))
    ans = str(a + b)
    # Answers can be of maximum size DIGITS + 1.
    ans += ' ' * (DIGITS + 1 - len(ans))
    if REVERSE:
        # Reverse the query, e.g., '12+345  ' becomes '  543+21'. (Note the
        # space used for padding.)
        query = query[::-1]
    questions.append(query)
    expected.append(ans)
print('Total addition questions:', len(questions))



In [None]:
# TODO: use this cell to figure out what's going on in the data: look at the form of questions and expected

**Step 2:** In this next step you will encode the examples above and create training and validation data to train the recurrent neural network on.

Remember, if the question isn't maximum length then it will padded with spaces. 

In [None]:
print('Vectorization...')
x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=np.bool)
y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=np.bool)

# To do: in x and y generate and store the encoded forms of the questions above, 
# use  the encode function in the class CharacterTable


# Shuffle (x, y) in unison as the later parts of x will almost all be larger
# digits.
indices = np.arange(len(y))
np.random.shuffle(indices)
x = x[indices]
y = y[indices]

# TODO: Explicitly set apart 10% for validation data that we never train over.


# Check the shape of your data

**Step 3:** Now the model will be created to test a recurrent neural network architecture.
***

The architecture has 2 main components: the encoder and the decoder, both of which exploit abilities to learn sequential relations between data.

Encoder:
Here we will use a recurrent unit (simpleRNN or LSTM) as the encoder. The goal of the encoder is to take in the clunky one-hot encoded input data say '53+21  ' character by character and encode it into a dense representation. '

Decoder: 
For this we will use another recurrent unit of choice. Here each time delayed unit (rolling out the recurrent  connection) will obtain the encoded representation from earlier. The output required has a maximum size of `DIGITS+1`. Therefore as input we are going to feed in the dense representation as many times, using `RepeatVector` "layer". 

The loss function used is therefore a categorical crossentropy loss function. 

Finally, the output of the decoder is a temporal sequence of probability outputs which returns the probability about what digit (or plus sign or space) the character can respresent. For this we apply a dense layer to each temporal slice of the output of the decoder, i.e., for each character of the sequence we generate a probability for that digit. This is done by using the `TimeDistributed` layer.

In [None]:
# Try replacing GRU, or SimpleRNN.
# RNN = layers.LSTM
RNN = layers.SimpleRNN
HIDDEN_SIZE = 128
BATCH_SIZE = 128
LAYERS = 1

print('Build model...')
model = Sequential()
# TODO: 
# "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE.
# Note: In a situation where your input sequences have a variable length,
# use input_shape=(None, num_feature).



# As the decoder RNN's input, repeatedly provide with the last output of
# RNN for each time step. Repeat 'DIGITS + 1' times as that's the maximum
# length of output, e.g., when DIGITS=3, max output is 999+999=1998.
model.add(layers.RepeatVector(DIGITS + 1))

# Build the Decoder : 
# TODO: 
# The decoder RNN could be multiple layers stacked or a single layer.
# Set return_sequences to True, return not only the last output but
# all the outputs so far in the form of (num_samples, timesteps,
# output_dim). This is necessary as TimeDistributed in the below expects
# the first dimension to be the timesteps.




# Apply a dense layer to the every temporal slice of an input. For each of step
# of the output sequence, decide which character should be chosen.
model.add(layers.TimeDistributed(layers.Dense(len(chars), activation='softmax')))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()



**Step 4:** Here we actually train the network using the `fit()` method. Then the code se

In [None]:
# Train the model each generation and show predictions against the validation
# dataset.
for iteration in range(1, 200):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    # TODO: run the model fit method for one epoch at a time

    # Select 10 samples from the validation set at random so we can visualize
    # errors.
    for i in range(10):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]
        preds = model.predict_classes(rowx, verbose=0)
        q = ctable.decode(rowx[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=False)
        print('Q', q[::-1] if REVERSE else q, end=' ')
        print('T', correct, end=' ')
        if correct == guess:
            print(colors.ok + '☑' + colors.close, end=' ')
        else:
            print(colors.fail + '☒' + colors.close, end=' ')
        print(guess)