# Learning LSTMs
The point of this notebook is just to mess with LSTMs more (in particular in the seq2seq setting) to get more comfortable with them.

### Imports

In [69]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, LSTM, Input

import numpy as np

### Defining an model using Sequential

In [25]:
def my_model(num_features, num_timesteps, num_units=10):
    model = Sequential()
    model.add(LSTM(units=num_units, input_shape=(num_timesteps, num_features)))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

### Defining a model using the Functional API
This is done roughly following [here](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html).

In [97]:
def my_model_func(num_timesteps, num_features, num_units=10, return_state=False, return_sequences=False):
    lstm_input = Input(shape=(num_timesteps, num_features))
    lstm = LSTM(units=num_units, return_state=return_state, return_sequences=return_sequences)
    lstm_output = lstm(lstm_input)
    model = Model(lstm_input, lstm_output)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model, lstm_output

### Generating data for the model
We'll one-hot the digits between 0 and 9 (inclusive)

In [7]:
mapping = dict()

for i in range(10):
    one_hot = np.zeros(10)
    one_hot[i] = 1
    mapping[i] = one_hot

def reverse_mapping(one_hot):
    i = int(np.argwhere(one_hot == 1))
    return i

# Take a (possibly) multidigit integer and one-hot it. Return a matrix of one-hot vectors representing each digit.
def one_hot(s, mapping):
    if not isinstance(s, str):
        s = str(s)
    one_hot_s = np.zeros((len(s), 10))
    for i, c in enumerate(s):
        one_hot_s[i] = mapping[int(c)]
    return one_hot_s

Next we'll use that to generate 100 training samples for the model. The input will be digits of length `digit_length`, and the output will be the sum of the digits mod 10.

In [115]:
def generate_samples(digit_length, n_samples):
    X = []
    y = []
    for _ in range(n_samples):
        x = str(np.random.randint(10**(digit_length-1), 10**digit_length))
        y = sum(int(c) for c in x) % 10
    return str(x), y

In [62]:
NUM_SAMPLES = 100
DIGIT_LENGTH = 3
NUM_CHARS = 10

X = np.zeros((NUM_SAMPLES, DIGIT_LENGTH, NUM_CHARS))
Y = np.zeros((NUM_SAMPLES, NUM_CHARS))

for i in range(NUM_SAMPLES):
    x, y = generate_sample(DIGIT_LENGTH)
    X[i] = one_hot(x, mapping)
    Y[i] = one_hot(y, mapping)

### Examining the model
So what exactly is returned from an LSTM? Not the entire model (which just returns a `model`), but the LSTM layer itself? Let's examine it step-by-step

In [99]:
model_func, lstm_output = my_model_func(num_timesteps=DIGIT_LENGTH, \
                                        num_features=NUM_CHARS, \
                                        return_state=False, \
                                        return_sequences=False)

In [100]:
lstm_output.shape

TensorShape([None, 10])

As described [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/), an LSTM returns its final hidden state by default. That is, it returns the hidden state at the *last timestep only*. Since each hidden state is determined (in part) by the previous hidden states, this final hidden state can be thought of as encapsulating the information from all previous hidden state, cell states and inputs. 

Recall that the hidden state (i.e. $\vec{h}=(h_1, h_2, \ldots, h_{\text{num_timesteps}})$) is actually a matrix, not a vector. Each $h_t$ is a vector itself of dimension `num_units`. So $\vec{h}$ is of dimension `(num_timesteps, num_units)`. That is why the return shape of even just the final hidden state (i.e. $h_{\text{num_timesteps}$) is still of dimension 10. Recall that the first entry (`None` in this case) refers to the batch size.

Next, let's specify `return_state=True`.

In [101]:
model_func, lstm_output = my_model_func(num_timesteps=DIGIT_LENGTH, \
                                        num_features=NUM_CHARS, \
                                        return_state=True, \
                                        return_sequences=False)

In [102]:
lstm_output

[<tf.Tensor 'lstm_16/Identity:0' shape=(None, 10) dtype=float32>,
 <tf.Tensor 'lstm_16/Identity_1:0' shape=(None, 10) dtype=float32>,
 <tf.Tensor 'lstm_16/Identity_2:0' shape=(None, 10) dtype=float32>]

So a list is returned. In fact, it consists of the triple (final hidden state, final hidden state (again), final cell state), as discussed [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). It might seem strange to return the final hidden state twice, but this is because while the LSTM output by default is the final hidden state, this can be modified. By passing `return_sequences=True`, the LSTM output will instead be the hidden state for *all* timesteps. Therefore, if you want to use the final hidden state, rather than needing to pull it from the LSTM output, you can just request it separately.

In [112]:
model_func, lstm_output = my_model_func(num_timesteps=DIGIT_LENGTH, \
                                        num_features=NUM_CHARS, \
                                        return_state=True, \
                                        return_sequences=True)

In [113]:
lstm_output

[<tf.Tensor 'lstm_17/Identity:0' shape=(None, 3, 10) dtype=float32>,
 <tf.Tensor 'lstm_17/Identity_1:0' shape=(None, 10) dtype=float32>,
 <tf.Tensor 'lstm_17/Identity_2:0' shape=(None, 10) dtype=float32>]