<a href="https://colab.research.google.com/github/mahesh-keswani/ML-DL-Basics/blob/main/TimeDistributedLayer_LSTM_Types.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Resource: https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/

# Long Short-Term Networks or LSTMs are a popular and powerful type of Recurrent Neural Network, or RNN.
# They can be quite difficult to configure and apply to arbitrary sequence prediction problems, even with well defined
# and “easy to use” interfaces like those provided in the Keras deep learning library in Python.

# One reason for this difficulty in Keras is the use of the TimeDistributed wrapper layer and the need for 
# some LSTM layers to return sequences rather than single values.

<h1>Sequence Learning Problem
</h1>

In [2]:
# We will use a simple sequence learning problem to demonstrate the TimeDistributed layer.

# In this problem, the sequence [0.0, 0.2, 0.4, 0.6, 0.8] will be given as input, one item at a time 
# and must be in turn returned as output, one item at a time.

# Think of it as learning a simple echo program. We give 0.0 as input, we expect to see 0.0 as output, 
# repeated for each item in the sequence.

In [3]:
from numpy import array
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM

In [4]:
length = 5
seq = array([i/float(length) for i in range(length)])
print(seq)

[0.  0.2 0.4 0.6 0.8]


In [5]:
# The example is configurable and you can play with longer/shorter sequences yourself later if you like.

<h1>One-to-One LSTM for Sequence Prediction</h1>

In [6]:
# Before we dive in, it is important to show that this sequence learning problem can be learned piecewise.

# That is, we can reframe the problem into a dataset of input-output pairs for each item in the sequence. 
# Given 0, the network should output 0, given 0.2, the network must output 0.2, and so on.

# This is the simplest formulation of the problem and requires the sequence to be split into input-output pairs 
# and for the sequence to be predicted one step at a time.

# The input-output pairs are as follows:
# X, 	y
# 0.0,	0.0
# 0.2,	0.2
# 0.4,	0.4
# 0.6,	0.6
# 0.8,	0.8

# The input for LSTMs must be three dimensional. We can reshape the 2D sequence into a 3D sequence with 
# 5 samples, 1 time step, and 1 feature. We will define the output as 5 samples with 1 feature.

X = seq.reshape(5, 1, 1)
y = seq.reshape(5, 1)

In [7]:
# We will define the network model as having 1 input with 1 time step. 
# The first hidden layer will be an LSTM with 5 units. The output layer with be a fully-connected layer with 1 output.

# define LSTM configuration
n_neurons = length  # 5
n_batch = length  # 5
n_epoch = 1000

# create LSTM
model = Sequential()
model.add(LSTM(n_neurons, input_shape=(1, 1)))
model.add(Dense(1))

model.compile(loss='mean_squared_error', optimizer='adam')
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 5)                 140       
_________________________________________________________________
dense (Dense)                (None, 1)                 6         
Total params: 146
Trainable params: 146
Non-trainable params: 0
_________________________________________________________________
None


In [8]:
# We can see that the LSTM layer has 140 parameters. This is calculated based on the number of inputs (1)  (n_features)
# and the number of outputs (5 for the 5 units in the hidden layer), as follows
#                                   bias terms
# n = 4 * ((inputs + 1) * outputs + outputs^2)
# n = 4 * ((1 + 1) * 5 + 5^2)
# n = 4 * 35
# n = 140

# We can also see that the fully connected layer only has 6 parameters for the number of inputs 
# (5 for the 5 inputs from the previous layer), number of outputs (1 for the 1 neuron in the layer), and the bias.
# n = inputs * outputs + outputs
# n = 5 * 1 + 1
# n = 6


In [9]:
# train LSTM
model.fit(X, y, epochs=n_epoch, batch_size=n_batch, verbose=2)

Epoch 1/1000
1/1 - 1s - loss: 0.3208
Epoch 2/1000
1/1 - 0s - loss: 0.3184
Epoch 3/1000
1/1 - 0s - loss: 0.3160
Epoch 4/1000
1/1 - 0s - loss: 0.3137
Epoch 5/1000
1/1 - 0s - loss: 0.3114
Epoch 6/1000
1/1 - 0s - loss: 0.3091
Epoch 7/1000
1/1 - 0s - loss: 0.3068
Epoch 8/1000
1/1 - 0s - loss: 0.3045
Epoch 9/1000
1/1 - 0s - loss: 0.3022
Epoch 10/1000
1/1 - 0s - loss: 0.3000
Epoch 11/1000
1/1 - 0s - loss: 0.2977
Epoch 12/1000
1/1 - 0s - loss: 0.2955
Epoch 13/1000
1/1 - 0s - loss: 0.2933
Epoch 14/1000
1/1 - 0s - loss: 0.2912
Epoch 15/1000
1/1 - 0s - loss: 0.2890
Epoch 16/1000
1/1 - 0s - loss: 0.2869
Epoch 17/1000
1/1 - 0s - loss: 0.2847
Epoch 18/1000
1/1 - 0s - loss: 0.2826
Epoch 19/1000
1/1 - 0s - loss: 0.2805
Epoch 20/1000
1/1 - 0s - loss: 0.2785
Epoch 21/1000
1/1 - 0s - loss: 0.2764
Epoch 22/1000
1/1 - 0s - loss: 0.2744
Epoch 23/1000
1/1 - 0s - loss: 0.2724
Epoch 24/1000
1/1 - 0s - loss: 0.2704
Epoch 25/1000
1/1 - 0s - loss: 0.2684
Epoch 26/1000
1/1 - 0s - loss: 0.2664
Epoch 27/1000
1/1 - 0

<tensorflow.python.keras.callbacks.History at 0x7f82c495ba10>

In [10]:
# evaluate
result = model.predict(X, batch_size=n_batch, verbose=0)

for value in result:
	print('%.1f' % value)

0.1
0.2
0.4
0.6
0.8


In [11]:
# The network correctly learns the prediction problem.

<h1>Many-to-One LSTM for Sequence Prediction (without  TimeDistributed)</h1>

In [12]:
# In this section, we develop an LSTM to output the sequence all at once, although without the TimeDistributed wrapper layer.

# The input for LSTMs must be three dimensional. We can reshape the 2D sequence into a 3D sequence 
# with 1 sample, 5 time steps, and 1 feature. We will define the output as 1 sample with 5 features.

X = seq.reshape(1, 5, 1)
y = seq.reshape(1, 5)

In [13]:
# Immediately, you can see that the problem definition must be slightly adjusted to support a network for 
# sequence prediction without a TimeDistributed wrapper. Specifically, output one vector instead of building out an 
# output sequence(vector) from one step at a time. The difference may sound subtle, but it is important to understanding 
# the role of the TimeDistributed wrapper.

In [14]:
# create LSTM
model = Sequential()
model.add(LSTM(5, input_shape=(5, 1)))
model.add(Dense(length))

model.compile(loss='mean_squared_error', optimizer='adam')
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 5)                 140       
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 30        
Total params: 170
Trainable params: 170
Non-trainable params: 0
_________________________________________________________________
None


In [15]:
# train LSTM
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
1/1 - 1s - loss: 0.2410
Epoch 2/500
1/1 - 0s - loss: 0.2389
Epoch 3/500
1/1 - 0s - loss: 0.2368
Epoch 4/500
1/1 - 0s - loss: 0.2347
Epoch 5/500
1/1 - 0s - loss: 0.2326
Epoch 6/500
1/1 - 0s - loss: 0.2305
Epoch 7/500
1/1 - 0s - loss: 0.2285
Epoch 8/500
1/1 - 0s - loss: 0.2265
Epoch 9/500
1/1 - 0s - loss: 0.2244
Epoch 10/500
1/1 - 0s - loss: 0.2224
Epoch 11/500
1/1 - 0s - loss: 0.2204
Epoch 12/500
1/1 - 0s - loss: 0.2184
Epoch 13/500
1/1 - 0s - loss: 0.2165
Epoch 14/500
1/1 - 0s - loss: 0.2145
Epoch 15/500
1/1 - 0s - loss: 0.2126
Epoch 16/500
1/1 - 0s - loss: 0.2106
Epoch 17/500
1/1 - 0s - loss: 0.2087
Epoch 18/500
1/1 - 0s - loss: 0.2068
Epoch 19/500
1/1 - 0s - loss: 0.2049
Epoch 20/500
1/1 - 0s - loss: 0.2030
Epoch 21/500
1/1 - 0s - loss: 0.2012
Epoch 22/500
1/1 - 0s - loss: 0.1993
Epoch 23/500
1/1 - 0s - loss: 0.1975
Epoch 24/500
1/1 - 0s - loss: 0.1956
Epoch 25/500
1/1 - 0s - loss: 0.1938
Epoch 26/500
1/1 - 0s - loss: 0.1920
Epoch 27/500
1/1 - 0s - loss: 0.1902
Epoch 28/5

<tensorflow.python.keras.callbacks.History at 0x7f82c1645250>

In [16]:
# evaluate
result = model.predict(X, batch_size=n_batch, verbose=0)

for value in result[0,:]:
	print('%.1f' % value)

0.0
0.2
0.4
0.6
0.8


In [17]:
# Output is a vector and not scaler
print(result[0])

[1.0412186e-06 2.0006351e-01 4.0003258e-01 5.9973603e-01 8.0006385e-01]


In [18]:
# The sequence is reproduced correctly, but as a single piece rather than stepwise through the input data. 
# We may have used a Dense layer as the first hidden layer instead of LSTMs as this usage of LSTMs does not take 
# much advantage of their full capability for sequence learning and processing.