# Naive LSTM for a Three-Char Feature Window to One-Char Mapping

A popular approach to adding more context to data for multilayer Perceptrons is to use the window method.

This is where previous steps in the sequence are provided as additional input features to the network. We can try the same trick to provide more context to the LSTM network.

Here, we increase the sequence length from 1 to 3, for example:

In [4]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils

# fix random seed for reproducibility
numpy.random.seed(7)

# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

# prepare the dataset of input to output pairs encoded as integers
seq_length = 3
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

Using TensorFlow backend.


ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> V
TUV -> W
UVW -> X
VWX -> Y
WXY -> Z


Each element in the sequence is then provided as a new input feature to the network. This requires a modification of how the input sequences reshaped in the data preparation step:

In [19]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), 1, seq_length))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [26]:
samples, time_steps, features = X.shape
print("Samples: ", samples)
print("Time Steps: ", time_steps)
print("Features: ", features)

Samples:  23
Time Steps:  1
Features:  3


In [37]:
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(time_steps, features)))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
 - 1s - loss: 3.2683 - acc: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.2545 - acc: 0.0435
Epoch 3/500
 - 0s - loss: 3.2467 - acc: 0.0435
Epoch 4/500
 - 0s - loss: 3.2393 - acc: 0.0870
Epoch 5/500
 - 0s - loss: 3.2320 - acc: 0.0000e+00
Epoch 6/500
 - 0s - loss: 3.2240 - acc: 0.0000e+00
Epoch 7/500
 - 0s - loss: 3.2162 - acc: 0.0435
Epoch 8/500
 - 0s - loss: 3.2076 - acc: 0.0435
Epoch 9/500
 - 0s - loss: 3.1977 - acc: 0.0000e+00
Epoch 10/500
 - 0s - loss: 3.1874 - acc: 0.0000e+00
Epoch 11/500
 - 0s - loss: 3.1775 - acc: 0.0435
Epoch 12/500
 - 0s - loss: 3.1667 - acc: 0.0435
Epoch 13/500
 - 0s - loss: 3.1554 - acc: 0.0000e+00
Epoch 14/500
 - 0s - loss: 3.1440 - acc: 0.0435
Epoch 15/500
 - 0s - loss: 3.1335 - acc: 0.0000e+00
Epoch 16/500
 - 0s - loss: 3.1212 - acc: 0.0000e+00
Epoch 17/500
 - 0s - loss: 3.1103 - acc: 0.0435
Epoch 18/500
 - 0s - loss: 3.0982 - acc: 0.0435
Epoch 19/500
 - 0s - loss: 3.0882 - acc: 0.0435
Epoch 20/500
 - 0s - loss: 3.0784 - acc: 0.0435
Epoch 21/500
 - 0

Epoch 170/500
 - 0s - loss: 2.0581 - acc: 0.3043
Epoch 171/500
 - 0s - loss: 2.0581 - acc: 0.2609
Epoch 172/500
 - 0s - loss: 2.0516 - acc: 0.3043
Epoch 173/500
 - 0s - loss: 2.0525 - acc: 0.4348
Epoch 174/500
 - 0s - loss: 2.0450 - acc: 0.3913
Epoch 175/500
 - 0s - loss: 2.0431 - acc: 0.3043
Epoch 176/500
 - 0s - loss: 2.0400 - acc: 0.4348
Epoch 177/500
 - 0s - loss: 2.0374 - acc: 0.3913
Epoch 178/500
 - 0s - loss: 2.0366 - acc: 0.3913
Epoch 179/500
 - 0s - loss: 2.0302 - acc: 0.4348
Epoch 180/500
 - 0s - loss: 2.0288 - acc: 0.3913
Epoch 181/500
 - 0s - loss: 2.0259 - acc: 0.3913
Epoch 182/500
 - 0s - loss: 2.0251 - acc: 0.3913
Epoch 183/500
 - 0s - loss: 2.0227 - acc: 0.3478
Epoch 184/500
 - 0s - loss: 2.0191 - acc: 0.3478
Epoch 185/500
 - 0s - loss: 2.0166 - acc: 0.3913
Epoch 186/500
 - 0s - loss: 2.0130 - acc: 0.3913
Epoch 187/500
 - 0s - loss: 2.0101 - acc: 0.4348
Epoch 188/500
 - 0s - loss: 2.0095 - acc: 0.3478
Epoch 189/500
 - 0s - loss: 2.0069 - acc: 0.4783
Epoch 190/500
 - 0s 

 - 0s - loss: 1.7209 - acc: 0.6522
Epoch 338/500
 - 0s - loss: 1.7172 - acc: 0.6522
Epoch 339/500
 - 0s - loss: 1.7174 - acc: 0.7391
Epoch 340/500
 - 0s - loss: 1.7156 - acc: 0.7391
Epoch 341/500
 - 0s - loss: 1.7137 - acc: 0.6522
Epoch 342/500
 - 0s - loss: 1.7111 - acc: 0.6087
Epoch 343/500
 - 0s - loss: 1.7081 - acc: 0.5217
Epoch 344/500
 - 0s - loss: 1.7113 - acc: 0.6957
Epoch 345/500
 - 0s - loss: 1.7044 - acc: 0.6957
Epoch 346/500
 - 0s - loss: 1.7049 - acc: 0.6087
Epoch 347/500
 - 0s - loss: 1.7055 - acc: 0.6957
Epoch 348/500
 - 0s - loss: 1.7027 - acc: 0.6087
Epoch 349/500
 - 0s - loss: 1.7000 - acc: 0.6087
Epoch 350/500
 - 0s - loss: 1.7015 - acc: 0.6522
Epoch 351/500
 - 0s - loss: 1.6956 - acc: 0.6957
Epoch 352/500
 - 0s - loss: 1.6960 - acc: 0.7391
Epoch 353/500
 - 0s - loss: 1.6956 - acc: 0.6957
Epoch 354/500
 - 0s - loss: 1.6963 - acc: 0.6522
Epoch 355/500
 - 0s - loss: 1.6915 - acc: 0.7391
Epoch 356/500
 - 0s - loss: 1.6927 - acc: 0.6522
Epoch 357/500
 - 0s - loss: 1.6901

<keras.callbacks.History at 0x11ff93f28>

In [38]:
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 86.96%


In [39]:
# demonstrate some model predictions
for pattern in dataX:
    x = numpy.reshape(pattern, (1, 1, len(pattern)))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = "".join([int_to_char[value] for value in pattern])
    print(seq_in, "->", result)

ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> W
TUV -> W
UVW -> Z
VWX -> Z
WXY -> Z


We can see a small lift in performance that may or may not be real. This is a simple problem that we were still not able to learn with LSTMs even with the window method.

Again, this is a misuse of the LSTM network by a poor framing of the problem. Indeed, the sequences of letters are time steps of one feature rather than one time step of separate features. We have given more context to the network, but not more sequence as it expected.

In the next section, we will give more context to the network in the form of time steps.