# LSTM with Variable-Length Input to One-Char Output

In the previous section, we discovered that the Keras “stateful” LSTM was really only a shortcut to replaying the first n-sequences, but didn’t really help us learn a generic model of the alphabet.

In this section we explore a variation of the “stateless” LSTM that learns random subsequences of the alphabet and an effort to build a model that can be given arbitrary letters or subsequences of letters and predict the next letter in the alphabet.

In [3]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
from keras.preprocessing.sequence import pad_sequences

# fix random seed for reproducibility
numpy.random.seed(7)

# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

Firstly, we are changing the framing of the problem. To simplify we will define a maximum input sequence length and set it to a small value like 5 to speed up training. This defines the maximum length of subsequences of the alphabet will be drawn for training. In extensions, this could just as well be set to the full alphabet (26) or longer if we allow looping back to the start of the sequence.

We also need to define the number of random sequences to create, in this case 1000. This too could be more or less. I expect less patterns are actually required.

In [4]:
# prepare the dataset of input to output pairs encoded as integers
num_inputs = 1000
max_len = 5
dataX = []
dataY = []
for i in range(num_inputs):
    start = numpy.random.randint(len(alphabet)-2)
    end = numpy.random.randint(start, min(start+max_len,len(alphabet)-1))
    sequence_in = alphabet[start:end+1]
    sequence_out = alphabet[end + 1]
    dataX.append([char_to_int[char] for char in sequence_in])
    dataY.append(char_to_int[sequence_out])
    print(sequence_in, '->', sequence_out)


PQRST -> U
W -> X
O -> P
OPQ -> R
IJKLM -> N
QRSTU -> V
ABCD -> E
X -> Y
GHIJ -> K
M -> N
XY -> Z
QRST -> U
ABC -> D
JKLMN -> O
OP -> Q
XY -> Z
D -> E
T -> U
B -> C
QRSTU -> V
HIJ -> K
JKLM -> N
ABCDE -> F
X -> Y
V -> W
DE -> F
DEFG -> H
BCDE -> F
EFGH -> I
BCDE -> F
FG -> H
RST -> U
TUV -> W
STUV -> W
LMN -> O
P -> Q
MNOP -> Q
JK -> L
MNOP -> Q
OPQRS -> T
UVWXY -> Z
PQRS -> T
D -> E
EFGH -> I
IJK -> L
WX -> Y
STUV -> W
MNOPQ -> R
P -> Q
WXY -> Z
VWX -> Y
V -> W
HI -> J
KLMNO -> P
UV -> W
JKL -> M
ABCDE -> F
WXY -> Z
M -> N
CDEF -> G
KLMNO -> P
RST -> U
RS -> T
W -> X
J -> K
WX -> Y
JKLMN -> O
MN -> O
L -> M
BCDE -> F
TU -> V
MNOPQ -> R
NOPQR -> S
HIJ -> K
JKLM -> N
STUVW -> X
QRST -> U
N -> O
VWXY -> Z
B -> C
UVWX -> Y
OP -> Q
K -> L
C -> D
X -> Y
ST -> U
JKLM -> N
B -> C
QR -> S
RS -> T
VWXY -> Z
S -> T
NOP -> Q
KLMNO -> P
IJ -> K
EF -> G
MNOP -> Q
WXY -> Z
HI -> J
P -> Q
STUVW -> X
Q -> R
MN -> O
O -> P
C -> D
L -> M
JKLM -> N
K -> L
IJKLM -> N
FGHIJ -> K
LM -> N
OPQ -> R
U -> V
HIJ

K -> L
VW -> X
GHI -> J
CD -> E
XY -> Z
HI -> J
C -> D
IJK -> L
DEFG -> H
UV -> W
LM -> N
X -> Y
UV -> W
I -> J
NO -> P
ABCD -> E
K -> L
IJK -> L
JKL -> M
EFGHI -> J
JK -> L
TU -> V
IJ -> K
MNOPQ -> R
C -> D
IJKLM -> N
VW -> X
CDE -> F
E -> F
NOP -> Q
OPQRS -> T
FGHI -> J
STUV -> W
IJKLM -> N
STUV -> W
TUVWX -> Y
RSTU -> V


The input sequences vary in length between 1 and **max_len** and therefore require zero padding. Here, we use left-hand-side (prefix) padding with the Keras built in **pad_sequences()** function.

In [5]:
# convert list of lists to array and pad sequences if needed
X = pad_sequences(dataX, maxlen=max_len, dtype='float32')
# reshape X to be [samples, time steps, features]
X = numpy.reshape(X, (X.shape[0], max_len, 1))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [7]:
samples, time_steps, features = X.shape
print("Samples: ", samples)
print("Time Steps: ", time_steps)
print("Features: ", features)

Samples:  1000
Time Steps:  5
Features:  1


The trained model is evaluated on randomly selected input patterns. This could just as easily be new randomly generated sequences of characters. I also believe this could also be a linear sequence seeded with “A” with outputs fed back in as single character inputs.

In [16]:
# create and fit the model
#epochs = 500
epochs = 50
batch_size = 1
model = Sequential()
model.add(LSTM(32, input_shape=(time_steps, 1)))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=epochs, batch_size=batch_size, verbose=2)

Epoch 1/50
 - 10s - loss: 3.0820 - acc: 0.0810
Epoch 2/50
 - 9s - loss: 2.7909 - acc: 0.1180
Epoch 3/50
 - 9s - loss: 2.4745 - acc: 0.1970
Epoch 4/50
 - 9s - loss: 2.2567 - acc: 0.2370
Epoch 5/50
 - 9s - loss: 2.1017 - acc: 0.2870
Epoch 6/50
 - 9s - loss: 1.9748 - acc: 0.3140
Epoch 7/50
 - 9s - loss: 1.8688 - acc: 0.3420
Epoch 8/50
 - 9s - loss: 1.7894 - acc: 0.3650
Epoch 9/50
 - 9s - loss: 1.7093 - acc: 0.4030
Epoch 10/50
 - 9s - loss: 1.6318 - acc: 0.4280
Epoch 11/50
 - 9s - loss: 1.5699 - acc: 0.4590
Epoch 12/50
 - 9s - loss: 1.5047 - acc: 0.4530
Epoch 13/50
 - 9s - loss: 1.4486 - acc: 0.4970
Epoch 14/50
 - 9s - loss: 1.3931 - acc: 0.5160
Epoch 15/50
 - 9s - loss: 1.3370 - acc: 0.5510
Epoch 16/50
 - 9s - loss: 1.3044 - acc: 0.5510
Epoch 17/50
 - 9s - loss: 1.2552 - acc: 0.5710
Epoch 18/50
 - 9s - loss: 1.2039 - acc: 0.6140
Epoch 19/50
 - 9s - loss: 1.1722 - acc: 0.6240
Epoch 20/50
 - 11s - loss: 1.1366 - acc: 0.6370
Epoch 21/50
 - 10s - loss: 1.1077 - acc: 0.6650
Epoch 22/50
 - 10s 

<keras.callbacks.History at 0x12844a898>

In [17]:
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 76.20%


In [18]:
# demonstrate some model predictions
for i in range(20):
    pattern_index = numpy.random.randint(len(dataX))
    pattern = dataX[pattern_index]
    x = pad_sequences([pattern], maxlen=max_len, dtype='float32')
    x = numpy.reshape(x, (1, max_len, 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = "".join([int_to_char[value] for value in pattern])
    print(seq_in, "->", result)

C -> D
QRS -> T
TUVW -> X
LM -> N
HI -> J
B -> D
HI -> J
W -> Y
UVWX -> Z
HIJ -> K
ST -> U
KLM -> N
MNOP -> Q
STUV -> W
VWXY -> Z
QRSTU -> V
X -> Y
GHIJK -> L
O -> R
X -> Y


We can see that although the model did not learn the alphabet perfectly from the randomly generated subsequences, it did very well. The model was not tuned and may require more training or a larger network, or both (an exercise for the reader).

This is a good natural extension to the “all sequential input examples in each batch” alphabet model learned above in that it can handle ad hoc queries, but this time of arbitrary sequence length (up to the max length).