# Problem Description: Learn the Alphabet
![](https://imgur.com/jdhQR4m.png)

In this tutorial we are going to develop and contrast a number of different LSTM recurrent neural network models.

The context of these comparisons will be a simple sequence prediction problem of learning the alphabet. That is, given a letter of the alphabet, predict the next letter of the alphabet.

This is a simple sequence prediction problem that once understood can be generalized to other sequence prediction problems like time series prediction and sequence classification.

In [1]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


Next, we can seed the random number generator to ensure that the results are the same each time the code is executed.

In [2]:
# fix random seed for reproducibility
numpy.random.seed(7)

We can now define our dataset, the alphabet. We define the alphabet in uppercase characters for readability.

Neural networks model numbers, so we need to map the letters of the alphabet to integer values. We can do this easily by creating a dictionary (map) of the letter index to the character. We can also create a reverse lookup for converting predictions back into characters to be used later.

In [3]:
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

In [4]:
print("The letters correspond to the numbers: \n", char_to_int)
print("\n")

print("The numbers correspond to the letters: \n", int_to_char)

The letters correspond to the numbers: 
 {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'J': 9, 'K': 10, 'L': 11, 'M': 12, 'N': 13, 'O': 14, 'P': 15, 'Q': 16, 'R': 17, 'S': 18, 'T': 19, 'U': 20, 'V': 21, 'W': 22, 'X': 23, 'Y': 24, 'Z': 25}


The numbers correspond to the letters: 
 {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J', 10: 'K', 11: 'L', 12: 'M', 13: 'N', 14: 'O', 15: 'P', 16: 'Q', 17: 'R', 18: 'S', 19: 'T', 20: 'U', 21: 'V', 22: 'W', 23: 'X', 24: 'Y', 25: 'Z'}


Now we need to create our input and output pairs on which to train our neural network. We can do this by defining an input sequence length, then reading sequences from the input alphabet sequence.

For example we use an input length of 1. Starting at the beginning of the raw input data, we can read off the first letter “A” and the next letter as the prediction “B”. We move along one character and repeat until we reach a prediction of “Z”.

In [5]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 1
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

A -> B
B -> C
C -> D
D -> E
E -> F
F -> G
G -> H
H -> I
I -> J
J -> K
K -> L
L -> M
M -> N
N -> O
O -> P
P -> Q
Q -> R
R -> S
S -> T
T -> U
U -> V
V -> W
W -> X
X -> Y
Y -> Z


We need to reshape the NumPy array into a format expected by the LSTM networks, that is `[samples, time steps, features]`.

Once reshaped, we can then normalize the input integers to the range 0-to-1, the range of the sigmoid activation functions used by the LSTM network.

Finally, we can think of this problem as a sequence classification task, where each of the 26 letters represents a different class. As such, we can convert the output (y) to a one hot encoding, using the Keras built-in function **to_categorical().**

In [6]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))

# normalize
X = X / float(len(alphabet))

# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [7]:
print("X shape: ", X.shape) 
print("y shape: ", y.shape)

X shape:  (25, 1, 1)
y shape:  (25, 26)


## Naive LSTM for Learning One-Char to One-Char Mapping

Let’s start off by designing a simple LSTM to learn how to predict the next character in the alphabet given the context of just one character.

We will frame the problem as a random collection of one-letter input to one-letter output pairs. As we will see this is a difficult framing of the problem for the LSTM to learn.

Let’s define an LSTM network with 32 units and an output layer with a softmax activation function for making predictions. Because this is a multi-class classification problem, we can use the log loss function (called “categorical_crossentropy” in Keras), and optimize the network using the ADAM optimization function.

The model is fit over 500 epochs with a batch size of 1.


In [8]:
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 32)                4352      
_________________________________________________________________
dense_1 (Dense)              (None, 26)                858       
Total params: 5,210
Trainable params: 5,210
Non-trainable params: 0
_________________________________________________________________


In [9]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
 - 4s - loss: 3.2660 - acc: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.2582 - acc: 0.0000e+00
Epoch 3/500
 - 0s - loss: 3.2551 - acc: 0.0400
Epoch 4/500
 - 0s - loss: 3.2524 - acc: 0.0400
Epoch 5/500
 - 0s - loss: 3.2495 - acc: 0.0400
Epoch 6/500
 - 0s - loss: 3.2470 - acc: 0.0400
Epoch 7/500
 - 0s - loss: 3.2439 - acc: 0.0400
Epoch 8/500
 - 0s - loss: 3.2411 - acc: 0.0400
Epoch 9/500
 - 0s - loss: 3.2377 - acc: 0.0400
Epoch 10/500
 - 0s - loss: 3.2347 - acc: 0.0400
Epoch 11/500
 - 0s - loss: 3.2311 - acc: 0.0400
Epoch 12/500
 - 0s - loss: 3.2275 - acc: 0.0400
Epoch 13/500
 - 0s - loss: 3.2235 - acc: 0.0400
Epoch 14/500
 - 0s - loss: 3.2202 - acc: 0.0400
Epoch 15/500
 - 0s - loss: 3.2159 - acc: 0.0400
Epoch 16/500
 - 0s - loss: 3.2115 - acc: 0.0400
Epoch 17/500
 - 0s - loss: 3.2064 - acc: 0.0400
Epoch 18/500
 - 0s - loss: 3.2014 - acc: 0.0400
Epoch 19/500
 - 0s - loss: 3.1967 - acc: 0.0400
Epoch 20/500
 - 0s - loss: 3.1908 - acc: 0.0400
Epoch 21/500
 - 0s - loss: 3.1851 - acc: 

 - 0s - loss: 2.2023 - acc: 0.2400
Epoch 171/500
 - 0s - loss: 2.2004 - acc: 0.2800
Epoch 172/500
 - 0s - loss: 2.1963 - acc: 0.3200
Epoch 173/500
 - 0s - loss: 2.1940 - acc: 0.3600
Epoch 174/500
 - 0s - loss: 2.1906 - acc: 0.2800
Epoch 175/500
 - 0s - loss: 2.1885 - acc: 0.2800
Epoch 176/500
 - 0s - loss: 2.1848 - acc: 0.2800
Epoch 177/500
 - 0s - loss: 2.1820 - acc: 0.2800
Epoch 178/500
 - 0s - loss: 2.1793 - acc: 0.3200
Epoch 179/500
 - 0s - loss: 2.1767 - acc: 0.2800
Epoch 180/500
 - 0s - loss: 2.1743 - acc: 0.2800
Epoch 181/500
 - 0s - loss: 2.1703 - acc: 0.2400
Epoch 182/500
 - 0s - loss: 2.1693 - acc: 0.3200
Epoch 183/500
 - 0s - loss: 2.1664 - acc: 0.3600
Epoch 184/500
 - 0s - loss: 2.1629 - acc: 0.3200
Epoch 185/500
 - 0s - loss: 2.1594 - acc: 0.2800
Epoch 186/500
 - 0s - loss: 2.1582 - acc: 0.3600
Epoch 187/500
 - 0s - loss: 2.1555 - acc: 0.3200
Epoch 188/500
 - 0s - loss: 2.1513 - acc: 0.3600
Epoch 189/500
 - 0s - loss: 2.1500 - acc: 0.3600
Epoch 190/500
 - 0s - loss: 2.1481

Epoch 338/500
 - 0s - loss: 1.8776 - acc: 0.6000
Epoch 339/500
 - 0s - loss: 1.8770 - acc: 0.6400
Epoch 340/500
 - 0s - loss: 1.8722 - acc: 0.6000
Epoch 341/500
 - 0s - loss: 1.8734 - acc: 0.6800
Epoch 342/500
 - 0s - loss: 1.8717 - acc: 0.5600
Epoch 343/500
 - 0s - loss: 1.8705 - acc: 0.4800
Epoch 344/500
 - 0s - loss: 1.8673 - acc: 0.6000
Epoch 345/500
 - 0s - loss: 1.8666 - acc: 0.6000
Epoch 346/500
 - 0s - loss: 1.8647 - acc: 0.6400
Epoch 347/500
 - 0s - loss: 1.8634 - acc: 0.5600
Epoch 348/500
 - 0s - loss: 1.8631 - acc: 0.6000
Epoch 349/500
 - 0s - loss: 1.8616 - acc: 0.6000
Epoch 350/500
 - 0s - loss: 1.8604 - acc: 0.6400
Epoch 351/500
 - 0s - loss: 1.8585 - acc: 0.7200
Epoch 352/500
 - 0s - loss: 1.8562 - acc: 0.5200
Epoch 353/500
 - 0s - loss: 1.8578 - acc: 0.6400
Epoch 354/500
 - 0s - loss: 1.8547 - acc: 0.6800
Epoch 355/500
 - 0s - loss: 1.8528 - acc: 0.6400
Epoch 356/500
 - 0s - loss: 1.8507 - acc: 0.6000
Epoch 357/500
 - 0s - loss: 1.8514 - acc: 0.6400
Epoch 358/500
 - 0s 

<keras.callbacks.History at 0x27be8e58550>

After we fit the model we can evaluate and summarize the performance on the entire training dataset.

In [10]:
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 88.00%


We can then re-run the training data through the network and generate predictions, converting both the input and output pairs back into their original character format to get a visual idea of how well the network learned the problem.

In [11]:
# demonstrate some model predictions
for pattern in dataX:
    # Take 26 letters one by one into the model to predict the letters that will appear
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction) # The most probable index
    result = int_to_char[index] # Look at what is predicted
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result) 

['A'] -> B
['B'] -> B
['C'] -> D
['D'] -> E
['E'] -> F
['F'] -> G
['G'] -> H
['H'] -> I
['I'] -> J
['J'] -> K
['K'] -> L
['L'] -> M
['M'] -> N
['N'] -> O
['O'] -> P
['P'] -> Q
['Q'] -> R
['R'] -> S
['S'] -> T
['T'] -> U
['U'] -> V
['V'] -> W
['W'] -> Y
['X'] -> Z
['Y'] -> Z


We can see that this problem is indeed difficult for the network to learn.

The reason is, the poor LSTM units do not have any context to work with. Each input-output pattern is shown to the network in a random order and the state of the network is reset after each pattern (each batch where each batch contains one pattern).

This is abuse of the LSTM network architecture, treating it like a standard multilayer Perceptron.

Next, let’s try a different framing of the problem in order to provide more sequence to the network from which to learn.

## Naive LSTM for a Three-Char Feature Window to One-Char Mapping
A popular approach to adding more context to data for multilayer Perceptrons is to use the window method.

This is where previous steps in the sequence are provided as additional input features to the network. We can try the same trick to provide more context to the LSTM network.

Here, we increase the sequence length from 1 to 3, for example:

In [12]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 3 
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length] 
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> V
TUV -> W
UVW -> X
VWX -> Y
WXY -> Z


Each element in the sequence is then provided as a new input feature to the network. This requires a modification of how the input sequences reshaped in the data preparation step:

Target training tensor structure : (samples, time_steps, features) -> (n , 1, 3 )

The three characters here will become a "feature" vector with 3 elements. Therefore, when preparing the training data set, 1 training data is only "1" time step, and "3" character data "features" vector is stored therein.

In [13]:
# Reshape the dimensions of the X data becomes (samples, time_steps, features)
X = numpy.reshape(dataX, (len(dataX), 1, seq_length))  

X = X / float(len(alphabet))

# one hot encode the output variable
y = np_utils.to_categorical(dataY)

print("X shape: ", X.shape)
print("y shape: ", y.shape)

X shape:  (23, 1, 3)
y shape:  (23, 26)


In [14]:
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2]))) # Note
model.add(Dense(y.shape[1], activation='softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 32)                4608      
_________________________________________________________________
dense_2 (Dense)              (None, 26)                858       
Total params: 5,466
Trainable params: 5,466
Non-trainable params: 0
_________________________________________________________________


In [15]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
 - 1s - loss: 3.2752 - acc: 0.0435
Epoch 2/500
 - 0s - loss: 3.2629 - acc: 0.0000e+00
Epoch 3/500
 - 0s - loss: 3.2555 - acc: 0.0000e+00
Epoch 4/500
 - 0s - loss: 3.2486 - acc: 0.0435
Epoch 5/500
 - 0s - loss: 3.2418 - acc: 0.0435
Epoch 6/500
 - 0s - loss: 3.2353 - acc: 0.0435
Epoch 7/500
 - 0s - loss: 3.2291 - acc: 0.0435
Epoch 8/500
 - 0s - loss: 3.2211 - acc: 0.0435
Epoch 9/500
 - 0s - loss: 3.2139 - acc: 0.0435
Epoch 10/500
 - 0s - loss: 3.2053 - acc: 0.0435
Epoch 11/500
 - 0s - loss: 3.1961 - acc: 0.0435
Epoch 12/500
 - 0s - loss: 3.1883 - acc: 0.0435
Epoch 13/500
 - 0s - loss: 3.1775 - acc: 0.0435
Epoch 14/500
 - 0s - loss: 3.1689 - acc: 0.0435
Epoch 15/500
 - 0s - loss: 3.1576 - acc: 0.0435
Epoch 16/500
 - 0s - loss: 3.1482 - acc: 0.0435
Epoch 17/500
 - 0s - loss: 3.1364 - acc: 0.0000e+00
Epoch 18/500
 - 0s - loss: 3.1277 - acc: 0.0435
Epoch 19/500
 - 0s - loss: 3.1157 - acc: 0.0435
Epoch 20/500
 - 0s - loss: 3.1054 - acc: 0.0435
Epoch 21/500
 - 0s - loss: 3.0966 - a

 - 0s - loss: 2.1408 - acc: 0.2609
Epoch 171/500
 - 0s - loss: 2.1398 - acc: 0.2609
Epoch 172/500
 - 0s - loss: 2.1335 - acc: 0.2609
Epoch 173/500
 - 0s - loss: 2.1309 - acc: 0.2609
Epoch 174/500
 - 0s - loss: 2.1271 - acc: 0.3478
Epoch 175/500
 - 0s - loss: 2.1241 - acc: 0.3043
Epoch 176/500
 - 0s - loss: 2.1211 - acc: 0.1739
Epoch 177/500
 - 0s - loss: 2.1195 - acc: 0.3043
Epoch 178/500
 - 0s - loss: 2.1166 - acc: 0.2609
Epoch 179/500
 - 0s - loss: 2.1148 - acc: 0.3043
Epoch 180/500
 - 0s - loss: 2.1108 - acc: 0.2609
Epoch 181/500
 - 0s - loss: 2.1080 - acc: 0.2609
Epoch 182/500
 - 0s - loss: 2.1053 - acc: 0.2174
Epoch 183/500
 - 0s - loss: 2.1016 - acc: 0.3043
Epoch 184/500
 - 0s - loss: 2.0989 - acc: 0.3043
Epoch 185/500
 - 0s - loss: 2.0960 - acc: 0.3043
Epoch 186/500
 - 0s - loss: 2.0926 - acc: 0.2609
Epoch 187/500
 - 0s - loss: 2.0922 - acc: 0.2609
Epoch 188/500
 - 0s - loss: 2.0897 - acc: 0.3043
Epoch 189/500
 - 0s - loss: 2.0860 - acc: 0.3043
Epoch 190/500
 - 0s - loss: 2.0852

Epoch 338/500
 - 0s - loss: 1.7926 - acc: 0.7391
Epoch 339/500
 - 0s - loss: 1.7912 - acc: 0.6087
Epoch 340/500
 - 0s - loss: 1.7914 - acc: 0.6957
Epoch 341/500
 - 0s - loss: 1.7908 - acc: 0.6522
Epoch 342/500
 - 0s - loss: 1.7881 - acc: 0.7826
Epoch 343/500
 - 0s - loss: 1.7844 - acc: 0.6957
Epoch 344/500
 - 0s - loss: 1.7845 - acc: 0.6522
Epoch 345/500
 - 0s - loss: 1.7850 - acc: 0.6522
Epoch 346/500
 - 0s - loss: 1.7812 - acc: 0.7826
Epoch 347/500
 - 0s - loss: 1.7822 - acc: 0.7826
Epoch 348/500
 - 0s - loss: 1.7782 - acc: 0.6957
Epoch 349/500
 - 0s - loss: 1.7781 - acc: 0.6957
Epoch 350/500
 - 0s - loss: 1.7762 - acc: 0.6522
Epoch 351/500
 - 0s - loss: 1.7763 - acc: 0.7826
Epoch 352/500
 - 0s - loss: 1.7734 - acc: 0.6522
Epoch 353/500
 - 0s - loss: 1.7713 - acc: 0.6522
Epoch 354/500
 - 0s - loss: 1.7723 - acc: 0.6522
Epoch 355/500
 - 0s - loss: 1.7702 - acc: 0.6957
Epoch 356/500
 - 0s - loss: 1.7689 - acc: 0.6957
Epoch 357/500
 - 0s - loss: 1.7681 - acc: 0.6957
Epoch 358/500
 - 0s 

<keras.callbacks.History at 0x27c6e634470>

In [18]:
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: {:.2f}%".format(scores[1]*100))

Model Accuracy: 82.61%


In [19]:
# demonstrate some model predictions
for pattern in dataX:
    x = numpy.reshape(pattern, (1, 1, len(pattern)))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result)

['A', 'B', 'C'] -> D
['B', 'C', 'D'] -> E
['C', 'D', 'E'] -> F
['D', 'E', 'F'] -> G
['E', 'F', 'G'] -> H
['F', 'G', 'H'] -> I
['G', 'H', 'I'] -> J
['H', 'I', 'J'] -> K
['I', 'J', 'K'] -> L
['J', 'K', 'L'] -> M
['K', 'L', 'M'] -> N
['L', 'M', 'N'] -> O
['M', 'N', 'O'] -> P
['N', 'O', 'P'] -> Q
['O', 'P', 'Q'] -> R
['P', 'Q', 'R'] -> S
['Q', 'R', 'S'] -> T
['R', 'S', 'T'] -> U
['S', 'T', 'U'] -> W
['T', 'U', 'V'] -> X
['U', 'V', 'W'] -> Z
['V', 'W', 'X'] -> Z
['W', 'X', 'Y'] -> Z


We can see a small lift in performance that may or may not be real. This is a simple problem that we were still not able to learn with LSTMs even with the window method.

Again, this is a misuse of the LSTM network by a poor framing of the problem. Indeed, the sequences of letters are time steps of one feature rather than one time step of separate features. We have given more context to the network, but not more sequence as it expected.

In the next section, we will give more context to the network in the form of time steps.

## Naive LSTM for a Three-Char Time Step Window to One-Char Mapping
In Keras, the intended use of LSTMs is to provide context in the form of time steps, rather than windowed features like with other network types.

We can take our first example and simply change the sequence length from 1 to 3.

In [20]:
seq_length = 3
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> V
TUV -> W
UVW -> X
VWX -> Y
WXY -> Z


The difference is that the reshaping of the input data takes the sequence as a time step sequence of one feature, rather than a single time step of multiple features.

Target training tensor structure : (samples, time_steps, features) -> (n , 3, 1 )

When preparing the training data set, the tensor structure of the data should be converted into a training data with "3" time steps, and "1" character data "features" vector is stored therein.

In [21]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [22]:
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2]))) 
model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 32)                4352      
_________________________________________________________________
dense_3 (Dense)              (None, 26)                858       
Total params: 5,210
Trainable params: 5,210
Non-trainable params: 0
_________________________________________________________________


In [23]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
 - 1s - loss: 3.2630 - acc: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.2500 - acc: 0.0000e+00
Epoch 3/500
 - 0s - loss: 3.2424 - acc: 0.0435
Epoch 4/500
 - 0s - loss: 3.2356 - acc: 0.0000e+00
Epoch 5/500
 - 0s - loss: 3.2284 - acc: 0.0000e+00
Epoch 6/500
 - 0s - loss: 3.2205 - acc: 0.0435
Epoch 7/500
 - 0s - loss: 3.2123 - acc: 0.0435
Epoch 8/500
 - 0s - loss: 3.2032 - acc: 0.0435
Epoch 9/500
 - 0s - loss: 3.1914 - acc: 0.0435
Epoch 10/500
 - 0s - loss: 3.1815 - acc: 0.0435
Epoch 11/500
 - 0s - loss: 3.1684 - acc: 0.0435
Epoch 12/500
 - 0s - loss: 3.1542 - acc: 0.0435
Epoch 13/500
 - 0s - loss: 3.1394 - acc: 0.0435
Epoch 14/500
 - 0s - loss: 3.1261 - acc: 0.0435
Epoch 15/500
 - 0s - loss: 3.1105 - acc: 0.0435
Epoch 16/500
 - 0s - loss: 3.0982 - acc: 0.0435
Epoch 17/500
 - 0s - loss: 3.0882 - acc: 0.0435
Epoch 18/500
 - 0s - loss: 3.0695 - acc: 0.0435
Epoch 19/500
 - 0s - loss: 3.0563 - acc: 0.0435
Epoch 20/500
 - 0s - loss: 3.0433 - acc: 0.0435
Epoch 21/500
 - 0s - loss: 3.0309

Epoch 170/500
 - 0s - loss: 1.1901 - acc: 0.8261
Epoch 171/500
 - 0s - loss: 1.1745 - acc: 0.8261
Epoch 172/500
 - 0s - loss: 1.1721 - acc: 0.8696
Epoch 173/500
 - 0s - loss: 1.1610 - acc: 0.8261
Epoch 174/500
 - 0s - loss: 1.1635 - acc: 0.8261
Epoch 175/500
 - 0s - loss: 1.1548 - acc: 0.8261
Epoch 176/500
 - 0s - loss: 1.1492 - acc: 0.8261
Epoch 177/500
 - 0s - loss: 1.1409 - acc: 0.8261
Epoch 178/500
 - 0s - loss: 1.1467 - acc: 0.7826
Epoch 179/500
 - 0s - loss: 1.1373 - acc: 0.8261
Epoch 180/500
 - 0s - loss: 1.1295 - acc: 0.8696
Epoch 181/500
 - 0s - loss: 1.1267 - acc: 0.9130
Epoch 182/500
 - 0s - loss: 1.1195 - acc: 0.8261
Epoch 183/500
 - 0s - loss: 1.1110 - acc: 0.8261
Epoch 184/500
 - 0s - loss: 1.1069 - acc: 0.8696
Epoch 185/500
 - 0s - loss: 1.0994 - acc: 0.8261
Epoch 186/500
 - 0s - loss: 1.0945 - acc: 0.8696
Epoch 187/500
 - 0s - loss: 1.0961 - acc: 0.8261
Epoch 188/500
 - 0s - loss: 1.0832 - acc: 0.8696
Epoch 189/500
 - 0s - loss: 1.0840 - acc: 0.9130
Epoch 190/500
 - 0s 

Epoch 338/500
 - 0s - loss: 0.5096 - acc: 1.0000
Epoch 339/500
 - 0s - loss: 0.5039 - acc: 0.9565
Epoch 340/500
 - 0s - loss: 0.4983 - acc: 1.0000
Epoch 341/500
 - 0s - loss: 0.4988 - acc: 1.0000
Epoch 342/500
 - 0s - loss: 0.4979 - acc: 1.0000
Epoch 343/500
 - 0s - loss: 0.5022 - acc: 1.0000
Epoch 344/500
 - 0s - loss: 0.4964 - acc: 1.0000
Epoch 345/500
 - 0s - loss: 0.4925 - acc: 1.0000
Epoch 346/500
 - 0s - loss: 0.4877 - acc: 0.9565
Epoch 347/500
 - 0s - loss: 0.4831 - acc: 1.0000
Epoch 348/500
 - 0s - loss: 0.4793 - acc: 1.0000
Epoch 349/500
 - 0s - loss: 0.4749 - acc: 0.9565
Epoch 350/500
 - 0s - loss: 0.4702 - acc: 1.0000
Epoch 351/500
 - 0s - loss: 0.4739 - acc: 0.9565
Epoch 352/500
 - 0s - loss: 0.4687 - acc: 1.0000
Epoch 353/500
 - 0s - loss: 0.4691 - acc: 0.9565
Epoch 354/500
 - 0s - loss: 0.4753 - acc: 0.9565
Epoch 355/500
 - 0s - loss: 0.4633 - acc: 1.0000
Epoch 356/500
 - 0s - loss: 0.4618 - acc: 1.0000
Epoch 357/500
 - 0s - loss: 0.4573 - acc: 0.9565
Epoch 358/500
 - 0s 

<keras.callbacks.History at 0x27c865afda0>

In [24]:
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: {:.2f}%".format(scores[1]*100))

Model Accuracy: 100.00%


In [25]:
# Let's take 3 characters into a tensor structure shape:(1,3,1) to do infer
for pattern in dataX:
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result)

['A', 'B', 'C'] -> D
['B', 'C', 'D'] -> E
['C', 'D', 'E'] -> F
['D', 'E', 'F'] -> G
['E', 'F', 'G'] -> H
['F', 'G', 'H'] -> I
['G', 'H', 'I'] -> J
['H', 'I', 'J'] -> K
['I', 'J', 'K'] -> L
['J', 'K', 'L'] -> M
['K', 'L', 'M'] -> N
['L', 'M', 'N'] -> O
['M', 'N', 'O'] -> P
['N', 'O', 'P'] -> Q
['O', 'P', 'Q'] -> R
['P', 'Q', 'R'] -> S
['Q', 'R', 'S'] -> T
['R', 'S', 'T'] -> U
['S', 'T', 'U'] -> V
['T', 'U', 'V'] -> W
['U', 'V', 'W'] -> X
['V', 'W', 'X'] -> Y
['W', 'X', 'Y'] -> Z


We can see that the model learns the problem perfectly as evidenced by the model evaluation and the example predictions.

But it has learned a simpler problem. Specifically, it has learned to predict the next letter from a sequence of three letters in the alphabet. It can be shown any random sequence of three letters from the alphabet and predict the next letter.

It can not actually enumerate the alphabet. I expect that a larger enough multilayer perception network might be able to learn the same mapping using the window method.

The LSTM networks are stateful. They should be able to learn the whole alphabet sequence, but by default the Keras implementation resets the network state after each training batch.

## LSTM with Variable-Length Input to One-Char Output
In this section we explore a variation of the “stateless” LSTM that learns random subsequences of the alphabet and an effort to build a model that can be given arbitrary letters or subsequences of letters and predict the next letter in the alphabet.

Firstly, we are changing the framing of the problem. To simplify we will define a maximum input sequence length and set it to a small value like 5 to speed up training. This defines the maximum length of subsequences of the alphabet will be drawn for training. In extensions, this could just as set to the full alphabet (26) or longer if we allow looping back to the start of the sequence.

We also need to define the number of random sequences to create, in this case 1000. This too could be more or less. I expect less patterns are actually required.


In [30]:
# prepare the dataset of input to output pairs encoded as integers
num_inputs = 1000
max_len = 5
dataX = []
dataY = []
for i in range(num_inputs):
    start = numpy.random.randint(len(alphabet)-2)
    end = numpy.random.randint(start, min(start+max_len,len(alphabet)-1))
    sequence_in = alphabet[start:end+1]
    sequence_out = alphabet[end + 1]
    dataX.append([char_to_int[char] for char in sequence_in])
    dataY.append(char_to_int[sequence_out])
    if i<20:
        print(sequence_in, '->', sequence_out)

MNO -> P
KLM -> N
UVWXY -> Z
BC -> D
RS -> T
WXY -> Z
UV -> W
QR -> S
JKL -> M
J -> K
GH -> I
JKLM -> N
VW -> X
QRSTU -> V
W -> X
WX -> Y
DEF -> G
EF -> G
QRSTU -> V
XY -> Z


The input sequences vary in length between 1 and `max_len` and therefore require zero padding. Here, we use left-hand-side (prefix) padding with the Keras built in `pad_sequences()` function.

In [31]:
X = pad_sequences(dataX, maxlen=max_len, dtype='float32') # Note
X = numpy.reshape(X, (X.shape[0], max_len, 1))
X = X / float(len(alphabet))
y = np_utils.to_categorical(dataY)

print("X shape: ", X.shape)
print("y shape: ", y.shape)

X shape:  (1000, 5, 1)
y shape:  (1000, 26)


In [41]:
batch_size = 32
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], 1))) # Note
model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_7 (LSTM)                (None, 32)                4352      
_________________________________________________________________
dense_7 (Dense)              (None, 26)                858       
Total params: 5,210
Trainable params: 5,210
Non-trainable params: 0
_________________________________________________________________


In [42]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2)

Epoch 1/500
 - 2s - loss: 3.2469 - acc: 0.0550
Epoch 2/500
 - 0s - loss: 3.2127 - acc: 0.0570
Epoch 3/500
 - 0s - loss: 3.1585 - acc: 0.0500
Epoch 4/500
 - 0s - loss: 3.0961 - acc: 0.0540
Epoch 5/500
 - 0s - loss: 3.0472 - acc: 0.0630
Epoch 6/500
 - 0s - loss: 3.0044 - acc: 0.0770
Epoch 7/500
 - 0s - loss: 2.9652 - acc: 0.1010
Epoch 8/500
 - 0s - loss: 2.9190 - acc: 0.1120
Epoch 9/500
 - 0s - loss: 2.8719 - acc: 0.1220
Epoch 10/500
 - 0s - loss: 2.8156 - acc: 0.1410
Epoch 11/500
 - 0s - loss: 2.7420 - acc: 0.1510
Epoch 12/500
 - 0s - loss: 2.6656 - acc: 0.1410
Epoch 13/500
 - 0s - loss: 2.5895 - acc: 0.1700
Epoch 14/500
 - 0s - loss: 2.5240 - acc: 0.1840
Epoch 15/500
 - 0s - loss: 2.4591 - acc: 0.2250
Epoch 16/500
 - 0s - loss: 2.4009 - acc: 0.2470
Epoch 17/500
 - 0s - loss: 2.3577 - acc: 0.2430
Epoch 18/500
 - 0s - loss: 2.3123 - acc: 0.2340
Epoch 19/500
 - 0s - loss: 2.2725 - acc: 0.2900
Epoch 20/500
 - 1s - loss: 2.2382 - acc: 0.2520
Epoch 21/500
 - 0s - loss: 2.2008 - acc: 0.2930
E

Epoch 171/500
 - 0s - loss: 0.8279 - acc: 0.7590
Epoch 172/500
 - 1s - loss: 0.7992 - acc: 0.8010
Epoch 173/500
 - 0s - loss: 0.7977 - acc: 0.8170
Epoch 174/500
 - 0s - loss: 0.7898 - acc: 0.7960
Epoch 175/500
 - 0s - loss: 0.7864 - acc: 0.8080
Epoch 176/500
 - 0s - loss: 0.7857 - acc: 0.7950
Epoch 177/500
 - 0s - loss: 0.7813 - acc: 0.8000
Epoch 178/500
 - 1s - loss: 0.7861 - acc: 0.7880
Epoch 179/500
 - 1s - loss: 0.7787 - acc: 0.7980
Epoch 180/500
 - 0s - loss: 0.7699 - acc: 0.8060
Epoch 181/500
 - 0s - loss: 0.7633 - acc: 0.8020
Epoch 182/500
 - 0s - loss: 0.7615 - acc: 0.8040
Epoch 183/500
 - 0s - loss: 0.7594 - acc: 0.8070
Epoch 184/500
 - 0s - loss: 0.7589 - acc: 0.8090
Epoch 185/500
 - 0s - loss: 0.7577 - acc: 0.8100
Epoch 186/500
 - 0s - loss: 0.7489 - acc: 0.8150
Epoch 187/500
 - 0s - loss: 0.7469 - acc: 0.8110
Epoch 188/500
 - 0s - loss: 0.7449 - acc: 0.8080
Epoch 189/500
 - 0s - loss: 0.7390 - acc: 0.8230
Epoch 190/500
 - 0s - loss: 0.7507 - acc: 0.8010
Epoch 191/500
 - 0s 

 - 0s - loss: 0.4704 - acc: 0.8720
Epoch 339/500
 - 0s - loss: 0.4709 - acc: 0.8700
Epoch 340/500
 - 0s - loss: 0.4712 - acc: 0.8620
Epoch 341/500
 - 0s - loss: 0.4742 - acc: 0.8750
Epoch 342/500
 - 0s - loss: 0.4704 - acc: 0.8670
Epoch 343/500
 - 1s - loss: 0.4762 - acc: 0.8680
Epoch 344/500
 - 0s - loss: 0.4718 - acc: 0.8670
Epoch 345/500
 - 0s - loss: 0.4665 - acc: 0.8680
Epoch 346/500
 - 0s - loss: 0.4877 - acc: 0.8530
Epoch 347/500
 - 0s - loss: 0.4684 - acc: 0.8670
Epoch 348/500
 - 1s - loss: 0.4668 - acc: 0.8700
Epoch 349/500
 - 0s - loss: 0.4687 - acc: 0.8670
Epoch 350/500
 - 0s - loss: 0.4602 - acc: 0.8620
Epoch 351/500
 - 0s - loss: 0.4664 - acc: 0.8690
Epoch 352/500
 - 1s - loss: 0.4622 - acc: 0.8710
Epoch 353/500
 - 0s - loss: 0.4568 - acc: 0.8770
Epoch 354/500
 - 0s - loss: 0.4544 - acc: 0.8740
Epoch 355/500
 - 0s - loss: 0.4526 - acc: 0.8860
Epoch 356/500
 - 0s - loss: 0.4518 - acc: 0.8870
Epoch 357/500
 - 0s - loss: 0.4523 - acc: 0.8770
Epoch 358/500
 - 0s - loss: 0.4602

<keras.callbacks.History at 0x27c8fc568d0>

In [43]:
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: {:.2f}%".format(scores[1]*100))

Model Accuracy: 92.10%


In [44]:
for i in range(20):
    pattern_index = numpy.random.randint(len(dataX))
    pattern = dataX[pattern_index]
    x = pad_sequences([pattern], maxlen=max_len, dtype='float32')
    x = numpy.reshape(x, (1, max_len, 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result)

['O', 'P', 'Q'] -> R
['R', 'S', 'T'] -> U
['L'] -> M
['F', 'G', 'H'] -> I
['E', 'F', 'G', 'H'] -> I
['L', 'M', 'N'] -> O
['V'] -> X
['G', 'H'] -> I
['I'] -> K
['G'] -> G
['W'] -> X
['O', 'P'] -> Q
['D', 'E', 'F'] -> G
['M'] -> M
['J'] -> K
['C', 'D', 'E'] -> F
['N', 'O', 'P', 'Q', 'R'] -> S
['L', 'M', 'N', 'O', 'P'] -> Q
['L', 'M'] -> N
['F', 'G', 'H', 'I'] -> J


We can see that although the model did not learn the alphabet perfectly from the randomly generated subsequences, it did very well. The model was not tuned and may require more training or a larger network, or both (an exercise for the reader).

This is a good natural extension to the “all sequential input examples in each batch” alphabet model learned above in that it can handle ad hoc queries, but this time of arbitrary sequence length (up to the max length).

# Reference:
[Understanding Stateful LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/)