# Problem Description: Learn the Alphabet
![](https://imgur.com/jdhQR4m.png)

In this tutorial we are going to develop and contrast a number of different LSTM recurrent neural network models.

The context of these comparisons will be a simple sequence prediction problem of learning the alphabet. That is, given a letter of the alphabet, predict the next letter of the alphabet.

This is a simple sequence prediction problem that once understood can be generalized to other sequence prediction problems like time series prediction and sequence classification.

In [1]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


Next, we can seed the random number generator to ensure that the results are the same each time the code is executed.

In [2]:
# fix random seed for reproducibility
numpy.random.seed(7)

We can now define our dataset, the alphabet. We define the alphabet in uppercase characters for readability.

Neural networks model numbers, so we need to map the letters of the alphabet to integer values. We can do this easily by creating a dictionary (map) of the letter index to the character. We can also create a reverse lookup for converting predictions back into characters to be used later.

In [3]:
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

In [4]:
print("The letters correspond to the numbers: \n", char_to_int)
print("\n")

print("The numbers correspond to the letters: \n", int_to_char)

The letters correspond to the numbers: 
 {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'J': 9, 'K': 10, 'L': 11, 'M': 12, 'N': 13, 'O': 14, 'P': 15, 'Q': 16, 'R': 17, 'S': 18, 'T': 19, 'U': 20, 'V': 21, 'W': 22, 'X': 23, 'Y': 24, 'Z': 25}


The numbers correspond to the letters: 
 {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J', 10: 'K', 11: 'L', 12: 'M', 13: 'N', 14: 'O', 15: 'P', 16: 'Q', 17: 'R', 18: 'S', 19: 'T', 20: 'U', 21: 'V', 22: 'W', 23: 'X', 24: 'Y', 25: 'Z'}


Now we need to create our input and output pairs on which to train our neural network. We can do this by defining an input sequence length, then reading sequences from the input alphabet sequence.

For example we use an input length of 1. Starting at the beginning of the raw input data, we can read off the first letter “A” and the next letter as the prediction “B”. We move along one character and repeat until we reach a prediction of “Z”.

In [5]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 1
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

A -> B
B -> C
C -> D
D -> E
E -> F
F -> G
G -> H
H -> I
I -> J
J -> K
K -> L
L -> M
M -> N
N -> O
O -> P
P -> Q
Q -> R
R -> S
S -> T
T -> U
U -> V
V -> W
W -> X
X -> Y
Y -> Z


We need to reshape the NumPy array into a format expected by the LSTM networks, that is `[samples, time steps, features]`.

Once reshaped, we can then normalize the input integers to the range 0-to-1, the range of the sigmoid activation functions used by the LSTM network.

Finally, we can think of this problem as a sequence classification task, where each of the 26 letters represents a different class. As such, we can convert the output (y) to a one hot encoding, using the Keras built-in function **to_categorical().**

In [8]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))

# normalize
X = X / float(len(alphabet))

# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [10]:
print("X shape: ", X.shape) 
print("y shape: ", y.shape)

X shape:  (25, 1, 1)
y shape:  (25, 26)


## Naive LSTM for Learning One-Char to One-Char Mapping

Let’s start off by designing a simple LSTM to learn how to predict the next character in the alphabet given the context of just one character.

We will frame the problem as a random collection of one-letter input to one-letter output pairs. As we will see this is a difficult framing of the problem for the LSTM to learn.

Let’s define an LSTM network with 32 units and an output layer with a softmax activation function for making predictions. Because this is a multi-class classification problem, we can use the log loss function (called “categorical_crossentropy” in Keras), and optimize the network using the ADAM optimization function.

The model is fit over 500 epochs with a batch size of 1.


In [20]:
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 32)                4352      
_________________________________________________________________
dense_1 (Dense)              (None, 26)                858       
Total params: 5,210
Trainable params: 5,210
Non-trainable params: 0
_________________________________________________________________


In [21]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
 - 4s - loss: 3.2660 - acc: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.2582 - acc: 0.0000e+00
Epoch 3/500
 - 0s - loss: 3.2551 - acc: 0.0400
Epoch 4/500
 - 0s - loss: 3.2524 - acc: 0.0400
Epoch 5/500
 - 0s - loss: 3.2495 - acc: 0.0400
Epoch 6/500
 - 0s - loss: 3.2470 - acc: 0.0400
Epoch 7/500
 - 0s - loss: 3.2439 - acc: 0.0400
Epoch 8/500
 - 0s - loss: 3.2411 - acc: 0.0400
Epoch 9/500
 - 0s - loss: 3.2377 - acc: 0.0400
Epoch 10/500
 - 0s - loss: 3.2347 - acc: 0.0400
Epoch 11/500
 - 0s - loss: 3.2311 - acc: 0.0400
Epoch 12/500
 - 0s - loss: 3.2275 - acc: 0.0400
Epoch 13/500
 - 0s - loss: 3.2235 - acc: 0.0400
Epoch 14/500
 - 0s - loss: 3.2202 - acc: 0.0400
Epoch 15/500
 - 0s - loss: 3.2159 - acc: 0.0400
Epoch 16/500
 - 0s - loss: 3.2115 - acc: 0.0400
Epoch 17/500
 - 0s - loss: 3.2064 - acc: 0.0400
Epoch 18/500
 - 0s - loss: 3.2014 - acc: 0.0400
Epoch 19/500
 - 0s - loss: 3.1967 - acc: 0.0400
Epoch 20/500
 - 0s - loss: 3.1908 - acc: 0.0400
Epoch 21/500
 - 0s - loss: 3.1851 - acc: 

 - 0s - loss: 2.2016 - acc: 0.2400
Epoch 171/500
 - 0s - loss: 2.1997 - acc: 0.2800
Epoch 172/500
 - 0s - loss: 2.1956 - acc: 0.3200
Epoch 173/500
 - 0s - loss: 2.1933 - acc: 0.3600
Epoch 174/500
 - 0s - loss: 2.1899 - acc: 0.2800
Epoch 175/500
 - 0s - loss: 2.1878 - acc: 0.2800
Epoch 176/500
 - 0s - loss: 2.1841 - acc: 0.2800
Epoch 177/500
 - 0s - loss: 2.1813 - acc: 0.2800
Epoch 178/500
 - 0s - loss: 2.1787 - acc: 0.3200
Epoch 179/500
 - 0s - loss: 2.1760 - acc: 0.2800
Epoch 180/500
 - 0s - loss: 2.1736 - acc: 0.2800
Epoch 181/500
 - 0s - loss: 2.1696 - acc: 0.2400
Epoch 182/500
 - 0s - loss: 2.1686 - acc: 0.3200
Epoch 183/500
 - 0s - loss: 2.1656 - acc: 0.3600
Epoch 184/500
 - 0s - loss: 2.1622 - acc: 0.3200
Epoch 185/500
 - 0s - loss: 2.1586 - acc: 0.2800
Epoch 186/500
 - 0s - loss: 2.1575 - acc: 0.3600
Epoch 187/500
 - 0s - loss: 2.1547 - acc: 0.3200
Epoch 188/500
 - 0s - loss: 2.1506 - acc: 0.3600
Epoch 189/500
 - 0s - loss: 2.1493 - acc: 0.3600
Epoch 190/500
 - 0s - loss: 2.1474

Epoch 338/500
 - 0s - loss: 1.8776 - acc: 0.6000
Epoch 339/500
 - 0s - loss: 1.8770 - acc: 0.6400
Epoch 340/500
 - 0s - loss: 1.8721 - acc: 0.6000
Epoch 341/500
 - 0s - loss: 1.8734 - acc: 0.6800
Epoch 342/500
 - 0s - loss: 1.8717 - acc: 0.5600
Epoch 343/500
 - 0s - loss: 1.8705 - acc: 0.4800
Epoch 344/500
 - 0s - loss: 1.8673 - acc: 0.6000
Epoch 345/500
 - 0s - loss: 1.8666 - acc: 0.6000
Epoch 346/500
 - 0s - loss: 1.8647 - acc: 0.6400
Epoch 347/500
 - 0s - loss: 1.8634 - acc: 0.5600
Epoch 348/500
 - 0s - loss: 1.8631 - acc: 0.6000
Epoch 349/500
 - 0s - loss: 1.8616 - acc: 0.6000
Epoch 350/500
 - 0s - loss: 1.8605 - acc: 0.6400
Epoch 351/500
 - 0s - loss: 1.8586 - acc: 0.6800
Epoch 352/500
 - 0s - loss: 1.8563 - acc: 0.5200
Epoch 353/500
 - 0s - loss: 1.8579 - acc: 0.6400
Epoch 354/500
 - 0s - loss: 1.8548 - acc: 0.6800
Epoch 355/500
 - 0s - loss: 1.8529 - acc: 0.6400
Epoch 356/500
 - 0s - loss: 1.8509 - acc: 0.6000
Epoch 357/500
 - 0s - loss: 1.8516 - acc: 0.6400
Epoch 358/500
 - 0s 

<keras.callbacks.History at 0x1ccae0844e0>

After we fit the model we can evaluate and summarize the performance on the entire training dataset.

In [22]:
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 88.00%


We can then re-run the training data through the network and generate predictions, converting both the input and output pairs back into their original character format to get a visual idea of how well the network learned the problem.

In [24]:
# demonstrate some model predictions
for pattern in dataX:
    # Take 26 letters one by one into the model to predict the letters that will appear
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction) # The most probable index
    result = int_to_char[index] # Look at what is predicted
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result) 

['A'] -> B
['B'] -> B
['C'] -> D
['D'] -> E
['E'] -> F
['F'] -> G
['G'] -> H
['H'] -> I
['I'] -> J
['J'] -> K
['K'] -> L
['L'] -> M
['M'] -> N
['N'] -> O
['O'] -> P
['P'] -> Q
['Q'] -> R
['R'] -> S
['S'] -> T
['T'] -> U
['U'] -> V
['V'] -> W
['W'] -> Y
['X'] -> Z
['Y'] -> Z


We can see that this problem is indeed difficult for the network to learn.

The reason is, the poor LSTM units do not have any context to work with. Each input-output pattern is shown to the network in a random order and the state of the network is reset after each pattern (each batch where each batch contains one pattern).

This is abuse of the LSTM network architecture, treating it like a standard multilayer Perceptron.

Next, let’s try a different framing of the problem in order to provide more sequence to the network from which to learn.