# Text generation
The point of this notebook is to develop a (toy) application of LSTMs. We'll follow the [Tensorflow tutorial](https://www.tensorflow.org/tutorials/text/text_generation) to generate Shakespeare text. However, I'll be writing it from scratch, and only referencing the tutorial when I get stuck, or to check if I'm heading in the right direction.

### Imports

In [37]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

import numpy as np

### Data loading

In [2]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


In [4]:
path_to_file

'/Users/psavala/.keras/datasets/shakespeare.txt'

In [5]:
text = open(path_to_file, 'rb').read().decode('utf-8')

In [17]:
unique_chars = sorted(list(set(text)))

print(f'{len(text)} characters ({len(unique_chars)} unique characters)')

1115394 characters (65 unique characters)


In [18]:
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [19]:
print(f'Unique characters: {unique_chars}')

Unique characters: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


### Tokenize text

In [20]:
NUM_CHARS = len(unique_chars)

In [23]:
char_to_int = {c: i for i, c in enumerate(unique_chars)}
int_to_char = {i: c for i, c in enumerate(unique_chars)}

### Training data
We'll create training data by selecting a number of characters to use for the input, say `seq_len`, then the `seq_len+1` character will be the target. The number of characters in these pairs will influence how much the LSTM can learn from the past. So we'll experiment with the results for different lengths.

In [97]:
def generate_samples(text, char_to_int_mapping, seq_length=5, n_samples=10**6):
    X = []
    y = []
    for i in range(0, n_samples * seq_length, seq_length):
        x = text[i:i+seq_length]
        target = text[i+seq_length]
        
        x = np.array([char_to_int_mapping[c] for c in x])
        target = char_to_int_mapping[target]
        X.append(x)
        y.append(target)
        
    X = np.array(X).reshape(n_samples, seq_length)
    y = np.array(y).reshape(n_samples, 1)
    return X, y

In [98]:
SEQ_LENGTH = 5

X, y = generate_samples(text, char_to_int, seq_length=SEQ_LENGTH, n_samples=10)

In [99]:
print(X.shape)
print(X)

(10, 5)
[[18 47 56 57 58]
 [ 1 15 47 58 47]
 [64 43 52 10  0]
 [14 43 44 53 56]
 [43  1 61 43  1]
 [54 56 53 41 43]
 [43 42  1 39 52]
 [63  1 44 59 56]
 [58 46 43 56  6]
 [ 1 46 43 39 56]]


In [100]:
print(y)

[[ 1]
 [64]
 [14]
 [43]
 [54]
 [43]
 [63]
 [58]
 [ 1]
 [ 1]]


An obvious next step would be to one-hot our data. We do this below. However, it's also possible to replace one-hotting with a Keras `Embedding` layer. That is, rather than one-hotting we will ask it to learn representations on its own. We'll do that as well and compare the results. One potential issue (maybe?) is that while the model will learn representations for the characters through the embedding layer, the target values (which are also characters, and thus have a corresponding vector in the embedding layer) cannot see this embedding, so they have to be predicted as raw characters. It would be nice to convert them to their corresponding vectors and have *those* be predicted. Maybe it doesn't matter?

In [124]:
def one_hot(n, dim):
    # One-hots a positive integer n
    one_hot_n = np.zeros(dim)
    one_hot_n[n] = 1
    return one_hot_n
    
def undo_one_hot(v):
    return np.argmax(v)

In [127]:
assert undo_one_hot(one_hot(5, 10)) == 5

### Model
Now let's build our model. We will use an (optional) `Embedding` layer, followed by an `LSTM`. Finally, we will use a `Dense` layer to predict the next character.

In [188]:
def my_model(vocab_size, seq_length, embedding=False, embedding_dim=256, lstm_units=1024):
    model = Sequential()
    if embedding:
        model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length))
        model.add(LSTM(units=lstm_units))
    else:
        model.add(LSTM(units=lstm_units, input_shape=(seq_length, vocab_size)))
    model.add(Dense(units=vocab_size, activation='softmax'))
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [189]:
model = my_model(vocab_size=NUM_CHARS, seq_length=SEQ_LENGTH)

In [190]:
model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_6 (LSTM)                (None, 1024)              4464640   
_________________________________________________________________
dense_6 (Dense)              (None, 65)                66625     
Total params: 4,531,265
Trainable params: 4,531,265
Non-trainable params: 0
_________________________________________________________________


In [191]:
embed_model = my_model(vocab_size=NUM_CHARS, seq_length=SEQ_LENGTH, embedding=True)

In [192]:
embed_model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 5, 256)            16640     
_________________________________________________________________
lstm_7 (LSTM)                (None, 1024)              5246976   
_________________________________________________________________
dense_7 (Dense)              (None, 65)                66625     
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


For both models the `None` values for each output shape are for the batch size. The characters from each sequence (of length 5) are embedded in 256-dimensional space (when an embedding layer is used). An LSTM layer expects an input of shape `(timesteps, num_features)`. When an embedding layer is used, the output shape of 5 corresponds to the timesteps, and the 256 is now the number of features (up from the original `NUM_CHARS=65` value). Finally, this flat (ignoring batch size) output (since it only outputs the hidden state from the last timestep) is now passed to the dense layer, which softmax's to get the most probable character. 

### Test out the model *(no embedding layer)*

In [209]:
N_SAMPLES = 1
X, y = generate_samples(text, char_to_int, seq_length=SEQ_LENGTH, n_samples=N_SAMPLES)

One-hot `X`:

In [227]:
def one_hot_matrix(M, vocab_size):
    n_samples, seq_length = M.shape
    M_oh = np.array([one_hot(r, vocab_size) for r in np.array(M).flatten()]).reshape((n_samples, seq_length, vocab_size))
    return np.squeeze(M_oh) # In case this is a target vector, we don't want to include an unnecessary axis

In [228]:
X_oh = one_hot_matrix(X, NUM_CHARS)
y_oh = one_hot_matrix(y, NUM_CHARS)

Let's do a quick test to see what's happening. We won't train it (just use the initial random weights) and make a prediction on the first training sample.

In [234]:
def example_prediction(model, X):
    example_pred = model.predict(X[0].reshape(1, *X[0].shape), batch_size=1)[0]
    return example_pred

In [235]:
example_pred = example_prediction(model, X_oh)

In [236]:
example_pred

array([3.6236871e-02, 6.3006467e-01, 6.1237547e-03, 9.7289749e-06,
       7.6074944e-06, 1.0224881e-02, 4.7651798e-02, 3.8483760e-03,
       1.6810777e-02, 7.5542125e-06, 7.7468310e-03, 8.5015539e-03,
       5.8913282e-03, 7.6052966e-04, 2.5205009e-04, 2.2003289e-04,
       1.4028796e-04, 3.4211713e-04, 1.4300564e-04, 1.8496141e-04,
       7.3084095e-04, 7.0533471e-04, 6.0472354e-05, 1.6821121e-04,
       1.9451063e-04, 1.9055483e-04, 3.6480208e-04, 2.6066133e-04,
       8.7988432e-05, 3.3014010e-05, 2.1228372e-04, 4.4457213e-04,
       3.3739515e-04, 2.9324333e-04, 1.3954485e-04, 3.6241428e-04,
       9.8123146e-06, 2.1711526e-04, 4.0131850e-05, 1.5786275e-02,
       8.6014549e-04, 1.3584991e-03, 1.8380266e-03, 3.8588770e-02,
       1.1158638e-03, 4.3081207e-04, 3.2587126e-02, 3.1348344e-02,
       6.6493521e-05, 2.2204348e-04, 4.4703628e-03, 1.7170900e-03,
       2.9643015e-03, 3.0356895e-02, 1.1907129e-03, 7.0920505e-05,
       1.1149459e-02, 6.7667114e-03, 4.0005515e-03, 9.2774872e

Following the suggestion [here](https://www.tensorflow.org/tutorials/text/text_generation) we will *sample* from the softmax distribution (as opposed to just doing `argmax`).

In [216]:
def sample_from_softmax(output):
    sampled_output = np.random.choice(np.arange(len(output)), p=output.reshape(len(output)))
    return sampled_output

In [217]:
sampled_output = sample_from_softmax(example_pred)

print(f'Input: {X[0]} (\'{"".join([int_to_char[c] for c in X[0]])}\')\nOutput: {int_to_char[sampled_output]}')

Input: [18 47 56 57 58] ('First')
Output: v


### Train the model *(no embedding layer)*
We're finally ready to train the model.

In [231]:
N_SAMPLES = 10**5
X, y = generate_samples(text, char_to_int, seq_length=SEQ_LENGTH, n_samples=N_SAMPLES)

In [232]:
X_oh = one_hot_matrix(X, NUM_CHARS)
y_oh = one_hot_matrix(y, NUM_CHARS)

In [253]:
model.fit(X_oh, y_oh, batch_size=64, epochs=5, validation_split=0.4)

Train on 60000 samples, validate on 40000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x66818d210>

### Generate some predictions and compare to ground truth *(no embedding layer)*

In [244]:
total = 20
correct = 0

for i in range(total):
    pred = sample_from_softmax(example_prediction(model, X_oh[i:]))
    actual = undo_one_hot(y_oh[i])
    correct += int(pred == actual)
    print(f'Predicted: {pred}, Actual: {actual}')
    
print(f'\nAccuracy: {correct}/{total} = {correct/total:.2f}')

Predicted: 63, Actual: 1
Predicted: 64, Actual: 64
Predicted: 20, Actual: 14
Predicted: 43, Actual: 43
Predicted: 61, Actual: 54
Predicted: 8, Actual: 43
Predicted: 42, Actual: 63
Predicted: 51, Actual: 58
Predicted: 1, Actual: 1
Predicted: 17, Actual: 1
Predicted: 43, Actual: 54
Predicted: 0, Actual: 0
Predicted: 42, Actual: 10
Predicted: 57, Actual: 39
Predicted: 59, Actual: 54
Predicted: 0, Actual: 0
Predicted: 57, Actual: 57
Predicted: 41, Actual: 58
Predicted: 1, Actual: 10
Predicted: 50, Actual: 1

Accuracy: 6/20 = 0.30


**In summary:** For the model *without* an embedding layer we trained on 10^5 examples (*not* shuffled) for one epoch and achieved ~30% accuracy (on the training set).

### Train the model *(with embedding layer)*

In [254]:
embed_model.fit(X, y_oh, batch_size=64, epochs=5, validation_split=0.4)

Train on 60000 samples, validate on 40000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x667529b10>

In [251]:
total = 20
correct = 0

for i in range(total):
    pred = sample_from_softmax(example_prediction(embed_model, X[i:]))
    actual = undo_one_hot(y_oh[i])
    correct += int(pred == actual)
    print(f'Predicted: {pred}, Actual: {actual}')
    
print(f'\nAccuracy: {correct}/{total} = {correct/total:.2f}')

Predicted: 0, Actual: 1
Predicted: 64, Actual: 64
Predicted: 32, Actual: 14
Predicted: 43, Actual: 43
Predicted: 39, Actual: 54
Predicted: 57, Actual: 43
Predicted: 42, Actual: 63
Predicted: 59, Actual: 58
Predicted: 1, Actual: 1
Predicted: 58, Actual: 1
Predicted: 61, Actual: 54
Predicted: 0, Actual: 0
Predicted: 1, Actual: 10
Predicted: 57, Actual: 39
Predicted: 46, Actual: 54
Predicted: 0, Actual: 0
Predicted: 57, Actual: 57
Predicted: 58, Actual: 58
Predicted: 2, Actual: 10
Predicted: 1, Actual: 1

Accuracy: 8/20 = 0.40


So the model *with* an embedding layer had a nice increase in accuracy (~20%, or 6 percentage points). Of course, huge grain of salt here since we'e testing on the training set, training for only one epoch, and without a large number of samples. But still, interesting to see.

Finally, let's shuffle the samples and train again to see if there is any change.

### Shuffling and retraining

In [None]:
# np.random.shuffle does not return anything, it shuffles in-place
shuffled_idx = np.arange(X.shape[0])
np.random.shuffle(shuffled_idx)

X_oh_shuffled = X_oh[shuffled_idx]
X_shuffled = X[shuffled_idx]
y_oh_shuffled = y_oh[shuffled_idx]

#### Model *without* embedding layer
Note that we are still working with the same (partially trained) model. So this is not a fair comparison in terms of how well shuffling affects the training. To do that we would need to save the model weights *before* training, train, and then reload the model weights before re-training each time.

In [269]:
model.fit(X_oh_shuffled, y_oh_shuffled, batch_size=64, epochs=5, validation_split=0.4)

Train on 60000 samples, validate on 40000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
 1024/60000 [..............................] - ETA: 3:18 - loss: 1.1173 - accuracy: 0.6448

KeyboardInterrupt: 

#### Model *with* embedding layer

In [None]:
embed_model.fit(X_shuffled, y_oh_shuffled, batch_size=64, epochs=5, validation_split=0.4)