# Recurrent neural network for text generation 

- The Keras embedding layer uses a word embedding based on integer encoded inputs, i.e. each word in the vocabulary is represented by an integer, and then this is processed. So it at least initially differs to that of word2vec, since word2vec uses one hot encoding where each word is represented by a vocabulary sized vector of all zeros expect a single 1.
- Intro: https://machinelearningmastery.com/an-introduction-to-recurrent-neural-networks-and-the-math-that-powers-them/
- Based on setup from https://www.tensorflow.org/text/tutorials/text_generation


In [1]:
import tensorflow as tf
import numpy as np
import os
import time
import pandas as pd
import io
import requests

In [2]:
url='https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
path_to_file = tf.keras.utils.get_file('shakespeare.txt', url)
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print(f'Length of text: {len(text)} characters')

Length of text: 1115394 characters


In [3]:
# alternatively, read it in through pandas
s=requests.get(url).content
c=pd.read_table(io.StringIO(s.decode('utf-8')))

In [4]:
print(c.head())
#print(c.describe())

                                      First Citizen:
0      Before we proceed any further, hear me speak.
1                                               All:
2                                      Speak, speak.
3                                     First Citizen:
4  You are all resolved rather to die than to fam...


In [6]:
# Take a look at the first 250 characters in text
print(text[:80])
# The unique characters in the file
vocab = sorted(set(text))
print("\n")
print(f'{len(vocab)} unique characters')

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


65 unique characters


Embedding the strings so that each character has a numerical value 

In [7]:
example_texts = [' abcqrsed efg,', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
chars

<tf.RaggedTensor [[b' ', b'a', b'b', b'c', b'q', b'r', b's', b'e', b'd', b' ', b'e', b'f',
  b'g', b',']                                                            ,
 [b'x', b'y', b'z']]>

In [8]:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None)

# ids will print out the string in terms of its integer. Remember there are only 65 unique characters in the Shakespeare piece, including 
# punctuation and other symbols like spaces
ids = ids_from_chars(chars)
ids

<tf.RaggedTensor [[2, 40, 41, 42, 56, 57, 58, 44, 43, 2, 44, 45, 46, 7], [63, 64, 65]]>

Note our network is going to reutrn numerical values, so naturally we want to convert these back to their corpus values (the 65 unqiue characters) and we can do so using tf.keras.layers.StringLookup(..., invert=True). E.g.

In [9]:
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)
chars = chars_from_ids(ids)
chars
print(tf.strings.reduce_join(chars, axis=-1).numpy())

[b' abcqrsed efg,' b'xyz']


Provided with an input string, the model is trained to predict the next character of the string. We have to create some training data. This is a supervised task since we have an input string which is our x input data, and we also have the character (or string) directly after it which is our y, output data.

Each input "sequence" we break the text into will contain `seq_length` lots of characters. The target or output data, will contain `seq_length` lots of characters also but it will take the last 4 characters of the sequence. E.g. break text into `seq_length+1 = 5`, example "Hello", the input would be "Hell", the output/target would be "ello".

In [10]:
# converting full text into its numerical format.
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
print(all_ids)

ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
[chars_from_ids(ids).numpy().decode('utf-8') for ids in ids_dataset.take(10)]

tf.Tensor([19 48 57 ... 46  9  1], shape=(1115394,), dtype=int64)


['F', 'i', 'r', 's', 't', ' ', 'C', 'i', 't', 'i']

In [11]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

In [12]:
seq_length = 100
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(seq))

tf.Tensor(
[b'F' b'i' b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':'
 b'\n' b'B' b'e' b'f' b'o' b'r' b'e' b' ' b'w' b'e' b' ' b'p' b'r' b'o'
 b'c' b'e' b'e' b'd' b' ' b'a' b'n' b'y' b' ' b'f' b'u' b'r' b't' b'h'
 b'e' b'r' b',' b' ' b'h' b'e' b'a' b'r' b' ' b'm' b'e' b' ' b's' b'p'
 b'e' b'a' b'k' b'.' b'\n' b'\n' b'A' b'l' b'l' b':' b'\n' b'S' b'p' b'e'
 b'a' b'k' b',' b' ' b's' b'p' b'e' b'a' b'k' b'.' b'\n' b'\n' b'F' b'i'
 b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':' b'\n' b'Y'
 b'o' b'u' b' '], shape=(101,), dtype=string)


In [15]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text


split_input_target(seq)
split_input_target(list("Tensorflow"))

(['T', 'e', 'n', 's', 'o', 'r', 'f', 'l', 'o'],
 ['e', 'n', 's', 'o', 'r', 'f', 'l', 'o', 'w'])

In [16]:
dataset = sequences.map(split_input_target)
print(dataset.take(1))

<_TakeDataset element_spec=(TensorSpec(shape=(100,), dtype=tf.int64, name=None), TensorSpec(shape=(100,), dtype=tf.int64, name=None))>


In [17]:
for input_example, target_example in dataset.take(3):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target: b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
Input : b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you '
Target: b're all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
Input : b"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us k"
Target: b"ow Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"


In [18]:
# Batch size
BATCH_SIZE = 32

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 3000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)

In [19]:
print(dataset)

<_PrefetchDataset element_spec=(TensorSpec(shape=(32, 100), dtype=tf.int64, name=None), TensorSpec(shape=(32, 100), dtype=tf.int64, name=None))>


In [20]:
#Randomly selects parts of the text
for input_example, target_example in dataset.take(1):
    print(text_from_ids(input_example).numpy().shape)
    print("Input :", text_from_ids(input_example).numpy()[0]) # Note we only take one from the batch
    print("Target:", text_from_ids(target_example).numpy()[0])

(32,)
Input : b"g'st sweet music. Hark, come hither, Tyrrel\nGo, by this token: rise, and lend thine ear:\nThere is no"
Target: b"'st sweet music. Hark, come hither, Tyrrel\nGo, by this token: rise, and lend thine ear:\nThere is no "


In [21]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())
# The embedding dimension. Embedding layer creates a vector representation of each member of the vocabulary. The vector's size is size embedding_dim
embedding_dim = 128
# Number of RNN units
rnn_units = 512

GRU is Gated Recurrent Unit. Each GRU takes two inputs: previous GRU state value and character embedding input. Each GRU output feeds the consequent GRU and also passes output to the dense layer, after which it will be converted into the logits for each vocab member.

Having set up the input data, we employ the RNN model.
 - super() allows us to initialize the attributes of the parent class.
 - we also define the three core layers: embedding, gru and dense for output. Note the dense layer has `vocab_size` number of nodes, ready to return a `vocab_size` set of logits, so that when we convert to probabilities, we can pick out which term in the vocabulary is the most likely in the sequence.

In [22]:
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True) # GRU has output vector dim = rnn_units
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [23]:
# Here we inistantiate model
model = MyModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

In [24]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch) #passing the input data through the model.
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(32, 100, 66) # (batch_size, sequence_length, vocab_size)


In [25]:
example_out =example_batch_predictions.numpy()[1]
print(example_out.shape) # Each character in the sequence, has a 66 length vector attached to it, these are the logits predictions for what the next 
    # character should be in the sequence.
#print(np.argmax(example_out))
print(example_out)
logits_preds=np.argmax(example_out, axis=0)
print(logits_preds)
chars_from_ids(logits_preds) # Looks nonsensical as output, but after all the model hasn't been trained, so the weights aren't tuned
text_from_ids(logits_preds)

(100, 66)
[[-0.00244161 -0.00189861  0.01936328 ... -0.01602551  0.00797732
  -0.00967796]
 [ 0.00205034 -0.00417172  0.00668585 ... -0.00462392 -0.00467548
  -0.01352196]
 [ 0.01252944  0.00064899 -0.00723644 ... -0.01087652  0.00205605
  -0.00039377]
 ...
 [ 0.00885282 -0.01293257  0.00142927 ... -0.01265633  0.02160918
   0.00436883]
 [-0.00017103 -0.0147817   0.00864164 ... -0.01568041  0.00127978
  -0.00010806]
 [ 0.00431343 -0.01233269  0.00233574 ... -0.00498666 -0.00767525
  -0.00856074]]
[54 15 44 44 13 13 31 49 64 45 91 23 67 35 87 92 70 86 49 85 37 46 77  6
 43 70 34 44 71 72 11 85 23 12 49  9 16 73 34 47 91 47  8 10 39 54 71 28
 41 91 70 29 70 48 75 97 72 77 66 98 67 28 92 70 97 74]


<tf.Tensor: shape=(), dtype=string, numpy=b"oBee??Rjyf[UNK]J[UNK]V[UNK][UNK][UNK][UNK]j[UNK]Xg[UNK]'d[UNK]Ue[UNK][UNK]:[UNK]J;j.C[UNK]Uh[UNK]h-3Zo[UNK]Ob[UNK][UNK]P[UNK]i[UNK][UNK][UNK][UNK][UNK][UNK][UNK]O[UNK][UNK][UNK][UNK]">

In [26]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b"lt die, by God's just ordinance,\nEre from this war thou turn a conqueror,\nOr I with grief and extrem"

Next Char Predictions:
 b"pc3.hHoLcesCoSjzfZ[UNK].xzY'wceUndHpmR?Tjj,Nh?b!DaK&Efwc&hjv[UNK]Blfpkamsm:DCTrHEVHtyBbIg J;[UNK]OEmYiCHOTcCYD.!"


 - Now define the loss function so that we can start training the model. The ouput of the model, as seen in the class MyModel definition, is a dense layer of nodes. Since there is no activation function on that layer, the output is just logits. The loss function to compare the output layer to the true/target value is then sparse categorical crossentropy.
 - Finally we compile the network using the adam optimiser to tune the weights.

In [27]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)

model.compile(optimizer='adam', loss=loss)

Prediction shape:  (32, 100, 66)  # (batch_size, sequence_length, vocab_size)
Mean loss:         tf.Tensor(4.1886606, shape=(), dtype=float32)


In [28]:
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Miniature code after running epoch.

In [29]:
input_chars = tf.strings.unicode_split(["ROMEO: Hath tho"], 'UTF-8')
input_ids=ids_from_chars(input_chars).to_tensor()
predicted_logits, states = model(inputs=input_ids,return_state=True)
predicted_logits = predicted_logits[:, -1, :]
print("Logits for each given vocabulary token: ", predicted_logits)
#print(predicted_logits.numpy()[0][60])
predicted_ids = tf.random.categorical(predicted_logits, num_samples=1) # equivalent to calling argmax in some sense, it is more like randomly sampling
predicted_ids_2 = np.argmax(predicted_logits) 
print([predicted_ids.numpy()[0][0],predicted_ids_2]) # Note: ouput logits can be positive or negative, and we take the largest positive.
#print(predicted_ids.numpy()[0])
predicted_ids = tf.squeeze(predicted_ids, axis=-1)
predicted_chars = chars_from_ids(predicted_ids)
print("Predicted characters: ", predicted_chars.numpy())

Logits for each given vocabulary token:  tf.Tensor(
[[-10.76817      4.124313     6.5988517   -0.55192953  -7.743239
   -7.96808      1.1384192    0.7626041   -1.5175424   -1.8090538
   -7.461964    -1.9752158   -2.0393474   -0.16406521  -6.644651
   -2.7755587   -1.2671316  -12.325125   -16.959589    -5.678003
   -4.1277566   -6.0240493  -10.284635    -5.0300655   -5.7403293
   -4.445474    -1.9379029   -5.581869    -5.4057302   -2.1415179
   -6.220854    -1.4235003   -2.9025276   -1.5184338   -9.805831
   -8.0391      -4.2209473   -8.74412     -7.2839494   -7.2051206
   -2.3716993   -0.9391699   -1.0404209   -0.88366956   4.481169
   -3.888344    -4.516986    -8.12911      2.6696954   -6.8614583
   -1.3988788    0.56288576   5.4198403    3.2455704    5.134699
    1.849685    -5.566955     5.5726066   10.199316     2.081392
   13.586778    -3.8270469    2.454413    -5.724052    -0.34534848
   -4.9702806 ]], shape=(1, 66), dtype=float32)
[60, 60]
Predicted characters:  [b'u']


The tensor flow example also provides an iterable process to generate a set of text as seen below.

In [30]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask
    
    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)
      
    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [31]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

In [34]:
start = time.time()
states = None
next_char = tf.constant(['JULIET: would thou'])
result = [next_char]

for n in range(200):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

JULIET: would thou better proud i' the stern?
Is all my tribunes. Or with this?

ARIEL:
It is my marriage, that he they be?
Then let the honour perchange your state wheels are patience!
And, Trubt, Katharina, with her  

________________________________________________________________________________

Run time: 0.2006397247314453


- Interestingly, even after 2 epochs, it works well enough to capture capitalisation for role name, as well as sentence structure. The meaning and spelling is flawed as would be expected. A greater number of epochs would lead to a higher class of generated outputs.
- After 10 epochs the grammar starts to improve too.