<a href="https://colab.research.google.com/github/nv-hiep/text_generation_with_LSTM/blob/main/text_generation_with_LSTM_using_Keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation With LSTM Recurrent Neural Networks using Keras

# Import libraries

In [None]:
import tensorflow as tf
print(tf.__version__)

print("GPU Available:", tf.config.list_physical_devices('GPU') )

if tf.test.is_gpu_available():
  device_name = tf.test.gpu_device_name()
else:
  device_name = '/CPU:0'
print(device_name)

2.4.1
GPU Available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
/device:GPU:0


In [None]:
import tensorflow as tf
import numpy as np
from IPython.display import Image
%matplotlib inline

# Preprocessing the dataset

In [None]:
! curl -O http://www.gutenberg.org/files/1268/1268-0.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1144k  100 1144k    0     0   598k      0  0:00:01  0:00:01 --:--:--  598k


In [None]:
with open('1268-0.txt') as datfile:
  text = datfile.read()

In [None]:
text[:100]

'\ufeffThe Project Gutenberg EBook of The Mysterious Island, by Jules Verne\n\nThis eBook is for the use of '

Remove some unnecessary text

In [None]:
start_idx = text.find('THE MYSTERIOUS ISLAND')
end_idx   = text.find('End of the Project Gutenberg')

print('Text starts at index {}: '.format(start_idx))
print('Text ends at index {}: '.format(end_idx))

Text starts at index 567: 
Text ends at index 1112917: 


In [None]:
text     = text[start_idx : end_idx]
char_set = set(text)

In [None]:
print(char_set)

{'1', ';', 'w', 'S', 'm', 'j', 'n', '!', 'Z', 'd', 'F', ' ', '*', 'N', '4', 'v', 'I', 'C', 'U', '‘', 'b', 'r', 's', '’', '“', 'B', 'E', '\n', 't', '-', 'a', 'y', 'M', '7', 'g', 'K', '”', '3', '9', 'Q', '(', '=', 'q', 'x', 'T', 'u', 'i', 'R', 'W', 'P', 'O', 'p', 'z', 'e', 'f', ',', 'D', 'J', '.', 'V', ':', '8', '2', 'L', 'o', '?', '5', '/', 'c', ')', 'H', 'l', '6', 'k', '&', '0', 'Y', 'G', 'h', 'A'}


In [None]:
print('Total length: {}'.format(len(text)))
print('Unique Characters: {}'.format(len(char_set)) )

Total length: 1112350
Unique Characters: 80


In [None]:
chars_sorted = sorted(char_set)
print(chars_sorted)

['\n', ' ', '!', '&', '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '‘', '’', '“', '”']


In [None]:
char2int = {char:k for k,char in enumerate(chars_sorted)}
print(char2int)

{'\n': 0, ' ': 1, '!': 2, '&': 3, '(': 4, ')': 5, '*': 6, ',': 7, '-': 8, '.': 9, '/': 10, '0': 11, '1': 12, '2': 13, '3': 14, '4': 15, '5': 16, '6': 17, '7': 18, '8': 19, '9': 20, ':': 21, ';': 22, '=': 23, '?': 24, 'A': 25, 'B': 26, 'C': 27, 'D': 28, 'E': 29, 'F': 30, 'G': 31, 'H': 32, 'I': 33, 'J': 34, 'K': 35, 'L': 36, 'M': 37, 'N': 38, 'O': 39, 'P': 40, 'Q': 41, 'R': 42, 'S': 43, 'T': 44, 'U': 45, 'V': 46, 'W': 47, 'Y': 48, 'Z': 49, 'a': 50, 'b': 51, 'c': 52, 'd': 53, 'e': 54, 'f': 55, 'g': 56, 'h': 57, 'i': 58, 'j': 59, 'k': 60, 'l': 61, 'm': 62, 'n': 63, 'o': 64, 'p': 65, 'q': 66, 'r': 67, 's': 68, 't': 69, 'u': 70, 'v': 71, 'w': 72, 'x': 73, 'y': 74, 'z': 75, '‘': 76, '’': 77, '“': 78, '”': 79}


In [None]:
# chars_sorted is a list, now convert it into a char_array
char_array = np.array(chars_sorted)
print(char_array)
print(char_array.shape)

['\n' ' ' '!' '&' '(' ')' '*' ',' '-' '.' '/' '0' '1' '2' '3' '4' '5' '6'
 '7' '8' '9' ':' ';' '=' '?' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K'
 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'Y' 'Z' 'a' 'b' 'c' 'd'
 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v'
 'w' 'x' 'y' 'z' '‘' '’' '“' '”']
(80,)


In [None]:
text_encoded = [ char2int[char] for char in text]
text_encoded = np.array( text_encoded, dtype=np.int32 )
print(text_encoded[:20])
print(text_encoded.shape)

[44 32 29  1 37 48 43 44 29 42 33 39 45 43  1 33 43 36 25 38]
(1112350,)


In [None]:
# show some example
print( '1. {} ---> {}'.format(text[:20], text_encoded[:20]) )
print( '2. {} <--- {}'.format(text_encoded[15:21], text[15:21]) )
print( '3. {} <--- {}'.format(text_encoded[15:21], char_array[text_encoded[15:21]] ) )
print( '3\'. {} <--- {}'.format(text_encoded[15:21], ''.join(char_array[text_encoded[15:21]]) ) )

1. THE MYSTERIOUS ISLAN ---> [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1 33 43 36 25 38]
2. [33 43 36 25 38 28] <--- ISLAND
3. [33 43 36 25 38 28] <--- ['I' 'S' 'L' 'A' 'N' 'D']
3'. [33 43 36 25 38 28] <--- ISLAND


The NumPy array text_encoded contains the encoded values for all the characters in the text. Now, we will create a TensorFlow dataset from this array:

In [None]:
# Create a source dataset from input data
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
  print(element)

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)


In [None]:
ds_text_encoded = tf.data.Dataset.from_tensor_slices(text_encoded)
for element in ds_text_encoded.take(5):
  print(element)
  print('{} -> {} \n'.format( element.numpy(), char_array[element.numpy()] ) )

tf.Tensor(44, shape=(), dtype=int32)
44 -> T 

tf.Tensor(32, shape=(), dtype=int32)
32 -> H 

tf.Tensor(29, shape=(), dtype=int32)
29 -> E 

tf.Tensor(1, shape=(), dtype=int32)
1 ->   

tf.Tensor(37, shape=(), dtype=int32)
37 -> M 



To implement the text generation task in TensorFlow, let's first clip the sequence length to 40. This means that the input tensor, x, consists of 40 tokens. In practice, the sequence length impacts the quality of the generated text. Longer sequences can result in more meaningful sentences. For shorter sequences, however, the model might focus on capturing individual words correctly, while ignoring the context for the most part. Although longer sequences usually result in more meaningful sentences, as mentioned, for long sequences, the RNN model will have problems capturing long-term dependencies. Thus, in practice, finding a sweet spot and good value for the sequence length is a hyperparameter optimization problem, which we have to evaluate empirically. Here, we are going to choose 40, as it offers a good tradeoff.

In order to predict the next character, the inputs, x, and targets, y, are offset by one character. Hence, we will split the text into chunks of size 41: the first 40 characters will form the input sequence, x, and the last 40 elements will form the target sequence, y.

In [None]:
# Text chunks of 41 characters each
seq_length = 40
chunk_size = seq_length + 1

ds_chunks  = ds_text_encoded.batch(chunk_size, drop_remainder=True) #  get rid of the last batch if it is shorter than 41 characters

In [None]:
for chunk in ds_chunks.take(5):
  print('{} - {} \n'.format(chunk.shape, chunk.numpy()) )

(41,) - [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1 33 43 36 25 38 28  1  6  6
  6  0  0  0  0  0 40 67 64 53 70 52 54 53  1 51 74] 

(41,) - [ 1 25 63 69 57 64 63 74  1 37 50 69 64 63 50 60  7  1 50 63 53  1 44 67
 54 71 64 67  1 27 50 67 61 68 64 63  0  0  0  0  0] 

(41,) - [ 0 44 32 29  1 37 48 43 44 29 42 33 39 45 43  1 33 43 36 25 38 28  0  0
 51 74  1 34 70 61 54 68  1 46 54 67 63 54  0  0 12] 

(41,) - [19 18 15  0  0  0  0  0 40 25 42 44  1 12  8  8 28 42 39 40 40 29 28  1
 30 42 39 37  1 44 32 29  1 27 36 39 45 28 43  0  0] 

(41,) - [ 0  0 27 57 50 65 69 54 67  1 12  0  0 78 25 67 54  1 72 54  1 67 58 68
 58 63 56  1 50 56 50 58 63 24 79  1 78 38 64  9  1] 



In [None]:
def split_input_target(chunk):
  input_seq  = chunk[:-1]
  target_seq = chunk[1:]
  return input_seq, target_seq

In [None]:
ds_sequences = ds_chunks.map(split_input_target)

In [None]:
for elem_input, elem_target in ds_sequences.take(2):
  print('Input - X: {} - {}'.format( elem_input.numpy(), repr(''.join(char_array[elem_input])) ) )
  print('Target - Y: {} - {}'.format( elem_target.numpy(), repr( ''.join(char_array[elem_target] )) ) )
  print()

Input - X: [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1 33 43 36 25 38 28  1  6  6
  6  0  0  0  0  0 40 67 64 53 70 52 54 53  1 51] - 'THE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced b'
Target - Y: [32 29  1 37 48 43 44 29 42 33 39 45 43  1 33 43 36 25 38 28  1  6  6  6
  0  0  0  0  0 40 67 64 53 70 52 54 53  1 51 74] - 'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'

Input - X: [ 1 25 63 69 57 64 63 74  1 37 50 69 64 63 50 60  7  1 50 63 53  1 44 67
 54 71 64 67  1 27 50 67 61 68 64 63  0  0  0  0] - ' Anthony Matonak, and Trevor Carlson\n\n\n\n'
Target - Y: [25 63 69 57 64 63 74  1 37 50 69 64 63 50 60  7  1 50 63 53  1 44 67 54
 71 64 67  1 27 50 67 61 68 64 63  0  0  0  0  0] - 'Anthony Matonak, and Trevor Carlson\n\n\n\n\n'



During the first preprocessing step to divide the dataset into batches, we created chunks of sentences. Each chunk represents one sentence, which corresponds to one training example. Now, we will shuffle the training examples and divide the inputs into mini-batches again; however, this time, each batch will contain multiple training examples:

In [None]:
BATCH_SIZE  = 64
BUFFER_SIZE = 10_000
ds          = ds_sequences.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

# Building a character-level RNN model

In [None]:
def build_model(vocab_size, embedding_dim, rnn_units):
  model = tf.keras.Sequential([
                               tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=None),
                               tf.keras.layers.LSTM(units=rnn_units, return_sequences=True),
                               tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [None]:
## Setting the training parameters
charset_size  = len(char_array) # 80 unique characters
embedding_dim = 256
rnn_units     = 512

tf.random.set_seed(1)

model = build_model(vocab_size=charset_size, embedding_dim=embedding_dim, rnn_units=rnn_units)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 256)         20480     
_________________________________________________________________
lstm (LSTM)                  (None, None, 512)         1574912   
_________________________________________________________________
dense (Dense)                (None, None, 80)          41040     
Total params: 1,636,432
Trainable params: 1,636,432
Non-trainable params: 0
_________________________________________________________________


Embedding layer : 80 * 256 = 20480 params

we specified activation=None for the final fully connected layer. The reason for this is that we will need to have the logits as outputs of the model so that we can sample from the model predictions in order to generate new text. We will get to this sampling part later. For now, let's train the model:

In [None]:
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
)

model.fit(ds, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f0ae009f350>

# Evaluation phase – generating new text passages

The RNN model we trained in the previous section returns the logits of size 80 for each unique character. These logits can be readily converted to probabilities, via the softmax function, that a particular character will be encountered as the next character. To predict the next character in the sequence, we can simply select the element with the maximum logit value, which is equivalent to selecting the character with the highest probability. However, instead of always selecting the character with the highest likelihood, we want to (randomly) sample from the outputs; otherwise, the model will always produce the same text. TensorFlow already provides a function, tf.random.categorical(), which we can use to draw random samples from a categorical distribution. To see how this works, let's generate some random samples from three categories [0, 1, 2], with input logits [1, 1, 1].

In [None]:
tf.random.set_seed(1)

logits = [[1.0, 1.0, 1.0]]
print('Probabilities:', tf.math.softmax(logits).numpy()[0])

samples = tf.random.categorical(logits=logits, num_samples=10)
tf.print(samples.numpy())

Probabilities: [0.33333334 0.33333334 0.33333334]
array([[1, 2, 0, 1, 0, 1, 1, 2, 1, 1]])


In [None]:
tf.random.set_seed(1)

logits = [[1.0, 1.0, 3.0]]
print('Probabilities:', tf.math.softmax(logits).numpy()[0])

samples = tf.random.categorical(logits=logits, num_samples=11)
tf.print(samples.numpy())

Probabilities: [0.10650698 0.10650698 0.78698605]
array([[2, 2, 0, 2, 2, 2, 2, 2, 1, 2, 0]])


In [None]:
def sample(model, starting_str, len_generated_text=500, max_input_length=40, scale_factor=1.):
  encoded_input = [char2int[x] for x in starting_str]
  encoded_input = tf.reshape( encoded_input, (1, -1) ) # 1 row, n cols

  generated_str = starting_str

  model.reset_states() # reset the states of all layers in the model

  for i in range(len_generated_text):
    logits = model(encoded_input)
    logits = tf.squeeze(logits, axis=0) # Removes dimensions of size 1 from the shape of a tensor. [1, 2, 1, 3, 1, 1] -> [2, 3]
    
    scaled_logits = scale_factor * logits

    new_char_idx = tf.random.categorical(scaled_logits, num_samples=1)
    new_char_idx = tf.squeeze(new_char_idx)[-1].numpy()
    
    # Add new character to the end of the string
    generated_str += str( char_array[new_char_idx] )

    # Add the new character_index to the end of the encoded_input
    new_char_idx = tf.expand_dims([new_char_idx], 0) # Returns a tensor with a length 1 axis inserted at index axis. e.g [10,10,3] - > [1, 10,10,3]

    encoded_input = tf.concat([encoded_input, new_char_idx], axis=1)
    encoded_input = encoded_input[:, -max_input_length : ] # just take the last 40 characters

  return generated_str


In [None]:
tf.random.set_seed(1)
print(sample(model, starting_str='The island'))

The island was explored their winding
colony. The “Nautilus” making the equarte with those oceand.

“Who can certainly no venture or Patalunation of the wild beasts. The well worth inable met himself prolonged
coulls, were save
you intoly the convicts be a subterranean solour’s border! The wind continued felt from him long in Llacking but
the colonists felt.

“But do you reach Pencroft’s gining len.”

“Inabletted all the gulleys must any event from each penewer at the wire towards the last, some sign of i


## Predictability vs. randomness

By scaling the logits with a factor < 1, the probabilities computed by the softmax function become more uniform, as shown in the following code:

In [None]:
logits = np.array([[1.0, 1.0, 3.0]])

print('Probabilities before scaling:        ', tf.math.softmax(logits).numpy()[0])

print('Probabilities after scaling with 0.5:', tf.math.softmax(0.5*logits).numpy()[0])

print('Probabilities after scaling with 0.1:', tf.math.softmax(0.1*logits).numpy()[0])

Probabilities before scaling:         [0.10650698 0.10650698 0.78698604]
Probabilities after scaling with 0.5: [0.21194156 0.21194156 0.57611688]
Probabilities after scaling with 0.1: [0.31042377 0.31042377 0.37915245]


Scaling the logits by alpha=0.1 results in near-uniform probabilities [0.31, 0.31, 0.38]. Now, we can compare the generated text with alpha=2.0 (more predictable) and alpha=0.5 (more randomness)

In [None]:
tf.random.set_seed(1)
print(sample(model, starting_str='The island', scale_factor=2.0))

The island was extended to the faithful animals of a second coast. The most recent latitude was advice, and the sudden was still provisions and soon set out of the state of the Prince Dakkar Grotto, and his companions were so completely considered it to be the corral. The colonists had a second contingent power of the lake was lighted by the colonists still being so strict in the river which he had destroyed their companions to extinguish the pirates were thrown upon the stranger at the same time at the s


In [None]:
tf.random.set_seed(1)
print(sample(model, starting_str='The island', scale_factor=0.5))

The island was egely place to kinny wh float! Well, that?”
-Chackyan
24me, Harding,--tho “and that is go favoces.” observed Evidepth Pacolur. Rown, Spilett, repained any myn inder-siea.” Ame of Migro’!

“We can, atlispher 8
she
you “yell speak,” said he.
 Top, this inglittod and push, making. Twen their voints next to lort. Carlingullock bodne of
Elunis, smeling, whnee
war asily from
the pirate’c dwelbing!

To:
illy ourselvely yeck,” anseed,
lift; vercy, linrers-fill, with under Latienly frisonmen
to puri


The results show that scaling the logits with  alpha=0.5 (similar to increasing the temperature) generates more random text. There is a tradeoff between the novelty of the generated text and its correctness.