<a href="https://colab.research.google.com/github/klaudia-nazarko/nlg-text-generation/blob/main/lstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word-Level Text Generation with LSTM

In addition to making predictions, RNNs may also be used as generative models (can learn the sequences and then generate entirely new seqences). One of RNN variant, LSTM neural network has been recognized as a very successful tool when working with sequences of letters or words.

Let's examine performance of basic LSTM model on generating text of fairy tales.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%cd 'drive/MyDrive/Colab Notebooks/nlg_tales_generation'

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks/nlg_tales_generation


In [2]:
import functions as f

from Text import *
from LSTM_class import *

from keras import layers, models, optimizers

## Text preprocessing

The loaded text file contains the content of tales scraped from websites. By creating the instance of Text object, the text is quickly preprocessed and tokenized; by creating the instance of Sequence object the text is prepared for use in LSTM model.

In [3]:
path_train, path_test = 'data/train.txt', 'data/test.txt'

input_train = f.read_txt(path_train)

In [4]:
max_len = 4
step = 3

text_train = Text(input_train)
text_train.tokens_info()

seq_train = Sequences(text_train, max_len, step)
seq_train.sequences_info()

total tokens: 890750, distinct tokens: 25165
number of sequences of length 4: 296916


The text is split into sequences of length 4 (max_len parameter) with step 3. We can see that the first sequence of 4 words starts with the first (0-index) word and the second sequence starts after 3 words, so from the 4th word (3-index).

In [5]:
print(text_train.tokens[:10])
print(text_train.tokens_ind[:10], '\n')

np.array(seq_train.sequences[:2])

['Once', 'upon', 'a', 'time', 'there', 'lived', 'a', 'sultan', 'who', 'loved']
[10701, 17952, 19552, 289, 10967, 9397, 19552, 21301, 6393, 1702] 



array([[10701, 17952, 19552,   289],
       [  289, 10967,  9397, 19552]])

TextDataGenerator is a Python generator that outputs batches of data (sequences and corresponding next words). Since the vocabulary size is over 25k, it's impossible to fit all data to the memory and that's why batch generator is extremely useful.

In [6]:
batch_size = 512

params = {
  'sequence_length': max_len,
  'vocab_size': len(text_train),
  'batch_size': batch_size,
  'shuffle': True
}

train_generator = TextDataGenerator(seq_train.sequences, seq_train.next_words, **params)

## Training the LSTM model

We'll build a simple model with one LSTM layer, dropout and dense layer with softmax activation (to return word probabilities).

In [7]:
def lstm_model(sequence_length, vocab_size, layer_size, embedding=False):
  model = models.Sequential()
  if embedding:
    model.add(layers.Embedding(vocab_size, layer_size))
    model.add(layers.LSTM(layer_size))    
  else:
    model.add(layers.LSTM(layer_size, input_shape=(sequence_length, vocab_size)))
  model.add(layers.Dropout(0.3))
  model.add(layers.Dense(vocab_size, activation='softmax'))
  return model

In [8]:
model = lstm_model(max_len, len(text_train), 512)

optimizer = optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

In [9]:
model.fit(train_generator,
          steps_per_epoch=len(train_generator),
          epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f0964180dd8>

In [10]:
model.save('data/out/lstm_model')



INFO:tensorflow:Assets written to: data/out/lstm_model/assets


INFO:tensorflow:Assets written to: data/out/lstm_model/assets


## Text generation with LSTM model

Generating text with LSTM model requires building the prediction loop which starts with choosing a prefix and setting the number of words to generate. Then we need to predict the next word using our LSTM model and use this word as part of the prefix for the next model input. The loop is executed until the expected number of words is generated.

In [None]:
#model = models.load_model('data/out/lstm_model')

In [11]:
token2ind, ind2token = text_train.token2ind, text_train.ind2token

input_prefix = 'Once upon a time'
text_prefix = Text(input_prefix, token2ind, ind2token)

In [12]:
pred = ModelPredict(model, text_prefix, token2ind, ind2token, max_len)

In [13]:
temperatures = [1, 0.7, 0.4, 0.1]

for temperature in temperatures:
  print('temperature:', temperature)
  print(pred.generate_sequence(100, temperature=0.7))
  print('\n')

temperature: 1
Once upon a time there was a son. This lived she couple of so stand up the horse, and his brother and it was not long before he fell to dost thou come, There is the one of the lion, saying : days is still in his great life. I went back to the old woman. The girl then said, Prince Ivan, and was not a only one where his mother and her own child, she put on the right, he would show you the way to you; and I return to it


temperature: 0.7
Once upon a time there was a child, and they made their children have the life in him. If you do what you only to do him, and what had because she had been passed by the cat, which had such a most beautiful woman, a front of them cried : This is all my daughter. And she got a letter, and the poor man was not long son of the king and queen and her mother, he came back, and his wife and the children and rage between them all. She had a


temperature: 0.4
Once upon a time there was a long time ago, and on they went, and they asked her mother 

## Text generation with LSTM model with Embedding layer

In [13]:
batch_size_emb = 256

params_emb = params.copy()
params_emb['embedding'] = True

train_generator_emb = TextDataGenerator(seq_train.sequences, seq_train.next_words, **params_emb)

In [14]:
model_emb = lstm_model(max_len, len(text_train), 256, embedding=True)
model_emb.compile(loss='categorical_crossentropy', optimizer=optimizer)

In [16]:
model_emb.fit(train_generator_emb,
              steps_per_epoch=len(train_generator_emb),
              epochs=2)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7efe6e587c18>

In [17]:
model_emb.save('data/out/lstm_model_emb')



INFO:tensorflow:Assets written to: data/out/lstm_model_emb/assets


INFO:tensorflow:Assets written to: data/out/lstm_model_emb/assets


In [None]:
#model_emb = models.load_model('data/out/lstm_model_emb')

In [18]:
pred_emb = ModelPredict(model_emb, text_prefix, token2ind, ind2token, max_len, embedding=True)

In [19]:
temperatures = [1, 0.7, 0.4, 0.1]

for temperature in temperatures:
  print('temperature:', temperature)
  print(pred_emb.generate_sequence(100, temperature=0.7))
  print('\n')

temperature: 1
Once upon a time she was to, and fantastical his love for the, the old man, and' avenging he. you to a a yoked the the children, and the King stithy to a little privilege, and foods to a'God to. It Desert I that thou smoky it. And he, she was a the man, he they shall abdicated reindeer a were pleads. So the I was so he and the and attentive. And he the his time to the king, he she had Wife to thou hast a.


temperature: 0.7
Once upon a time the Increase the.' waggle he he told him to the little Christendom, until he he he fell to the. After the he from the basing the- a hashish it this they had to. One of the he was fodder the they had a this to the a. The the it in the defend it at the he we have a it.' rankled in to the Cut that they had lifestyle to of her Free to. The nane it, for the they. They were flits. And he the


temperature: 0.4
Once upon a time there was as if all anymore he, the [Virgil it out with the to, but the he were, the I. He was. The man was in the 