# Deep Learning Text Generation with Keras

Facebooks Fast Text library for finding semantic similarity and to perform text classification.

Text generation is one of the state-of-the-art applications of NLP. Deep learning techniques are being used for a variety of text generation tasks such as writing poetry, generating scripts for movies, and even for composing music. However, in this article we will see a very simple example of text generation where given an input string of words, we will predict the next word. We will use the raw text from Shakespeare's famous novel "Macbeth" and will use that to predict the next word given a sequence of input words. 

# Library

In [26]:
import numpy as np
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Embedding, LSTM, Dropout
from tensorflow.keras.utils import to_categorical
from random import randint
import re
from gensim.parsing.preprocessing import remove_stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [27]:
import nltk
nltk.download('gutenberg')
nltk.data.path.append("E:\programs\nltk_data")
nltk.data.path.append("E:\programs\nltk_data\corpora")
nltk.data.path.append("E:\programs\nltk_data\tokenizers")
from nltk.corpus import gutenberg as gut
print(gut.fileids())

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\imanursar\AppData\Roaming\nltk_data...


['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


[nltk_data]   Package gutenberg is already up-to-date!


In [45]:
macbeth_text = nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt')
print(macbeth_text[:500])

[The Tragedie of Macbeth by William Shakespeare 1603]


Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
  2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come, Gray-Malkin

   All. Padock calls anon: faire is foule, and foule is faire,
Houer through 


# Data Preprocessing

In [46]:
stemmer = WordNetLemmatizer()

def preprocess_text(sentence):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)
    
    #added
    #removes spaces from at the end 
    sentence = re.sub(r"\s+$", "", sentence)
    # Converting to Lowercase
    sentence = sentence.lower()
    # Lemmatization
    # reduce the word into dictionary root form
#     sentence = sentence.split()
#     sentence = [stemmer.lemmatize(word) for word in sentence]
#     sentence = ' '.join(sentence)
    
    #remove stopwords
    sentence = remove_stopwords(sentence)
    return sentence

In [47]:
print(macbeth_text[:1000])
print("\n")
macbeth_text = preprocess_text(macbeth_text)
macbeth_text[:500]

[The Tragedie of Macbeth by William Shakespeare 1603]


Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
  2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come, Gray-Malkin

   All. Padock calls anon: faire is foule, and foule is faire,
Houer through the fogge and filthie ayre.

Exeunt.


Scena Secunda.

Alarum within. Enter King Malcome, Donalbaine, Lenox, with
attendants,
meeting a bleeding Captaine.

  King. What bloody man is that? he can report,
As seemeth by his plight, of the Reuolt
The newest state

   Mal. This is the Serieant,
Who like a good and hardie Souldier fought
'Gainst my Captiuitie: Haile braue friend;
Say to the King, the knowledge of the Broyle,
As thou didst leaue it

   Cap. Doubtfull it stood,
As two spent Swimmers, t

'tragedie macbeth william shakespeare actus primus scoena prima thunder lightning enter witches shall meet againe thunder lightning raine hurley burley battaile lost wonne ere set sunne place vpon heath meet macbeth come gray malkin padock calls anon faire foule foule faire houer fogge filthie ayre exeunt scena secunda alarum enter king malcome donalbaine lenox attendants meeting bleeding captaine king bloody man report seemeth plight reuolt newest state mal serieant like good hardie souldier fou'

# Convert Words to Numbers 

tokenize (convert words to vector)

In [48]:
macbeth_text_words = (word_tokenize(macbeth_text))
n_words = len(macbeth_text_words)
unique_words = len(set(macbeth_text_words))
print('Total Words: %d' % n_words)
print('Unique Words: %d' % unique_words)

Total Words: 9062
Unique Words: 3231


To convert tokenized words to numbers, the Tokenizer class from the keras.preprocessing.text. A dictionary will be created where the keys will represent words, whereas integers will represent the corresponding values of the dictionary. 

In [49]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=3437)
tokenizer.fit_on_texts(macbeth_text_words)

To access the dictionary that contains words and their corresponding indexes, the word_index attribute of the tokenizer object can be used: 

In [50]:
vocab_size = len(tokenizer.word_index) + 1
word_2_index = tokenizer.word_index

In [51]:
print(macbeth_text_words[500])
print(word_2_index[macbeth_text_words[500]])

like
20


# Modifying the Shape of the Data

Text generation falls in the category of many-to-one sequence problems since the input is a sequence of words and output is a single word. (LSTM), which is a type of recurrent neural network to create our text generation model. LSTM accepts data in a 3-dimensional format ( number of samples, number of time-steps, features per time step). Since the output will be a single word, the shape of the output will be 2-dimensional ( number of samples, number of unique words in the corpus).

In [53]:
input_sequence = []
output_words = []
input_seq_length = 100

for i in range(0, n_words - input_seq_length , 1):
    in_seq = macbeth_text_words[i:i + input_seq_length]
    out_seq = macbeth_text_words[i + input_seq_length]
    input_sequence.append([word_2_index[word] for word in in_seq])
    output_words.append(word_2_index[out_seq])

The input_seq_length is set to 100, which means that our input sequence will consist of 100 words. 

Next, we execute a loop where in the first iteration, integer values for the
first 100 words from the text are appended to the input_sequence list. The 101st word is appended to the output_words list. During the second iteration, a sequence of words that starts from the 2nd word in the text and ends at the 101st word is stored in the input_sequence list, and the 102nd word is stored in the output_words array, and so on. 

A total of 17150 input sequences will be generated since there are 17250 total
words in the dataset (100 less than the total words). 

In [54]:
print(input_sequence[0])

[698, 6, 1179, 1180, 273, 1181, 1182, 274, 151, 699, 4, 172, 5, 200, 51, 151, 699, 1183, 1184, 1185, 1186, 226, 700, 78, 173, 494, 115, 7, 701, 200, 6, 14, 1187, 1188, 1189, 702, 201, 227, 275, 275, 227, 1190, 1191, 703, 79, 35, 50, 276, 228, 4, 13, 1192, 202, 38, 229, 277, 704, 1193, 13, 80, 31, 152, 1194, 1195, 705, 495, 203, 41, 1196, 20, 16, 1197, 357, 706, 358, 1198, 87, 707, 278, 13, 496, 1199, 3, 497, 88, 498, 708, 359, 709, 1200, 24, 710, 1201, 59, 1202, 1203, 1204, 711, 1205, 1206]


Let's normalize our input sequences by dividing the integers in the sequences by the largest integer value. The following script also converts the output into 2-dimensional format. 

In [55]:
X = np.reshape(input_sequence, (len(input_sequence), input_seq_length, 1))
X = X / float(vocab_size)
y = to_categorical(output_words)

In [59]:
print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (8962, 100, 1)
y shape: (8962, 3232)


# Training the Model 

In [60]:
model = Sequential()

model.add(LSTM(800, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(LSTM(800, return_sequences=True))
model.add(LSTM(800))

model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

model.compile(loss='categorical_crossentropy', 
              optimizer='adam')

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 100, 800)          2566400   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100, 800)          5123200   
_________________________________________________________________
lstm_2 (LSTM)                (None, 800)               5123200   
_________________________________________________________________
dense (Dense)                (None, 3232)              2588832   
Total params: 15,401,632
Trainable params: 15,401,632
Non-trainable params: 0
_________________________________________________________________


In [62]:
history = model.fit(X, y, 
          batch_size=64, 
          epochs=10, 
          verbose=1)

Train on 8962 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# Making Predictions 

To make predictions, we will randomly select a sequence from the convert it into a 3-dimentional shape and then pass it to the predict() method of the 
trained model.

The model will return a one-hot encoded array where the index that contains 1 will be the index value of the next word. The index value is then passed to the index_2_word dictionary, where the word index is used as a key. The index_2_word dictionary will return the word that belong to the index that is passed as a key to the dictionary.

## index to word

In [63]:
random_seq_index = np.random.randint(0, len(input_sequence)-1)
random_seq = input_sequence[random_seq_index]
index_2_word = dict(map(reversed, word_2_index.items()))
word_sequence = [index_2_word[value] for value in random_seq]
print(' '.join(word_sequence))

bad thee thane cawdor addition haile worthy thane thine banq deuill speake true macb thane cawdor liues doe dresse borrowed robes ang thane liues vnder heauie iudgement beares life deserues loose combin norway lyne rebell hidden helpe vantage labour countreyes wracke know treasons capitall confess prou haue ouerthrowne macb glamys thane cawdor greatest behinde thankes paines doe hope children shall kings gaue thane cawdor promis lesse banq trusted home enkindle vnto crowne thane cawdor tis strange oftentimes winne vs harme instruments darknesse tell vs truths winne vs honest trifles betray deepest consequence cousins word pray macb truths told happy prologues


### we will print the next 100 words that follow the above sequence of words

In [65]:
for i in range(100):
    int_sample = np.reshape(random_seq, (1, len(random_seq), 1))
    int_sample = int_sample / float(vocab_size)
    
    predicted_word_index = model.predict(int_sample, verbose=0)
    predicted_word_id = np.argmax(predicted_word_index)

    word_sequence.append(index_2_word[predicted_word_id])
    
    random_seq.append(predicted_word_id)
    random_seq = random_seq[1:len(random_seq)]

The word_sequence variable now contains our input sequence of words, along with the next 100 predicted words. The word_sequence variable contains sequence of words in the form of list. We can simply join the words in the list to get the final output sequence.

In [66]:
final_output = ""
for word in word_sequence:
    final_output = final_output + " " + word
print(final_output)

 bad thee thane cawdor addition haile worthy thane thine banq deuill speake true macb thane cawdor liues doe dresse borrowed robes ang thane liues vnder heauie iudgement beares life deserues loose combin norway lyne rebell hidden helpe vantage labour countreyes wracke know treasons capitall confess prou haue ouerthrowne macb glamys thane cawdor greatest behinde thankes paines doe hope children shall kings gaue thane cawdor promis lesse banq trusted home enkindle vnto crowne thane cawdor tis strange oftentimes winne vs harme instruments darknesse tell vs truths winne vs honest trifles betray deepest consequence cousins word pray macb truths told happy prologues macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb macb m

To improve the results, I have the following recommendations for you:

1. Change the hyper parameters, including the size and number of LSTM layers and number of epochs to see if you get better results.

2. Try to remove the stop words like is, am, are from training set to generate 
words other than stop words in the test set (although this will depend on the type of application).

3. Create a character-level text generation model that predicts the next N characters. 