# Text Generation with LSTMs

This notebok contains the definition, training and inference with a LSTM model that generates text. The model is trained with the text of Moby Dick and it is designed to predict the next token given a sequence of tokens; i.e., we input `N` preprocessed tokens and the model outputs the token `N+1`, which is th emost probable, given the fed sequence. All the necessary text pre-processing is shown: tokenization, vocabulary generation, encoding, sequence generation, mapping into a compact embedding, etc.

For more information on Deep Learning concepts and Recurrent Neural Networks (RNNs), check:

- `~/Dropbox/Documentation/howtos/keras_tensorflow_guide.txt`
- `~/git_repositories/data_science_python_tools/19_NeuralNetworks_Keras`

The latter is a section in my repository [data_science_python_tools](https://github.com/mxagar/data_science_python_tools). It contains specific notebooks relevant to NLP, covering Keras basics, RNNs, and NLP.

Overview of contents:
1. Load the Training Text
2. Text Processing
    - 2.1 Tokenize and Clean the Text
    - 2.2 Create Sequences of Tokens
    - 2.3 Convert Token Sequences to Integer Sequences
    - 2.4 Separate the Input and the Target: X, y
3. Neural Network - Model
    - 3.1 Define the Model
    - 3.2 Train the Model
    - 3.3 Save the Model and the Tokenizer
4. Generate New Text
    - 4.1 Text Generation Function
    - 4.2 Grab a Random Seed Text Sequence and Predict the Next 50 Words
    - 4.3 Testing a Big Model Trained with the Full Moby Dick Text

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by Jos√© Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Load the Training Text

In [1]:
def read_file(filepath):
    with open(filepath) as f:
        str_text = f.read()
    return str_text

In [2]:
# Complete Moby Dick
read_file('../data/melville-moby_dick.txt');
# First 4 chapters
read_file('../data/moby_dick_four_chapters.txt');

## 2. Text Processing

### 2.1 Tokenize and Clean the Text

In [3]:
import spacy
# We want only tokenization
# thus, we disable parsing, tagging and named-entity recognition, to make it faster
nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner'])

In [4]:
# Spacy sometimes complains if the text is very large,
# thus we extend the max_length
nlp.max_length = 1198623

In [5]:
# Convinience function which 
# (1) tokenizes and (2) filters tokens that are not in the puntuation string
# The string is taken from Keras
# Note that the used text (Moby Dick) has many such characters,
# and we want to avoid learning them - i.e., we don't want to overfit & predict them!
def separate_punc(doc_text):
    return [token.text.lower() for token in nlp(doc_text) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

In [6]:
# Load and tokenize text
d = read_file('../data/moby_dick_four_chapters.txt')
tokens = separate_punc(d)



In [7]:
len(tokens)

11338

In [8]:
tokens[:10]

['call',
 'me',
 'ishmael',
 'some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long']

### 2.2 Create Sequences of Tokens

We are going to group the tokens in sequences of `N`. We feed the neural network with these `N` tokens and let it predict the `N + 1` token. For that reason, `N` must be long enough to learn the structure of a sentence. We choose `N = 25`; i.e., we build the model so that it predicts the token number 26. Other texts or authors might require more tokens, e.g., Shakespeare 50, haikus 15, etc.

In [9]:
# Organize into sequences of tokens
train_len = 25+1 # 25 training words , then one target word

# Empty list of sequences
text_sequences = []

for i in range(train_len, len(tokens)):
    
    # Grab train_len# amount of characters
    seq = tokens[i-train_len:i]
    
    # Add to list of sequences
    text_sequences.append(seq)

In [10]:
# We can create a text string of each sequence as follows
i = 0
' '.join(text_sequences[i])

'call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on'

In [11]:
# The next sequence starts with the next token
i = 1
' '.join(text_sequences[i])

'me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore'

### 2.3 Convert Token Sequences to Integer Sequences

We have sequences of `25 + 1` words as token strings. To feed them to the neural network, we need to encode them as numbers -- i.e., integers. We can do that with the keras `Tokenizer`.

In [12]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [13]:
# Integer-encoded sequences of words
tokenizer = Tokenizer()
# Pass all text sequences: array which contains arrays of 25+1 token strings
tokenizer.fit_on_texts(text_sequences)
# Convert token-string sequences into integers sequences
sequences = tokenizer.texts_to_sequences(text_sequences)

In [14]:
# Sequence i = 0
i = 0
sequences[1]

[14,
 263,
 51,
 261,
 408,
 87,
 219,
 129,
 111,
 954,
 260,
 50,
 43,
 38,
 314,
 7,
 23,
 546,
 3,
 150,
 259,
 6,
 2713,
 14,
 24,
 957]

In [15]:
# Dictionary with index-string pairs
tokenizer.index_word

{1: 'the',
 2: 'a',
 3: 'and',
 4: 'of',
 5: 'i',
 6: 'to',
 7: 'in',
 8: 'it',
 9: 'that',
 10: 'he',
 11: 'his',
 12: 'was',
 13: 'but',
 14: 'me',
 15: 'with',
 16: 'as',
 17: 'at',
 18: 'this',
 19: 'you',
 20: 'is',
 21: 'all',
 22: 'for',
 23: 'my',
 24: 'on',
 25: 'be',
 26: "'s",
 27: 'not',
 28: 'from',
 29: 'there',
 30: 'one',
 31: 'up',
 32: 'what',
 33: 'him',
 34: 'so',
 35: 'bed',
 36: 'now',
 37: 'about',
 38: 'no',
 39: 'into',
 40: 'by',
 41: 'were',
 42: 'out',
 43: 'or',
 44: 'harpooneer',
 45: 'had',
 46: 'then',
 47: 'have',
 48: 'an',
 49: 'upon',
 50: 'little',
 51: 'some',
 52: 'old',
 53: 'like',
 54: 'if',
 55: 'they',
 56: 'would',
 57: 'do',
 58: 'over',
 59: 'landlord',
 60: 'thought',
 61: 'room',
 62: 'when',
 63: 'could',
 64: "n't",
 65: 'night',
 66: 'here',
 67: 'head',
 68: 'such',
 69: 'which',
 70: 'man',
 71: 'did',
 72: 'sea',
 73: 'time',
 74: 'other',
 75: 'very',
 76: 'go',
 77: 'these',
 78: 'more',
 79: 'though',
 80: 'first',
 81: 'sort',


In [16]:
# We can see the token-string of each integer/index using the dictionary
for i in sequences[0]:
    print(f'{i} : {tokenizer.index_word[i]}')

956 : call
14 : me
263 : ishmael
51 : some
261 : years
408 : ago
87 : never
219 : mind
129 : how
111 : long
954 : precisely
260 : having
50 : little
43 : or
38 : no
314 : money
7 : in
23 : my
546 : purse
3 : and
150 : nothing
259 : particular
6 : to
2713 : interest
14 : me
24 : on


In [17]:
# How many times appears each token?
tokenizer.word_counts

OrderedDict([('call', 27),
             ('me', 2471),
             ('ishmael', 133),
             ('some', 758),
             ('years', 135),
             ('ago', 84),
             ('never', 449),
             ('mind', 164),
             ('how', 321),
             ('long', 374),
             ('precisely', 37),
             ('having', 142),
             ('little', 767),
             ('or', 950),
             ('no', 1003),
             ('money', 120),
             ('in', 5647),
             ('my', 1786),
             ('purse', 71),
             ('and', 9646),
             ('nothing', 281),
             ('particular', 152),
             ('to', 6497),
             ('interest', 24),
             ('on', 1716),
             ('shore', 26),
             ('i', 7150),
             ('thought', 676),
             ('would', 702),
             ('sail', 104),
             ('about', 1014),
             ('a', 10377),
             ('see', 416),
             ('the', 15540),
             ('watery', 26),
  

In [18]:
# How many unique words do we have?
vocabulary_size = len(tokenizer.word_counts)
print(vocabulary_size)

2718


In [19]:
# The sequence of integers needs to be converted into a 2D numpy array
import numpy as np

In [20]:
sequences = np.array(sequences)

In [21]:
# Each row is a sequence of 25+1 tokens: 25 are the input, 1 is the target
# These tokens are represented as integers
sequences.shape

(11312, 26)

In [22]:
sequences

array([[ 956,   14,  263, ..., 2713,   14,   24],
       [  14,  263,   51, ...,   14,   24,  957],
       [ 263,   51,  261, ...,   24,  957,    5],
       ...,
       [ 952,   12,  166, ...,  262,   53,    2],
       [  12,  166, 2712, ...,   53,    2, 2718],
       [ 166, 2712,    3, ...,    2, 2718,   26]])

### 2.4 Separate the Input and the Target: X, y

In [23]:
from keras.utils import to_categorical

In [24]:
# First N-1 words: Inputs: m x N
X = sequences[:,:-1]

In [25]:
# Last word: Target: m
y = sequences[:,-1]

In [26]:
# One-hot encoding
# We have computed the size of the vocabulary above: number of unique words in text
# We add an integer because of how Keras padding works
y = to_categorical(y, num_classes=vocabulary_size+1)

In [27]:
# Sequence length: N
seq_len = X.shape[1]

In [28]:
seq_len

25

In [31]:
# Number of sequences x N tokens
X.shape

(11312, 25)

In [34]:
# Number of sequences x number of vocabulary items (binary)
y.shape

(11312, 2719)

## 3. Neural Network - Model

In [41]:
import keras
from keras.models import Sequential
from keras.layers import Dense,LSTM,Embedding

### 3.1 Define the Model

In [38]:
def create_model(vocabulary_size, seq_len):
    model = Sequential()
    # Embedding: sparse one-hot encoded word vectors are converted to dense word vectors
    model.add(Embedding(vocabulary_size, seq_len, input_length=seq_len))
    # LSTM units: It is common to make it a multiple of the sequence length
    # The more units, the longer the training takes
    # In the videos, 50 = 25*2 was chosen
    multiple = 6 # 25*6 = 150
    model.add(LSTM(seq_len*multiple, return_sequences=True))
    model.add(LSTM(seq_len*multiple))
    model.add(Dense(seq_len*multiple, activation='relu'))
    # The last layer needs to be the size of our vocabulary: a word is predicted
    model.add(Dense(vocabulary_size, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model

In [40]:
# Define model
# We pass the vocabulary size + 1 due to keras padding
model = create_model(vocabulary_size+1, seq_len)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 25, 25)            67975     
_________________________________________________________________
lstm_5 (LSTM)                (None, 25, 150)           105600    
_________________________________________________________________
lstm_6 (LSTM)                (None, 150)               180600    
_________________________________________________________________
dense_5 (Dense)              (None, 150)               22650     
_________________________________________________________________
dense_6 (Dense)              (None, 2719)              410569    
Total params: 787,394
Trainable params: 787,394
Non-trainable params: 0
_________________________________________________________________


### 3.2 Train the Model

In [43]:
# Train for at least 200 epochs!
# Otherwise, just the most common words are predicted...
model.fit(X, y, batch_size=128, epochs=5,verbose=1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x7f794050b6d0>

### 3.3 Save the Model and the Tokenizer

We need to save the tokenizer together with the model, because the model receives the text processed with it!

In [47]:
# Save the model
model.save('lstm_moby_dick_5_epochs.h5')

In [48]:
# Save the tokenizer
from pickle import dump,load
dump(tokenizer, open('lstm_moby_dick_tokenizer', 'wb'))

## 4. Generate New Text

Here we create a function that uses the model and the tokenizer to generate new text.

In [83]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

### 4.1 Text Generation Function

In [54]:
def generate_text(model, tokenizer, seq_len, seed_text, num_gen_words):
    '''
    INPUTS:
    model : model that was trained on text data
    tokenizer : tokenizer that was fit on text data
    seq_len : length of training sequence
    seed_text : raw string text to serve as the seed
    num_gen_words : number of words to be generated by model
    '''
    
    # Final Output
    output_text = []
    
    # Intial Seed Sequence
    input_text = seed_text
    
    # Create num_gen_words
    for i in range(num_gen_words):
        
        # Take the input text string and encode it to a sequence
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        
        # Pad sequences to our trained rate (seq_len=25 words in the video)
        # Padding means if we have more or less tokens than maxlen
        # the token sequence is always set to be maxlen.
        # truncating='pre': remove tokens at the beginning
        # padding='pre': extend sequence at the beginning - with argument value (=0.0)
        # Note that we are extending the input_text with predicted words below,
        # thus, truncation is happening all the time!
        pad_encoded = pad_sequences([encoded_text], maxlen=seq_len, truncating='pre')
        
        # Predict class probabilities for each word
        # This is the next word
        pred_word_ind = model.predict_classes(pad_encoded, verbose=0)[0]
        
        # Grab (next) word
        pred_word = tokenizer.index_word[pred_word_ind] 
        
        # Update the sequence of input text (shifting one over with the new word)
        # Note that we are extending our input text with the new predicted words
        # and pad_sequences truncates/removes the beginning part every iteration!
        input_text += ' ' + pred_word
        
        # Concatenate all predicted words
        output_text.append(pred_word)
        
    # Make it look like a sentence
    return ' '.join(output_text)

### 4.2 Grab a Random Seed Text Sequence and Predict the Next 50 Words

In [66]:
import random
random.seed(101)
random_pick = random.randint(0,len(text_sequences))

In [67]:
random_seed_text = text_sequences[random_pick]

In [68]:
seed_text = ' '.join(random_seed_text)

In [69]:
seed_text

"thought i to myself the man 's a human being just as i am he has just as much reason to fear me as i have"

In [73]:
# We seed a seed sequence of words
# and for that sequence, we generate the next 50 words.
# If the model is simplistic and was trained with few epochs,
# just the most common words will be picked.
# For realistic results, we need to have a complex model (at least 150 LSTM unists in 3 layers)
# and we need to train it long enough (at least 300 epochs)
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=50)

'the little and of the little and of the little and of the little and of the little and of the little and of the little and of the little and of the little and of the little and of the little and of the little and of the little'

### 4.3 Testing a Big Model Trained with the Full Moby Dick Text

This model was provided by the course.

In [77]:
#model = load_model('epoch250.h5')
model = load_model('epochBIG.h5')

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [79]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 25)            431400    
_________________________________________________________________
lstm_1 (LSTM)                (None, 25, 150)           105600    
_________________________________________________________________
lstm_2 (LSTM)                (None, 150)               180600    
_________________________________________________________________
dense_1 (Dense)              (None, 150)               22650     
_________________________________________________________________
dense_2 (Dense)              (None, 17256)             2605656   
Total params: 3,345,906
Trainable params: 3,345,906
Non-trainable params: 0
_________________________________________________________________


In [81]:
tokenizer = load(open('epochBIG','rb'))

In [82]:
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=50)

"to be seen there was no bad olfactories my own letter was cheerily listening over his hearers who 's more can go have a wearing answer to accumulate a vow and do him not they to think of his lances stubb sperm by the vast man 's sign by the"