<a href="https://colab.research.google.com/github/gnitnaw/LDL/blob/main/tf_framework/c12e1_autocomplete_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
The MIT License (MIT)
Copyright (c) 2021 NVIDIA
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""


'\nThe MIT License (MIT)\nCopyright (c) 2021 NVIDIA\nPermission is hereby granted, free of charge, to any person obtaining a copy of\nthis software and associated documentation files (the "Software"), to deal in\nthe Software without restriction, including without limitation the rights to\nuse, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of\nthe Software, and to permit persons to whom the Software is furnished to do so,\nsubject to the following conditions:\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\nTHE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS\nFOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR\nCOPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER\nIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OU

This code example is similar to c11_e1_autocomplete, but it works on words (encoded with an embedding layer) instead of characters and it does not do beam search. More context for this code example can be found in the section "Programming Example: Neural Language Model and Resulting Embeddings" in Chapter 12 in the book Learning Deep Learning by Magnus Ekman (ISBN: 9780137470358).


The initialization code below contains a couple of additional imports compared to c11_e1_autocomplete and defines two new constants MAX_WORDS and EMBEDDING_WIDTH that define the max size of our vocabulary and the dimensionality of the word vectors.


# Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')
path_head = '/content/drive/MyDrive/Colab Notebooks/' # You have to change this. 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.text \
    import text_to_word_sequence
import tensorflow as tf
import logging
tf.get_logger().setLevel(logging.ERROR)

EPOCHS = 32
BATCH_SIZE = 256
# INPUT_FILE_NAME = '../data/frankenstein.txt'
INPUT_FILE_NAME = path_head+'data/frankenstein.txt'
WINDOW_LENGTH = 40
WINDOW_STEP = 3
PREDICT_LENGTH = 3
MAX_WORDS = 10000
EMBEDDING_WIDTH = 100


The next code snippet first reads the input file and splits the text into a list of individual words. The latter is done by using the imported function text_to_word_sequence(), which also removes punctuation and converts the text to lowercase. We then create input fragments and associated target words just as in the character-based example.


In [4]:
# Open and read file.
file = open(INPUT_FILE_NAME, 'r', encoding='utf-8-sig')
text = file.read()
file.close()

# Make lower case and split into individual words.
text = text_to_word_sequence(text)

# Create training examples.
fragments = []
targets = []
for i in range(0, len(text) - WINDOW_LENGTH, WINDOW_STEP):
    fragments.append(text[i: i + WINDOW_LENGTH])
    targets.append(text[i + WINDOW_LENGTH])


The next step is to convert the training examples into the correct format. Each input word needs to be encoded to a corresponding word index (an integer). This index will then be converted into an embedding by the Embedding layer. The target (output) word should still be one-hot encoded. To simplify how to interpret the output, we want the one-hot encoding to be done in such a way that bit N is hot when the network outputs the word corresponding to index N in the input encoding.

We make use of the Keras Tokenizer class. When we construct our tokenizer, we provide an argument num_words = MAX_WORDS that caps the size of the vocabulary. The tokenizer object reserves index 0 to use as a special padding value and index 1 for unknown words. The remaining 9,998 indices (MAX_WORDS was set to 10,000) are used to represent words in the vocabulary.

The padding value (index 0) can be used to make all training examples within the same batch have the same length. The Embedding layer can be instructed to ignore this value, so the network does not train on the padding values.

Index 1 is reserved for UNKnown (UNK) words because we have declared UNK as an out-of-vocabulary (oov) token. When using the tokenizer to convert text to tokens, any word that is not in the vocabulary will be replaced by the word UNK. Similarly, if we try to convert an index that is not assigned to a word, the tokenizer will return UNK. If we do not set the oov_token parameter, it will simply ignore such words/indices.

After instantiating our tokenizer, we call fit_on_texts() with our entire text corpus, which will result in the tokenizer assigning indices to words. We can then use the function texts_to_sequences to convert a text string into a list of indices, where unknown words will be assigned the index 1.


In [5]:
# Convert to indices.
tokenizer = Tokenizer(num_words=MAX_WORDS, oov_token='UNK')
tokenizer.fit_on_texts(text)
fragments_indexed = tokenizer.texts_to_sequences(fragments)
targets_indexed = tokenizer.texts_to_sequences(targets)

# Convert to appropriate input and output formats.
X = np.array(fragments_indexed, dtype=np.int64)
y = np.zeros((len(targets_indexed), MAX_WORDS))
for i, target_index in enumerate(targets_indexed):
    y[i, target_index] = 1


The next code snippet creates a model with an Embedding layer followed by two long short-term memory (LSTM) layers, followed by one fully connected layer with ReLU activation, and finally a fully connected layer with softmax as output. When we declare the Embedding layer, we provide it with its input dimensions (vocabulary size) and output dimensions (embedding width) and tell it to mask inputs using index 0. This masking is not necessary for our programming example given that we created the training input such that all input examples have the same length, but we do it for good practice. We state input_length=None so that we can feed training examples of any length to the network.


In [6]:
# Build and train model.
training_model = Sequential()
training_model.add(Embedding(
    output_dim=EMBEDDING_WIDTH, input_dim=MAX_WORDS,
    mask_zero=True, input_length=None))
training_model.add(LSTM(128, return_sequences=True,
                        dropout=0.2, recurrent_dropout=0.2))
training_model.add(LSTM(128, dropout=0.2,
                        recurrent_dropout=0.2))
training_model.add(Dense(128, activation='relu'))
training_model.add(Dense(MAX_WORDS, activation='softmax'))
training_model.compile(loss='categorical_crossentropy',
                       optimizer='adam')
training_model.summary()
history = training_model.fit(X, y, validation_split=0.05,
                             batch_size=BATCH_SIZE, 
                             epochs=EPOCHS, verbose=2, 
                             shuffle=True)


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 100)         1000000   
                                                                 
 lstm (LSTM)                 (None, None, 128)         117248    
                                                                 
 lstm_1 (LSTM)               (None, 128)               131584    
                                                                 
 dense (Dense)               (None, 128)               16512     
                                                                 
 dense_1 (Dense)             (None, 10000)             1290000   
                                                                 
Total params: 2,555,344
Trainable params: 2,555,344
Non-trainable params: 0
_________________________________________________________________
Epoch 1/32
98/98 - 62s - loss: 7.2116 - val_lo

After training the model, we are ready to use it to do predictions. We do this a little bit differently than in the previous chapters. Instead of feeding a string of symbols as input to the model, we feed it only a single symbol at a time. This is an alternative implementation compared to the implementation in c11e1_autocomplete, where we repeatedly fed the model a growing sequence of characters (see the book for a more detailed description).

The scheme used in this chapter has a subtle implication, which has to do with dependencies between multiple consecutive calls to model.predict(). We want the LSTM layers to retain their c and h states from one call to another so that the outputs of subsequent calls to predict() will depend on the prior calls to predict(). This can be done by giving the parameter stateful=True to the LSTM layers. A side effect of this is that we manually need to call reset_states() on the model before our first prediction.

The code snippet below creates a model that is identical to the training model except that we declare the LSTM layers with stateful=True as well as specify a fixed batch size (required when declaring the LSTM layer as stateful) of size 1 using the batch_input_shape argument. We will transfer the weights from the trained model to this new model and use it for inference. This is done in the two last lines in the code snippet. There, we first read out the weights from the trained model and then initialize it into our inference model. For this to work, the models must have identical topology.


In [7]:
# Build stateful model used for prediction.
inference_model = Sequential()
inference_model.add(Embedding(
    output_dim=EMBEDDING_WIDTH, input_dim=MAX_WORDS,
    mask_zero=True, batch_input_shape=(1, 1)))
inference_model.add(LSTM(128, return_sequences=True,
                         dropout=0.2, recurrent_dropout=0.2,
                         stateful=True))
inference_model.add(LSTM(128, dropout=0.2,
                         recurrent_dropout=0.2, stateful=True))
inference_model.add(Dense(128, activation='relu'))
inference_model.add(Dense(MAX_WORDS, activation='softmax'))
weights = training_model.get_weights()
inference_model.set_weights(weights)


The next code snippet implements logic of presenting a word to the model and retrieving the word with the highest probability from the output. This word is then fed back as input to the model in the next timestep. The resulting autocompleted text sequence is printed.


In [8]:
# Provide beginning of sentence and
# predict next words in a greedy manner
first_words = ['i', 'saw']
first_words_indexed = tokenizer.texts_to_sequences(
    first_words)
inference_model.reset_states()
predicted_string = ''
# Feed initial words to the model.
for i, word_index in enumerate(first_words_indexed):
    x = np.zeros((1, 1), dtype=np.int64)
    x[0][0] = word_index[0]
    predicted_string += first_words[i]
    predicted_string += ' '
    y_predict = inference_model.predict(x, verbose=0)[0]
# Predict PREDICT_LENGTH words.
for i in range(PREDICT_LENGTH):
    new_word_index = np.argmax(y_predict)
    word = tokenizer.sequences_to_texts(
        [[new_word_index]])
    x[0][0] = new_word_index
    predicted_string += word[0]
    predicted_string += ' '
    y_predict = inference_model.predict(x, verbose=0)[0]
print(predicted_string)


i saw and the air 


All of the preceding code had to do with building and using a language model. The next code snippet adds some functionality to explore the learned embeddings. We first read out the word embeddings from the Embedding layer by calling get_weights() on layer 0, which represents the Embedding layer. We then declare a list of a number of arbitrary lookup words. This is followed by a loop that does one iteration per lookup word. The loop uses the Tokenizer to convert the lookup word to a word index, which is then used to retrieve the corresponding word embedding. The Tokenizer functions are generally assumed to work on lists. Therefore, although we work with a single word at a time, we need to provide it as a list of size 1, and then we need to retrieve element zero ([0]) from the output.

Once we have retrieved the corresponding word embedding, we loop through all the other embeddings and calculate the Euclidean distance to the embedding for the lookup word using the NumPy function norm(). We add the distance and the corresponding word to the dictionary word_indices. Once we have calculated the distance to each word, we simply sort the distances and retrieve the five word indices that correspond to the word embeddings that are closest in vector space. We use the Tokenizer to convert these indices back to words and print them and their corresponding distances.


In [9]:
# Explore embedding similarities.
embeddings = training_model.layers[0].get_weights()[0]
lookup_words = ['the', 'saw', 'see', 'of', 'and',
                'monster', 'frankenstein', 'read', 'eat']
for lookup_word in lookup_words:
    lookup_word_indexed = tokenizer.texts_to_sequences(
        [lookup_word])
    print('words close to:', lookup_word)
    lookup_embedding = embeddings[lookup_word_indexed[0]]
    word_indices = {}
    # Calculate distances.
    for i, embedding in enumerate(embeddings):
        distance = np.linalg.norm(
            embedding - lookup_embedding)
        word_indices[distance] = i
    # Print sorted by distance.
    for distance in sorted(word_indices.keys())[:5]:
        word_index = word_indices[distance]
        word = tokenizer.sequences_to_texts([[word_index]])[0]
        print(word + ': ', distance)
    print('')


words close to: the
the:  0.0
a:  1.0806463
“man:  1.3769337
perpendicular:  1.38155
heavily:  1.3912796

words close to: saw
saw:  0.0
as:  0.48732606
which:  0.49339467
that:  0.5051552
UNK:  0.5101035

words close to: see
see:  0.0
made:  0.55155224
saw:  0.5805882
waves:  0.5845164
lake:  0.58588165

words close to: of
of:  0.0
by:  0.427217
in:  0.43739766
declared:  0.47325063
unwholesome:  0.47516632

words close to: and
and:  0.0
now:  0.37749964
am:  0.41460738
UNK:  0.4224915
lieu:  0.42396092

words close to: monster
monster:  0.0
alone:  0.49328128
prison:  0.51289016
UNK:  0.5157103
UNK:  0.51599073

words close to: frankenstein
frankenstein:  0.0
UNK:  0.44750696
UNK:  0.45352083
UNK:  0.45699644
UNK:  0.45789152

words close to: read
read:  0.0
me:  0.54325795
UNK:  0.5451037
UNK:  0.56608945
decline:  0.56610894

words close to: eat
eat:  0.0
extremely:  0.54712933
stopped:  0.56699574
whirlwind:  0.5690579
impetuous:  0.57299787

