# Predict Next Word using LSTM

We import necessary libraries and modules for training an LSTM (Long Short-Term Memory) language model using Keras. Here's a brief explanation of each import:

from __future__ import print_function: This line ensures that the print function behaves as expected, making it compatible with both Python 2 and 3.

from keras.callbacks import LambdaCallback: LambdaCallback is a Keras callback that allows you to define custom functions (callbacks) to be executed during training at specific points.

from keras.models import Sequential: Sequential is a linear stack of Keras layers, where you can easily add and remove layers.

from keras.layers import Dense: Dense is a fully connected neural network layer, where each neuron is connected to every neuron in the previous and next layers.

from keras.layers import LSTM: LSTM is a type of recurrent neural network layer that can remember information for a long time, making it suitable for sequence data like text.

from keras.optimizers import RMSprop: RMSprop is an optimization algorithm used to update the neural network weights during training.

from keras.utils.data_utils import get_file: get_file is a utility function to download files from the internet.

In [1]:
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

2023-07-24 17:37:32.953282: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-24 17:37:32.981851: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-24 17:37:32.982424: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


We can download the dataset which is publicly available nietzsche.txt

In [2]:
path = get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()
print('corpus length:', len(text))

corpus length: 600893


The following code cell performs the following tasks:

- It calculates the total number of unique characters in the variable 'text' and prints the count.
- It creates two dictionaries, 'char_indices' and 'indices_char,' to map characters to their corresponding indices and vice versa.
- The code cuts the text into semi-redundant sequences of a specified maximum length ('maxlen') with a given step size ('step').
- It creates two lists, 'sentences' and 'next_chars,' to store the sequences and their corresponding next characters, respectively.
Finally, it prints the total number of sequences created.

In [3]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

total chars: 57
nb sequences: 200285


This code performs vectorization to transform text data into numerical representations suitable for training a machine learning model. It uses NumPy arrays to create input (x) and output (y) arrays for a character-level language modeling task. The x array represents the input sequences, while the y array represents the corresponding target characters.

x is initialized as a multi-dimensional NumPy array of shape (number of sentences, maxlen, number of unique characters) where maxlen is the maximum sequence length of the sentences and chars contains all unique characters in the dataset.

y is initialized as a multi-dimensional NumPy array of shape (number of sentences, number of unique characters) to store the target characters.

The code then iterates over each sentence and each character in the sentence. For each character, it sets the corresponding index in x to 1 to indicate the presence of that character in the sequence. For the target characters, it sets the corresponding index in y to 1.

This vectorization process is essential for efficiently feeding the data into a machine learning model, typically a neural network, for training and prediction.

In [4]:
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


Vectorization...


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = np.zeros((len(sentences), len(chars)), dtype=np.bool)


Now we can build the LSTM model

In [6]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

Build model...
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_1 (LSTM)               (None, 128)               95232     
                                                                 
 dense_1 (Dense)             (None, 57)                7353      
                                                                 
Total params: 102,585
Trainable params: 102,585
Non-trainable params: 0
_________________________________________________________________


2023-07-24 17:38:29.738772: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-07-24 17:38:29.739924: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-07-24 17:38:29.740939: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

In [7]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

This function is a helper used for sampling an index from a probability array. It takes two parameters:

preds: The input probability array.
temperature: A parameter controlling the randomness of the sampling. Higher values (e.g., > 1) make the sampling more random, while lower values (e.g., < 1) make it more deterministic.
The function performs the following steps:

Converts the input probability array (preds) to a numpy array of floats.
Applies a logarithmic transformation to the probabilities, divided by the temperature.
Calculates the exponentials of the transformed probabilities.
Normalizes the exponentials to obtain a new probability array.
Uses the multinomial distribution to randomly select an index based on the probabilities.
Returns the index with the highest probability, which is the sampled index.

In [8]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

The code defines a function on_epoch_end(epoch, _) that is used as a callback during the training of a language generation model. This function is invoked at the end of each training epoch. It serves the purpose of generating text using the trained model and printing the output for different diversity values.

The key steps of the function are as follows:

It prints a separator line and indicates the current epoch number.
It randomly selects a starting index within the text data, ensuring the chosen section is at least maxlen characters away from the end to form a complete input sequence.
It iterates over a predefined set of diversity values (0.2, 0.5, 1.0, and 1.2) that control the randomness of the text generation process.
For each diversity value, it generates text by sampling characters from the model's output distribution given the input sequence.
The generated text is displayed, and the process continues for 400 characters.
The function makes use of various global variables, such as text, maxlen, chars, char_indices, indices_char, model, sample, random, np, and sys, which are presumably defined elsewhere in the code. Additionally, it sets up a print_callback using LambdaCallback to call the on_epoch_end function at the end of each training epoch.

Now we can train the model, and see the generated text by our model

In [9]:
model.fit(x, y,
          batch_size=128,
          epochs=5,
          callbacks=[print_callback])

Epoch 1/5


2023-07-24 17:38:51.423225: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-07-24 17:38:51.424810: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-07-24 17:38:51.425879: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "ten mixed and entangled together, a
magn"
ten mixed and entangled together, a
magner o

2023-07-24 17:39:45.522726: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-07-24 17:39:45.523905: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-07-24 17:39:45.525040: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

f the means of the something to the are to the are and the sensity and a streng the present to the superity of the something to the self the were the most all the become to the says and the something of the something to the something to the mean of the sensists to a stand to the something to the most and the something of the sensity and the more and the most and the stand the world and the sen
----- diversity: 0.5
----- Generating with seed: "ten mixed and entangled together, a
magn"
ten mixed and entangled together, a
magner belongs intrigituces of the heres when the has freence, a mentance and are and the somether compresty to be the sensity of the stace lost the good many deligions and the stolly about the senserse of the mest have a stands to the is a science, and conacting it a standers to may of the grates to he have are what the here of the as the are be the freence of the are and it and an as the ment to far
----- diversity: 1.0
----- Generating with seed: "ten mixed and entang

<keras.callbacks.History at 0x7f9638185e50>