In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.2.4'

# Text generation with LSTM

This notebook contains the code samples found in Chapter 8, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

[...]

## Implementing character-level LSTM text generation


Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a 
language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this 
example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model 
we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the 
English language.

## Preparing the data

Let's start by downloading the corpus and converting it to lowercase:

In [2]:
import keras
import numpy as np

path = keras.utils.get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
print('Corpus length:', len(text))

Corpus length: 600893


Next, we will extract partially-overlapping sequences of length `maxlen`, one-hot encode them and pack them in a 3D Numpy array `x` of 
shape `(sequences, maxlen, unique_characters)`. Simultaneously, we prepare a array `y` containing the corresponding targets: the categorical `(sequences, maxlen, unique_characters)` of the next section of the document.

In [3]:
# Length of extracted character sequences
maxlen = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up sequence)
next_chars = []

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

for i in range(0, len(text) - maxlen - step, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

Unique characters: 57
Number of sequences: 200277


In [8]:
# Instead of selection by static step, we weight the output on the ratio of their popularity and scale them
seed = 1000
np.random.seed(seed)

# appearance dict
from collections import Counter
count = Counter()
count.update(text)
selection_chance_fn = lambda occ: float(400000) / (float(occ) * len(count))
char_selection_chance = {c: selection_chance_fn(occ) for c, occ in count.items()}

sentences, next_chars = [], []
for i, chance in zip(range(maxlen, len(text)), np.random.rand(len(text)-maxlen)):
    c = text[i]
    if(chance <= char_selection_chance[c]):
        # load in if pass the random selection
        sentences.append(text[i-maxlen:i])
        next_chars.append(c)

## Building the network

Our network is a single `LSTM` layer followed by a `Dense` classifier and softmax over all possible characters. But let us note that 
recurrent neural networks are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in 
recent times.

In [10]:
import tensorflow as tf
# Enforce dynamic session for Keras. Run to avoid GPU ram hogging
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU
config.log_device_placement = True  # to log device placement (on which device the operation ran)
                                    # (nothing gets printed in Jupyter, only if you run it standalone)
sess = tf.Session(config=config)
set_session(sess)  # set this TensorFlow session as the default session for Keras

In [11]:
# Bidirectional Sequential, with internal embedding layer
from keras import layers, Input
from keras.models import Model

vocab_size = len(chars)
#import inspect
#print(inspect.signature(Input))
inputs = Input(shape=(maxlen, ))
embedded = layers.Embedding(vocab_size, 128)(inputs)
bidir_encoder = layers.Bidirectional(layers.LSTM(128, unroll=True))(embedded)
output_logits = layers.Dense(vocab_size)(bidir_encoder)
#output_raw = layers.Reshape([output_size, vocab_size], name="resize")(output_flat)
outputs = layers.Softmax()(output_logits)
model = Model(inputs=inputs, outputs=outputs)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 60)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 60, 128)           7296      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               263168    
_________________________________________________________________
dense_1 (Dense)              (None, 57)                14649     
_________________________________________________________________
softmax_1 (Softmax)          (None, 57)                0         
Total params: 285,113
Trainable params: 285,113
Non-trainable params: 0
_________________________________________________________________


Since our targets are one-hot encoded, we will use `categorical_crossentropy` as the loss to train the model:

In [12]:
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

## Training the language model and sampling from it


Given a trained model and a seed text snippet, we generate new text by repeatedly:

* 1) Drawing from the model a probability distribution over the next character given the text available so far
* 2) Reweighting the distribution to a certain "temperature"
* 3) Sampling the next character at random according to the reweighted distribution
* 4) Adding the new character at the end of the available text

This is the code we use to reweight the original probability distribution coming out of the model, 
and draw a character index from it (the "sampling function"):

In [13]:
from scipy.special import softmax
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def mass_sample(preds, temperature=0.0):
    if(temperature==0.0):
        return np.argmax(preds, axis=1)
    log_preds = np.log(preds) / temperature
    softmaxed_preds = softmax(log_preds, axis=1) * 0.99 # hack to avoid the 1.02 stuff
    assert np.all(np.sum(softmaxed_preds, axis=1) <= 1.0), "{}".format(np.sum(softmaxed_preds, axis=1))
    return [np.argmax(np.random.multinomial(vocab_size, p)) for p in softmaxed_preds]

In [28]:
#np.argmax(np.random.multinomial(3, [1.0 / 7.0] * 7, size=(3,)))

4


Finally, this is the loop where we repeatedly train and generated text. We start generating text using a range of different temperatures 
after every epoch. This allows us to see how the generated text evolves as the model starts converging, as well as the impact of 
temperature in the sampling strategy.

In [9]:
# additional conversion to assure data constraint. 
from keras.utils import to_categorical
# convert the features and labels to their respective indices and one-hot values
features = [[char_indices[c] for c in sent] for sent in sentences]
labels = to_categorical([char_indices[c] for c in next_chars], num_classes=len(chars))
# Should came in shapes of [data_size, input_size] and [data_size, output_size, vocab_size]
features = np.array(features)
labels = np.array(labels)
print(features.shape, labels.shape)

(183110, 60) (183110, 57)


In [14]:
def output_constant_temperature(seed_text, model, num_gen=400, temperature=1.0):
    # constant temperature throughout the text
    maxpreds = []
    for i in range(num_gen):
        sampled = np.array([char_indices[c] for c in seed_text]).reshape([1, len(seed_text)])

        preds = model.predict(sampled, verbose=0)[0]
        next_index = sample(preds, temperature)
        next_char = chars[next_index]
        seed_text = seed_text[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
        maxpreds.append(np.max(preds))
    print() #print("\nAverage maximum preds: ", np.mean(maxpreds))

In [15]:
import random
import sys
from IPython.display import clear_output

histories = []
for epoch in range(1, 10):
    print('Epoch: ', epoch)
    # Fit the model for 1 epoch on the available training data
    current_history = model.fit(features, labels,
              batch_size=128,
              epochs=1)
    histories.append(current_history)

    # clear the output and prevent cluttering
    clear_output()
    
    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('--- Generating with seed: "' + generated_text + '"')

    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('------ constant temperature:', temperature)
        sys.stdout.write(generated_text)
        # We generate for 400 characters using a maxlen batch
        output_constant_temperature(generated_text, model, num_gen=400, temperature=temperature)
        
        print()

--- Generating with seed: "of these
things he attained the extraordinary with which he "
------ constant temperature: 0.2
of these
things he attained the extraordinary with which he which which
which which which which which which which which which purposes,
wherever,                                                            which
wherever,                                                            is
which much which which which which which which which which which
which which which which may believed which which wherever,                                                      

------ constant temperature: 0.5
of these
things he attained the extraordinary with which he whenciling
things, and provilved wherever,                                   
. but in many, by markness,"--which power, but indurous, by--where
may capent", problecip, for the whereventing plict, which domich,
which powers," which is any believed of sympathy.




11111


1



3

=progling, which which philosophy, in more

In [15]:
temperature = 0.5
start_index = random.randint(0, len(text) - maxlen - 1)
generated_text = text[start_index: start_index + maxlen]
sys.stdout.write(generated_text)
# We generate for 400 characters using a maxlen batch
output_constant_temperature(generated_text, model, num_gen=400, temperature=temperature)

re, than the happily preserved petit
fait that under the pile be is have the self--it is not self-and long as he who it is the should the self-more is been even with a neverty, of a incertain possessions of the freagod of therefore the would never of self-interent, in there is the finally
self--the say in the different of ever the precessal been have as do back, more is the delive of been the reason in a men must be more as it morant a make and the necesso



As you can see, a low temperature results in extremely repetitive and predictable text, but where local structure is highly realistic: in 
particular, all words (a word being a local pattern of characters) are real English words. With higher temperatures, the generated text 
becomes more interesting, surprising, even creative; it may sometimes invent completely new words that sound somewhat plausible (such as 
"eterned" or "troveration"). With a high temperature, the local structure starts breaking down and most words look like semi-random strings 
of characters. Without a doubt, here 0.5 is the most interesting temperature for text generation in this specific setup. Always experiment 
with multiple sampling strategies! A clever balance between learned structure and randomness is what makes generation interesting.

Note that by training a bigger model, longer, on more data, you can achieve generated samples that will look much more coherent and 
realistic than ours. But of course, don't expect to ever generate any meaningful text, other than by random chance: all we are doing is 
sampling data from a statistical model of which characters come after which characters. Language is a communication channel, and there is 
a distinction between what communications are about, and the statistical structure of the messages in which communications are encoded. To 
evidence this distinction, here is a thought experiment: what if human language did a better job at compressing communications, much like 
our computers do with most of our digital communications? Then language would be no less meaningful, yet it would lack any intrinsic 
statistical structure, thus making it impossible to learn a language model like we just did.


## Take aways

* We can generate discrete sequence data by training a model to predict the next tokens(s) given previous tokens.
* In the case of text, such a model is called a "language model" and could be based on either words or characters.
* Sampling the next token requires balance between adhering to what the model judges likely, and introducing randomness.
* One way to handle this is the notion of _softmax temperature_. Always experiment with different temperatures to find the "right" one.