In [1]:
import torch
torch.__version__

'1.7.1'

To start using zoo.orca, we need to first initialize orca context. Here we specify local or distributed mode. In this example, we choose the local mode.

In [None]:
from zoo.orca import init_orca_context, stop_orca_context
from zoo.orca import OrcaContext

# recommended to set it to True when running Analytics Zoo in Jupyter notebook. 
OrcaContext.log_output = True # (this will display terminal's stdout and stderr in the Jupyter notebook).

cluster_mode = "local"

if cluster_mode == "local":
    init_orca_context(cores=1, memory="2g")   # run in local mode
elif cluster_mode == "k8s":
    init_orca_context(cluster_mode="k8s", num_nodes=2, cores=4) # run on K8s cluster
elif cluster_mode == "yarn":
    init_orca_context(
        cluster_mode="yarn-client", cores=4, num_nodes=2, memory="2g",
        driver_memory="10g", driver_cores=1,
        conf={"spark.rpc.message.maxSize": "1024",
              "spark.task.maxFailures": "1",
              "spark.driver.extraJavaOptions": "-Dbigdl.failure.retryTimes=1"})   # run on Hadoop YARN cluster

# Text generation with LSTM
This notebook contains the code samples found in Chapter 8, Section 1 of Deep Learning with Python. Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

-----------------------------------------

[...]

## Implementing character-level LSTM text generation
Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the English language.

## Preparing the data
Let's start by downloading the corpus and converting it to lowercase:

In [3]:
from torchvision.datasets.utils import download_url

download_url("https://s3.amazonaws.com/text-datasets/nietzsche.txt", '.')
text = open('./nietzsche.txt').read().lower()
print('Corpus length:', len(text))

Using downloaded and verified file: ./nietzsche.txt
Corpus length: 600893


Next, we will extract partially-overlapping sequences of length maxlen, one-hot encode them and pack them in a 3D Numpy array `x` of shape `(sequences, maxlen, unique_characters)`. Simultaneously, we prepare a array `y` containing the corresponding targets: the one-hot encoded characters that come right after each extracted sequence.

In [4]:
import numpy as np

# Length of extracted character sequences
maxlen = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen), dtype=np.int32)
y = np.zeros((len(sentences)), dtype=np.int32)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t] = char_indices[char]
    y[i] = char_indices[next_chars[i]]

Number of sequences: 200278
Unique characters: 57
Vectorization...



## Building the network
Our network is a single LSTM layer followed by a Dense classifier and softmax over all possible characters. But let us note that recurrent neural networks are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in recent times. Since our targets are one-hot encoded, we will use `categorical_crossentropy` as the loss to train the model:

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class LSTMNet(nn.Module):
    def __init__(self, maxlen, chars):
        super(LSTMNet, self).__init__()
        self.embedding = nn.Embedding(len(chars), maxlen)
        self.lstm = nn.LSTM(maxlen, 128)
        self.fc = nn.Linear(128, len(chars))
        
        self.maxlen = maxlen
        self.chars = chars

    def forward(self, x):
        batch_size, vocab_size = x.size()
        x = self.embedding(x)
        
        h0 = torch.randn(1, vocab_size, 128)
        c0 = torch.randn(1, vocab_size, 128)
        (x, _) = self.lstm(x, (h0, c0))
        #print(x.size())
        #x = x.view(x.size(0), -1)
        return F.log_softmax(self.fc(x[:, -1, :]), dim=-1)
        
model = LSTMNet(maxlen, chars)
model.train()
criterion = nn.CrossEntropyLoss()
opt = torch.optim.RMSprop(model.parameters(), lr=0.01)

In [6]:
print(model)

LSTMNet(
  (embedding): Embedding(57, 60)
  (lstm): LSTM(60, 128)
  (fc): Linear(in_features=128, out_features=57, bias=True)
)


We also create a `data_creator` function that help us import the data into the orca estimator.

In [7]:
import torch
import numpy as np
from torch.utils.data import TensorDataset, DataLoader

def data_loader_creator(data, targets, batch_size, shuffle=True):
    """
    Transforms data and targets from np.array to torch.utils.data.DataLoader,
    and shuffle the dataset if `shuffle` is set to true.
    """
    
    data_tensor = torch.LongTensor(data)
    targets_tensor = torch.LongTensor(targets)
    dataset = TensorDataset(data_tensor, targets_tensor)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)

In [8]:
train_loader = data_loader_creator(x, y, 128)

Training the language model and sampling from it
Given a trained model and a seed text snippet, we generate new text by repeatedly:

* 1) Drawing from the model a probability distribution over the next character given the text available so far
* 2) Reweighting the distribution to a certain "temperature"
* 3) Sampling the next character at random according to the reweighted distribution
* 4) Adding the new character at the end of the available text

This is the code we use to reweight the original probability distribution coming out of the model, and draw a character index from it (the "sampling function"):

In [9]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = preds / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Finally, this is the loop where we repeatedly train and generated text. We start generating text using a range of different temperatures after every epoch. This allows us to see how the generated text evolves as the model starts converging, as well as the impact of temperature in the sampling strategy.

In [10]:
from zoo.orca.learn.pytorch import Estimator 
from zoo.orca.learn.metrics import Accuracy

est = Estimator.from_torch(model=model, optimizer=opt, loss=criterion, metrics=[Accuracy()])

creating: createZooKerasAccuracy
creating: createTorchLoss
creating: createTorchOptim
creating: createTorchModel
creating: createEstimator


In [None]:
from zoo.orca.data import XShards
import random

batch_size=128
output = ""
for epoch in range(1, 10):
    #print('epoch', epoch)
    output += "epoch " + str(epoch) + '\n'
    # Fit the model for 1 epoch on the available training data
    est.fit(data=train_loader, epochs=epoch)
    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    #print('--- Generating with seed: "' + generated_text + '"')
    output += '--- Generating with seed: "' + str(generated_text) + '"\n'
    trained_model = est.get_model()
    for temperature in [0.2, 0.5, 1.0, 1.2]:
        #print('------ temperature:', temperature)
        output += '------ temperature: ' + str(temperature) + '\n'
        #sys.stdout.write(generated_text)
        output += generated_text

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen))
            for t, char in enumerate(generated_text):
                sampled[0, t] = char_indices[char]
                
            #sample_shards = XShards.partition({"x": sampled}, num_shards=1)
            #preds = est.predict(sample_shards).collect()
            #next_index = sample(preds[0]['prediction'][0], temperature)
            
            
            next_index = sample(trained_model(torch.LongTensor(sampled)).detach().numpy()[0], temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            #sys.stdout.write(next_char)
            output += next_char
            #sys.stdout.flush()
        #print()
        output += '\n'
    

In [12]:
print(output)

epoch 1
--- Generating with seed: " the
origin of ideas.

21. the causa sui is the best self-co"
------ temperature: 0.2
 the
origin of ideas.

21. the causa sui is the best self-co ind ane the we the athis and sthe al an t thin woure t in the ind an at thinceres thin and and ind as athind thr in an thorere in in and the icof as an the anthen and the the he athe ond in the there and wed f ce ind ind athe man an ande onde and hin ou=; the t the an me and thand t inen t is itin the win on the ind and an t thin the ind s an han alind in ind an ind t wisthen thend ce icon ind on
------ temperature: 0.5
 s an han alind in ind an ind t wisthen thend ce icon ind ond anern ourondstere arendourond the is ange t itheind nd thof theres fucildt aceerene ine k, man icesuswerd t mes ond ind thin and anasilond aner anino inindinof ind cofat ar whar is d oue ithire aredere hea ice ntlerer hais an theshere ble t me thers therthin s a ald is tit oun hatayer land ces thour at is in d mo menther o at n an

As you can see, a low temperature results in extremely repetitive and predictable text, but where local structure is highly realistic: in particular, all words (a word being a local pattern of characters) are real English words. With higher temperatures, the generated text becomes more interesting, surprising, even creative; it may sometimes invent completely new words that sound somewhat plausible (such as "eterned" or "troveration"). With a high temperature, the local structure starts breaking down and most words look like semi-random strings of characters. Without a doubt, here 0.5 is the most interesting temperature for text generation in this specific setup. Always experiment with multiple sampling strategies! A clever balance between learned structure and randomness is what makes generation interesting.

Note that by training a bigger model, longer, on more data, you can achieve generated samples that will look much more coherent and realistic than ours. But of course, don't expect to ever generate any meaningful text, other than by random chance: all we are doing is sampling data from a statistical model of which characters come after which characters. Language is a communication channel, and there is a distinction between what communications are about, and the statistical structure of the messages in which communications are encoded. To evidence this distinction, here is a thought experiment: what if human language did a better job at compressing communications, much like our computers do with most of our digital communications? Then language would be no less meaningful, yet it would lack any intrinsic statistical structure, thus making it impossible to learn a language model like we just did.

## Take aways
* We can generate discrete sequence data by training a model to predict the next tokens(s) given previous tokens.
* In the case of text, such a model is called a "language model" and could be based on either words or characters.
* Sampling the next token requires balance between adhering to what the model judges likely, and introducing randomness.
* One way to handle this is the notion of softmax temperature. Always experiment with different temperatures to find the "right" one.

In [13]:
stop_orca_context()

Stopping orca context
