# Generating text

Today we will train a simple text generation model based [on the Cyberspace Solarium Commission's final report](https://www.solarium.gov/). 

This notebook will also show you how to take advantage of a GPU cluster, like Aquamentus. At the time this notebook was written, each epoch of training took 3-4 minutes on a personal laptop compared to 15 seconds on a single Aquamentus GPU. This notebook will show you how to use all four GPUs. However, there is a quite a bit of overhead in using multiple GPUs so I recommend sticking to a single GPU unless your model is really large. 

**IMPORTANT:** Make sure that you install `tensorflow-gpu`, not `tensorflow`, for GPU support. 



## Download

Let's start by downloading the corpus and using Keras's text preprocessing library to normalize and create an array of words:

In [0]:
import requests
import tensorflow

from tensorflow import keras
from collections import Counter
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence

url = 'https://drive.google.com/uc?export=download&id=1Ke0h_WWOsndoQLBFutNBfZg0Qkksi-DD'
response = requests.get(url)

# tokenize while preserving periods 
filters = Tokenizer().filters.replace('.','')
words = text_to_word_sequence(response.text, filters=filters)

# filter out words that only appear once
counts = Counter(words)
words = [word for word in words if counts[word] > 1]

print('The final report is', len(words), "words")

## Vectorize

Next, we will generate our inputs and labels. Inputs are sequences of words from the corpus and labels are the word that should follow the sequence. For example, if we were using sequences of 4 words, then one input would be: 

```python
sample_input = ['you', 'spend', 'your', 'whole']
sample_label = 'career'
```

Of course, we will use numbers instead of words and our labels will be probability distributions over all possible words. For the example above, the label will be an array of zeros except for the entry that corresponds to the word `career`, which would be a `1`. 

In [0]:
import numpy as np

# create a tokenizer for our corpus
tokenizer = Tokenizer(filters=filters)
tokenizer.fit_on_texts(words)

# compute the average sequence length
sequence_length = int(len(words) / counts['.'])

# add one for index zero, which is a reserved index
num_unique_words = len(tokenizer.word_index) + 1

# convert words to numbers 
nums = tokenizer.texts_to_sequences([' '.join(words)])[0]

# grab all sequences of `sequence_length`
# drop the last few so we don't have to worry about padding. 
x = np.array([nums[i:i+sequence_length] for i in range(len(words)-sequence_length)])

# we actually need y to be a prob dist over all possible words ...
y = np.zeros((len(words)-sequence_length, num_unique_words), dtype=np.bool)

# set the correct index in y to 1
for i in range(0,len(nums)-sequence_length):
  y[i][nums[i+sequence_length]] = 1

## Model

To utilize multiple GPUs, you need a [distribution strategy](https://www.tensorflow.org/tutorials/distribute/keras). We'll use `MirroredStrategy` which divides each batch by the number of GPUs, trains, then combines and syncs results across all GPUs. 

It is important that you declare and fit your model in the distributed scope. Note that you **cannot save your model** in the distributed scope.

The [TensorFlow team recommends](https://www.tensorflow.org/tutorials/distribute/keras#setup_input_pipeline) using the largest batch size that will fit in GPU memory since moving data between system memory and GPU memory is expensive. 

In [0]:
from tensorflow.keras import Model, Sequential
from tensorflow.keras.models import load_model
from tensorflow.keras.layers import *

# a big batch is better for GPUs, but you should probably 
# add a learning rate schedule to compensate ...
batch_size = 256
embedding_dims = 64

strategy = tensorflow.distribute.MirroredStrategy()
num_gpus = strategy.num_replicas_in_sync

print('Number of GPUs: {}'.format(num_gpus))

with strategy.scope():
  input = Input(shape=(sequence_length,))
  embed = Embedding(num_unique_words, embedding_dims, input_length=sequence_length)(input)
  
  # I recommend using LSTM if you need recurrent layers and plan to export to tfjs:
  # https://github.com/tensorflow/tfjs/issues/2442
  recurrent = Bidirectional(LSTM(128, return_sequences=True, dropout=0.1, recurrent_dropout=0.5))(embed)
  recurrent = Bidirectional(LSTM(128, dropout=0.1, recurrent_dropout=0.5))(recurrent)

  output = Dense(num_unique_words, activation='softmax')(recurrent)

  model = Model(inputs=input, outputs=output)
  model.compile(loss='categorical_crossentropy', optimizer='RMSProp', metrics=['accuracy']) 

model.summary()

# Sampling

Since we don't want to generate the same text sequences every time, we will introduce some randomness into the sampling process. How much randomness? We'll use a softmax temperature algorithm in which higher temperatures produce more randomness in the sample. 

A very low temperature will always produce the same next word - the word that has the highest probability. If that's what you're after, then I recommend using [beam search](https://github.com/dabasajay/Image-Caption-Generator/blob/master/utils/model.py) instead. Beam search will give you the *sequence* with the highest probability, not just the next *word* the highest probability. 

In [0]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# given a phrase, complete it using the given model and tokenizer
def complete_phrase(model, tokenizer, phrase, temperature=0.5, max_length=25):
    print("[", phrase, "]", end=" ")
    
    vector = tokenizer.texts_to_sequences([phrase])[0]
    sequence = np.zeros(sequence_length)
    sequence[-len(vector):] = vector

    for i in range(max_length):
      preds = model(np.expand_dims(sequence, axis=0), training=False)[0]
      
      index = sample(preds, temperature)
      word = tokenizer.index_word[index]
      
      sequence = np.roll(sequence, -1)
      sequence[-1] = index      
      
      print(word, end=" ")
      if word == '.':
        break

# Train

We're going to pause training every 10 epochs so we can see results. This is terrible for performance, but it's really hard to say how well the model is doing without generating some text. We'll train forever, saving a model every 10 epochs. 

There appears to be [some issue](https://github.com/keras-team/keras/issues/13861) calling `predict` on models that were generated using a distributed strategy. You *should* be able to use the predict method to get results:

```
results = model.predict(inputs)
```

But that produces an shape error. Instead, treat the model as a function: 

```
results = model(inputs, training=False)
```


In [0]:
import os.path
from random import choice
from tensorflow.keras.callbacks import Callback

checkpoint_path = 'csc-weights.h5'

# create a callback that generates some text every 10 epochs
class CheckpointAndEvaluate(Callback):
  def on_epoch_end(self, epoch, logs):
    if epoch % 10 == 0: 
      print(F"Saving {checkpoint_path} ...")
      model.save_weights(checkpoint_path)

      print("Evaluating model ...")
      for temperature in [0.2, 0.5, 1.2]:
        print('temperature:', temperature, end=" ")
        phrase = choice(x)[0:5]
        complete_phrase(
          model,
          tokenizer, 
          tokenizer.sequences_to_texts([phrase])[0],
          temperature
        )
        print()

if os.path.isfile(checkpoint_path):
  print(F"Loading saved weights from {checkpoint_path} ...")
  model.load_weights(checkpoint_path) 

with strategy.scope():
  model.fit(x, y, epochs=1000, batch_size=batch_size, callbacks=[CheckpointAndEvaluate()])