# Shakespear Text Generator

This notebook uses RNN to generate Shakespear-like texts, adapted from [example of RNN in Hands on Machine Learning](https://github.com/ageron/handson-ml2/blob/master/16_nlp_with_rnns_and_attention.ipynb). 

We first import all the dependent libraries from training. In this notebook, we use TensorFlow to build our model. We also need to open input file containing text written by Shakespear for training, acquired from [karpathy's repository](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt).

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

with open("input.txt", "r") as f:
  text = f.read()

2022-01-02 09:11:29.461807: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-02 09:11:30.944650: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-02 09:11:31.097950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1731] Found device 0 with properties: 
pciBusID: 0004:05:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.00GiB deviceMemoryBandwidth: 836.37GiB/s
2022-01-02 09:11:31.097982: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-02 09:11:31.103014: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2022-01-02 09:11:31.103058: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas

We then tokenize the text in char level, representing each character with a single number(ID). We substract 1 from the original vector to ensure that the minimum ID is 0 rather than 1. After that, we build a TensorFlow dataset for training with first 90% data.

In [2]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level = True)
tokenizer.fit_on_texts(text)

[token] = np.array(tokenizer.texts_to_sequences([text])) - 1

train_ds = tf.data.Dataset.from_tensor_slices(
    token[:int(tokenizer.document_count * .9)])

2022-01-02 09:11:33.683847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1731] Found device 0 with properties: 
pciBusID: 0004:05:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.00GiB deviceMemoryBandwidth: 836.37GiB/s
2022-01-02 09:11:33.687264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1869] Adding visible gpu devices: 0
2022-01-02 09:11:33.689274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1731] Found device 0 with properties: 
pciBusID: 0004:05:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.00GiB deviceMemoryBandwidth: 836.37GiB/s
2022-01-02 09:11:33.692519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1869] Adding visible gpu devices: 0
2022-01-02 09:11:33.692570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1256] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-01-02 09:11:33.692582: I tensorflow/core/com

Here we define several constants important for training. We set ```steps``` to 100, meaning we are using 100 characters to predict the next character. We use ```w_len``` for window length, which also contains another character in the right as the target, while ```char_size``` is the number of possible ID values. For batch size, we use a relatively large value because the dataset contains millions of values and a large batch size decreases training time. After that, we apply a series of transformations to the dataset. First, we use ```window``` to crop the dataset into windows of ```w_len``` and use ```flat_map``` to flatten the cropped dataset. After shuffling and batching, we separate dataset into data and label and transform data to one-hot encoding.

In [3]:
steps = 100
w_len = steps + 1
batch_size = 1024
char_size = len(tokenizer.word_index)

ds = train_ds.window(w_len, shift = 1, drop_remainder = True)
ds = ds.flat_map(lambda w : w.batch(w_len))
ds = ds.shuffle(steps * 100).batch(batch_size)
ds = ds.map(lambda w: (w[:, :-1], w[:, 1:]))
ds = ds.map(lambda x, y: (tf.one_hot(x, depth = char_size), y)).prefetch(1)

We can then build up our model. Our model is very simple, containing 2 GRU layers and a Softmax layer for output. We set ```current_dropout``` to 0 so that GRUs can be supported by GPU. We compile the model with Adam optimizer and train it on the dataset for 20 epochs

In [4]:
model = tf.keras.Sequential([
                             tf.keras.layers.GRU(128, return_sequences = True, 
                                                 dropout = .2, 
                                                 recurrent_dropout = 0),
                             tf.keras.layers.GRU(128, return_sequences = True, 
                                                 dropout = .2, 
                                                 recurrent_dropout = 0),
                             tf.keras.layers.TimeDistributed(
                                 tf.keras.layers.Dense(char_size, 
                                                activation = "softmax")
                             )
])
model.compile(loss = "sparse_categorical_crossentropy", optimizer = "adam")
history = model.fit(ds, epochs = 20)

2022-01-02 09:11:34.108068: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2022-01-02 09:11:34.160927: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3783000000 Hz


Epoch 1/20


2022-01-02 09:11:38.850257: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-01-02 09:11:39.388221: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8101
2022-01-02 09:11:39.736807: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2022-01-02 09:11:40.196298: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Here we define a function to preprocess the text so that it can be fed into the model we just trained.

In [5]:
def preprocess(text):
  x = np.array(tokenizer.texts_to_sequences(text)) - 1

  return tf.one_hot(x, char_size)

We test the model with a simple sentence. The output should be 'u'!

In [6]:
sample_x = preprocess(["How are yo"])
sample_pred = model.predict(sample_x)
print(tokenizer.sequences_to_texts(np.argmax(sample_pred, axis = -1) + 1)[0][-1])

u


In order to generate a series of characters, we need to define two functions. In ```next_char```, we use the text given to predict the next character ID and transform it into the char predicted. In this function, we use two methods to avoid repeating characters. With ```tf.random.categorical``` function, we can generate random characters based on probabilities predicted. We also define a parameter ```temp``` as temperature, which controls how much the generator flavors high-prob characters. In ```complete```, we simply add all the characters generated together.

In [7]:
def next_char(text, temp = 1):
  x = preprocess([text])
  y = model.predict(x)[0, -1:, :]
  logits = tf.math.log(y) / temp
  char = tf.random.categorical(logits, num_samples = 1) + 1
  return tokenizer.sequences_to_texts(char.numpy())[0]

def complete(text, n_char = 50, temp = 1):
  for _ in range(n_char):
    text += next_char(text, temp)

  return text

In our last step, we can finally generate text in Shakespear style. Starting with a letter T, what can our model generate?

In [8]:
print(complete("T", temp = 1))

T sir.

arswers:
percupio, shr she down as iron, go
