# Text generation by RNN

### Andrej Karpathy 
"The Unreasonable Effectiveness of Recurrent Neural Networks"

QUEENE:
I had thought thou hadst a Roman; for the oracle,
Thus by All bids the man against the word,
Which are so weak of care, by old care done;
Your children were in your holy love,
And the precipitation through the bleeding throne.

BISHOP OF ELY:
Marry, and will, my lord, to weep in such a one were prettiest;
Yet now I was adopted heir
Of the world's lamentable day,
To watch the next way with his father with his face?

ESCALUS:
The cause why then we are all resolved more sons.

VOLUMNIA:
O, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, it is no sin it should be dead,
And love and pale as any will to that word.

QUEEN ELIZABETH:
But how long have I heard the soul for this world,
And show his hands of life be proved to stand.

PETRUCHIO:
I say he look'd on, if I must be content
To stay him from the fatal of our country's bliss.
His lordship pluck'd from this sentence then for prey,
And then let us twain, being the moon,
were she such a case as fills m

In [1]:
import tensorflow as tf
print(tf.__version__)
import numpy as np
import os 
import time

2.0.0


### load dataset

In [2]:
path_to_file = tf.keras.utils.get_file("shakespeare.txt", 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


### load data

In [3]:
# after loading data, decode data for compatibility with python2
text = open(path_to_file, "rb").read().decode(encoding="utf-8")
# the number of characters equals to text length
print("Length of text: {} characters".format(len(text)))

Length of text: 1115394 characters


In [4]:
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [5]:
#unique numbers of characters
factor_vocab = set(text)
vocab = sorted(factor_vocab)

print(factor_vocab)
print(vocab)
print("{} unique characters".format(len(vocab)))

{' ', 'c', 'F', 'E', 'I', 'R', 'U', 'T', '\n', 'i', 't', 'X', ';', 'd', 'h', 'Q', 'y', 'z', 'm', 'g', 'B', 'j', 'a', 'l', 'w', 'P', 'u', 'r', 'n', 'A', 'S', 'W', 'K', '3', 'D', 'M', "'", 'O', 'o', 'G', 'e', 'k', '&', '$', 'Z', 's', 'x', ':', 'v', '?', 'H', '.', 'b', 'q', '-', 'f', 'L', 'J', 'p', 'Y', '!', 'C', 'V', 'N', ','}
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
65 unique characters


## Process the text
### vectorize the text

In [12]:
# make correspondace table of each character and its index
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

print("***************** char2idx *****************")
print(char2idx)
print("***************** idx2char *****************")
print(idx2char)
# print(text)
print("***************** text_as_int *****************")
print(text_as_int)

***************** char2idx *****************
{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
***************** idx2char *****************
['\n' ' ' '!' '$' '&' "'" ',' '-' '.' '3' ':' ';' '?' 'A' 'B' 'C' 'D' 'E'
 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W'
 'X' 'Y' 'Z' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'
 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z']
***************** text_as_int *****************
[18 47 56 ... 45  8

In [14]:
print("{")
for char, _ in zip(char2idx, range(20)):
    print(" {:4s}: {:3d},".format(repr(char), char2idx[char]))

print(" ...\n")

{
 '\n':   0,
 ' ' :   1,
 '!' :   2,
 '$' :   3,
 '&' :   4,
 "'" :   5,
 ',' :   6,
 '-' :   7,
 '.' :   8,
 '3' :   9,
 ':' :  10,
 ';' :  11,
 '?' :  12,
 'A' :  13,
 'B' :  14,
 'C' :  15,
 'D' :  16,
 'E' :  17,
 'F' :  18,
 'G' :  19,
 ...



In [15]:
print("{} ----- characters mapped to in  -----> {}".format(repr(text[:13]), text_as_int[:13]))

'First Citizen' ----- characters mapped to in  -----> [18 47 56 57 58  1 15 47 58 47 64 43 52]


## Prediction task
Given some characters or their sequences, what is the most likely next letter? 

This is the task you wanto to train your model to do. The input to the model is a string of characters, and we train the model to make predictions about the output, the next character at each point in time.

## Create a training sample and target
We then split the text into sample sequences. Each input sequence contains seq_length characters from the original text.

For each input sequence, the corresponding target contains the same length of text, but shifted to the right by one character.

Therefore, the text is split into seq_length+1 chunks. For example, suppose seq_length is 4 and the text is "Hello". The input sequence would be "Hell" and the target sequence would be "Hello".

To do this, we first convert the text vector into a sequence of character indexes using the tf.data.Dataset.from_tensor_slices function.

In [24]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)
print(examples_per_epoch)

# create training sample and target
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
print(char_dataset)

for i in char_dataset.take(5):
    print(idx2char[i.numpy()])

print("***********************************")
for i in char_dataset.take(5):
    print(i.numpy())
    print(i)

11043
<TensorSliceDataset shapes: (), types: tf.int64>
F
i
r
s
t
***********************************
18
tf.Tensor(18, shape=(), dtype=int64)
47
tf.Tensor(47, shape=(), dtype=int64)
56
tf.Tensor(56, shape=(), dtype=int64)
57
tf.Tensor(57, shape=(), dtype=int64)
58
tf.Tensor(58, shape=(), dtype=int64)


In [30]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset.as_numpy_iterator():
  print(element)

AttributeError: 'TensorSliceDataset' object has no attribute 'as_numpy_iterator'

In [32]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
    print(repr("".join(idx2char[item.numpy()])))

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


In [33]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

### output the first sample input and target

In [34]:
for input_example, target_example in dataset.take(1):
    print("Input data: ", repr("".join(idx2char[input_example.numpy()])))
    print("Target data: ", repr("".join(idx2char[target_example.numpy()])))

Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data:  'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


In [37]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print(" input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print(" expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
 input: 18 ('F')
 expected output: 47 ('i')
Step    1
 input: 47 ('i')
 expected output: 56 ('r')
Step    2
 input: 56 ('r')
 expected output: 57 ('s')
Step    3
 input: 57 ('s')
 expected output: 58 ('t')
Step    4
 input: 58 ('t')
 expected output: 1 (' ')


## Create a training batch

In [38]:
BATCH_SIZE = 64

# BUFFER SIZE for shuffling a dataset
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

## Create models

In [40]:
vocab[:10]

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3']

In [41]:
vocab_size = len(vocab)
embedding_dim = 256
rnn_units= 1024

In [47]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        tf.keras.layers.GRU(rnn_units, return_sequences=True, stateful=True, recurrent_initializer="glorot_uniform"),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

In [48]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

## Try model!

In [49]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 65) # (batch_size, sequence_length, vocab_size)


In [51]:
print(type(dataset.take(1)))

<class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>


In [52]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (64, None, 256)           16640     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 65)            66625     
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


In [53]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

In [54]:
sampled_indices

array([46, 61, 45, 11, 41, 46, 62, 41, 47, 41, 14, 53,  5, 61, 24, 27, 27,
       50, 48, 38, 54,  1, 60, 36, 52, 41, 51, 60,  3, 38,  7, 44,  8, 14,
       25, 53,  9, 56, 31, 21, 31, 10, 57,  2, 26, 18, 37, 12, 10, 43, 25,
       55, 37,  5, 50, 61, 42, 45, 58, 41, 51, 59, 56, 25, 21, 27, 15, 21,
       12,  4, 42, 47, 48, 58, 12,  2, 49, 56, 21, 24,  5, 20, 50, 36, 38,
       53, 56, 39, 21, 51, 64, 49, 31,  5, 64, 47, 10, 15, 16, 19])

In [55]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices])))

Input: 
 'ar this: mistake me not; no life,\nI prize it not a straw, but for mine honour,\nWhich I would free, i'

Next Char Predictions: 
 "hwg;chxcicBo'wLOOljZp vXncmv$Z-f.BMo3rSIS:s!NFY?:eMqY'lwdgtcmurMIOCI?&dijt?!krIL'HlXZoraImzkS'zi:CDG"


In [59]:
print(idx2char[input_example_batch[0]])
print(idx2char[sampled_indices])

['a' 'r' ' ' 't' 'h' 'i' 's' ':' ' ' 'm' 'i' 's' 't' 'a' 'k' 'e' ' ' 'm'
 'e' ' ' 'n' 'o' 't' ';' ' ' 'n' 'o' ' ' 'l' 'i' 'f' 'e' ',' '\n' 'I' ' '
 'p' 'r' 'i' 'z' 'e' ' ' 'i' 't' ' ' 'n' 'o' 't' ' ' 'a' ' ' 's' 't' 'r'
 'a' 'w' ',' ' ' 'b' 'u' 't' ' ' 'f' 'o' 'r' ' ' 'm' 'i' 'n' 'e' ' ' 'h'
 'o' 'n' 'o' 'u' 'r' ',' '\n' 'W' 'h' 'i' 'c' 'h' ' ' 'I' ' ' 'w' 'o' 'u'
 'l' 'd' ' ' 'f' 'r' 'e' 'e' ',' ' ' 'i']
['h' 'w' 'g' ';' 'c' 'h' 'x' 'c' 'i' 'c' 'B' 'o' "'" 'w' 'L' 'O' 'O' 'l'
 'j' 'Z' 'p' ' ' 'v' 'X' 'n' 'c' 'm' 'v' '$' 'Z' '-' 'f' '.' 'B' 'M' 'o'
 '3' 'r' 'S' 'I' 'S' ':' 's' '!' 'N' 'F' 'Y' '?' ':' 'e' 'M' 'q' 'Y' "'"
 'l' 'w' 'd' 'g' 't' 'c' 'm' 'u' 'r' 'M' 'I' 'O' 'C' 'I' '?' '&' 'd' 'i'
 'j' 't' '?' '!' 'k' 'r' 'I' 'L' "'" 'H' 'l' 'X' 'Z' 'o' 'r' 'a' 'I' 'm'
 'z' 'k' 'S' "'" 'z' 'i' ':' 'C' 'D' 'G']


## Training model!

In [56]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 65)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.1732717


In [57]:
model.compile(optimizer="adam", loss=loss)

## Create checkpoints

In [58]:
checkpoint_dir = "./training_checkpoints"
# name of checkpoint file
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

### do training

In [61]:
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback], verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Text generation

In [62]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_10'

In [63]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

In [64]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (1, None, 256)            16640     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_1 (Dense)              (1, None, 65)             66625     
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


In [65]:
ls {checkpoint_dir}

checkpoint                   ckpt_5.data-00000-of-00001
ckpt_1.data-00000-of-00001   ckpt_5.index
ckpt_1.index                 ckpt_6.data-00000-of-00001
ckpt_10.data-00000-of-00001  ckpt_6.index
ckpt_10.index                ckpt_7.data-00000-of-00001
ckpt_2.data-00000-of-00001   ckpt_7.index
ckpt_2.index                 ckpt_8.data-00000-of-00001
ckpt_3.data-00000-of-00001   ckpt_8.index
ckpt_3.index                 ckpt_9.data-00000-of-00001
ckpt_4.data-00000-of-00001   ckpt_9.index
ckpt_4.index


In [66]:
latest = tf.train.latest_checkpoint(checkpoint_dir)
latest

'./training_checkpoints/ckpt_10'

In [67]:
model.save_weights("./checkpoints/my_checkpoint")

## Prediction loop
* feed a character series, initialize the state of RNN, then set the number of characters generated
* Earn the prediction distribution of next character using start characters and the RNN state.
* Using category distribution, calculate the index of predicted characters. And feed them into the next model.
* These returned RNN states are fed back into the next RNN model so that this would have much more contexts.

In [74]:
def generate_text(model, start_string):
    # evaluatioin step
    # generation character number
    num_generate = 1000

    # convert the starting characters into integers(vectorize)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    text_generated = []

    # low 'temperature' would likely bring predictable texts
    # high "temperature" would likely bring unpredictable texts.
    # find optimal conditins through some experiments
    temperature = 1.0

    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # delete the dim of batch
        predictions = tf.squeeze(predictions, 0)

        # predict the retuned character using categorical distribution
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # pass the last hidden state and predicted characters to model as an input
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])
    
    return (start_string + "".join(text_generated))

In [75]:
print(generate_text(model, start_string=u"ROMEO: "))

ROMEO: these have us banished in
being an intelligent for your
Petruchio of their causes by the charity.

POLIXENES:
Trie's good, and am
I think, she's donat oup the darkless croud,
And there they aspect some other-heart.

CALIABENH:
Upriest winding on him--as this so divers a heavenly right.

ANGELO:
You make your gracious bowels--look'd for my i'nt,
Courts up it daughter'd to thee.

JONTES:
Bring these another, his departered
hath tripp'd not foundance, madam,
Thou hast not break our scorn his fatach, nay, my lord,
Your lifes of Follow.

Sereft.

Seeve have the bear the man.
Shase not the both of me, and more de with the morning,
And sure I shore nothing heavens that late not have.

CATEBBY:
Why, low abward master?

GREMIO:
Take not by notle last.

COMINIUS:
You have better from the first
And met all the sound hot proud tond men open when you bid Goor labe
their terrant with me with thy looks a
joy, but as
you have not chedities, that whose advise
by humbishness traniou-daring frown: