<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/tf/c3_w4_tf_text_generation_with_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text generation with a RNN

## Text generation with an RNN

In [1]:
import os
import time
import numpy as np
import tensorflow as tf

from tensorflow.keras.utils import get_file 


In [2]:
url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
path_to_file = get_file('shakespeare.txt', url)
path_to_file

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


'/root/.keras/datasets/shakespeare.txt'

In [3]:
with open(path_to_file, 'rb') as file:
    text = file.read().decode(encoding="utf-8")
    print(f'Length of text: {len(text):,}')
    vocab = sorted(set(text))

Length of text: 1,115,394


Take a looks at the dataset text

In [4]:
print(text[:70])
print("...")
print(text[-70:])

First Citizen:
Before we proceed any further, hear me speak.

All:
Spe
...
et'st thy fortune sleep--die, rather; wink'st
Whiles thou art waking.



In [5]:
print(f"{len(vocab)} unique characters")

65 unique characters


### Process the text

In [6]:
char2idx = {char: index for index, char in enumerate(vocab)}
idx2char = np.array(vocab)

In [7]:
text_as_int = np.array([char2idx[char] for char in text])

In [8]:
print("{")
for char,_ in zip(char2idx, range(5)):
    print("  {:4s}: {:3d},".format(repr(char), char2idx[char]))
print("  ...\n}")

{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '$' :   3,
  '&' :   4,
  ...
}


In [9]:
# Show how the first 13 characters from the text are mapped to integers
print(f"{text[:13]} <-- mapped to int --> {text_as_int[:13]}")

First Citizen <-- mapped to int --> [18 47 56 57 58  1 15 47 58 47 64 43 52]


In [10]:
seq_length = 100
examples_per_epoch = len(text) // (seq_length + 1)
examples_per_epoch

11043

In [11]:
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
    print(idx2char[i.numpy()], end='')

First

The `batch` method lets us easily convert these individual characters to sequences of the desired size.

In [12]:
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'
