# Generating Shakespearean Text Using a Character RNN

In a famous [2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) titled “The Unreasonable Effectiveness of Recurrent Neural Networks,” Andrej Karpathy showed how to train an RNN to predict the next character in a sentence

This **Char-RNN** can then be used to generate novel text, one character at a time

## Creating the Training Dataset

First, let’s download all of Shakespeare’s work, using Keras’s handy get_file() function and downloading the data from Andrej Karpathy’ [Char-RNN project](https://github.com/karpathy/char-rnn):

In [20]:
import tensorflow.keras as keras

In [21]:
shakespeare_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
filepath = keras.utils.get_file('shakespeare.txt', shakespeare_url)
with open(filepath) as f:
  shakespeare_text = f.read()

Next, we must encode every character as an integer. One option is to create a custom preprocessing layer, as we did in Chapter 13

But in this case, it will be simpler to use Keras’ Tokenizer class. First we need to fit a **tokenizer** to the text: **it will find all the characters used in the text and map each of them to a different character ID, from 1 to the number of distinct characters** (it does not start at 0, so we can use that value for masking, as we will see later in this chapter):

In [22]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts([shakespeare_text])

We set **char_level=True** to get **character-level encoding rather than the default word-level encoding**. Note that this **tokenizer converts the text to lowercase by default** (but you can set lower=False if you do not want that)

Now the tokenizer can encode a sentence (or a list of sentences) to a list of character IDs and back, and it tells us how many distinct characters there are and the total number of characters in the text:

In [23]:
tokenizer.texts_to_sequences(["First"])

[[20, 6, 9, 8, 3]]

In [27]:
tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])

['f i r s t']

In [25]:
max_id = len(tokenizer.word_index) # number of distinct characters
max_id

39

In [29]:
dataset_size = tokenizer.document_count
dataset_size

1