# Generating Shakespeare

## Setup

We're going to download the collected plays of Shakespeare to use as our data.

Source: http://www.gutenberg.org/cache/epub/100/pg100.txt

The original source was preprocessed to remove sonnets and non-Shakesperean text added by Project Gutenberg.

In [1]:
import os

BASE_DIR = os.getcwd()
data = BASE_DIR + '/gutenberg_shakespeare_modified.txt' # preprocessed

In [2]:
with open(data, 'r') as f:
    text = f.read()
print('corpus length:', len(text))

('corpus length:', 5291227)


In [3]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

('total chars:', 88)


Sometimes it's useful to have a zero value in the dataset, e.g. for padding

In [4]:
chars.insert(0, "\0")

In [5]:
''.join(chars)

'\x00\n\r !"&\'(),-.0123456789:;<?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_`abcdefghijklmnopqrstuvwxyz|}\xbb\xbf\xef'

Map chars to indices and vice versa

In [6]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [7]:
print(char_indices)

{'\x00': 0, ' ': 3, '(': 8, ',': 10, '0': 13, '4': 17, '8': 21, '\xbb': 85, '<': 25, '\xbf': 86, 'D': 30, 'H': 34, 'L': 38, 'P': 42, 'T': 46, 'X': 50, '`': 56, 'd': 60, 'h': 64, 'l': 68, '\xef': 87, 'p': 72, 't': 76, 'x': 80, '|': 83, "'": 7, '3': 16, '7': 20, ';': 24, '?': 26, 'C': 29, 'G': 33, 'K': 37, 'O': 41, 'S': 45, 'W': 49, '[': 53, '_': 55, 'c': 59, 'g': 63, 'k': 67, 'o': 71, 's': 75, 'w': 79, '\n': 1, '"': 5, '&': 6, '.': 12, '2': 15, '6': 19, ':': 23, 'B': 28, 'F': 32, 'J': 36, 'N': 40, 'R': 44, 'V': 48, 'Z': 52, 'b': 58, 'f': 62, 'j': 66, 'n': 70, 'r': 74, 'v': 78, 'z': 82, '\r': 2, '!': 4, ')': 9, '-': 11, '1': 14, '5': 18, '9': 22, 'A': 27, 'E': 31, 'I': 35, 'M': 39, 'Q': 43, 'U': 47, 'Y': 51, ']': 54, 'a': 57, 'e': 61, 'i': 65, 'm': 69, 'q': 73, 'u': 77, 'y': 81, '}': 84}


*idx* converts the Shakepearean text to character indices (based on the *char_indices* mapping above)

In [8]:
idx = [char_indices[c] for c in text]

In [9]:
print(idx[:70])

[87, 85, 86, 45, 29, 31, 40, 31, 23, 2, 1, 44, 71, 77, 75, 65, 68, 68, 71, 70, 24, 3, 42, 57, 74, 65, 75, 24, 3, 32, 68, 71, 74, 61, 70, 59, 61, 24, 3, 39, 57, 74, 75, 61, 65, 68, 68, 61, 75, 2, 1, 2, 1, 2, 1, 27, 29, 46, 3, 35, 12, 3, 45, 29, 31, 40, 31, 3, 14, 12]


In [10]:
''.join(indices_char[i] for i in idx[:70])

'\xef\xbb\xbfSCENE:\r\nRousillon; Paris; Florence; Marseilles\r\n\r\n\r\nACT I. SCENE 1.'

## 3 char model

### Create inputs

Create a list of every 4th character, starting at the 0th, 1st, 2nd, then 3rd characters

In [11]:
nc=3 # num chars
c1_dat = [idx[i] for i in xrange(0, len(idx)-1-nc, nc)]
c2_dat = [idx[i+1] for i in xrange(0, len(idx)-1-nc, nc)]
c3_dat = [idx[i+2] for i in xrange(0, len(idx)-1-nc, nc)]
c4_dat = [idx[i+3] for i in xrange(0, len(idx)-1-nc, nc)]

In [12]:
0, len(idx)-1-nc, nc

(0, 5291223, 3)

In [13]:
len(c1_dat), len(c4_dat)

(1763741, 1763741)

Out inputs

In [14]:
import numpy as np

x1 = np.stack(c1_dat)
x2 = np.stack(c2_dat)
x3 = np.stack(c3_dat)

Out output

In [15]:
y = np.stack(c4_dat)

In [16]:
x1.shape, y.shape

((1763741,), (1763741,))

In [17]:
n_fac = 42 # number of latent factors (size of embedding matrix)

Create inputs and embedding outputs for each of our 3 character inputs

In [18]:
from keras.layers import Input, Embedding
from keras.layers.core import Flatten

def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1)(inp)
    return inp, Flatten()(emb)

Using Theano backend.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)


In [19]:
c1_in, c1_emb = embedding_input('c1', vocab_size, n_fac)
c2_in, c2_emb = embedding_input('c2', vocab_size, n_fac)
c3_in, c3_emb = embedding_input('c3', vocab_size, n_fac)

### Create and train model

In [20]:
n_hidden = 256 # hyperparameter: size of hidden state

![3char](./3char.png)

`dense_in` is the 'green arrow' in the diagram - the layer operation from input to hidden

In [22]:
from keras.layers.core import Dense

dense_in = Dense(n_hidden, activation='relu')

Our first hidden activation is simply this function applied to the result of the embedding of the first character.

In [23]:
c1_hidden = dense_in(c1_emb)

`dense_hidden` is the 'orange arrow' from our diagram - the layer operation from hidden to hidden

_Note:_ unsure why the activation for this is `tanh`

In [24]:
dense_hidden = Dense(n_hidden, activation='tanh')

Our second and third activations sum up the previous hidden state (after applying `dense_hidden`) to the new input state.

In [27]:
from keras.layers import merge

# merge([new input state, orange arrow from previous hidden state])
c2_hidden = merge([dense_in(c2_emb), dense_hidden(c1_hidden)])
c3_hidden = merge([dense_in(c3_emb), dense_hidden(c2_hidden)])

`dense_out` is the 'blue arrow' from our diagram - the layer operation from hidden to output

In [28]:
dense_out = Dense(vocab_size, activation='softmax')

The third hidden state is the input to our output layer

In [29]:
c4_out = dense_out(c3_hidden)

In [31]:
from keras.models import Model
from keras.optimizers import Adam

model = Model([c1_in, c2_in, c3_in], c4_out)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())
model.optimizer.lr=0.000001

In [32]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
c1 (InputLayer)                  (None, 1)             0                                            
____________________________________________________________________________________________________
c2 (InputLayer)                  (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 1, 42)         3696        c1[0][0]                         
____________________________________________________________________________________________________
embedding_2 (Embedding)          (None, 1, 42)         3696        c2[0][0]                         
___________________________________________________________________________________________

In [33]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f618b7db250>

In [34]:
model.optimizer.lr=0.01

In [35]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f618c479e50>

In [36]:
model.optimizer.lr=0.000001

In [37]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f618b76e2d0>

In [38]:
model.optimizer.lr=0.01

In [39]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f61bc0290d0>

In [40]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f6189a44c90>

In [41]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f618b8bd190>

In [42]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f618b682910>

In [43]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f618b8bd2d0>