## Lesson 6

[lesson 6 wiki](http://wiki.fast.ai/index.php/Lesson_6)

In [1]:
%matplotlib inline
import utils
import imp
imp.reload(utils)
from utils import *

Using TensorFlow backend.


## Setup

We're going to download the collected works of Nietzsche to use a sout data for this class.

In [2]:
path = get_file('nietzsche.txt', origin='http://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read()
print('corpus length:', len(text))

corpus length: 600893


In [3]:
chars = sorted(list(set(text)))
vocab_size = len(chars) + 1
print('total chars', vocab_size)

total chars 85


Sometimes it's useful to have zero value in the dataset, e.g. for padding

In [4]:
chars.insert(0, '\0')

In [5]:
''.join(chars[1:-6])

'\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxy'

Map from chars to indices and back again

In [6]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

*idx* will be the data we use from now own - it simply converts all the characters to their index (based on the mapping above)

In [7]:
idx = [char_indices[c] for c in text]

In [8]:
idx[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [9]:
''.join(indices_char[i] for i in idx[:70])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

## 3 char model

### create inputs

Create a list of every 4th character, starting at the 0th, 1st, 2nd, then 3rd characters

In [10]:
cs = 3
c1_dat = [idx[i] for i in range(0, len(idx) - 1 - cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx) - 1 - cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx) - 1 - cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx) - 1 - cs, cs)]

Our inputs

In [11]:
x1 = np.stack(c1_dat[:-2])
x2 = np.stack(c2_dat[:-2])
x3 = np.stack(c3_dat[:-2])

Our output

In [12]:
y = np.stack(c4_dat[:-2])

The first 4 inputs and outputs

In [13]:
x1[:4], x2[:4], x3[:4]

(array([40, 30, 29,  1]), array([42, 25,  1, 43]), array([29, 27,  1, 45]))

In [14]:
y[:4]

array([30, 29,  1, 40])

In [15]:
x1.shape, y.shape

((200295,), (200295,))

The number of latent factors to create (i.e. the size of the embedding matrix)

In [16]:
n_fac = 42

Create inpus and embedding outputs for each our 3 character inputs

In [17]:
def embedding_input(name, n_in, n_out):
    inp = Input(shape = (1,), dtype = 'int64', name = name)
    emb = Embedding(n_in, n_out, input_length = 1)(inp)
    return inp, Flatten()(emb)

In [18]:
c1_in, c1 = embedding_input('c1', vocab_size, n_fac)
c2_in, c2 = embedding_input('c2', vocab_size, n_fac)
c3_in, c3 = embedding_input('c3', vocab_size, n_fac)

### Create and train model

Pick a size for out hidden state

In [19]:
n_hidden = 256

This is the 'green arrow' from our diagram - the layer operation from input to hidden

In [20]:
dense_in = Dense(n_hidden, activation = 'relu')

Our first hidden activation is simply this function applied to the result of the embedding of the first character.

In [21]:
c1_hidden = dense_in(c1)

This is the 'orange arrow' from our diagram - the layer operation from hidden to hidden

In [22]:
dense_hidden = Dense(n_hidden, activation = 'tanh')

Our second and third hidden activations sum up the previous hidden state (agter applying dense_hidden) to the new input state.

In [23]:
c2_dense = dense_in(c2)
hidden_2 = dense_hidden(c1_hidden)
c2_hidden = merge([c2_dense, hidden_2])

  This is separate from the ipykernel package so we can avoid doing imports until


In [24]:
c3_dense = dense_in(c3)
hidden_3 = dense_hidden(c2_hidden)
c3_hidden = merge([c3_dense, hidden_3])

  This is separate from the ipykernel package so we can avoid doing imports until


This is the 'blue arrow' from out diagram - the layer operation from hidden to output.

In [25]:
dense_out = Dense(vocab_size, activation = 'softmax')

The third hidden state is the inupt to our output layer.

In [26]:
c4_out = dense_out(c3_hidden)

In [27]:
model = Model([c1_in, c2_in, c3_in], c4_out)

In [28]:
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = Adam())

In [29]:
model.optimizer.lr = 1e-6

In [30]:
model.fit([x1, x2, x3], y, batch_size = 64, epochs = 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x1361ad68>

In [31]:
model.optimizer.kr = 0.01

In [32]:
model.fit([x1, x2, x3], y, batch_size = 64, epochs = 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x5e5e048>

In [33]:
model.optimizer.lr = 1e-6

In [34]:
model.fit([x1, x2, x3], y, batch_size = 64, epochs = 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x1361a9e8>

In [35]:
model.optimizer.lr = 0.01

In [36]:
model.fit([x1, x2, x3], y, batch_size = 64, epochs = 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x116b5e80>

### Test model

In [37]:
def get_next(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict(arrs)
    i = np.argmax(p)
    return chars[i]

In [38]:
get_next('phi')

' '

In [39]:
get_next(' th')

' '

In [40]:
get_next(' an')

' '

## Out first RNN!

### Create inputs

This is the size of out unrolled RNN.

In [41]:
cs = 8

For each of 0 through 7, create a list of every 8th character with taht starting point. These will be the 8 inputs to out model.

In [42]:
c_in_dat = [[idx[i+n] for i in range(0, len(idx)-1-cs, cs)] for n in range(cs)]

Then create a list of the next character in each of these series. This will be the labels for our model.

In [43]:
c_out_dat = [idx[i+cs] for i in range(0, len(idx)-1-cs, cs)]

In [44]:
xs = [np.stack(c[:-2]) for c in c_in_dat]

In [45]:
len(xs), xs[0].shape

(8, (75109,))

In [46]:
y = np.stack(c_out_dat[:-2])

So each column below is one series of 8 characters from the text.

In [47]:
[xs[n][:cs] for n in range(cs)]

[array([40,  1, 33,  2, 72, 67, 73,  2]),
 array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67])]

..and this is the next character after each sequence.

In [48]:
y[:cs]

array([ 1, 33,  2, 72, 67, 73,  2, 68])

In [49]:
n_fac = 42

### Create and train model

In [50]:
def embedding_input(name, n_in, n_out) :
    inp = Input(shape = (1,), dtype = 'int64', name = name + '_in')
    emb = Embedding(n_in, n_out, input_length = 1, name = name + '_emb')(inp)
    return inp, Flatten()(emb)

In [51]:
c_ins = [embedding_input('c' + str(n), vocab_size, n_fac) for n in range(cs)]

In [52]:
n_hidden = 256

In [53]:
dense_in = Dense(n_hidden, activation = 'relu')
dense_hidden = Dense(n_hidden, activation = 'relu', kernel_initializer = 'identity')
dense_out = Dense(vocab_size, activation = 'softmax')

The first character of each sequence goes through dense_in(), to create out first hidden activations.

In [54]:
hidden = dense_in(c_ins[0][1])

Then for each successive layer we combine the output of dense_in() on the next character with the output of dense_hidden() on the current hidden state, to create new hidden state.

In [55]:
for i in range(1, cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden])

  after removing the cwd from sys.path.


Putting the final hidden state through desnse_out() gives us our output.

In [56]:
c_out = dense_out(hidden)

So now we can create out model.

In [57]:
model = Model([c[0] for c in c_ins], c_out)
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = Adam())

In [58]:
model.fit(xs, y, batch_size = 64, epochs = 12)

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x12750fd0>

### Test model

In [59]:
def get_next(inp):
    idxs = [np.array(char_indices[c])[np.newaxis] for c in inp]
    p = model.predict(idxs)
    return chars[np.argmax(p)]

In [60]:
get_next('for thos')

'e'

In [61]:
get_next('part of ')

't'

In [62]:
get_next('queens a')

'n'

## Our first RNN with keras!

In [63]:
n_hidden, n_fac, cs, vocab_size = (256, 42, 8, 86)

This is nearly exactly equivalent to the RNN we built ourselves in the previous section.

In [64]:
model = Sequential([
    Embedding(vocab_size, n_fac, input_length = cs),
    SimpleRNN(n_hidden, activation = 'relu', inner_init = 'identity'),
    Dense(vocab_size, activation = 'softmax')
])

  This is separate from the ipykernel package so we can avoid doing imports until


In [65]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 8, 42)             3612      
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 256)               76544     
_________________________________________________________________
dense_7 (Dense)              (None, 86)                22102     
Total params: 102,258
Trainable params: 102,258
Non-trainable params: 0
_________________________________________________________________


In [66]:
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = Adam())

In [67]:
model.fit(np.concatenate(xs, axis = 1), y, batch_size = 64, epochs = 8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x168f8b70>

In [68]:
def get_next_keras(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = np.array(idxs)[np.newaxis,:]
    p = model.predict(arrs)[0]
    return chars[np.argmax(p)]

In [69]:
get_next_keras('this is ')

'a'

In [70]:
get_next_keras('part of ')

't'

In [71]:
get_next_keras('queens a')

'n'

## Returning sequeces

### Create inputs

To use a sequence model, we can leave out input unchanged - but we have to change out output to a sequence (of course!)

Here, c_out_dat is identical to c_in_dat, but moved across 1 character,

In [72]:
c_out_dat = [[idx[i+n] for i in range(1, len(idx)-cs, cs)] for n in range(cs)]

In [73]:
ys = [np.stack(c[:-2]) for c in c_out_dat]

Reading down each column shows one set of inputs and outputs.

In [74]:
[xs[n][:cs] for n in range(cs)]

[array([[40],
        [ 1],
        [33],
        [ 2],
        [72],
        [67],
        [73],
        [ 2]]), array([[42],
        [ 1],
        [38],
        [44],
        [ 2],
        [ 9],
        [61],
        [73]]), array([[29],
        [43],
        [31],
        [71],
        [54],
        [ 9],
        [58],
        [61]]), array([[30],
        [45],
        [ 2],
        [74],
        [ 2],
        [76],
        [67],
        [58]]), array([[25],
        [40],
        [73],
        [73],
        [76],
        [61],
        [24],
        [71]]), array([[27],
        [40],
        [61],
        [61],
        [68],
        [54],
        [ 2],
        [58]]), array([[29],
        [39],
        [54],
        [ 2],
        [66],
        [73],
        [33],
        [ 2]]), array([[ 1],
        [43],
        [73],
        [62],
        [54],
        [ 2],
        [72],
        [67]])]

In [75]:
[ys[n][:cs] for n in range(cs)]

[array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67]),
 array([ 1, 33,  2, 72, 67, 73,  2, 68])]

### Create and train model

In [76]:
dense_in = Dense(n_hidden, activation = 'relu')
dense_hidden = Dense(n_hidden, activation = 'relu', init = 'identity')
dense_out = Dense(vocab_size, activation = 'softmax', name = 'output')

  


In [77]:
inp1 = Input(shape = (n_fac,), name = 'zero')
hidden = dense_in(inp1)

In [78]:
outs = []

for i in range(cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden], mode = 'sum')
    
    # every Layer new has an output
    outs.append(dense_out(hidden))

  


In [79]:
model = Model([inp1] + [c[0] for c in c_ins], outs)
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = Adam())

In [80]:
zeros = np.tile(np.zeros(n_fac), (len(xs[0]), 1))
zeros.shape

(75109, 42)

In [81]:
model.fit([zeros] + xs, ys, batch_size = 64, epochs = 12)

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x18423400>

### Test model

In [82]:
def get_nexts(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict([np.zeros(n_fac)[np.newaxis,:]] + arrs)
    print(list(inp))
    return [chars[np.argmax(o)] for o in p]

In [83]:
get_nexts(' this is')

[' ', 't', 'h', 'i', 's', ' ', 'i', 's']


['t', 'h', 'e', 't', ' ', 'i', 'n', ' ']

In [84]:
get_nexts(' part of')

[' ', 'p', 'a', 'r', 't', ' ', 'o', 'f']


['t', 'o', 'r', 't', 'i', 'o', 'f', ' ']

### Sequence model with keras

In [85]:
n_hidden, n_fac, cs, vocab_size

(256, 42, 8, 86)

To convert out previous keras model into a sequence model, simply add the 'return_sequences = True' parameter, and add TimeDistributed() around out dense layer.

In [86]:
model = Sequential([
    Embedding(vocab_size, n_fac, input_length = cs),
    SimpleRNN(n_hidden, return_sequences = True, activation = 'relu', inner_init = 'identity'),
    TimeDistributed(Dense(vocab_size, activation = 'softmax'))
])

  This is separate from the ipykernel package so we can avoid doing imports until


In [87]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 8, 42)             3612      
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 8, 256)            76544     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 8, 86)             22102     
Total params: 102,258
Trainable params: 102,258
Non-trainable params: 0
_________________________________________________________________


In [88]:
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = Adam())

In [89]:
xs[0].shape

(75109, 1)

In [90]:
x_rnn = np.stack(np.squeeze(xs), axis = 1)
y_rnn = np.atleast_3d(np.stack(ys, axis = 1))

In [91]:
x_rnn.shape, y_rnn.shape

((75109, 8), (75109, 8, 1))

In [92]:
model.fit(x_rnn, y_rnn, batch_size = 64, epochs = 8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x1c5d7588>

In [93]:
def get_nexts_keras(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = np.array(idxs)[np.newaxis, :]
    p = model.predict(arrs)[0]
    print(list(p))
    return [chars[np.argmax(o)] for o in p]

In [94]:
get_nexts_keras(' this is')

[array([  1.3582e-09,   5.9093e-04,   6.5847e-03,   1.1993e-05,   8.8671e-03,   1.3096e-04,
         2.9389e-03,   1.2866e-06,   5.9853e-06,   2.6198e-05,   1.6628e-05,   5.1936e-11,
         2.0068e-04,   1.9883e-05,   9.5108e-06,   1.4398e-05,   1.2478e-05,   2.3554e-05,
         7.1654e-06,   2.8285e-05,   4.9636e-06,   3.8419e-06,   1.6101e-06,   2.8257e-05,
         1.5803e-05,   4.5987e-03,   2.7638e-03,   1.8396e-03,   1.0799e-03,   2.8969e-03,
         3.0138e-03,   3.6639e-03,   2.3399e-03,   6.5076e-03,   3.0201e-04,   3.5064e-04,
         1.7882e-03,   2.2801e-03,   1.7953e-03,   2.3678e-03,   3.0387e-03,   2.3288e-04,
         1.2397e-03,   3.0916e-03,   6.0466e-03,   6.8427e-04,   4.7758e-04,   5.4191e-03,
         7.1926e-07,   4.7049e-04,   3.5523e-05,   2.9255e-04,   3.7240e-08,   4.4278e-04,
         1.4772e-01,   3.9829e-02,   3.3048e-02,   3.0528e-02,   2.5180e-02,   3.2138e-02,
         9.9425e-03,   4.6592e-02,   8.0851e-02,   1.5418e-03,   4.7150e-03,   1.8890e-02

['a', 'h', 'e', 'n', ' ', 's', 'n', ' ']

### one-hot sequence model with keras

This is the keras version of th theano model taht we're about to create.

In [95]:
model = Sequential([
    SimpleRNN(n_hidden, return_sequences = True, input_shape = (cs, vocab_size), activation = 'relu', inner_init = 'identity'),
    TimeDistributed(Dense(vocab_size, activation = 'softmax'))
])
model.compile(loss = 'categorical_crossentropy', optimizer = Adam())

  


In [96]:
oh_ys = [to_categorical(o, vocab_size) for o in ys]
oh_y_rnn = np.stack(oh_ys, axis = 1)

oh_xs = [to_categorical(o, vocab_size) for o in xs]
oh_x_rnn = np.stack(oh_xs, axis = 1)

oh_x_rnn.shape, oh_y_rnn.shape

((75109, 8, 86), (75109, 8, 86))

In [97]:
model.fit(oh_x_rnn, oh_y_rnn, batch_size = 64, epochs = 8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x1cc5b470>

In [98]:
def get_nexts_oh(inp):
    idxs = np.array([char_indices[c] for c in inp])
    arr = to_categorical(idxs, vocab_size)
    
    p = model.predict(arr[np.newaxis, :])[0]
    print(list(inp))
    return [chars[np.argmax(o)] for o in p]

In [99]:
get_nexts_oh(' this is')

[' ', 't', 'h', 'i', 's', ' ', 'i', 's']


['t', 'h', 'e', 's', ' ', 'c', 's', ' ']

## Stateful model with keras

In [100]:
bs = 64

A stateful model is easy to create (just add "stateful=True") but harder to train. We had to add batchnorm and use LSTM to get resonable results. When Using stateful in keras, you have to also add 'batch_input_shape' to the first layer, and the catch size there.

In [104]:
model = Sequential([
    Embedding(vocab_size , n_fac, input_length = cs, batch_input_shape = (bs, 8)),
    BatchNormalization(),
    LSTM(n_hidden, return_sequences = True, stateful = True),
    TimeDistributed(Dense(vocab_size, activation = 'softmax'))
])

In [105]:
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = Adam())

Since we're using a fixed batch shape, we have to ensure our inputs and outputs are even multiple of the batch size.

In [106]:
mx = len(x_rnn) // bs * bs

In [107]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size = bs, epochs = 4, shuffle = False)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x525c9940>

In [108]:
model.optimizer.lr = 1e-4

In [109]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size = bs, epochs = 4, shuffle = False)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x1c6ebcc0>

In [110]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size = bs, epochs = 4, shuffle = False)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x51e1d7b8>

## Theano RNN

In [111]:
n_input = vocab_size
n_output = vocab_size

Using raw theano, we have to create out wieght matrices and bias vectors ourselves - here are the functions we'll use to do so (using florot initialiation). The return values are wrapped in shared(), whicih is how we thell theano that it ca manage this data (copying it to and from the GPU as necessary).

In [304]:
def init_wgts(rows, cols):
    scale = math.sqrt(2. / rows)
    return shared(normal(scale = scale, size = (rows, cols)).astype(np.float32))

def init_bias(rows):
    return shared(np.zeros(rows, dtype = np.float32))

We return the weights and biases together as a tuple. For the hidden weights, we'll use an identity intialization (as recommended by [Hinton]().)

In [305]:
def wgts_and_bias(n_in, n_out):
    return init_wgts(n_in, n_out), init_bias(n_out)

def id_and_bias(n):
    return shared(np.eye(n, dtype = np.float32)), init_bias(n)

Theano doesn;t actually do any computations until we explicitly compile and evaluate the function (at which point it'll be turned into CUDA code and sent off to the GPU). SO or job is to describe the computations that we;ll wnat theano to do - the first step is to tell theano what inputs we'll be providing to our computation:

In [129]:
t_inp = T.matrix('inp')
t_outp = T.matrix('outp')
t_h0 = T.vector('h0')
lr = T.scalar('lr')

all_args = [t_h0, t_inp, t_outp, lr]

Now we're ready to create out initial weight matrices.

In [131]:
W_h = id_and_bias(n_hidden)
W_x = wgts_and_bias(n_input, n_hidden)
W_y = wgts_and_bias(n_hidden, n_output)
w_all = list(chain.from_iterable([W_h, W_x, W_y]))

Theano handles looping by using the [GPU scan]() operation. We have to tell theano what to do at each step through the scan - this is the function we'll use, which does a single forward pass for one character:

In [136]:
def step(x, h, W_h, b_h, W_x, b_x, W_y, b_y):
    # Calculate the hidden activations
    h = nnet.relu(T.dot(x, W_x) + b_x + T.dot(h, W_h) + b_h)
    # Calculate the output activations
    y = nnet.softmax(T.dot(h, W_y) + b_y)
    # Return both (the 'Flatten()' is to work around a theano bug)
    return h, T.flatten(y, 1)

Now we can provide everything necessary for the scan operation, so we can setup that up - we have to pass inthe function to call at each step, the sequence to step through, the intital values of the outputs, and any other arguments to pass to the step function.

In [137]:
[v_h, v_y], _ = theano.scan(step, sequences = t_inp, outputs_info = [t_h0, None], non_sequences = w_all)

We can now calculate our loss function, and *all* of our gradients, with just a couple of lines of code!

In [140]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)

We even have to show theano show to how to do SGD - so we setup this dictionary of updates to complete after every forward pass, which apply to standard SGD update rule to every weight.

In [142]:
def upd_dict(wgts, grads, lr):
    return OrderedDict({w: w-g*lr for (w, g) in zip(wgts, grads) })

upd = upd_dict(w_all, g_all, lr)

We're finally ready to compile the function!

In [144]:
fn = theano.function(all_args, error, updates = upd, allow_input_downcast = True)

TypeError: ('An update must have the same type as the original shared variable (shared_var=<TensorType(float32, matrix)>, shared_var.type=TensorType(float32, matrix), update_val=Elemwise{sub,no_inplace}.0, update_val.type=TensorType(float64, matrix)).', 'If the difference is related to the broadcast pattern, you can call the tensor.unbroadcast(var, axis_to_unbroadcast[, ...]) function to remove broadcastable dimensions.')

## Pure python RNN

### SEtup basic functions

Now we're going to try to repeat the above theano RNN, using just python (and numpy). Which means, we have to do everything ourselves, including defining the basic functions of a neural net! Below are all of the definitions, along with tests to check tat they give the same answers as theano. The functions ending in \_d are the defivatives of each function.

In [145]:
def sigmoid(x) : return 1 / (1 + np.exp(-x))
def sigmoid_d(x):
    output = sigmoid(x)
    return output * (1 - output)

In [146]:
def relu(x): return np.maximum(0., x)
def relu_d(x): return (x > 0.) * 1.

In [157]:
relu(np.array([3.,-3.])), relu_d(np.array([3., -3.]))

(array([ 3.,  0.]), array([ 1.,  0.]))

In [158]:
def dist(a, b): return pow(a-b,2)
def dist_d(a,b): return 2*(a-b)

In [159]:
import pdb

In [170]:
esp = 1e-7
def x_entropy(pred, actual):
    return -np.sum(actual * np.log(np.clip(pred, esp, 1-esp)))
def x_entropy_d(pred, actual): return -actual / pred

In [181]:
def softmax(x): return np.exp(x) / np.exp(x).sum()

In [190]:
def softmax_d(x):
    sm = softmax(x)
    res = np.expand_dims(-sm, -1) * sm
    res[np.diag_indices_from(res)] = sm*(1-sm)
    return res

In [191]:
test_preds = np.array([0.2, 0.7, 0.1])
test_actuals = np.array([0., 1., 0.])
nnet.categorical_crossentropy(test_preds, test_actuals).eval()

array(0.35667494393873245)

In [192]:
x_entropy(test_preds, test_actuals)

0.35667494393873245

In [193]:
test_inp = T.dvector()
test_out = nnet.categorical_crossentropy(test_inp, test_actuals)
test_grad = theano.function([test_inp], T.grad(test_out, test_inp))

In [194]:
test_grad(test_preds)

array([-0.    , -1.4286, -0.    ])

In [195]:
x_entropy_d(test_preds, test_actuals)

array([-0.    , -1.4286, -0.    ])

In [196]:
pre_pred = random(oh_x_rnn[0][0].shape)
preds = softmax(pre_pred)
actual = oh_x_rnn[0][0]

In [197]:
np.allclose(softmax_d(pre_pred).dot(x_entropy_d(preds, actual)), preds - actual)

True

In [198]:
softmax(test_preds)

array([ 0.2814,  0.464 ,  0.2546])

In [200]:
nnet.softmax(test_preds).eval()

array([[ 0.2814,  0.464 ,  0.2546]])

In [201]:
test_out = T.flatten(nnet.softmax(test_inp))

In [202]:
test_grad = theano.function([test_inp], theano.gradient.jacobian(test_out, test_inp))

In [203]:
test_grad(test_preds)

array([[ 0.2022, -0.1306, -0.0717],
       [-0.1306,  0.2487, -0.1181],
       [-0.0717, -0.1181,  0.1898]])

In [204]:
softmax_d(test_preds)

array([[ 0.2022, -0.1306, -0.0717],
       [-0.1306,  0.2487, -0.1181],
       [-0.0717, -0.1181,  0.1898]])

In [205]:
act = relu
act_d = relu_d

In [206]:
loss = x_entropy
loss_d = x_entropy_d

We also have to define out own scan function. Since we're not worrying about running things in parallel. Its's very simple to implement:

In [207]:
def scan(fn, start, seq):
    res = []
    prev = start
    for s in seq:
        app = fn(prev, s)
        res.append(app)
        prev = app
    return res

For instance, scan on + is the cumulative sum.

In [209]:
scan(lambda prev, curr: prev+curr, 0, range(5))

[0, 1, 3, 6, 10]

### Set up training

Let's now build the functions to do the forward and backward passes of out RNN, First, define our data and shape.

In [252]:
inp = oh_x_rnn
outp = oh_y_rnn
n_input = vocab_size
n_output = vocab_size

In [253]:
inp.shape, outp.shape

((75109, 8, 86), (75109, 8, 86))

Here's the functioni to do a single forward pass of an RNN, for a single character.

In [264]:
def one_char(prev, item):
    # previous state
    tot_loss, pre_hidden, pre_pred, hidden, ypred = prev
    # current tinputs and output
    x, y = item
    pre_hidden = np.dot(x, w_x) + np.dot(hidden, w_h)
    hidden = act(pre_hidden)
    pre_pred = np.dot(hidden, w_y)
    ypred = softmax(pre_pred)
    return (
    # keep track of loss so we can report it
    tot_loss + loss(ypred, y),
    # used in backprop
    pre_hidden, pre_pred,
    # used in next iteration
    hidden,
    # to provide predictions
    ypred)

We use scan to apply the above to a whole sequence of characters.

In [265]:
def get_chars(n): return zip(inp[n], outp[n])
def one_fwd(n): return scan(one_char, (0,0,0,np.zeros(n_hidden), 0), get_chars(n))

Now we can define the cbackward step. We use a loop to fo through every element of the sequence. The derivatives are applying the chain rule to each step, and accumulating the gradients across the sequence.

In [266]:
# columnify a vector
def col(x): return x[:, newaxis]

def one_bkwd(args, n):
    global w_x, w_y, w_h
    
    i = inp[n]  # 8x86
    o = outp[n]  # 8x86
    d_pre_hidden = np.zeros(n_hidden) # 256
    
    for p in reversed(range(len(i))):
        totloss, pre_hidden, pre_pred, hidden, ypred = args[p]
        x = i[p] # 86
        y = o[p] # 86
        d_pre_pred = softmax_d(pre_pred).dot(loss_d(ypred, y)) # 86
        d_pre_hidden = (np.dot(d_pre_hidden, w_h.T) + np.dot(d_pre_pred, w_y.T)) * act_d(pre_hidden) # 256
        
        # d(loss)/d(w_y) = d(loss)/d(pre_pred) * d(pre_pred)/d(w_y)
        w_y -= col(hidden) * d_pre_pred * alpha
        # d(loss)/d(w_h) = d(loss)/d(pre_hidden[p-1]) * d(pre_hidden[p-1])/d(w_h)
        if (p > 0): w_h -= args[p-1][3].dot(d_pre_hidden) * alpha
        w_x -= col(x) * d_pre_hidden * alpha
    return d_pre_hidden

Now we can set up our initial weight matrices. Note that we're not using bisa at all in this example, in order to keep things simpler.

In [267]:
scale = math.sqrt(2. / n_input)
w_x = normal(scale = scale, size = (n_input, n_hidden))
w_y = normal(scale = scale, size = (n_hidden, n_output))
w_h = np.eye(n_hidden, dtype = np.float32)

Out loop looks much like the theano loop in the previous section, except hat we have to call the backwards step ourselves.

In [268]:
overallError = 0
alpha = 1e-5
for n in range(10000):
    res = one_fwd(n)
    overallError += res[-1][0]
    deriv = one_bkwd(res, n)
    if (n % 1000 == 999):
        print("Error:{:.4f}; Gradient:{:.5f}".format(overallError/1000, np.linalg.norm(deriv)))
        overallError = 0

Error:35.8657; Gradient:2.68377
Error:35.6886; Gradient:2.13522
Error:35.6671; Gradient:2.22855
Error:35.6236; Gradient:2.20996
Error:35.5871; Gradient:1.96299
Error:35.5815; Gradient:1.98897
Error:35.5323; Gradient:1.75088
Error:35.5286; Gradient:1.93306
Error:35.5256; Gradient:1.85797
Error:35.4930; Gradient:2.06328


## Keras GRU

Identical to the last keras mn, but a GRU!

In [271]:
model = Sequential([
    GRU(n_hidden, return_sequences = True, input_shape = (cs, vocab_size), activation = 'relu'),
    TimeDistributed(Dense(vocab_size, activation = 'softmax'))
])
model.compile(loss = 'categorical_crossentropy', optimizer = Adam())

In [273]:
model.fit(oh_x_rnn, oh_y_rnn, batch_size = 64, epochs = 8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x5de99e80>

In [276]:
get_nexts_oh(' this is')

[' ', 't', 'h', 'i', 's', ' ', 'i', 's']


['t', 'h', 'e', 's', ' ', 'c', 'n', ' ']

## Theano GRU

### Separate weights

The theano GRU looks just like the simple theano RNN, except for the use of the reset and update gates. Each of these gates requires its own hidden and input weights, so we add those to our weight matrices.

In [306]:
W_h = id_and_bias(n_hidden)
W_x = init_wgts(n_input, n_hidden)
W_y = wgts_and_bias(n_hidden, n_output)
rW_h = init_wgts(n_hidden, n_hidden)
rW_x = wgts_and_bias(n_input, n_hidden)
uW_h = init_wgts(n_hidden, n_hidden)
uW_x = wgts_and_bias(n_input, n_hidden)
w_all = list(chain.from_iterable([W_h, W_y, uW_x, rW_x]))
w_all.extend([W_x, uW_h, rW_h])

Here's the definition of a gate - it's just a sigmoid applied to the addition of the dot prodicts of th einput vectors.

In [307]:
def gate(x, h, W_h, W_x, b_x):
    return nnet.sigmoid(T.dot(x, W_x) + b_x + T.dot(h, W_h))

Our step is nearly identical to before, except that we multiply our hidden state by our reset gate, adn we update our hiden state based on the update gate.

In [308]:
def step(x, h, W_h, b_h, W_y, b_y, uW_x, ub_x, rW_x, rb_x, W_x, uW_h, rW_h):
    reset = gate(x, h, rW_h, rW_x, rb_x)
    update = gate(x, h, uW_h, uW_x, ub_x)
    h_new = gate(x, h * reset, W_h, W_x, b_h)
    h = update * h + (1.0 - update) * h_new
    y = nnet.softmax(T.dot(h, W_y) + b_y)
    return h, T.flatten(y, 1)

Everything from here on is identical to our simple RNN in thano.

In [309]:
[v_h, v_y,], _ = theano.scan(step, sequences = t_inp, outputs_info = [t_h0, None], non_sequences = w_all)

In [310]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)

In [311]:
upd = upd_dict(w_all, g_all, lr)
fn = theano.function(all_args, error, updates = upd, allow_input_downcast = True)

TypeError: ('An update must have the same type as the original shared variable (shared_var=<TensorType(float32, matrix)>, shared_var.type=TensorType(float32, matrix), update_val=Elemwise{sub,no_inplace}.0, update_val.type=TensorType(float64, matrix)).', 'If the difference is related to the broadcast pattern, you can call the tensor.unbroadcast(var, axis_to_unbroadcast[, ...]) function to remove broadcastable dimensions.')

### Combined weights

We can make the previous section simpler and gaster by concatenating the hidden and input matrices and inuts together. We're not going to step through this cell by cell - you'll see it's identical to the previous section except for this concatenation.

In [313]:
W = (shared(np.concatenate([np.eye(n_hidden), normal(size = (n_input, n_hidden))]).astype(np.float32)),  init_bias(n_hidden))

rW = wgts_and_bias(n_input + n_hidden, n_hidden)
uW = wgts_and_bias(n_input + n_hidden, n_hidden)
W_y = wgts_and_bias(n_hidden, n_output)
w_all = list(chain.from_iterable([W, W_y, uW, rW]))

In [319]:
def gate(m, W, b): return nnet.sigmoid(T.dot(m,W) + b)

In [322]:
def step(x, h, W, b, W_y, b_y, uW, ub, rW, rb):
    m = T.concatenate([h, x])
    reset = gate(m, rW, rb)
    update = gate(m, uW, ub)
    m = T.concatenate([h * reset, x])
    h_new = gate(m, W, b)
    h = update * h + (1.0 - update) * h_new
    y = nnet.softmax(T.dot(h, W_y) + b_y)
    return h, T.flatten(y, 1)

In [323]:
[v_h, v_y], _ = theano.scan(step, sequences = t_inp, outputs_info = [t_h0, None], non_sequences = w_all)

In [328]:
def upd_dict(wgts, grads, lr):
    return OrderedDict({w: w-g*lr for (w,g) in zip(wgts, grads)})

In [329]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)

In [331]:
upd = upd_dict(w_all, g_all, lr)
fn = theano.function(all_args, error, updates = upd, allow_input_downcast = True)

TypeError: ('An update must have the same type as the original shared variable (shared_var=<TensorType(float32, matrix)>, shared_var.type=TensorType(float32, matrix), update_val=Elemwise{sub,no_inplace}.0, update_val.type=TensorType(float64, matrix)).', 'If the difference is related to the broadcast pattern, you can call the tensor.unbroadcast(var, axis_to_unbroadcast[, ...]) function to remove broadcastable dimensions.')

In [332]:
err = 0.
l_rate = 0.01
for i in range(len(X)):
    err += fn(zp.zeros(n_hidden), X[i], Y[i], l_rate)
    if i % 1000 == 999:
        print('Error:{.2f}'.fomat(err / 1000))
        err = 0.

NameError: name 'X' is not defined