# Lesson 6 - RNN - redux 1
By me.

In [1]:
from theano.sandbox import cuda
cuda.use('gpu1')

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)


In [2]:
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function

Using Theano backend.


## Setup
We'll work on a Nietzsche text corpus.

In [3]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read()
print('corpus length:', len(text))

corpus length: 600901


In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

total chars: 86


In [5]:
# we're adding a zero for padding (sometimes it's useful to have a meaningless token)
chars.insert(0, "\0")

In [6]:
print(' '.join(chars))

  
   ! " ' ( ) , - . 0 1 2 3 4 5 6 7 8 9 : ; = ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ ] _ a b c d e f g h i j k l m n o p q r s t u v w x y z � � � � � �


In [7]:
# we want to work with numbers so we need to turn these chars (our vocabulary) into indices
char_indices = {c:i for i, c in enumerate(chars)}
indices_char = {i:c for i,c in enumerate(chars)}

In [8]:
# and now we change the entire corpus into numbers
idx = [char_indices[c] for c in text]

In [9]:
idx[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [10]:
''.join([indices_char[i] for i in idx[:70]])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

## Models
### 3 Char Model
Start with the simplest.

#### Create input
For this model our input will be a list of every fourth character, starting at 0, 1, 2 and 3rd char.

In [11]:
# what the hell is cs? Well it's e.g. the number of chars from which we'll be trying to predict the 4th one (3 previous)
cs = 3

# ok, so we're going to be using step here (of cs=3) and grabbing every 1st char of that 4 char sequence, every 2nd char
# and so on.
c1_dat = [idx[i] for i in xrange(0, len(idx)-1-cs, cs)]
c2_dat = [idx[i+1] for i in xrange(0, len(idx)-1-cs, cs)]
c3_dat = [idx[i+2] for i in xrange(0, len(idx)-1-cs, cs)]
c4_dat = [idx[i+3] for i in xrange(0, len(idx)-1-cs, cs)]

# so c1_dat holds the 0th char, 4th char, 8th char of idx (that's how step of cs = 3 works)
# c2_dat holds 1st, 5th, 9th etc.
# and c4_dat is our y, what we're trying to predict

In [12]:
c1_dat[:10]

[40, 30, 29, 1, 40, 43, 31, 61, 2, 74]

In [13]:
# turn them into inputs (np.ndarrays, using np.stack) - no idea why we're skipping the last two
x1 = np.stack(c1_dat[:-2])
x2 = np.stack(c2_dat[:-2])
x3 = np.stack(c3_dat[:-2])

In [14]:
x1

array([40, 30, 29, ..., 62, 72, 59])

In [15]:
# and outputs (y)
y = np.stack(c4_dat[:-2])

In [16]:
x1.shape, y.shape

((200297,), (200297,))

Let's define the number of latent factors:

In [17]:
n_fac = 42

Create inputs and embedding outputs for each of our 3 inputs (define a function)

In [18]:
from keras.layers import Input, Embedding

def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1)(inp)
    return inp, Flatten()(emb)

In [19]:
# n_in is our vocab size, n_out is the number of latent factors we've defined
c1_in, c1 = embedding_input("c1", vocab_size, n_fac)
c2_in, c2 = embedding_input("c2", vocab_size, n_fac)
c3_in, c3 = embedding_input("c3", vocab_size, n_fac)

#### Create and train model
We've got the first 2 layers already done.

In [20]:
# pick the number of activations in our hidden fully connected layer:
n_hidden = 256

The green arrow from our diagram (from every input to hidden layer):

In [21]:
from keras.layers import Dense

dense_in = Dense(n_hidden, activation="relu")

For our first input (every first character in a 4char sequence) we just use this green arrow to turn it into our first hidden matrix.

In [22]:
c1_hidden = dense_in(c1)  # this is the functional notation, passing something to the layer

This is the orange arrows - passing info from hidden to hidden layer.

In [23]:
dense_hidden = Dense(n_hidden, activation="tanh")  # no explanation why we used tanh here

Remember from the diagram that the 2nd and 3rd characters come in after the previous ones have already been turned
via the green arrow into a hidden dense matrix.

In [24]:
c2_dense = dense_in(c2)  # (green) this is just the green arrow for c2 input
hidden_2 = dense_hidden(c1_hidden)  # (orange) this is the first part of the dense matrix resulting from c1 and c2
c2_hidden = merge([c2_dense, hidden_2])  # this is the full c2_hidden layer, a SUM of c2_dense and the hidden from c1.

In [25]:
c2_hidden.shape

Shape.0

In [26]:
# repeat for the c3
c3_dense = dense_in(c3) # green arrow for c3
hidden_3 = dense_hidden(c2_hidden) # orange arrow between 2 hidden dense layers
c3_hidden = merge([c3_dense, hidden_3]) # this is a merge (default=sum) of the input from c3 and .. 
# ... the previous hidden dense.

Now for the blue arrow, going from last hidden to output.

In [27]:
dense_out = Dense(vocab_size, activation="softmax")
# we want it to output a char, hence vocab_size

In [28]:
# the last hidden state is the input to this last layer
c4_out = dense_out(c3_hidden)

The model is defined by 3 inputs in a list and the c4_out holds all the operations (we've chained them functionally).

In [29]:
model = Model([c1_in, c2_in, c3_in], c4_out)

In [30]:
model.compile(loss="sparse_categorical_crossentropy", optimizer=Adam())
# we use sparse categorical crossentropy because we didn't one-hot-encode our output.
# it takes integer targets, one-hot encodes automatically in the background!
# REALLY USEFUL POSSIBLY - this way we don't need to create Thousand-columned arrays!
# WE CAN SKIP ONE-HOT ENCODING IN KERAS!

In [31]:
model.optimizer.lr = 0.000001

In [35]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9eea008f10>

In [41]:
model.optimizer.lr = 0.01

In [42]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9eea008e50>

In [74]:
model.optimizer.lr = 0.000001

In [75]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9ee99a6910>

In [76]:
model.optimizer.lr = 0.01

In [77]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9ee99a6f50>

### Test the model
We want a way for our model to output the next char given 3 previous ones.

In [55]:
# this is my way of testing the np.newaxis
l = [1, 2, 3]
res = [np.array(i)[np.newaxis] for i in l]
type(res[0])

numpy.ndarray

In [60]:
def get_next(inp):
    # first turn input into numbers
    idxs = [char_indices[c] for c in inp]
    
    # I think we turn the inputs into np. arrays here (yes, every element of idxs becomes a 1 elem np array
    # if we skipped [np.newaxis] we'd get array(i), when we don't we get array([i]). They're both of type numpy.ndarray.
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    
    p = model.predict(arrs)  # I think maybe it's because our inputs need to be in a list? not sure...
    i = np.argmax(p)  # find the softmaxed, most likely index of the char
    
    # turn index into char
    return chars[i]

In [78]:
get_next("phi")

't'

In [79]:
get_next("thi")

' '

In [80]:
# yep, just like the original only predicts a space.

In [81]:
get_next("is ")

' '

### Our first RNN
Let's make one!

In [82]:
# cs will stand for the size of our unrolled RNN (they wrote, weirdly)
cs = 8

# cause 1 input, 2 new char inputs, 3 dense overall and an output? that's 7...
# No, it's just about how many chars we'll be remembering (it used to be 4, now it's gonna be 8)

In [83]:
# create the cs (so eight) inputs - we need a list of every eight character starting at 0, then 1 and 2 and so on.
c_in_dat = [[idx[i + n] for i in xrange(0, len(idx) - 1 - cs, cs)]for n in range(cs)]

In [84]:
# then we need outputs - a list of eighth char (the one we're trying to predict)
c_out_dat = [idx[i+cs] for i in xrange(0, len(idx)-1-cs, cs)]

In [85]:
# now turn that into numpy.ndarray (no idea why until -2)
xs = [np.stack(c[:-2]) for c in c_in_dat]

In [87]:
# so we've got eight 1d arrays within xs, which have 75110 elems.
len(xs), xs[0].shape

(8, (75110,))

In [88]:
y = np.stack(c_out_dat[:-2])
len(y), y.shape

(75110, (75110,))

In [89]:
# when we show them like this, each COLUMN becomes a series of 8 consecutive chars
[xs[n][:cs] for n in range(cs)]

[array([40,  1, 33,  2, 72, 67, 73,  2]),
 array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67])]

In [90]:
chars[40], chars[42], chars[29], chars[30], chars[25], chars[27], chars[29], chars[1]

('P', 'R', 'E', 'F', 'A', 'C', 'E', '\n')

In [93]:
# and the y holds the next (eighth char) for each of those sequences
y[:cs]

array([ 1, 33,  2, 72, 67, 73,  2, 68])

In [94]:
# we only care about the first column now (that's the one that spells PREFACE) and we can see in the text
# that the character that follows is another newline
[chars[c] for c in y[:cs]]

['\n', 'I', ' ', 's', 'n', 't', ' ', 'o']

In [97]:
text[:9]

'PREFACE\n\n'

In [98]:
# let's define a new number of latent factors
n_fact = 42

### Create and train model
This time as an RNN

In [101]:
# almost the same, except we use a more clever naming convention

def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1, ), dtype='int64', name=name + "_in")
    emb = Embedding(n_in, n_out, input_length=1, name=name + "_emb")(inp)
    return inp, Flatten()(emb)

In [102]:
# this is weird for me cause there are gonna be many embedding layers and I'd worry about them
# having always the same way of embedding... Cause hey I might want 2 different embedding layers one day..
c_ins = [embedding_input('c'+str(n), vocab_size, n_fac) for n in range(cs)]

In [105]:
type(c_ins[0][1])

theano.tensor.var.TensorVariable

In [106]:
n_hidden = 256

In [108]:
# here we define the dense layers, notice that we're initializing the hidden layer to not be
# small random values (the Glorot way) but to an identity matrix to avoid exploding gradients in recursion
# more aptly called exploding activations
dense_in = Dense(n_hidden, activation="relu")
# when you use tab + shift in the Dense() you can see that by default it uses init="glorot_uniform"
dense_hidden = Dense(n_hidden, activation="tanh", init="identity")
dense_out = Dense(vocab_size, activation="softmax")

The first character of each of the 8 sequences goes through te dense_in to create our first layer of hidden activations.
I actually think each embedding layer might have different values. Cause like, a space as the first in a sequence is different than a space in the middle.

In [109]:
hidden = dense_in(c_ins[0][1])

Now for each layer we combine the output of dense_in on the next character in the sequence with the dense_hidden on the current state (via merge) to create the new hidden state.

In [110]:
for i in range(1, cs):
    # the final 1 here access the Embedding layer from the tuple containing Input, Embedding
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden])

In [112]:
# and for the output
c_out = dense_out(hidden)

In [114]:
# I think it's essentially the same thing we've done before, just with more chars.
model = Model([c[0] for c in c_ins], c_out)

# since we didn't one-hot encode our input we can use sparse_categorical crossentropy to save time
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [121]:
model.fit(xs, y,  batch_size=64, nb_epoch=12)

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x7f9edede6950>

#### Test the model

In [122]:
# we can use the same get_next fucntion to test

In [123]:
get_next('for thos')

'e'

In [124]:
get_next('part of ')

't'

In [125]:
get_next('queens a')

'n'

## RNN with Keras
This time it's personal.