## RNN Language Model

Below is a diagram of the RNN computation that we will implement below. We're plugging characters into the RNN with a 1-hot encoding and expecting it to predict the next character. In this example the training data is the string "hello", so there are 4 letters in the vocabulary: [h,e,l,o].

<img src="rnnlm.jpeg">

In [1]:
import numpy as np
np.random.seed(1337)

In [28]:
# data I/O
# get shakespeare from http://cs.stanford.edu/people/karpathy/shakespeare.txt
data = open('shakespeare.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print 'data has %d characters, %d unique.' % (data_size, vocab_size)

data has 4573338 characters, 67 unique.


In [29]:
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

In [30]:
char_to_ix['a']

41

In [31]:
# lets sample a batch of data
seq_length = 25 # number of characters in the batch
p = 220000 # point in the book to sample from
print data[p:p+seq_length] # print a chunk of data

 thing when he was young,


In [32]:
inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]
print inputs
print targets

[2, 61, 49, 48, 55, 46, 2, 62, 49, 44, 55, 2, 49, 44, 2, 62, 41, 58, 2, 64, 54, 60, 55, 46, 7]
[61, 49, 48, 55, 46, 2, 62, 49, 44, 55, 2, 49, 44, 2, 62, 41, 58, 2, 64, 54, 60, 55, 46, 7, 0]


In [33]:
# lets plug the first character into the RNN
ix_input = inputs[0]
ix_target = targets[0]
# encode the input character with a 1-hot representation
x = np.zeros((vocab_size,1))
x[ix_input] = 1
print x.ravel()

[ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


In [34]:
# create random starting parameters
hidden_size = 10
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

In [35]:
# compute the hidden state
h_prev = np.zeros((hidden_size, 1))
h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h_prev + bh))
print h.ravel()

[-0.02566928 -0.00711926 -0.00851462  0.01228545 -0.00241891  0.00636176
 -0.00171284 -0.01129739 -0.0069362   0.00932362]


In [36]:
# compute the scores for next character
y = np.dot(Why, h) + by
print y.ravel()

[  6.29612155e-04   4.09079294e-04  -3.29899872e-04   7.98200509e-04
  -2.62905161e-05   4.94626771e-04   1.97138889e-04  -1.01600591e-04
   7.72757316e-04   2.84376903e-04  -7.29973921e-04   1.56005304e-05
  -1.11927240e-04   1.35442172e-04  -3.89815428e-05  -4.86357178e-05
   1.32336208e-04   3.15738595e-04  -3.87247490e-04   7.28991890e-04
  -5.30632950e-05   4.20179198e-04   4.42242144e-04   2.83823246e-04
  -3.58363287e-05   6.98975802e-05   4.84398003e-04  -2.81941909e-04
   5.07592676e-04  -2.68109997e-04  -6.98104505e-05   3.21717382e-04
   5.08520176e-05   6.37695233e-04  -1.02395859e-04   1.63546016e-04
  -5.80853510e-04   1.19142485e-04  -3.79932371e-04  -3.94374025e-04
   7.46960859e-04   4.68737825e-04  -3.62337202e-04  -7.06302136e-06
   4.24622028e-04   9.21261371e-04   1.02755871e-04   2.95008636e-04
   1.41569258e-04  -7.44459004e-04   3.24625094e-04  -3.00690740e-05
   4.47332341e-04   2.14832415e-05  -4.31132112e-04   4.42862442e-04
   1.87427875e-06   7.64699298e-05

In [37]:
# the scores are unnormalized log probabilities. compute the probabilities
p = np.exp(y) / np.sum(np.exp(y))
print p.ravel()
print 'probabilities sum to ', p.sum()

[ 0.01493319  0.0149299   0.01491887  0.01493571  0.0149234   0.01493117
  0.01492673  0.01492227  0.01493533  0.01492803  0.0149129   0.01492402
  0.01492212  0.01492581  0.01492321  0.01492306  0.01492576  0.0149285
  0.01491801  0.01493467  0.014923    0.01493006  0.01493039  0.01492803
  0.01492325  0.01492483  0.01493102  0.01491958  0.01493137  0.01491979
  0.01492275  0.01492859  0.01492455  0.01493331  0.01492226  0.01492623
  0.01491512  0.01492557  0.01491812  0.0149179   0.01493494  0.01493079
  0.01491838  0.01492368  0.01493013  0.01493754  0.01492532  0.01492819
  0.0149259   0.01491268  0.01492863  0.01492334  0.01493047  0.01492411
  0.01491736  0.0149304   0.01492382  0.01492493  0.01492532  0.0149277
  0.01491917  0.01491625  0.01492354  0.01492264  0.01492744  0.01492071
  0.01492822]
probabilities sum to  1.0


In [38]:
print 'probability assigned to the correct next character is right now: ', p[ix_target,0]

probability assigned to the correct next character is right now:  0.0149162482677


In [39]:
loss = -np.log(p[ix_target,0])
print 'the cross-entropy (softmax) loss is ', loss

the cross-entropy (softmax) loss is  4.20530417242


In [40]:
# compute the gradient on y
dy = np.copy(p)
dy[ix_target] -= 1
print dy.ravel()
print 'sum of dy is ', dy.sum()
print 'the gradient for the correct character (%s) is: %s' % (ix_to_char[ix_target], dy[ix_target,0])
print 'the gradient for the character (a) is: ', dy[char_to_ix['a'],0]

[ 0.01493319  0.0149299   0.01491887  0.01493571  0.0149234   0.01493117
  0.01492673  0.01492227  0.01493533  0.01492803  0.0149129   0.01492402
  0.01492212  0.01492581  0.01492321  0.01492306  0.01492576  0.0149285
  0.01491801  0.01493467  0.014923    0.01493006  0.01493039  0.01492803
  0.01492325  0.01492483  0.01493102  0.01491958  0.01493137  0.01491979
  0.01492275  0.01492859  0.01492455  0.01493331  0.01492226  0.01492623
  0.01491512  0.01492557  0.01491812  0.0149179   0.01493494  0.01493079
  0.01491838  0.01492368  0.01493013  0.01493754  0.01492532  0.01492819
  0.0149259   0.01491268  0.01492863  0.01492334  0.01493047  0.01492411
  0.01491736  0.0149304   0.01492382  0.01492493  0.01492532  0.0149277
  0.01491917 -0.98508375  0.01492354  0.01492264  0.01492744  0.01492071
  0.01492822]
sum of dy is  2.77555756156e-17
the gradient for the correct character (t) is: -0.985083751732
the gradient for the character (a) is:  0.0149307863167


In [41]:
# we computed [y = np.dot(Why, h) + by]; Backpropagate to Why, h, and by
dWhy = np.dot(dy, h.T)
dh = np.dot(Why.T, dy)
dby = np.copy(dy)
print 'the hidden vector activations were:'
print h.ravel()
print 'the gradients are:'
print dh.ravel()
print 'the gradients dWhy have size: ', dWhy.shape
print 'a small sample is:'
print dWhy[:4,:4]

the hidden vector activations were:
[-0.02566928 -0.00711926 -0.00851462  0.01228545 -0.00241891  0.00636176
 -0.00171284 -0.01129739 -0.0069362   0.00932362]
the gradients are:
[-0.00824375 -0.00696831 -0.00844694  0.01640971 -0.0017776   0.00293419
  0.01496486  0.0062472  -0.00851701  0.009765  ]
the gradients dWhy have size:  (67, 10)
a small sample is:
[[-0.00038332 -0.00010631 -0.00012715  0.00018346]
 [-0.00038324 -0.00010629 -0.00012712  0.00018342]
 [-0.00038296 -0.00010621 -0.00012703  0.00018328]
 [-0.00038339 -0.00010633 -0.00012717  0.00018349]]


In [42]:
# we computed [h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h_prev + bh))]; 
# Backprop into Wxh, x, Whh, h_prev, bh:
dh_before_tanh = (1-h**2)*dh
dbh = np.copy(dh_before_tanh)
dWxh = np.dot(dh_before_tanh, x.T)
dWhh = np.dot(dh_before_tanh, h.T)
dh_prev = np.dot(Whh.T, dh_before_tanh)
print 'small sample of Whh:'
print Whh[:4,:4]

small sample of Whh:
[[-0.01152775  0.00881821 -0.00906459  0.00349652]
 [ 0.00261533  0.0120227  -0.00259614  0.01621284]
 [ 0.00182372  0.00073918 -0.00662722  0.02817786]
 [-0.01495566  0.00292029 -0.00142797  0.00315272]]


In [43]:
# we now have the gradients for all parameters! (Wxh, Whh, Why, bh, by)
# lets do a parameter update
learning_rate = 0.1
Wxh2 = Wxh - learning_rate * dWxh
Whh2 = Whh - learning_rate * dWhh
Why2 = Why - learning_rate * dWhy
bh2 = bh - learning_rate * dbh
by2 = by - learning_rate * dby

In [44]:
# these parameters should be much better! lets try it out:
h2 = np.tanh(np.dot(Wxh2, x) + np.dot(Whh2, h_prev + bh2))
y2 = np.dot(Why2, h2) + by2
p2 = np.exp(y2) / np.sum(np.exp(y2))
print 'probability assigned to the correct next character was: ', p[ix_target,0]
print 'probability assigned to the correct next character is now: ', p2[ix_target,0]
loss2 = -np.log(p2[ix_target,0])
print 'the cross-entropy (softmax) loss was ', loss
print 'the loss is now ', loss2

probability assigned to the correct next character was:  0.0149162482677
probability assigned to the correct next character is now:  0.0164625966368
the cross-entropy (softmax) loss was  4.20530417242
the loss is now  4.10666434182


In [45]:
# note: the probability for the correct character went up! (and the loss went down)

In [46]:
# putting it together with loops
def lossFun(inputs, targets, hprev):
    """
    inputs,targets are both list of integers.
    hprev is Hx1 array of initial hidden state
    returns the loss, gradients on model parameters, and last hidden state
    """
    xs, hs, ys, ps = {}, {}, {}, {}
    hs[-1] = np.copy(hprev)
    loss = 0
    
    # forward pass
    for t in xrange(len(inputs)):
        xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
        xs[t][inputs[t]] = 1
        hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
        ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
        loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
    
    # backward pass: compute gradients going backwards
    dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)
    dhnext = np.zeros_like(hs[0])
    for t in reversed(xrange(len(inputs))):
        dy = np.copy(ps[t])
        dy[targets[t]] -= 1 # backprop into y
        dWhy += np.dot(dy, hs[t].T)
        dby += dy
        dh = np.dot(Why.T, dy) + dhnext # backprop into h
        dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
        dbh += dhraw
        dWxh += np.dot(dhraw, xs[t].T)
        dWhh += np.dot(dhraw, hs[t-1].T)
        dhnext = np.dot(Whh.T, dhraw)
        
    # clip to mitigate exploding gradients
    for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(dparam, -5, 5, out=dparam)
    
    return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

In [47]:
loss, dWxh, dWhh, dWhy, dbh, dby, hnew = lossFun(inputs, targets, h_prev)
print loss

105.12080312


In [48]:
# TODO: write the sampling code
def sample(h, seed_ix, n):
    """ 
    sample a sequence of integers from the model 
    h is initial memory state, seed_ix is seed letter for first time step
    n is the number of time steps to sample for
    """
    x = np.zeros((vocab_size, 1))
    x[seed_ix] = 1
    ixes = [] # sampled indices
    for t in xrange(n):
        pass # TODO: run the RNN for one time step, sample from distribution
    return ixes


In [49]:
# Stochastic Gradient Descent
n, p = 0, 0
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
learning_rate = 1e-3
while n < 10000:
    # prepare inputs (we're sweeping from left to right in steps seq_length long)
    if p+seq_length+1 >= len(data) or n == 0: 
        hprev = np.zeros((hidden_size,1)) # reset RNN memory
        p = 0 # go from start of data
    inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
    targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

    # forward seq_length characters through the net and fetch gradient
    loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
    smooth_loss = smooth_loss * 0.999 + loss * 0.001
    if n % 100 == 0: print 'iter %d, loss: %f' % (n, smooth_loss) # print progress

    # perform parameter update with Adagrad
    for param, dparam in zip([Wxh, Whh, Why, bh, by], 
                                [dWxh, dWhh, dWhy, dbh, dby]):
        param += -learning_rate * dparam

    p += seq_length # move data pointer
    n += 1 # iteration counter 

iter 0, loss: 105.117317
iter 100, loss: 104.983575
iter 200, loss: 104.623957
iter 300, loss: 104.095161
iter 400, loss: 103.411069
iter 500, loss: 102.748750
iter 600, loss: 101.900376
iter 700, loss: 101.043855
iter 800, loss: 100.269985
iter 900, loss: 99.357290
iter 1000, loss: 98.447726
iter 1100, loss: 97.485653
iter 1200, loss: 96.434239
iter 1300, loss: 95.316229
iter 1400, loss: 94.361564
iter 1500, loss: 93.489495
iter 1600, loss: 92.453112
iter 1700, loss: 91.702083
iter 1800, loss: 91.107364
iter 1900, loss: 90.257844
iter 2000, loss: 89.215814
iter 2100, loss: 88.522362
iter 2200, loss: 88.013419
iter 2300, loss: 87.378470
iter 2400, loss: 87.002066
iter 2500, loss: 86.754673
iter 2600, loss: 86.210309
iter 2700, loss: 85.948958
iter 2800, loss: 86.013457
iter 2900, loss: 85.739568
iter 3000, loss: 85.272399
iter 3100, loss: 85.384693
iter 3200, loss: 85.350449
iter 3300, loss: 85.119789
iter 3400, loss: 84.972730
iter 3500, loss: 84.656232
iter 3600, loss: 84.490833
iter

In [50]:
def sample(h, seed_ix, n):
    """ 
    sample a sequence of integers from the model 
    h is memory state, seed_ix is seed letter for first time step
    """
    x = np.zeros((vocab_size, 1))
    x[seed_ix] = 1
    ixes = []
    for t in xrange(n):
        h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
        y = np.dot(Why, h) + by
        p = np.exp(y) / np.sum(np.exp(y))
        ix = np.random.choice(range(vocab_size), p=p.ravel())
        x = np.zeros((vocab_size, 1))
        x[ix] = 1
        ixes.append(ix)
    return ixes

In [51]:
sample_ix = sample(hprev, char_to_ix['a'], 1000)
txt = ''.join(ix_to_char[ix] for ix in sample_ix)
print txt

noe a L:odmlt hd ts
olu hed og hoe. i
nrarro uCws ' Zlmky ynwrss[ mmo
sS hCInAr
ib htneayHseiCf uueo
H'Il s,dou
usKi., t?fr
i whhfgdre AtitTe
,mer
ll 
G tetdjcqsrrtfs soin,rnseiIdnIshos mn iasgh
idghfMIaoeM;,oaat,wv
d  , rnthoikfrlh rsqa at ;ilr wdeat tawl
atran iter yahnneesevd rls Soide TswD haPsiu no sphrcnhyGrsd mbRyos ,g sdt widotLn ohh 
 tcavny svag
bee rr nlh dnl h olCn  . terTrfo o t d nhoeewisn riwgono maeAahe CrS
tystet i!BWehi-'Nh:syd cv mr
iceawtesbthnhul eaec ,eeddage

mtdsu : mioeuT,ayin aharwoouh,uh alrhofnwW Tdeo CDnupee tntaehd
lee hGtc
oneE hs oh otil'rr ftO
t hoe nie yeidoqotnlt h s! ai.e,s c? eTwbr r ,etaa t y
Sar
ieeeio
oUoyehm UIGndoLeaeae h eaFfoaH
etnmris Paib
h r,iiEb
haemlkecySa,nlfTe
o hssRThokiwo R
r mso pmdsnbn weu ameUe l
tldal -enr otguO, ue 
nY olKh oh  Iboewornr 
yhh o:a,e pntU edvhclsndH hs mEQoeoi lhnub !e n,&iirEr sdiSdWl saases tsstgitsg urhdaleeweareit,e n!hitra u
Bl pr .ih 't  oit.es aoy atyt, enit fntae hEt
stuhaoIr  uKis
tUikn otbareu oefudz'uay