In [1]:
%pylab inline
from IPython.display import Image, display

import tensorflow as tf
sess = tf.InteractiveSession()

Populating the interactive namespace from numpy and matplotlib


## Language Modeling Using TensorFlow

* Task : Given a sequence of words, predict the next word
  - Models the probability of sentences in a language
* Data available at http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz

In [2]:
!head -n3 data/ptb_word/ptb.train.txt

 pierre <unk> N years old will join the board as a nonexecutive director nov. N 
 mr. <unk> is chairman of <unk> n.v. the dutch publishing group 
 rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate 


* Exactly 10000 words including `<unk>` and `<eos>` (End of sentence)

* Read the data

In [3]:
words = open('data/ptb_word/ptb.train.txt').read().replace('\n', '<eos>').split()
words_as_set = set(words)
print('Number of words %d' % len(words_as_set))
word_to_id = {w: i for i, w in enumerate(words_as_set)}
id_to_word = {i: w for i, w in enumerate(words_as_set)}
data = [word_to_id[w] for w in words]

Number of words 10000


* Let's build the following model
  - A recurrent neural network, unrolled in time
  - Long short term memory (LSTM) cells

<img src='data/lstm.png' />

* LSTM Cell
  - Takes input, previous output and current state, and produces output and next state.
  
$$
h_t, C_t = lstm(x_t, h_{t-1}, C_{t-1})
$$

<img src='data/lstm_cell.png' width='40%'>

* Full set of equations ($[]$ is vector concatenation, $\times$ is matrix multiply, $*$ is element-wise multiply)

$$ X = [h_{t-1}, x_t] $$
$$ f_t = \sigma(W_f \times X + b_f) $$
$$ i_t = \sigma(W_i \times X + b_i) $$
$$ o_t = \sigma(W_o \times X + b_o) $$
$$ \tilde{C}_t = tanh(W_C \times X + b_C) $$
$$ C_t = f_t * C_{t-1} + i_t * \tilde{C}_t$$
$$ h_t = o_t * tanh(C_t)$$

### Parameters of the model
* We need to pick embedding dimensions and the dimensions of the state vector.
  - For convenience, let's pick `embedding_dims = state_size = 128`
* Embedding vectors
  - `[10000, embedding_dims]`.
* The 4 weight matrices in the equation ($W_f, W_i, W_o, W_C$)
  - `[2 * state_size, state_size]`
* 4 biases ($b_f, b_i, b_o, b_C$)
  - `[state_size]`
* Softmax classifier logit layer weights and biases
  - `[state_size, 10000], [10000]`

* Implement an LSTM cell as a class, so we can instantiate many layers

In [16]:
class LSTMCell(object):
  def __init__(self, state_size):
    self.state_size = state_size
    self.W_f = tf.Variable(self.initializer())
    self.W_i = tf.Variable(self.initializer())
    self.W_o = tf.Variable(self.initializer())
    self.W_C = tf.Variable(self.initializer())
    self.b_f = tf.Variable(tf.zeros([state_size]))
    self.b_i = tf.Variable(tf.zeros([state_size]))
    self.b_o = tf.Variable(tf.zeros([state_size]))
    self.b_C = tf.Variable(tf.zeros([state_size]))
  def __call__(self, x_t, h_t1, C_t1):
    X = tf.concat(1, [h_t1, x_t])
    f_t = tf.sigmoid(tf.matmul(X, self.W_f) + self.b_f)
    i_t = tf.sigmoid(tf.matmul(X, self.W_i) + self.b_i)
    o_t = tf.sigmoid(tf.matmul(X, self.W_o) + self.b_o)
    Ctilde_t = tf.tanh(tf.matmul(X, self.W_C) + self.b_C)
    C_t = f_t * C_t1 + i_t * Ctilde_t
    h_t = o_t * tf.tanh(C_t)
    return h_t, C_t
  def initializer(self):
    return tf.random_uniform([2*self.state_size, self.state_size],
                             -0.1, 0.1)

* Declare embedding vectors, LSTM cells, and logit layer params

In [17]:
state_size = 128

embedding_params = tf.Variable(tf.random_uniform([10000, state_size],
                                                 -0.02, 0.02))

lstm = []
for _ in range(4):
  lstm.append(LSTMCell(state_size))

sm_w = tf.Variable(tf.random_uniform([state_size, 10000], -0.1, 0.1))
sm_b = tf.Variable(tf.zeros([10000]))

* Let's build the model!

In [6]:
# words and targets are placeholders for [batch_size, num_steps]
# tensor of word and target ids
words = tf.placeholder(tf.int64, name='words')
targets = tf.placeholder(tf.int64, name='targets')

def model(batch_size, num_steps):
  output = [tf.zeros([batch_size, state_size])] * 4
  state = [tf.zeros([batch_size, state_size])] * 4
  preds = []
  cost = 0.0
  for i in range(num_steps):
    # Get the embedding for words
    embedding = tf.nn.embedding_lookup(embedding_params, words[:, i])
    # Run the LSTM cells
    output[0], state[0] = lstm[0](embedding, output[0], state[0])
    for d in range(1, 4):
      output[d], state[d] = lstm[d](output[d-1], output[d], state[d])
    # Get the logits
    logits = tf.matmul(output[-1], sm_w) + sm_b
    # Get the softmax predictions
    preds.append(tf.nn.softmax(logits))
    # Cost per step
    cost = cost + tf.reduce_mean(
      tf.nn.sparse_softmax_cross_entropy_with_logits(logits,
                                                     targets[:, i]))
  # Average cost across time steps
  cost = cost / np.float32(num_steps)
  return preds, cost

* Some boring routines to get mini-batch of examples

In [20]:
def get_one_example(num_steps):
  offset = np.random.randint(len(data) - num_steps - 1)
  return (data[offset:offset + num_steps],
          data[offset+1:offset+1+num_steps])
w, t = get_one_example(4)
print([id_to_word[x] for x in w],[id_to_word[x] for x in t])

(['the', 'company', "'s", 'egg'], ['company', "'s", 'egg', 'product'])


In [26]:
def get_mini_batch(batch_size, num_steps):
  words, targets = [], []
  for _ in range(batch_size):
    w, t = get_one_example(num_steps)
    words.append(w)
    targets.append(t)
  return np.array(words), np.array(targets)

w, t = get_mini_batch(2, 4)
for i in range(2):
  print([id_to_word[x] for x in w[i]], [id_to_word[x] for x in t[i]])

(['baker', 'is', 'willing', 'to'], ['is', 'willing', 'to', 'accept'])
(['N', 'N', 'to', 'yield'], ['N', 'to', 'yield', 'N'])


* Everything in working order?
* Try to get the predictions for a random example

In [9]:
preds, cost = model(1, 8)
tf.initialize_all_variables().run()
w, t = get_mini_batch(1, 8)
p = preds[0].eval(feed_dict={words: w, targets: t})
np.set_printoptions(formatter={'float': lambda x: '%.04f'%x}, threshold=10000)
print(p[0][:100])

[0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001]


* $e^{cost}$ should be approximately 10000

In [10]:
c = cost.eval(feed_dict={words: w, targets: t})
print(c, np.exp(c))

(9.2103415, 10000.011)


* Let's train the model
* Let's get fancy
  - Clip gradients before applying to parameters
  - Use `tf.train.GradientDescentOptimizer` to reduce some boiler plate
  - Use exponential decay on the learning rate

In [11]:
# Create a variable to hold the step number, but mark it as not trainable 
global_step = tf.Variable(0, trainable=False)

In [12]:
def train(learning_rate, batch_size, num_steps):
  _, cost_value = model(batch_size, num_steps)
  all_vars = tf.trainable_variables()
  grads = tf.gradients(cost_value, all_vars)
  grads, _ = tf.clip_by_global_norm(grads, 5.0)
  # Decay the learning rate by 0.8 every 1000 steps
  learning_rate = tf.train.exponential_decay(
    learning_rate, global_step, 1000, 0.8)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  # apply_gradients increments the global_step
  train_op = optimizer.apply_gradients(zip(grads, all_vars),
                                       global_step=global_step)
  return cost_value, train_op

* And we are off to the races!

In [27]:
batch_size = 32
num_timesteps = 16
cost_value, train_op = train(1.0, batch_size, num_timesteps)
tf.initialize_all_variables().run()
for step_number in range(100):
  w, t = get_mini_batch(batch_size, num_timesteps)
  c, _ = sess.run([cost_value, train_op], feed_dict={words: w, targets: t})
  if step_number % 10 == 0:
    print('step %d: %.3f' % (step_number, c))

step 0: 9.210
step 10: 9.067
step 20: 8.898
step 30: 8.349
step 40: 7.647
step 50: 7.557
step 60: 7.284
step 70: 7.101
step 80: 7.090
step 90: 7.010


In [28]:
saver = tf.train.Saver(tf.all_variables())
saver.save(sess, './ptb_params', global_step=global_step.eval())

'./ptb_params-100'

* Let's ask the model to generate sentences
  - Start off with few words
  - Sample from the probability distribution to get the next word
  - Remember to feed the cell state back into the model

In [15]:
saver.restore(sess, './ptb_params-5000')

embedding = tf.nn.embedding_lookup(embedding_params, words[:, 0])
output_in = [tf.zeros([1, state_size])] * 4
state_in = [tf.zeros([1, state_size])] * 4
output = [0] * 4
state = [0] * 4
# Run the LSTM cells
output[0], state[0] = lstm[0](embedding, output_in[0], state_in[0])
for d in range(1, 4):
  output[d], state[d] = lstm[d](output[d-1], output_in[d], state_in[d])
# Get the logits
logits = tf.matmul(output[-1], sm_w) + sm_b
# Get the softmax predictions
preds = tf.nn.softmax(logits)

def get_sentence(start_words, length):
  start_words = start_words.split()
  w = np.array([[word_to_id[start_words[0]]]])
  t = sess.run([preds] + output + state,
               feed_dict={words: w})
  sentence = [start_words[0]]
  for i in range(length):
    if i + 1 < len(start_words):
      w[0, 0] = word_to_id[start_words[i+1]]
    else:
      w[0, 0] = min(10000, np.sum(np.cumsum(t[0]) < np.random.rand()))
    sentence.append(id_to_word[w[0, 0]])
    feed_dict = dict([(output_in[i], t[1+i]) for i in range(4)]+
                    [(state_in[i], t[5+i]) for i in range(4)] +
                     [(words, w)])
    t = sess.run([preds] + output + state,
                 feed_dict=feed_dict)
  return ' '.join(sentence)

will donald trump casualty by tension major <eos> already
what is not trust how slid davis in purchase
the greatest soviet june N in asked the u.s.
meaning of life reactions are but $ soon still


In [29]:
print(get_sentence('will donald trump', 8))
print(get_sentence('what is', 8))
print(get_sentence('the greatest', 8))
print(get_sentence('meaning of life', 8))

will donald trump midnight esso present knowing resulting brady
what is boat our requested aspects sweden robust reluctance
the greatest yet lesser debris cnbc join jonathan idle
meaning of life 14-year-old republics criticisms somebody inspector child
