Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [3]:
def read_data(filename):
  with zipfile.ZipFile(filename) as f:
    name = f.namelist()[0]
    data = tf.compat.as_str(f.read(name))
  return data
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [4]:
valid_size = 5000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99995000  of one eight four eight in france proudhon s philosophy of prop
5000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [5]:
vocabulary_size = len(string.ascii_lowercase) + 3 # [a-z] + ' ' + ~(eos) + '^(go)' (https://www.tensorflow.org/tutorials/seq2seq/)
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 3
  elif char == ' ':
    return 0
  elif char == '^': # <go>
    return 1
  elif char == '~': # <eos>
    return 2
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 2:
    return chr(dictid + first_letter - 3)
  elif dictid == 1:
        return '<go>'
  elif dictid == 2:
    return '<eos>'
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'), char2id('^'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
3 28 0 0 1
<go> x  


In [6]:
def mirror_words(sentence):
    return ' '.join([word[::-1] for word in sentence.split(' ')])
mirror_words('the quick brown fox')

'eht kciuq nworb xof'

Function to generate a training batch for the LSTM model.

In [7]:
batch_size=64
seq_length = 15

class BatchGenerator(object):
  def __init__(self, text, batch_size, seq_length):
    segment_length = seq_length * batch_size
    len_text = len(text)//segment_length*segment_length
    self._text = text[:len_text]
    self._text_size = len_text
    self._batch_size = batch_size
    self._seq_length = seq_length
    self._cursor = [ offset * segment_length for offset in range(batch_size)]
  

  
  def next(self):
    """Generate the next array of batches from the data. 
    """
    batches = []
    for step in range(self._batch_size):
      batches.append(self._text[self._cursor[step]:self._cursor[step]+self._seq_length])
      self._cursor[step] = (self._cursor[step] + self._seq_length)
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]


def prediction_to_str(predictions):
    return ''.join([characters(pred)[0] for pred in predictions])



train_batches = BatchGenerator(train_text, batch_size, seq_length)
valid_batches = BatchGenerator(valid_text, 1, seq_length)

print(train_batches.next())
print(valid_batches.next())

[' of one eight f', 'self an anarchi', 'december one ni', 'eceived the ful', 'kunin as their ', 'n the one eight', 'clude emma gold', 'the assassinati', 'archism see bel', ' the cgt saw li', 'on federations ', ' s two zero s a', 'ion of one nine', 'ontrol for them', 'rm continues to', 'ian fascism was', 'ers supported b', 'nti fascist act', ' freedom christ', 'ro s strands of', 'story thus the ', 'ed the state ha', 't anarchism tak', 'me anarcho capi', 'itting another ', ' throughout the', 'r than speaking', 'sm rejects the ', ' in magazines s', 'means let alone', 'avors a non hie', ' such as the wo', 'ist cause both ', 'ed parliamentar', 'iticisms of ana', 's book european', 'anarchism has b', ' fascists he al', ' in most of wes', 'more extensive ', 't the list of a', 'ations hundreds', 'tal health give', ' autism for rea', 'ddle of the twe', 'y read until on', 'st terminology ', ' for autism eit', 'd that although', 'ldhood that man', 'nderreactivity ', 'd that sensory ', 'utism are 

In [8]:
num_units = 64
embedding_size = 30
graph = tf.Graph()
with graph.as_default():
    
    train_encoder_inputs = [tf.placeholder(tf.int32, shape=(None,)) for x in range(seq_length)]
    train_decoder_inputs = [tf.placeholder(tf.int32, shape=(None,)) for x in range(seq_length+1)]
    train_targets = [tf.placeholder(tf.int32, shape=(None,)) for x in range(seq_length+1)]

    train_weights = [tf.constant(1.0) for x in range(seq_length+1)]

    cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

    with tf.variable_scope("s2s"):
        outputs, states = tf.nn.seq2seq.embedding_rnn_seq2seq(train_encoder_inputs, train_decoder_inputs, cell,vocabulary_size, 
                                                              vocabulary_size, embedding_size)

    train_loss = tf.nn.seq2seq.sequence_loss(outputs, train_targets, train_weights) 
    train_step = tf.train.GradientDescentOptimizer(0.1).minimize(train_loss)


    validation_encoder_inputs =  [tf.placeholder(tf.int32, shape=(1,)) for x in range(seq_length)]
    validation_decoder_inputs = [tf.placeholder(tf.int32, shape=(1,)) for x in range(seq_length+1)]
    validation_targets = [tf.placeholder(tf.int32, shape=(1,)) for x in range(seq_length+1)] 

    validation_weights =  [tf.constant(1.0) for x in range(seq_length+1)]

    with tf.variable_scope("s2s", reuse = True):
        validation_outputs, validation_states = tf.nn.seq2seq.embedding_rnn_seq2seq(validation_encoder_inputs, validation_decoder_inputs,
                                                                                    cell, vocabulary_size, vocabulary_size, embedding_size,
                                                                                    feed_previous=True)
    validation_predictions = tf.pack([output for output in validation_outputs])
    validation_loss = tf.nn.seq2seq.sequence_loss(validation_outputs, validation_targets, validation_weights) 

In [9]:


with tf.Session(graph=graph) as sess:
    sess.run(tf.initialize_all_variables())

    for step in range(60001):
        
    
        input_batches = train_batches.next()
        mirror_batches = [mirror_words(seq) for seq in input_batches]
        train_encoder_inputs_data = [[char2id(seq[i]) for seq in input_batches] for i in range(seq_length)]
        train_decoder_inputs_data = [[1] * batch_size] +  [[char2id(seq[i]) for seq in mirror_batches]  for i in range(seq_length)]
        train_targets_data = [[char2id(seq[i]) for seq in mirror_batches]  for i in range(seq_length)] + [[2] * batch_size]
        feed_dict = dict(zip(train_encoder_inputs+train_decoder_inputs+train_targets,
                             train_encoder_inputs_data+train_decoder_inputs_data+train_targets_data))
        
        sess.run(train_step, feed_dict=feed_dict)

        if step % 1000 == 0:

            
            validation_input_batches = valid_batches.next()
            validation_mirror_batches = [mirror_words(seq) for seq in validation_input_batches]
            
            validation_encoder_inputs_data = [[char2id(validation_input_batches[0][i])] for i in range(seq_length)]
            validation_targets_data = [[char2id(validation_mirror_batches[0][i])]  for i in range(seq_length)] + [[2]]
            feed_dict = dict(zip(validation_encoder_inputs+validation_targets,
                             validation_encoder_inputs_data+validation_targets_data))
            feed_dict[validation_decoder_inputs[0]] = [1] # <go>
            loss , predictions = sess.run([validation_loss, validation_predictions], feed_dict=feed_dict)
            print("="*80)
            print('step:',step)
            print('loss: ', loss)
            print('input: ', validation_input_batches[0])
            print('mirror input: ', validation_mirror_batches[0])
            print('prediction:   ', prediction_to_str(predictions[:-1]))

step: 0
loss:  3.34634
input:  inated as a ter
mirror input:  detani sa a ret
prediction:    sssssssssnntsss
step: 1000
loss:  2.73117
input:  m of abuse firs
mirror input:  m fo esuba srif
prediction:     e e ea aa aa <eos>
step: 2000
loss:  2.86431
input:  t used against 
mirror input:  t desu tsniaga 
prediction:     ets sita erna 
step: 3000
loss:  3.05266
input:  early working c
mirror input:  ylrae gnikrow c
prediction:    rerani rotna o<eos>
step: 4000
loss:  2.71172
input:  lass radicals i
mirror input:  ssal slacidar i
prediction:    ssa lalidsna da
step: 5000
loss:  3.49797
input:  ncluding the di
mirror input:  gnidulcn eht id
prediction:    naditno dihl ht
step: 6000
loss:  4.86864
input:  ggers of the en
mirror input:  sregg fo eht ne
prediction:    sret no eht eco
step: 7000
loss:  4.4078
input:  glish revolutio
mirror input:  hsilg oitulover
prediction:    silo ehtilugnor
step: 8000
loss:  1.06272
input:  n and the sans 
mirror input:  n dna eht snas 
prediction:    n dn