Deep Learning
=============

Assignment 6 - PROBLEM 3
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [66]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
import collections
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [67]:
import time
def how_long(f, *args):
    #medir el tiempo que tarda f
    t1 = time.time()
    res = f(*args)
    t2 = time.time()
    print ("tiempo utilizado = ",t2-t1)
    #return res, t2-t1
    return res

In [68]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---

In [75]:
def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words"""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data
  
words = read_data(filename)
print('Data size %d' % len(words))

Data size 17005207


Build the dictionary and replace rare words with UNK token.

In [76]:
vocabulary_size = 50000

def build_dataset(words):
  #RESERVED WORD IDs
  #ID=0 para la palabra especial GO
  #ID=1 para la palabra especial EOS
  #ID=2 para los "OTROS POCO FRECUENTES" que llama "UNK"

  #inicializa el array de contadores de palabras (frecuencias)
  count = [['GO', 0],['EOS', 0],['UNK', -1]]
  #cuenta las 50000 palabras más comunes (menos las tres reservadas)
  count.extend(collections.Counter(words).most_common(vocabulary_size - 3))
  #inicializa un diccionario del vocabulario, va a crear pares palabra - ID, para tratar con números
  #y asigna los id por orden de frecuencia de más a menos
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  #inicializa y crea una lista de ID equivalente al dataset "words" reemplazando cada palabra por su ID
  #y de paso va apuntando los UNK para rellenar la frecuencia
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 2  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  #apunta el dato final de cuantas palabras raras hay (frecuencia de palabras raras)
  count[2][1] = unk_count
  #crea un diccionario al revés, es decir, donde el ID es la clave y la palabra es el valor
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = how_long(build_dataset,words)
print('Most common words (+ GO,EOS,UNK)', count[:5])
print('Sample data', data[:10])
del words  # Hint to reduce memory.

tiempo utilizado =  61.1043949127
Most common words (+ GO,EOS,UNK) [['GO', 0], ['EOS', 0], ['UNK', 418409], ('the', 1061396), ('of', 593677)]
Sample data [5241, 3086, 14, 8, 197, 4, 3139, 48, 61, 158]


Create a small validation set.

In [77]:
valid_size = 1000 #las mil primeras palabras son de validación
valid_text = data[:valid_size]
train_text = data[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print('train:',[reverse_dictionary[i] for i in train_text[:64]])
print(valid_size, valid_text[:64])
print('valid:',[reverse_dictionary[i] for i in valid_text[:64]])

17004207 [66, 10288, 5241, 3246, 9633, 7, 6, 15, 12, 18, 10881, 5866, 51, 5444, 7, 8, 2, 2413, 3089, 21, 620, 7420, 91, 60, 5107, 36, 1339, 7, 8, 291, 83, 17367, 151, 1186, 1219, 5866, 7518, 3, 442, 17, 1895, 27, 8, 1133, 4, 628, 3419, 5, 8, 1133, 4, 820, 920, 5866, 7453, 9, 18863, 2, 4788, 1509, 36, 5774, 156, 38]
train: ['american', 'individualist', 'anarchism', 'benjamin', 'tucker', 'in', 'one', 'eight', 'two', 'five', 'josiah', 'warren', 'had', 'participated', 'in', 'a', 'UNK', 'experiment', 'headed', 'by', 'robert', 'owen', 'called', 'new', 'harmony', 'which', 'failed', 'in', 'a', 'few', 'years', 'amidst', 'much', 'internal', 'conflict', 'warren', 'blamed', 'the', 'community', 's', 'failure', 'on', 'a', 'lack', 'of', 'individual', 'sovereignty', 'and', 'a', 'lack', 'of', 'private', 'property', 'warren', 'proceeded', 'to', 'organise', 'UNK', 'anarchist', 'communities', 'which', 'respected', 'what', 'he']
1000 [5241, 3086, 14, 8, 197, 4, 3139, 48, 61, 158, 130, 744, 479, 10574, 136,

Utility functions

In [78]:
"""
def word2id(word):
  if word in dictionary:
    return dictionary[word]
  else:
    print('Unexpected word: %s' % word)
    return 0
  
def id2word(id):
  if id < vocabulary_size:
    return reverse_dictionary[id]
  else:
    print('Unexpected word id: ', id)
    return 'UNK'

print(word2id('what'), word2id('woman'), word2id('supercalifragilisticoexpialidoso'))
print(id2word(1), id2word(268), id2word(0), id2word(vocabulary_size+1))
"""
def reverseword(word):
    return word[::-1]

print(reverseword('what'))

tahw


###### The language to translate to: inversed words

In [79]:
vocabulary_size_2 = vocabulary_size #both languages have the same number of words

reserved = [0,1]
i = vocabulary_size - 1
dictionary_2 = dict()
reverse_dictionary_2 = dict()
for w in dictionary:
    if(dictionary[w] not in reserved):
        #reverse the IDs as well, so it is not too obvious
        w2 = reverseword(w)
        dictionary_2[w2] = i
        reverse_dictionary_2[i] = w2
        i =  i - 1
dictionary_2['GO'] = 0
dictionary_2['EOS'] = 1
reverse_dictionary_2[0] = 'GO'
reverse_dictionary_2[1] = 'EOS'
        
for w in ['the','quick','brown','fox']:
    print('id = ',dictionary[w],'\tword = ',w,
          '\tid2 = ',dictionary_2[reverseword(w)],'\tword2 = ',reverse_dictionary_2[dictionary_2[reverseword(w)]])


id =  3 	word =  the 	id2 =  33988 	word2 =  eht
id =  3905 	word =  quick 	id2 =  2342 	word2 =  kciuq
id =  1201 	word =  brown 	id2 =  8646 	word2 =  nworb
id =  2528 	word =  fox 	id2 =  37023 	word2 =  xof


In [80]:
print(dictionary['GO'])
print(dictionary['EOS'])
print(dictionary['UNK'])
print(dictionary['the'])
print(dictionary_2['GO'])
print(dictionary_2['EOS'])
print(dictionary_2['KNU'])
print(dictionary_2['eht'])

0
1
2
3
0
1
9887
33988


Function to generate a training batch and labels for the Seq2Seq model.

In [111]:
class BatchGeneratorS2S(object): #return batch and labels
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings #"phrase" size
    segment = self._text_size // batch_size #reparte el texto en N caminos de predicion (N=batch_size)
    self._cursor = [ offset * segment for offset in range(batch_size)] #array de cursores para los N caminos
    self._last_encoder, self._last_decoder, self._last_target = self._next_batch()
    
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    encoder = np.zeros(shape=(self._batch_size), dtype=np.int)
    decoder = np.zeros(shape=(self._batch_size), dtype=np.int)
    target = np.zeros(shape=(self._batch_size,vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      encoder[b] = self._text[self._cursor[b]]
      decoder[b] = dictionary_2[reverseword(reverse_dictionary[self._text[self._cursor[b]]])]
      target[b, dictionary_2[reverseword(reverse_dictionary[self._text[self._cursor[b]]])]] = 1.0 #1-hot-encoded array
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size #incrementa el puntero en modo circular por el texto
    return encoder,decoder,target
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    encoders = [self._last_encoder]
    decoders = [self._last_decoder]
    targets = [self._last_target]
    for step in range(self._num_unrollings-1): 
      e,d,t = self._next_batch()
      encoders.append(e)
      decoders.append(d)
      targets.append(t)
    self._last_encoder = encoders[-1]
    self._last_decoder = decoders[-1]
    self._last_target = targets[-1]
    return encoders,decoders,targets

def words(reverse_dictionary, probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  words back into its (most likely) character representation."""
  return [reverse_dictionary[w] for w in np.argmax(probabilities, 1)]

def batches2string(reverse_dictionary, batches):
  """Convert a sequence of batches back into their string representation."""
  s = [''] * batches[0].shape[0]
  for i,b in enumerate(batches):
    for j,e in enumerate(b):
        s[j] = s[j] +' '+ reverse_dictionary[e]
  return s

def targets2string(reverse_dictionary, targets):
  """Convert a sequence of 1-hot-encoded targets back into their (most likely) string representation."""
  s = [''] * targets[0].shape[0]
  for i,b in enumerate(targets):
    s = [''.join(x) for x in zip(s,[' ']*targets[0].shape[0],words(reverse_dictionary,b))] 
  return s

train_batches = BatchGeneratorS2S(train_text, 3, 5) #batch_size=3 num_unrollings=5 for function testing
valid_batches = BatchGeneratorS2S(valid_text, 1, 1)

print("train")
e1,d1,t1 = train_batches.next()
print(len(e1),len(d1),len(t1))
print(e1[0].shape)
print(e1[0])
print(d1[0].shape)
print(d1[0])
print(t1[0].shape)
print(batches2string(reverse_dictionary,e1))
print(batches2string(reverse_dictionary_2,d1))
print(targets2string(reverse_dictionary_2,t1))

e2,d2,t2 = train_batches.next()
print(batches2string(reverse_dictionary,e2))
print(batches2string(reverse_dictionary_2,d2))
print(targets2string(reverse_dictionary_2,t2))

print("\nvalid")
ve1,vd1,vt1 = valid_batches.next()
print(batches2string(reverse_dictionary,ve1))
print(batches2string(reverse_dictionary_2,vd1))
print(targets2string(reverse_dictionary_2,vt1))

train
5 5 5
(3,)
[ 66   3 578]
(3,)
[45057 33988 18302]
(3, 50000)
[' american individualist anarchism benjamin tucker', ' the table top wargame warhammer', ' subject to much debate recent']
[' nacirema tsilaudividni msihcrana nimajneb rekcut', ' eht elbat pot emagraw remmahraw', ' tcejbus ot hcum etabed tnecer']
[' nacirema tsilaudividni msihcrana nimajneb rekcut', ' eht elbat pot emagraw remmahraw', ' tcejbus ot hcum etabed tnecer']
[' tucker in one eight two', ' warhammer four zero zero zero', ' recent archaeological finds suggest multiple']
[' rekcut ni eno thgie owt', ' remmahraw ruof orez orez orez', ' tnecer lacigoloeahcra sdnif tseggus elpitlum']
[' rekcut ni eno thgie owt', ' remmahraw ruof orez orez orez', ' tnecer lacigoloeahcra sdnif tseggus elpitlum']

valid
[[ 0.  0.  0. ...,  0.  0.  0.]]
[' anarchism']
[' msihcrana']
[' msihcrana']


In [82]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized probabilities."""
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Seq2seq LSTM Model.

[documento] (https://arxiv.org/pdf/1409.3215v3.pdf)

<img src="basic_seq2seq.png">

In [87]:
##################
# DECLARACION LSTM3 seq2seq
##################

batch_size=64 #frases en un lote
num_unrollings=5 #palabras en cada frase (hay uno más de solapamiento)

embedding_size = 1024 # Dimension of the embedding vector. Las features de cada palabra que la distinguen.

num_nodes = 64 #Tantas células en la memoria como líneas de entrenamiento va a seguir en un lote, que casualidad
starter_learning_rate = 10.0
learning_decay_steps = 5000
learning_decay_rate = 0.1
clip_limit = 1.25

graphLSTM3 = tf.Graph()
with graphLSTM3.as_default():
  
  # Special words for beginning and end
  GO = tf.Variable(np.zeros([batch_size], dtype=np.int32), trainable=False)
  e = np.zeros([batch_size,vocabulary_size_2], dtype=np.float32)
  e[:,1] = 1.0 #ohe for labels, EOS id is "1"
  EOS = tf.Variable(e, trainable=False)
    
  # Input data, get input and labels.
  train_encoder = list()
  for _ in range(num_unrollings):
    train_encoder.append(tf.placeholder(tf.int32, shape=[batch_size])) #word ids for input language
  train_decoder = list()
  for _ in range(num_unrollings):
    train_decoder.append(tf.placeholder(tf.int32, shape=[batch_size])) #word ids for output language "inversish" :-)
  train_targets = list()
  for _ in range(num_unrollings):
    train_targets.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size_2])) #ohe for output language

  # LSTM Gates: 0=memory cell, 1=input, 2=forget, 3=output
  num_gates = 4  

  # Definition of the cell computation.
  def lstm_cell(i, o, last_state, gx, gm, gb):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the previous state and the gates."""
    I = tf.stack([i,i,i,i])
    O = tf.stack([o,o,o,o])
    gates = tf.matmul(I, gx) + tf.matmul(O, gm) + gb
    update = gates[0,:,:]
    input_gate = tf.sigmoid(gates[1,:,:])
    forget_gate = tf.sigmoid(gates[2,:,:])
    output_gate = tf.sigmoid(gates[3,:,:])
    next_state = forget_gate * last_state + input_gate * tf.tanh(update)
    outputmem = output_gate * tf.tanh(next_state)
    return outputmem, next_state

  # Variables saving state across unrollings.
  saved_omem = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)

  #################################
  # ENCODER variables:
  #################################
  embeddings_E = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) 

  #Gates parameters:
  gxE = tf.Variable(tf.truncated_normal([num_gates, embedding_size, num_nodes], -0.1, 0.1))
  gmE = tf.Variable(tf.truncated_normal([num_gates, num_nodes, num_nodes], -0.1, 0.1))
  gbE = tf.Variable(tf.zeros([num_gates, 1, num_nodes]))

  #################################
  # DECODER variables:
  #################################
  embeddings_D = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) 

  #Gates parameters:
  gxD = tf.Variable(tf.truncated_normal([num_gates, embedding_size, num_nodes], -0.1, 0.1))
  gmD = tf.Variable(tf.truncated_normal([num_gates, num_nodes, num_nodes], -0.1, 0.1))
  gbD = tf.Variable(tf.zeros([num_gates, 1, num_nodes]))

  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))

  #################################
  # ENCODER + DECODER processing:
  #################################
  
  #Inputs for encoder
  train_encoder_X = list()
  for i in range(num_unrollings):
    train_encoder_X.append(tf.nn.embedding_lookup(embeddings_E, train_encoder[i])) #input encoder embeddings

  # Unrolled encoder LSTM loop.
  omem = saved_omem
  state = saved_state
  for i in train_encoder_X:
    omem, state = lstm_cell(i, omem, state, gxE, gmE, gbE)

  #Inputs for decoder
  train_decoder_X = list() #GO + translated_words_embeddings: num_unrollings + 1
  train_decoder_X.append(tf.nn.embedding_lookup(embeddings_D, GO)) #GO at the beginning
  for i in range(num_unrollings):
    train_decoder_X.append(tf.nn.embedding_lookup(embeddings_D, train_decoder[i])) #Convierte ID en embeddings

  #Labels
  train_decoder_y = list() #translated_words_ohe + EOS: num_unrollings + 1
  for i in range(num_unrollings):
    train_decoder_y.append(train_targets[i])
  train_decoder_y.append(EOS) #EOS at the end

  # Unrolled decoder LSTM loop. IMPORTANTE, NO SE REINICIA omem ni state, se heredan del ENCODER
  omemoriesD = list()
  for i in train_decoder_X:
    omem, state = lstm_cell(i, omem, state, gxD, gmD, gbD)
    omemoriesD.append(omem) #esto va luego para la salida y evaluación de pérdidas

  # State saving across unrollings.
  with tf.control_dependencies([saved_omem.assign(omem),saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat_v2(omemoriesD, 0), w, b)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf.concat_v2(train_decoder_y, 0)))
                #Evalúa la salida de cada ráfaga con un logistic classifier normal

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, 
                                             learning_decay_steps, learning_decay_rate, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate) #TODO, ¿usar Adagrad?
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, clip_limit) #avoid exploding gradients
  optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  #################################
  # SAMPLE and VALIDATION:
  #################################

  # Sampling and validation eval: batch size = 1, ¿no unrolling? prueba de una mono-palabra.
  sample_input = tf.placeholder(tf.int32, shape=[1])
  saved_sample_omem = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(saved_sample_omem.assign(tf.zeros([1, num_nodes])),
                                saved_sample_state.assign(tf.zeros([1, num_nodes])))
  
  #Al encoder le meto la palabra prueba, 
  #al decoder le meto GO y espero sacar la palabra traducida 
  sample_omem, sample_state = lstm_cell(tf.nn.embedding_lookup(embeddings_E, sample_input), 
                                         saved_sample_omem, saved_sample_state, gxD, gmD, gbD) #encoder

  sample_GO = tf.Variable(np.zeros([1], dtype=np.int32), trainable=False)
  sample_omem, sample_state = lstm_cell(tf.nn.embedding_lookup(embeddings_D, sample_GO), 
                                        sample_omem, sample_state, gxD, gmD, gbD) #decoder
    
  with tf.control_dependencies([saved_sample_omem.assign(sample_omem),saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_omem, w, b))

In [None]:
##################
# EJECUCION LTSM3 seq2seq
##################
t1 = time.time()

train_batches = BatchGeneratorS2S(train_text, batch_size, num_unrollings)
valid_batches = BatchGeneratorS2S(valid_text, 1, 1)

num_steps = 7001
summary_frequency = 100

eos = np.zeros([batch_size,vocabulary_size_2], dtype=np.float32)
eos[:,1] = 1.0 #ohe for labels, EOS id is "1"

with tf.Session(graph=graphLSTM3) as session: 
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    encoders,decoders,targets = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings):
      feed_dict[train_encoder[i]] = encoders[i]
      feed_dict[train_decoder[i]] = decoders[i]
      feed_dict[train_targets[i]] = targets[i]
    _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l

    #Informe
    if step % summary_frequency == 0:
      t2 = time.time()
      # The mean loss is an estimate of the loss over the last few batches.
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # La preplejidad depende de cuanto acierta con las etiquetas
      targets.append(eos)
      labels = np.concatenate(list(targets))
      mbpx = float(np.exp(logprob(predictions, labels)))
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        ve,vd,vt = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: ve[0]})
        valid_logprob = valid_logprob + logprob(predictions, vt[0])
      vspx = float(np.exp(valid_logprob / valid_size))
      #Report
      print('step\t%d\t%ds:\tAvgLoss %f\tLRate %f\tMBperplex %.2f\tVSperplex %.2f' 
            % (step, t2-t1, mean_loss, lr, mbpx, vspx))
      mean_loss = 0
    
      #Informe más completo con muestras
      #TODO hacerlo de una tacada, una gestión sample de frase completa
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        test_phrase = ['the','quick','brown','fox']
        translated_sentence = ''
        reset_sample_state.run()
        for w in test_phrase:
          feed = [dictionary[w]]
          prediction = sample_prediction.eval({sample_input: feed})
          s = words(reverse_dictionary_2,sample(prediction))
          translated_sentence = translated_sentence + ' ' + s[0]
        print('the quick brown fox')
        print(translated_sentence)
        print('=' * 80)
  print("End.")

Initialized
step	0	10s:	AvgLoss 10.844849	LRate 10.000000	MBperplex 51269.32	VSperplex 40173.87
the quick brown fox
 satsinidnas remitrom degdirb tcnujnoc
