# Recurrent Neural Networks

***

### Índice
1. [Introdução](#Recurrent-Neural-Networks) 
2. [LSTM simples](#LSTM-simples)
3. [LSTM empilhada (stacked)](#LSTM-empilhada-(stacked))
4. [LSTM para previsão de palavras](#LSTMs-para-previsão-de-palavras)

***

As redes neurais recorrentes consideram dados anteriores ao atual, sendo muito úteis para trabalhar com dados contínuos, como análises temporais ou reconhecimento de áudio, por exemplo.

As RNNs armazem estados, que funcionam como memória. O hidden output, como e conhecido, é reinserido na rede, agindo como um input secundário, que é atualizado a cada novo dado inserido. O hidden output antigo é então atualizado pela rede e funciona assim sucessivamente até o final do treinamento.

![rnn](img/rnn.png)

Dois tipos de RNN:
- One-to-many: um input gera vários outputs (ex.: gerar uma frase descritiva de uma imagem)
- Many-to-one: vários inputs geram um output (ex.: mercado financeiro ou análise de sentimentos)

Porém esse modelo apresenta alguns problemas, que podem dificultar ou até inviabilizar o processo de treinamento de muitos dados. Pode-se citar o elevado custo computacional de manter os estados, o "vanishing gradient", quando o gradiente se estabiliza perto de 0, ou o "exploding gradient", quando o gradiente explode ao infinito.

***

# Long Short-Term Memory (LSTM)
Para resolver problemas intrínsecos às RNNs comuns, foram desenvolvidas as LSTMs. Elas consistem de 4 elementos logísticos básicos, cada um com pesos e vieses específicos:
- A célula de memória
- Registro de leitura (read gate), que lê as informações da célula de memória e as envia de volta para a RNN
- Registro de escrita (write gate), que escreve informações da célula de memória
- Registro de esquecimento (forget/keep gate), que define quais informações antigas devem ser apagadas da célula de memória

![elements](img/lstm-elements.png)

O fluxo de dados inicia-se com o registro de esquecimento, que determina se a informação anterior deve ser mantida ou esquecida, recebendo tanto o input atual, quanto o input, passando pela [função de ativação sigmoidal](https://en.wikipedia.org/wiki/Sigmoid_function). Se for determinado que a informação armazenada anteriormente, deve-se multiplicar o valor armazenado pelo input, gerando um *dado candidato* a ser mantido na célula de memória.

Os registros, por serem logísticos, têm uma grande facilidade de passar pelo *backpropagation*, podem ser aprendidos como devem se comportar em cada caso. O problema de armazenamento é resolvido ao selecionar qual informação deve ser armazenada. Os problemas de gradiente são resolvidos com a possibilidade de atualizar os pesos ao longo do tempo, com uma função facilmente derivável. Assim, as LSTM são uma excelente solução para os dois problemas que dificultavam o uso das RNNs

***


## LSTM simples

In [1]:
import numpy as np
import tensorflow as tf
sess = tf.Session()

O Tensorflow conta com um modelo de RNN, podendo ser importado diretamente pela função ```tensorflow.contrib.rnn```. Precisamos passar dois parâmetros, o ```prv_output``` (também chamado de ```h```) e o ```prv_state``` (conhecido como ```c```). Também devemos incializar um vetor de estado ```state```, no caso uma tupla de dois números.

In [2]:
LSTM_CELL_SIZE = 4  # output size (dimension), which is same as hidden size in the cell

lstm_cell = tf.contrib.rnn.BasicLSTMCell(LSTM_CELL_SIZE, state_is_tuple=True)
state = (tf.zeros([1,LSTM_CELL_SIZE]),)*2
state

(<tf.Tensor 'zeros:0' shape=(1, 4) dtype=float32>,
 <tf.Tensor 'zeros:0' shape=(1, 4) dtype=float32>)

Vamos criar uma entrada de exemplo:

In [3]:
sample_input = tf.constant([[3,2,2,2,2,2]],dtype=tf.float32)
print (sess.run(sample_input))

[[3. 2. 2. 2. 2. 2.]]


Vamos passar essa entrada para a LSTM:

In [4]:
with tf.variable_scope("LSTM_sample1"):
    output, state_new = lstm_cell(sample_input, state)
sess.run(tf.global_variables_initializer())
print (sess.run(state_new))

LSTMStateTuple(c=array([[ 0.45861578, -0.30320457,  0.12372471, -0.00261922]],
      dtype=float32), h=array([[ 0.39968607, -0.23159988,  0.03959474, -0.00215346]],
      dtype=float32))


Vemos aqui que o estado tem duas 2 partes: o estado ```c``` e o output ```h```. Podemos ver o output a seguir:

In [5]:
print (sess.run(output))

[[ 0.39968607 -0.23159988  0.03959474 -0.00215346]]


***

## LSTM empilhada (stacked)
Uma outra maneira de trabalhar com LSTMs é empilhá-las, sendo cada unidade chamada de "célula". Assim, o output de uma é o input de outra, funcionando como camadas de redes neurais profundas, com diferentes graus de abstração e complexidade em cada uma das camadas.

Como sempre, devemos começar uma nova sessão:

In [6]:
sess = tf.Session()

Vamos definir o tamanho do input, de cada célula e o número de nós escondidos em cada célula.

In [7]:
input_dim = 6

cells = []

Primeira célula:

In [8]:
LSTM_CELL_SIZE_1 = 4 #4 hidden nodes
cell1 = tf.contrib.rnn.LSTMCell(LSTM_CELL_SIZE_1)
cells.append(cell1)

Segunda célula:

In [9]:
LSTM_CELL_SIZE_2 = 5 #5 hidden nodes
cell2 = tf.contrib.rnn.LSTMCell(LSTM_CELL_SIZE_2)
cells.append(cell2)

As células podem ser empilhadas por meio da função ```tf.contrib.rnnMultiRNNCell```:

In [10]:
stacked_lstm = tf.contrib.rnn.MultiRNNCell(cells)

Deve-se criar então uma RNN à partir da ```stacked_lstm```:

In [11]:
# Batch size x time steps x features.
data = tf.placeholder(tf.float32, [None, None, input_dim])
output, state = tf.nn.dynamic_rnn(stacked_lstm, data, dtype=tf.float32)

O input da RNN será um tensor do formato [batch_size, max_time, dimension]. Se for feito um paralelo, 'dimension' são os dados obtidos de uma observação; 'max_time' é o espaço de tempo considerado para informações correlatas; 'batch_size' seriam observações tiradas em períodos diferentes.

In [12]:
#Batch size x time steps x features.
sample_input = [[[1,2,3,4,3,2], [1,2,1,1,1,2],[1,2,2,2,2,2]],[[1,2,3,4,3,2],[3,2,2,1,1,2],[0,0,0,0,3,2]]]
sample_input

[[[1, 2, 3, 4, 3, 2], [1, 2, 1, 1, 1, 2], [1, 2, 2, 2, 2, 2]],
 [[1, 2, 3, 4, 3, 2], [3, 2, 2, 1, 1, 2], [0, 0, 0, 0, 3, 2]]]

In [13]:
output

<tf.Tensor 'rnn/transpose_1:0' shape=(?, ?, 5) dtype=float32>

Devido à quantidade de nós escondidos definidos na segunda camada (5), temos uma saída de tamanho diferente da entrada.

In [14]:
sess.run(tf.global_variables_initializer())
sess.run(output, feed_dict={data: sample_input})

array([[[-0.02508579, -0.0288842 , -0.032556  ,  0.01631174,
          0.02888456],
        [-0.05613008, -0.09874412, -0.10730629,  0.02930988,
          0.0627039 ],
        [-0.06494982, -0.14582326, -0.16224556,  0.03179003,
          0.0591735 ]],

       [[-0.02508579, -0.0288842 , -0.032556  ,  0.01631174,
          0.02888456],
        [-0.05069217, -0.09139941, -0.10482069,  0.02718367,
          0.0638642 ],
        [-0.04845953, -0.11996509, -0.15019654,  0.02321863,
          0.05263352]]], dtype=float32)

***

## LSTMs para previsão de palavras
Agora, vamos aplicar as LSTMs para um problema da vida real. Assim como o teclado do seu celular, vamos tentar prever qual será a róxima palavra que se encaixa no contexto. Tal situação se encaixa em diversos tipos de problemas, como reconhecimento de fala, tradução, legendas e correção de texto.

![language-modelling](img/language-modelling.png)

Para tal, vamos usar incorporadores de palavras ([word embeddings](https://www.tensorflow.org/tutorials/representation/word2vec)), vetores de n dimensões capazes de representar frases e palavras. Inicialmente, são determinados valores aleatórios para cada palavra. Ao longo do treinamento, os valores se adaptam e nos ajudam a predizer a próxima palavra. Os 'embeddings' agrupam, no espaço vetorial, palavras que são usadas em contextos semelhantes, como palavras que indicam quantidade, lugares ou sentimentos em diferentes grupos.

Usaremos o dataset "The Penn Treebank", um grande dataset manualmente anotado pela faculdade da Pensilvânia, com alta credibilidade. Assim podemos alimentar o modelo com esse conteúdo, que varia desde textos do Departamento de Energia americano a textos da Livraria da América.

O Tensorflow conta até mesmo com uma [função extra](https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/reader.py), específica para ler o conteúdo do dataset, a ```tensorflow.models.rnn.ptb.reader```.

In [1]:
import time
import numpy as np
import tensorflow as tf
import os
os._exit(00)

In [2]:
# !wget -q -O datasets/ptb.zip https://ibm.box.com/shared/static/z2yvmhbskc45xd2a9a4kkn6hg4g4kj5r.zip
# !unzip -o datasets/ptb.zip -d datasets
# !cp datasets/ptb/reader.py .

import reader

Agora, podemos baixar o conteúdo do dataset.

In [3]:
# !wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz 
# !tar xzf simple-examples.tgz -C datasets/

Vamos definir os valores de hiperparâmetros e sua estrutura:

In [4]:
#Initial weight scale
init_scale = 0.1
#Initial learning rate
learning_rate = 1.0
#Maximum permissible norm for the gradient (For gradient clipping -- another measure against Exploding Gradients)
max_grad_norm = 5
#The number of layers in our model
num_layers = 2
#The total number of recurrence steps, also known as the number of layers when our RNN is "unfolded"
num_steps = 20
#The number of processing units (neurons) in the hidden layers
hidden_size_l1 = 256
hidden_size_l2 = 128
#The maximum number of epochs trained with the initial learning rate
max_epoch_decay_lr = 4
#The total number of epochs in training
max_epoch = 15
#The probability for keeping data in the Dropout Layer (This is an optimization, but is outside our scope for this notebook!)
#At 1, we ignore the Dropout Layer wrapping.
keep_prob = 1
#The decay for the learning rate
decay = 0.5
#The size for each batch of data
batch_size = 60
#The size of our vocabulary
vocab_size = 10000
embeding_vector_size = 200
#Training flag to separate training from testing
is_training = 1
#Data directory for our dataset
data_dir = "datasets/simple-examples/data/"

#### Estrutura:
- Usaremos duas células de LSTMs. Uma com 256 hidden layers e outra com 128 hidden layers
- 20 passos de recorrência

Vamos iniciar uma nova sessão:

In [5]:
try:
    sess.close()
    session = tf.InteractiveSession(config=tf.ConfigProto(log_device_placement=True))
except:
    session = tf.InteractiveSession(config=tf.ConfigProto(log_device_placement=True))

In [6]:
# Reads the data and separates it into training data, validation data and testing data
raw_data = reader.ptb_raw_data(data_dir)
train_data, valid_data, test_data, vocab, word_to_id = raw_data

In [7]:
len(train_data)

929589

Vemos que nosso dataset de treino é composto de 929.589 palavras. Agora vamos ver algumas dessas palavras:

In [8]:
def id_to_word(id_list):
    line = []
    for w in id_list:
        for word, wid in word_to_id.items():
            if wid == w:
                line.append(word)
    return line            
                

print(id_to_word(train_data[0:100]))

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter', '<eos>', 'pierre', '<unk>', 'N', 'years', 'old', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'nov.', 'N', '<eos>', 'mr.', '<unk>', 'is', 'chairman', 'of', '<unk>', 'n.v.', 'the', 'dutch', 'publishing', 'group', '<eos>', 'rudolph', '<unk>', 'N', 'years', 'old', 'and', 'former', 'chairman', 'of', 'consolidated', 'gold', 'fields', 'plc', 'was', 'named', 'a', 'nonexecutive', 'director', 'of', 'this', 'british', 'industrial', 'conglomerate', '<eos>', 'a', 'form', 'of', 'asbestos', 'once', 'used', 'to', 'make', 'kent', 'cigarette', 'filters', 'has', 'caused', 'a', 'high', 'percentage', 'of', 'cancer', 'deaths', 'among', 'a', 'group', 'of']


Usando o módulo reader, vamos criar um iterator, a fim de ler as frases em batches.

In [9]:
itera = reader.ptb_iterator(train_data, batch_size, num_steps)
first_touple = itera.__next__()
x = first_touple[0]
y = first_touple[1]

x.shape

(60, 20)

Verificamos que são nosso dataset foi divido em batches de 60 "frases" de 20 palavras. Vamos verificar a seguir as frases, aqui codificadas como IDs únicos.

In [10]:
x[0:3]

array([[9970, 9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984,
        9986, 9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995],
       [ 901,   33, 3361,    8, 1279,  437,  597,    6,  261, 4276, 1089,
           8, 2836,    2,  269,    4, 5526,  241,   13, 2420],
       [2654,    6,  334, 2886,    4,    1,  233,  711,  834,   11,  130,
         123,    7,  514,    2,   63,   10,  514,    8,  605]],
      dtype=int32)

Definimos dois placeholders: um para input e outro pra target (próxima palavra)

In [11]:
_input_data = tf.placeholder(tf.int32, [batch_size, num_steps]) #[30#20]
_targets = tf.placeholder(tf.int32, [batch_size, num_steps]) #[30#20]

Definimos um dicionário, que vai conter o input e os targets. Podemos usá-lo para alimentar o input.

In [12]:
feed_dict = {_input_data:x, _targets:y}
session.run(_input_data, feed_dict)

array([[9970, 9971, 9972, ..., 9993, 9994, 9995],
       [ 901,   33, 3361, ...,  241,   13, 2420],
       [2654,    6,  334, ...,  514,    8,  605],
       ...,
       [7831,   36, 1678, ...,    4, 4558,  157],
       [  59, 2070, 2433, ...,  400,    1, 1173],
       [2097,    3,    2, ..., 2043,   23,    1]], dtype=int32)

 Só então vamos empilhar as duas células da LSTM:

In [13]:
lstm_cell_l1 = tf.contrib.rnn.BasicLSTMCell(hidden_size_l1, forget_bias=0.0)
lstm_cell_l2 = tf.contrib.rnn.BasicLSTMCell(hidden_size_l2, forget_bias=0.0)
stacked_lstm = tf.contrib.rnn.MultiRNNCell([lstm_cell_l1, lstm_cell_l2])

Cada LSTM tem duas matrizes: ```c_state```, o estado da célula e ```m_state```, o estado da memória. Como a primeira camada tem 256 hidden layers e input de 60 frases, temos duas matrizes [60x256]. Já na segunda camada, são duas matrizes [60x128].

In [14]:
_initial_state = stacked_lstm.zero_state(batch_size, tf.float32)
_initial_state

(LSTMStateTuple(c=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState/zeros:0' shape=(60, 256) dtype=float32>, h=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState/zeros_1:0' shape=(60, 256) dtype=float32>),
 LSTMStateTuple(c=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState_1/zeros:0' shape=(60, 128) dtype=float32>, h=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState_1/zeros_1:0' shape=(60, 128) dtype=float32>))

Podemos ver as matrizes de estado a seguir (por enquanto, zerados):

In [15]:
session.run(_initial_state, feed_dict)

(LSTMStateTuple(c=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), h=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)),
 LSTMStateTuple(c=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), h=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
  

### Embeddings
As palavras devem ser convertidas para um vetor de números. No caso, utilizar o método de One Hot Encoding não é o mais inteligente, já que são 10000 palavras únicas, ou seja, 10000 categorias. Tal solução seria extremamente custosa. Dessa forma, será criada uma camada da nossa rede, que converterá o input de IDs em uma representação densa, em forma de tensor.

In [16]:
embedding_vocab = tf.get_variable("embedding_vocab", [vocab_size, embeding_vector_size])  #[10000x200]

In [17]:
session.run(tf.global_variables_initializer())
session.run(embedding_vocab)

array([[-0.02222355, -0.02239299,  0.02285221, ..., -0.0135777 ,
        -0.00966106, -0.01049087],
       [-0.02212248,  0.01584486,  0.01178148, ...,  0.00932343,
         0.0138042 , -0.0190709 ],
       [ 0.01354723,  0.01081765,  0.01197524, ...,  0.01789277,
         0.01237494,  0.01284679],
       ...,
       [-0.00505498, -0.02215985,  0.01811605, ..., -0.00453311,
        -0.01991387, -0.00718697],
       [-0.00291013, -0.00580085,  0.0231658 , ...,  0.01885387,
         0.01027678, -0.01407855],
       [-0.01132059,  0.02221517,  0.02410362, ..., -0.01758648,
         0.00971607, -0.0231338 ]], dtype=float32)

A função ```embedding_lookup``` converte o input do ID no vetor correspondente do embedding.

In [18]:
# Define where to get the data for our embeddings from
inputs = tf.nn.embedding_lookup(embedding_vocab, _input_data)  #shape=(30, 20, 200) 
inputs

<tf.Tensor 'embedding_lookup:0' shape=(60, 20, 200) dtype=float32>

In [19]:
session.run(inputs[0], feed_dict)

array([[ 2.08730474e-02, -1.24567943e-02, -1.92565564e-02, ...,
         1.39541328e-02,  2.42200568e-02,  1.56214535e-02],
       [ 8.36488232e-03,  6.21220097e-04, -1.22633232e-02, ...,
         7.27566704e-03,  1.49087235e-02, -2.36732066e-02],
       [-1.45274960e-03, -9.85317770e-03,  6.40131906e-03, ...,
         1.22594312e-02,  1.95758156e-02,  1.97419338e-02],
       ...,
       [ 9.10164788e-04,  2.15193368e-02, -1.73129663e-02, ...,
         2.92157941e-03,  2.29546204e-02,  8.31496343e-03],
       [-5.76356798e-03, -1.86453406e-02, -1.01010324e-02, ...,
         5.18731214e-03, -1.49471778e-02,  1.81617029e-02],
       [-1.33857066e-02,  5.48977219e-03,  9.89437103e-05, ...,
         2.34806277e-02, -9.60426405e-03,  2.08830759e-02]], dtype=float32)

### Construindo a RNN
A função ```dynamic_rnn``` criará a rede neural recorrente à partir das duas células de LSTM.

outputs, new_state =  tf.nn.dynamic_rnn(stacked_lstm, inputs, initial_state=_initial_state)
outputs

Como a segunda camada de LSTMs tem 128 camadas ocultas, sua saída é no formato [60x20x128], sendo 60 o tamanho do batch, 20 o tamanho do input e 128 a profundidade da saída.

In [20]:
outputs, new_state =  tf.nn.dynamic_rnn(stacked_lstm, inputs, initial_state=_initial_state)
outputs

<tf.Tensor 'rnn/transpose_1:0' shape=(60, 20, 128) dtype=float32>

In [21]:
session.run(tf.global_variables_initializer())
session.run(outputs[0], feed_dict)

array([[-3.0827976e-04, -1.2370784e-04,  4.1377652e-04, ...,
        -3.6192688e-04, -3.1070056e-04, -5.7222578e-04],
       [-4.6243705e-04, -2.3241562e-05, -2.0648286e-04, ...,
        -2.8773278e-04, -4.0090283e-05, -6.5303245e-04],
       [-3.5284500e-04,  3.5875296e-04, -4.4528861e-04, ...,
        -1.3894135e-04,  7.7257596e-04, -5.3864077e-04],
       ...,
       [-3.1978011e-04,  5.9345010e-04,  7.9479569e-04, ...,
         4.0902427e-04, -4.5945882e-04,  6.3383649e-04],
       [ 4.6206434e-04,  3.5707920e-04,  1.2664311e-03, ...,
         8.3880633e-04, -3.3646365e-04,  1.0801820e-03],
       [ 7.5855060e-05,  2.3736736e-04,  7.6443620e-04, ...,
         1.2507838e-03, -5.9488160e-04,  7.7890075e-04]], dtype=float32)

Para inserir na camada softmax, devemos transformar em uma array:

In [22]:
output = tf.reshape(outputs, [-1, hidden_size_l2])
output

<tf.Tensor 'Reshape:0' shape=(1200, 128) dtype=float32>

Agora, devemos criar a camada logística, que nos retorna a probabilidade da próxima palavra no nosso universo de 1000 palavras únicas.

In [23]:
softmax_w = tf.get_variable("softmax_w", [hidden_size_l2, vocab_size]) #[200x1000]
softmax_b = tf.get_variable("softmax_b", [vocab_size]) #[1x1000]
logits = tf.matmul(output, softmax_w) + softmax_b
prob = tf.nn.softmax(logits)

In [24]:
session.run(tf.global_variables_initializer())
output_words_prob = session.run(prob, feed_dict)
print("shape of the output: ", output_words_prob.shape)
print("The probability of observing words in t=0 to t=20", output_words_prob[0:20])

shape of the output:  (1200, 10000)
The probability of observing words in t=0 to t=20 [[1.00774181e-04 9.86271843e-05 9.88556567e-05 ... 9.97972238e-05
  1.00062149e-04 9.97601164e-05]
 [1.00778823e-04 9.86246014e-05 9.88633110e-05 ... 9.97858588e-05
  1.00062622e-04 9.97575044e-05]
 [1.00786878e-04 9.86279338e-05 9.88588872e-05 ... 9.97834140e-05
  1.00057558e-04 9.97627285e-05]
 ...
 [1.00780358e-04 9.86357336e-05 9.88559914e-05 ... 9.97836978e-05
  1.00047888e-04 9.97665484e-05]
 [1.00791272e-04 9.86357918e-05 9.88568645e-05 ... 9.97784518e-05
  1.00052290e-04 9.97813331e-05]
 [1.00791935e-04 9.86377563e-05 9.88520187e-05 ... 9.97755269e-05
  1.00050129e-04 9.97867319e-05]]


Para obter as palavras com maior probabilidade, vamos usar a função ```argmax```

In [25]:
np.argmax(output_words_prob[0:20], axis=1)

array([1493, 9833, 9833, 7173, 7173, 8935, 8935, 9506, 8935, 9046, 9046,
       9046, 9046, 2868, 2868, 5555, 2868, 8935, 8935, 8935])

O correto seriam as seguintes palavras:

In [26]:
y[0]

array([9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984, 9986,
       9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995, 9996], dtype=int32)

Ou, diretamente do vetor de embedding:

In [27]:
targ = session.run(_targets, feed_dict) 
targ[0]

array([9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984, 9986,
       9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995, 9996], dtype=int32)

Nosso modelo, treinado com apenas uma época de treinamento não acertou nenhuma das predições. Para melhorar, vamos definir nossa função de perda, no caso, a função ```sequence_loss_by_example```.

In [28]:
loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [tf.reshape(_targets, [-1])],[tf.ones([batch_size * num_steps])])

Sua saída indica a [perplexidade](https://pt.wikipedia.org/wiki/Perplexidade) de cada sequência. Quanto menor a perplexidade, melhor o modelo consegue prever o resultado desejado.

In [29]:
session.run(loss, feed_dict)[:10]

array([9.193686, 9.223942, 9.206563, 9.201277, 9.200038, 9.216866,
       9.223826, 9.201516, 9.21173 , 9.219634], dtype=float32)

À partir da matriz acima, obtemos a nossa função de custo:

In [30]:
cost = tf.reduce_sum(loss) / batch_size
session.run(tf.global_variables_initializer())
session.run(cost, feed_dict)

184.22096

### Treinamento
Usando o gradiente descendente (reduzir a derivada da função), vamos reduzir o erro e obter os melhores pesos e viéses para a nossa função de custo.

In [31]:
# Create a variable for the learning rate
lr = tf.Variable(0.0, trainable=False)
# Create the gradient descent optimizer with our learning rate
optimizer = tf.train.GradientDescentOptimizer(lr)

In [32]:
# Get all TensorFlow variables marked as "trainable" (i.e. all of them except _lr, which we just created)
tvars = tf.trainable_variables()
tvars

[<tf.Variable 'embedding_vocab:0' shape=(10000, 200) dtype=float32_ref>,
 <tf.Variable 'rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0' shape=(456, 1024) dtype=float32_ref>,
 <tf.Variable 'rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0' shape=(1024,) dtype=float32_ref>,
 <tf.Variable 'rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0' shape=(384, 512) dtype=float32_ref>,
 <tf.Variable 'rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0' shape=(512,) dtype=float32_ref>,
 <tf.Variable 'softmax_w:0' shape=(128, 10000) dtype=float32_ref>,
 <tf.Variable 'softmax_b:0' shape=(10000,) dtype=float32_ref>]

Podemos ver todas as variáveis declaradas ao longo do código.

In [33]:
[v.name for v in tvars]

['embedding_vocab:0',
 'rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0',
 'rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0',
 'rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0',
 'rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0',
 'softmax_w:0',
 'softmax_b:0']

Definimos a função de gradiente descendente

In [34]:
tf.gradients(cost, tvars)

[<tensorflow.python.framework.ops.IndexedSlices at 0x7f121c679f98>,
 <tf.Tensor 'gradients/rnn/while/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/MatMul/Enter_grad/b_acc_3:0' shape=(456, 1024) dtype=float32>,
 <tf.Tensor 'gradients/rnn/while/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/BiasAdd/Enter_grad/b_acc_3:0' shape=(1024,) dtype=float32>,
 <tf.Tensor 'gradients/rnn/while/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/MatMul/Enter_grad/b_acc_3:0' shape=(384, 512) dtype=float32>,
 <tf.Tensor 'gradients/rnn/while/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/BiasAdd/Enter_grad/b_acc_3:0' shape=(512,) dtype=float32>,
 <tf.Tensor 'gradients/MatMul_grad/MatMul_1:0' shape=(128, 10000) dtype=float32>,
 <tf.Tensor 'gradients/add_grad/Reshape_1:0' shape=(10000,) dtype=float32>]

In [35]:
grad_t_list = tf.gradients(cost, tvars)
# Define the gradient clipping threshold
grads, _ = tf.clip_by_global_norm(grad_t_list, max_grad_norm)
grads

[<tensorflow.python.framework.ops.IndexedSlices at 0x7f121c693390>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_1:0' shape=(456, 1024) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_2:0' shape=(1024,) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_3:0' shape=(384, 512) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_4:0' shape=(512,) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_5:0' shape=(128, 10000) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_6:0' shape=(10000,) dtype=float32>]

In [36]:
session.run(grads, feed_dict)

[IndexedSlicesValue(values=array([[ 1.1066734e-06,  1.6076833e-06, -6.0458410e-06, ...,
          1.8483920e-06,  9.2165701e-06,  3.6859747e-06],
        [ 3.4974846e-06,  5.6411591e-07, -2.2276440e-06, ...,
         -1.6159240e-06,  1.3331493e-05,  6.9155358e-06],
        [ 4.4597073e-08,  3.4838777e-06, -2.9736254e-06, ...,
         -3.2425794e-06,  5.3656540e-06,  8.7939925e-06],
        ...,
        [ 2.2553311e-06, -3.5619773e-06,  2.9856112e-06, ...,
          8.3469286e-06, -1.5622375e-06,  3.2724295e-06],
        [ 4.6432278e-06, -4.1974363e-06,  3.6143392e-06, ...,
          9.4145171e-06, -1.8534065e-06,  1.7166608e-06],
        [ 2.1976766e-06, -5.1447801e-06,  2.4829185e-06, ...,
          3.7043799e-06, -8.2279371e-08, -2.4241028e-06]], dtype=float32), indices=array([9970, 9971, 9972, ..., 2043,   23,    1], dtype=int32), dense_shape=array([10000,   200], dtype=int32)),
 array([[ 3.4677434e-08,  2.9318450e-08,  4.5990255e-08, ...,
          4.3552966e-08, -1.3607192e-08, -

Finalmente podemos treinar o modelo:

In [37]:
# Create the training TensorFlow Operation through our optimizer
train_op = optimizer.apply_gradients(zip(grads, tvars))

In [38]:
session.run(tf.global_variables_initializer())
session.run(train_op, feed_dict)

## Classe LSTM
Vamos usar programação orientada a objetos para criar uma classe que implementa as funções acima definidas.

In [43]:
class PTBModel(object):

    def __init__(self, action_type):
        ######################################
        # Setting parameters for ease of use #
        ######################################
        self.batch_size = batch_size
        self.num_steps = num_steps
        self.hidden_size_l1 = hidden_size_l1
        self.hidden_size_l2 = hidden_size_l2
        self.vocab_size = vocab_size
        self.embeding_vector_size = embeding_vector_size
        ###############################################################################
        # Creating placeholders for our input data and expected outputs (target data) #
        ###############################################################################
        self._input_data = tf.placeholder(tf.int32, [batch_size, num_steps]) #[30#20]
        self._targets = tf.placeholder(tf.int32, [batch_size, num_steps]) #[30#20]

        ##########################################################################
        # Creating the LSTM cell structure and connect it with the RNN structure #
        ##########################################################################
        # Create the LSTM unit. 
        # This creates only the structure for the LSTM and has to be associated with a RNN unit still.
        # The argument n_hidden(size=200) of BasicLSTMCell is size of hidden layer, that is, the number of hidden units of the LSTM (inside A).
        # Size is the same as the size of our hidden layer, and no bias is added to the Forget Gate. 
        # LSTM cell processes one word at a time and computes probabilities of the possible continuations of the sentence.
        lstm_cell_l1 = tf.contrib.rnn.BasicLSTMCell(self.hidden_size_l1, forget_bias=0.0)
        lstm_cell_l2 = tf.contrib.rnn.BasicLSTMCell(self.hidden_size_l2, forget_bias=0.0)
        
        # Unless you changed keep_prob, this won't actually execute -- this is a dropout wrapper for our LSTM unit
        # This is an optimization of the LSTM output, but is not needed at all
        if action_type == "is_training" and keep_prob < 1:
            lstm_cell_l1 = tf.contrib.rnn.DropoutWrapper(lstm_cell_l1, output_keep_prob=keep_prob)
            lstm_cell_l2 = tf.contrib.rnn.DropoutWrapper(lstm_cell_l2, output_keep_prob=keep_prob)
        
        # By taking in the LSTM cells as parameters, the MultiRNNCell function junctions the LSTM units to the RNN units.
        # RNN cell composed sequentially of multiple simple cells.
        stacked_lstm = tf.contrib.rnn.MultiRNNCell([lstm_cell_l1, lstm_cell_l2])

        # Define the initial state, i.e., the model state for the very first data point
        # It initialize the state of the LSTM memory. The memory state of the network is initialized with a vector of zeros and gets updated after reading each word.
        self._initial_state = stacked_lstm.zero_state(batch_size, tf.float32)

        ####################################################################
        # Creating the word embeddings and pointing them to the input data #
        ####################################################################
        with tf.device("/gpu:0"):
            # Create the embeddings for our input data. Size is hidden size.
            embedding = tf.get_variable("embedding", [vocab_size, self.embeding_vector_size])  #[10000x200]
            # Define where to get the data for our embeddings from
            inputs = tf.nn.embedding_lookup(embedding, self._input_data)

        # Unless you changed keep_prob, this won't actually execute -- this is a dropout addition for our inputs
        # This is an optimization of the input processing and is not needed at all
        if action_type == "is_training" and keep_prob < 1:
            inputs = tf.nn.dropout(inputs, keep_prob)

        ############################################
        # Creating the input structure for our RNN #
        ############################################
        # Input structure is 20x[30x200]
        # Considering each word is represended by a 200 dimentional vector, and we have 30 batchs, we create 30 word-vectors of size [30xx2000]
        # inputs = [tf.squeeze(input_, [1]) for input_ in tf.split(1, num_steps, inputs)]
        # The input structure is fed from the embeddings, which are filled in by the input data
        # Feeding a batch of b sentences to a RNN:
        # In step 1,  first word of each of the b sentences (in a batch) is input in parallel.  
        # In step 2,  second word of each of the b sentences is input in parallel. 
        # The parallelism is only for efficiency.  
        # Each sentence in a batch is handled in parallel, but the network sees one word of a sentence at a time and does the computations accordingly. 
        # All the computations involving the words of all sentences in a batch at a given time step are done in parallel. 

        ####################################################################################################
        # Instantiating our RNN model and retrieving the structure for returning the outputs and the state #
        ####################################################################################################
        
        outputs, state = tf.nn.dynamic_rnn(stacked_lstm, inputs, initial_state=self._initial_state)
        #########################################################################
        # Creating a logistic unit to return the probability of the output word #
        #########################################################################
        output = tf.reshape(outputs, [-1, self.hidden_size_l2])
        softmax_w = tf.get_variable("softmax_w", [self.hidden_size_l2, vocab_size]) #[200x1000]
        softmax_b = tf.get_variable("softmax_b", [vocab_size]) #[1x1000]
        logits = tf.matmul(output, softmax_w) + softmax_b
        logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])
        prob = tf.nn.softmax(logits)
        out_words = tf.argmax(prob, axis=2)
        self._output_words = out_words
        #########################################################################
        # Defining the loss and cost functions for the model's learning to work #
        #########################################################################
            

        # Use the contrib sequence loss and average over the batches
        loss = tf.contrib.seq2seq.sequence_loss(
            logits,
            self.targets,
            tf.ones([batch_size, num_steps], dtype=tf.float32),
            average_across_timesteps=False,
            average_across_batch=True)
    
#         loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [tf.reshape(self._targets, [-1])],
#                                                       [tf.ones([batch_size * num_steps])])
        self._cost = tf.reduce_sum(loss)

        # Store the final state
        self._final_state = state

        #Everything after this point is relevant only for training
        if action_type != "is_training":
            return

        #################################################
        # Creating the Training Operation for our Model #
        #################################################
        # Create a variable for the learning rate
        self._lr = tf.Variable(0.0, trainable=False)
        # Get all TensorFlow variables marked as "trainable" (i.e. all of them except _lr, which we just created)
        tvars = tf.trainable_variables()
        # Define the gradient clipping threshold
        grads, _ = tf.clip_by_global_norm(tf.gradients(self._cost, tvars), max_grad_norm)
        # Create the gradient descent optimizer with our learning rate
        optimizer = tf.train.GradientDescentOptimizer(self.lr)
        # Create the training TensorFlow Operation through our optimizer
        self._train_op = optimizer.apply_gradients(zip(grads, tvars))

    # Helper functions for our LSTM RNN class

    # Assign the learning rate for this model
    def assign_lr(self, session, lr_value):
        session.run(tf.assign(self.lr, lr_value))

    # Returns the input data for this model at a point in time
    @property
    def input_data(self):
        return self._input_data

    
    # Returns the targets for this model at a point in time
    @property
    def targets(self):
        return self._targets
    
    # Returns the initial state for this model
    @property
    def initial_state(self):
        return self._initial_state

    # Returns the defined Cost
    @property
    def cost(self):
        return self._cost

    # Returns the final state for this model
    @property
    def final_state(self):
        return self._final_state
    
    # Returns the final output words for this model
    @property
    def final_output_words(self):
        return self._output_words
    
    # Returns the current learning rate for this model
    @property
    def lr(self):
        return self._lr

    # Returns the training operation defined for this model
    @property
    def train_op(self):
        return self._train_op


Agora, vamos criar funções para nos avisar do progresso do treinamento.

In [44]:
##########################################################################################################################
# run_one_epoch takes as parameters the current session, the model instance, the data to be fed, and the operation to be run #
##########################################################################################################################
def run_one_epoch(session, m, data, eval_op, verbose=False):

    #Define the epoch size based on the length of the data, batch size and the number of steps
    epoch_size = ((len(data) // m.batch_size) - 1) // m.num_steps
    start_time = time.time()
    costs = 0.0
    iters = 0

    state = session.run(m.initial_state)
    
    #For each step and data point
    for step, (x, y) in enumerate(reader.ptb_iterator(data, m.batch_size, m.num_steps)):
        
        #Evaluate and return cost, state by running cost, final_state and the function passed as parameter
        cost, state, out_words, _ = session.run([m.cost, m.final_state, m.final_output_words, eval_op],
                                     {m.input_data: x,
                                      m.targets: y,
                                      m.initial_state: state})

        #Add returned cost to costs (which keeps track of the total costs for this epoch)
        costs += cost
        
        #Add number of steps to iteration counter
        iters += m.num_steps

        if verbose and step % (epoch_size // 10) == 10:
            print("Itr %d of %d, perplexity: %.3f speed: %.0f wps" % (step , epoch_size, np.exp(costs / iters), iters * m.batch_size / (time.time() - start_time)))

    # Returns the Perplexity rating for us to keep track of how the model is evolving
    return np.exp(costs / iters)


In [45]:
# Reads the data and separates it into training data, validation data and testing data
raw_data = reader.ptb_raw_data(data_dir)
train_data, valid_data, test_data, _, _ = raw_data

In [46]:
# Initializes the Execution Graph and the Session
with tf.Graph().as_default(), tf.Session() as session:
    initializer = tf.random_uniform_initializer(-init_scale, init_scale)
    
    # Instantiates the model for training
    # tf.variable_scope add a prefix to the variables created with tf.get_variable
    with tf.variable_scope("model", reuse=None, initializer=initializer):
        m = PTBModel("is_training")
        
    # Reuses the trained parameters for the validation and testing models
    # They are different instances but use the same variables for weights and biases, they just don't change when data is input
    with tf.variable_scope("model", reuse=True, initializer=initializer):
        mvalid = PTBModel("is_validating")
        mtest = PTBModel("is_testing")

    #Initialize all variables
    tf.global_variables_initializer().run()

    for i in range(max_epoch):
        # Define the decay for this epoch
        lr_decay = decay ** max(i - max_epoch_decay_lr, 0.0)
        
        # Set the decayed learning rate as the learning rate for this epoch
        m.assign_lr(session, learning_rate * lr_decay)

        print("Epoch %d : Learning rate: %.3f" % (i + 1, session.run(m.lr)))
        
        # Run the loop for this epoch in the training model
        train_perplexity = run_one_epoch(session, m, train_data, m.train_op, verbose=True)
        print("Epoch %d : Train Perplexity: %.3f" % (i + 1, train_perplexity))
        
        # Run the loop for this epoch in the validation model
        valid_perplexity = run_one_epoch(session, mvalid, valid_data, tf.no_op())
        print("Epoch %d : Valid Perplexity: %.3f" % (i + 1, valid_perplexity))
    
    # Run the loop in the testing model to see how effective was our training
    test_perplexity = run_one_epoch(session, mtest, test_data, tf.no_op())
    
    print("Test Perplexity: %.3f" % test_perplexity)

Epoch 1 : Learning rate: 1.000
Itr 10 of 774, perplexity: 3935.783 speed: 15986 wps
Itr 87 of 774, perplexity: 1262.493 speed: 18209 wps
Itr 164 of 774, perplexity: 970.080 speed: 18048 wps
Itr 241 of 774, perplexity: 806.905 speed: 16529 wps
Itr 318 of 774, perplexity: 713.001 speed: 16258 wps


KeyboardInterrupt: 