# Sequence to Sequence with Attention

Los modelos encoder-decoder tienen un problema, si os fijasteis, y sino, ver foto, el contexto del encoder era el útlimo estado de nuestra red recurrente. Es decir, pretendemos que una frase, da igual si es corta, larga o lo que sea, pretendemos que quede en un vector de tamaño fijo, ya sea 64 o 1024. Obviamente, si usamos 1024 posiciones sería mejor, pero no dejamos de depender de un vector de tamaño fijo.

![](https://pytorch.org/tutorials/_images/seq2seq.png)

Es por eso, que en 2015, empezaron a salir, basados de nuevo en Computer Vision los llamádos módulos de atención. Estos módulos nos permiten que nuestras redes aprendan dónde hay información relevante en el encoder. Y muy importante, nos permiten ver a nosotros donde se estan fijando.

![](https://i.imgur.com/JaSRu42.png =700x)

Estos modelos, son usados ya en cualquier sistema encoder-decoder, de hecho en los links que hemos visto anteriormente sobre el sistema de Google Translate, y cualquier otro sistema, usan ya estos sistemas. Aquí podeis ver un ejemplo aplicado a la traducción francés-inglés visto arriba.

![](https://camo.githubusercontent.com/c54ad54bebb12b5b585ab666664e6d0e6002d894/68747470733a2f2f692e696d6775722e636f6d2f357936534376552e706e67 =900x)

## Explain Attention Network

Basicamente, el cambio que hacemos en esta arquitectura es el siguiente. En lugar de usar sólo el ultimo hidden_state de nuestro encoder, vamos a usarlos todos, y dejar que la red aprenda donde hay información relevante para producir el output. Es decir, en lugar de fijar-nos en lo último, o de fijar-nos en todo, ayudamos a nuestra arquitectura a fijar-se en aquello que es relevante.

Ver ejemplo en [Attentional interfaces](https://distill.pub/2016/augmented-rnns/#attentional-interfaces)

Hay varias maneras de calcular este nuevo contexto, nosotros implementaremos una en Keras (Luong's Attention, del segundo paper que os recomiendo aquí.) 

La forma de conseguir esto, no es distinta a nada que no hayamos hecho hasta la fecha, dot products, quizás una fully connected layer, y una softmax para generar una distribución que diga que partes son las más importantes.

En estos dos papers, podeis leer en más detalle el funcionamiento de los módulos de atención.

*   [Neural Machine Translation by jointly learning to align and translate](https://arxiv.org/pdf/1409.0473.pdf)
*   [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/pdf/1508.04025.pdf)

Son papers algo complicados, pero muy influyentes en las tareas de NLP actuales.

A continuación, haremos una simple implementación de un módulo de atención en numpy, y luego lo implementaremos en Keras modificando así nuestra arquitectura encoder-decoder (Sequence 2 Sequence)



## Numpy Implementation

In [1]:
import numpy as np

In [2]:
def softmax(x, axis=0):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=axis) # only difference


#### Alignment score

![](https://i.imgur.com/pjJrL24.png =400x)

#### Alignment vector

![](https://i.imgur.com/QAOe4Qk.png =400x)

#### Context vector

![](https://i.imgur.com/4fjOKja.png =650x)



## Imports

In [3]:
from keras.models import Model
from keras.layers import Input, CuDNNLSTM, Dense, LSTM
from keras.layers import Bidirectional
from keras.layers import Embedding
from keras.layers import Merge, Dot, Concatenate, Flatten, Permute, Multiply, dot, concatenate, Average
from keras.layers import TimeDistributed
from keras.layers import Activation
from keras.activations import softmax
from keras.preprocessing import sequence
from keras.callbacks import Callback
from keras.optimizers import SGD
from keras.models import load_model
import tensorflow as tf
import keras.backend as K

from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint


from random import shuffle, choice, sample
import time

import pprint as pp
import pickle

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import pylab as pl
from IPython import display

sns.set(color_codes=True)

import warnings
warnings.filterwarnings('ignore')


%matplotlib inline
%load_ext autoreload
%autoreload 2
%matplotlib notebook

Using TensorFlow backend.


## Data

In [4]:
USE_EMBEDDINGS = True
SAMPLE_EVERY = 3


In [5]:
def generate_dummy_data():
    x = [ix for ix in range(100)]
    data = []
    for ix_source in range(3, 5):
        for ix_target in range(3, 5):
            for ix, _ in enumerate(x):
                data.append((x[ix:ix+ix_source], x[ix+ix_source:ix+ix_target*2]))
    return data
    
dummy_data = generate_dummy_data()
shuffle(dummy_data)


In [6]:
 # sentence preparation
data_tr = []
for i, (inp, out) in enumerate(dummy_data):
    # _in = inp.split()
    # _in = inp.split()
    
    inp.insert(0, '<SOS>')
    out.insert(0, '<SOS>')
    
    inp.append('<EOS>')
    out.append('<EOS>')
    
    data_tr.append((inp, out))

In [7]:
maxlen_source = max([len(x) for x, _ in dummy_data])
maxlen_source

6

## Data Preparation

In [8]:
#vocabulary preparation
vocab = []
for inp, out in data_tr:
    vocab+=[w for w in inp]
    vocab+=[w for w in out]
vocab = list(set(vocab))
vocab.insert(0, '<PAD>')
vocab.append('<UNK>')
print(vocab)

['<PAD>', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, '<EOS>', '<SOS>', '<UNK>']


In [9]:
w2id = {w:i for i, w in enumerate(vocab)}
id2w = {w:i for i, w in w2id.items()}
#pp.pprint(id2w)

In [10]:
data_train = []
for inp, out in data_tr:
    ind_enc_in = [w2id[w] if w in w2id else w2id['<UNK>'] for w in inp]
    ind_dec_in = [w2id[w] if w in w2id else w2id['<UNK>'] for w in out]
    ind_dec_out = [w2id[w] if w in w2id else w2id['<UNK>'] for w in out[1:]]
    data_train.append((ind_enc_in, ind_dec_in, ind_dec_out))

In [11]:
def split_train_val_test(dataset, split=0.2):

    xe, xd, y = zip(*dataset)
    xe = np.array(list(xd))
    xd = np.array(list(xd))
    y = np.array(list(y))
    sss = StratifiedShuffleSplit(n_splits=1, test_size=split, random_state=1337) #l33t seed
    for train_index, test_index in sss.split(xe, y):
        xe_train, xe_val = xe[train_index], xe[test_index]
        xd_train, xd_val = xd[train_index], xd[test_index]
        y_train, y_val = y[train_index], y[test_index]
    splits = {'train':(xe_train, xd_train, y_train), 'test':(xe_val, xd_val, y_val)}
    return splits

## Auxiliary functions

In [12]:
def sample_pred(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    # print('sample pred: ',probas.shape) # apply here LM
    return np.argmax(probas)

In [13]:
class Sampletest(Callback):
    def on_epoch_end(self, epoch, logs):
        if epoch % 5 == 0  and epoch>0:
            nb_samples = 1
            data_t = sample(data_tr, nb_samples)
            data_test = []
            for inp, out in data_t:
                ind_enc_in = [w2id[w] if w in w2id else w2id['<UNK>'] for w in inp]
                ind_dec_in = [w2id['<SOS>']] # w2id[w] if w in w2id else w2id['<UNK>'] for w in out.split(' ')
                data_test.append((ind_enc_in,ind_dec_in))

            params = {
                'max_encoder_len': maxlen_source+2,
                'max_decoder_len': maxlen_source+2,
                'target_len': len(vocab),
                'use_embeddings': True
                }
            if 'use_embeddings' in params and params['use_embeddings']:
                encoder_input_data = np.zeros(shape=(nb_samples, train_params['max_encoder_len']))    
                decoder_input_data = np.zeros(shape=(nb_samples, train_params['max_decoder_len']))
                for i, (ei, di) in enumerate(data_test):
                    for j, idx in enumerate(ei):
                        encoder_input_data[i, j] = idx
                    for j, idx_di in enumerate(di):
                        decoder_input_data[i, j] = idx_di        
            else:
                encoder_input_data = np.zeros(shape=(nb_samples, params['max_encoder_len'], params['target_len']))    
                decoder_input_data = np.zeros(shape=(nb_samples, params['max_decoder_len'], params['target_len']))

                for i, (ei, di) in enumerate(data_test):
                    for j, idx in enumerate(ei):
                        encoder_input_data[i, j, idx] = 1
                    for j, idx_di in enumerate(di):
                        decoder_input_data[i, j, idx_di] = 1
            temperature = choice([0.1, 0.3, 1.5])
            for i in range(1, params['max_decoder_len']):
                output = self.model.predict([encoder_input_data, decoder_input_data]) #.argmax(axis=2)
                # print(output.shape)
                output_t = np.apply_along_axis(sample_pred, 2, output, temperature=temperature)
                # print(output_t.shape)
                decoder_input_data[:,i] = output_t[:,i]
                
            result = decoder_input_data
            for r, original in zip(result, data_t):
                
                sentence = original
                repr_out = []
                for ix in r:
                    token = id2w[ix]
                    if token == '<EOS>':
                        break
                    else:
                        repr_out.append(token)
            
                print('Test Sample epoch({}): {} ====> {}'.format(epoch, sentence, repr_out[1:]))

In [14]:
class HistoryDisplay(Callback):
    
    def on_train_begin(self, logs={}):
        self.losses = []
        self.accs = []
        self.epochs = []
        self.fig, self.ax = plt.subplots()
        #plt.show()
        
        plt.ion()
        self.fig.show()
        self.fig.canvas.draw()
    
    def on_epoch_end(self, epoch, logs):
        self.epochs.append(epoch)
        self.losses.append(logs['loss'])
        self.accs.append(logs['acc'])
        if epoch % PLOT_EVERY == 0:
            
            self.ax.clear()
            self.ax.plot(self.epochs, self.accs, 'g', label='acc')
            self.ax.plot(self.epochs, self.losses, 'b', label='loss')
            legend = self.ax.legend(loc='upper right', shadow=True, fontsize='x-large')
            #display.clear_output(wait=True)
            #display.display(pl.gcf())
            self.fig.canvas.draw()
            
            #plt.draw()
        

## Architecture Defintion

In [15]:
from keras.engine.topology import Layer

In [16]:
class Seq2Seq:
    def __init__(self, **kwargs):
        self.params = kwargs.pop('params', None)
    
    def compile_attention_seq2seq(self, params={}):
        
        # embeddings
        
        embedding_layer_encoder = Embedding(input_dim=params['vocab'], output_dim=params['emb_feats'], input_length=params['max_encoder_len'], name='embedding_layer_encoder')
        
        embedding_layer_decoder = Embedding(input_dim=params['vocab'], output_dim=params['emb_feats'], input_length=params['max_decoder_len'], name='embedding_layer_decoder')
        
        
        # ENCODER
        encoder_inputs = Input(shape=(params['max_encoder_len'], ), name='encoder_input')       
        
        encoder_embedding = embedding_layer_encoder(encoder_inputs)
        
        encoder = LSTM(params['hidden_size'], return_state=True, return_sequences=True, name='encoder')
        
        encoder_outputs, state_h, state_c = encoder(encoder_embedding)

        # inicializar el decoder con el ultimo estado de nuestro encoder
        encoder_states = [state_h, state_c]
        
        # DECODER
               
        decoder_inputs = Input(shape=(params['max_decoder_len'], ), name='decoder_input')       
        
        decoder_embedding = embedding_layer_decoder(decoder_inputs)
        
        decoder = LSTM(params['hidden_size'], return_sequences=True, name='decoder')
        decoder_outputs = decoder(decoder_embedding, encoder_states)
        
        # empieza la fiesta // Attention Layer
        
        sc = dot([decoder_outputs, encoder_outputs], axes=[2, 2], name='partial_scores')
        print(sc)
        scores = Dense(params['hidden_size'], activation='softmax', name='scores')(sc)
        print(scores)
        
        attention_weights = Activation('softmax', name='attention_weights')(dot([scores, encoder_outputs], axes=[2, 2], name='alignment_vector'))
        
        print('attention_weights', attention_weights)
        print('decoder_outputs', decoder_outputs)
        
        context = dot([attention_weights, encoder_outputs], axes=[2,1], name='context_vector')
        decoder_combined = concatenate([context, decoder_outputs])
        
        # h_hat
        h_hat = TimeDistributed(Dense(64, activation='tanh'))(decoder_combined)
        decoder_outputs = TimeDistributed(Dense(params['target_size'], activation='softmax'))(h_hat)
        
        model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
        
        model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
        model.summary()
        return model
        
    def train(self, model, data, params={}):
        
        callbacks = self._get_callbacks()
        print(callbacks)
        if 'shuffle' in params and params['shuffle']:
            shuffle(data)
        encoder_input_data = np.zeros(shape=(len(data), params['max_encoder_len']))    
        decoder_input_data = np.zeros(shape=(len(data), params['max_decoder_len']))
        decoder_target_data = np.zeros(shape=(len(data), params['max_decoder_len'], params['target_len']))
        for i, (ei, di,dt) in enumerate(data):
            for j, idx in enumerate(ei):
                encoder_input_data[i, j] = idx
            for j, idx_di in enumerate(di):
                decoder_input_data[i, j] = idx_di
            for j in range(params['max_decoder_len']):      
                if j<len(dt):
                    decoder_target_data[i, j, dt[j]] = 1
                else:
                    decoder_target_data[i, j, 0] = 1 
        model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=params['batch_size'], epochs=params['epochs'],  callbacks=callbacks, verbose=1)
            
    def predict(self, model, data, params={}):        


        nb_samples = len(data)
        data_t = sample(data_tr, nb_samples)
        data_test = []
        predicted_out = []
        for samp in range(nb_samples):
            for inp, out in data_t:
                ind_enc_in = [w2id[w] if w in w2id else w2id['<UNK>'] for w in inp]
                ind_dec_in = [w2id[w] if w in w2id else w2id['<UNK>'] for w in out]
                data_test.append((ind_enc_in,ind_dec_in))


            encoder_input_data = np.zeros(shape=(1, params['max_encoder_len']))    
            decoder_input_data = np.zeros(shape=(1, params['max_decoder_len']))

            for i, (ei, di) in enumerate(data_test):
                for j, idx in enumerate(ei):
                    encoder_input_data[i, j] = idx
                for j, idx_di in enumerate(di):
                    decoder_input_data[i, j] = idx_di

            result = self.model.predict([encoder_input_data, decoder_input_data])
            for r, original in zip(result, data_t):
                original_sentence = original[0]
                idx = np.argmax(r, axis=1)
                print(idx)
                repr_out = []
                for ix in idx:
                    token = id2w[ix]
                    if token == '<EOS>':
                        break
                    else:
                        repr_out.append(token)
            predicted_out.append(repr_out)
        return predicted_out
    
    def load(self, model_path='seq2seq_attn.h5'):
        return load_model(model_path)
    
    def _get_callbacks(self, model_path='seq2seq_attn.h5'):
        es = EarlyStopping(monitor='loss', patience=9, mode='auto', verbose=1)
        save_best = ModelCheckpoint(model_path, monitor='loss', verbose = 1, save_best_only=True, save_weights_only=False, period=2)
        st = Sampletest()
        # hd = HistoryDisplay()
        rlr = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=3, min_lr=0.0001, verbose=1)
        return [rlr, es, st]#st, save_best,  hd,

## Compile model 

In [17]:
compile_params = {
    'vocab': len(vocab),
    'emb_feats': 50,
    'hidden_size': 128,
    'target_size': len(vocab),
    'input_size': len(vocab),
    'max_encoder_len': maxlen_source+2,
    'max_decoder_len': maxlen_source+2,
    'use_embeddings': True,
}

pp.pprint(compile_params)


{'emb_feats': 50,
 'hidden_size': 128,
 'input_size': 104,
 'max_decoder_len': 8,
 'max_encoder_len': 8,
 'target_size': 104,
 'use_embeddings': True,
 'vocab': 104}


In [18]:
s2s = Seq2Seq()

s2s_model = s2s.compile_attention_seq2seq(params=compile_params)    

Tensor("partial_scores/MatMul:0", shape=(?, ?, ?), dtype=float32)
Tensor("scores/truediv:0", shape=(?, 8, 128), dtype=float32)
attention_weights Tensor("attention_weights/truediv:0", shape=(?, 8, ?), dtype=float32)
decoder_outputs Tensor("decoder/transpose_1:0", shape=(?, ?, 128), dtype=float32)
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input (InputLayer)      (None, 8)            0                                            
__________________________________________________________________________________________________
decoder_input (InputLayer)      (None, 8)            0                                            
__________________________________________________________________________________________________
embedding_layer_encoder (Embedd (None, 8, 50)        5200        encoder_input[0][0]              
__________

Comprobar si los shapes son iguales en keras que en numpy!

## Train

In [19]:
# data_train

In [20]:
train_params = {
    'epochs': 500,
    'batch_size': 16,
    'shuffle': True,
    'target_len': len(vocab),
    'max_encoder_len': maxlen_source+2,
    'max_decoder_len': maxlen_source+2,
    'use_embeddings': True
}
pp.pprint(train_params)

s2s.train(model=s2s_model, data=data_train, params=train_params)

{'batch_size': 16,
 'epochs': 500,
 'max_decoder_len': 8,
 'max_encoder_len': 8,
 'shuffle': True,
 'target_len': 104,
 'use_embeddings': True}
[<keras.callbacks.ReduceLROnPlateau object at 0x7fee0f476550>, <keras.callbacks.EarlyStopping object at 0x7fee0f476208>, <__main__.Sampletest object at 0x7fee0f4765f8>]
Epoch 1/500


InternalError: ignored

## Predict

In [0]:
predict_params = {
            'max_encoder_len': maxlen + 2,
            'max_decoder_len': maxlen + 2,
            'target_len': len(vocab)
            }

s2s.predict(model=s2s_model, data=['', ''], params=predict_params)

# data tiene que ser una lista de esas predicciones que quereis hacer. Fijaros en la implementación de Sampletest más arriba