# Transformer

Referência: [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)

# Teoria
------

## Introdução

### RNNs clássicas para NLP

- Palavras são codificadas em vetores
- Cada novo estado é baseado no estado anterior
- Decodificação começa no estado final do codificador
![Rnns_classica](images/rnn_classica.png)

### Mecanismo de atenção

- ultima camada do codificador sobrecarregada com todos os textos
- atenção maior para as camadas
- maiores pesos para o contexto estado anterior

![Mecanismo de atenção](images/attention.png)

## Arquitetura 

Essa arquitetura tem como foco os mecanismos de atenção. Ao em vez de cada palavra receber a atenção, nos Transformers cada frase terá esse processo.

![arquitetura](images/transformer.png)
![arquitetura](images/self-attention.png)


## Scale-dot product

**Ideia principal:**
- 2 sequências (iguais no caso de self-attention), A e B
- calcular como cada elemento de A está relacionado a cada elmento de B
- depois recombinamos A de acordo com essa relação

**Matematicamente**, dot-product indica a similaridade entre dois vetores.

![scale_dot-product](images/scale_dot-product.png)

![scale_dot-product](images/scale_dot-product2.png)

## Look-ahead Mask

![look-ahead](images/look-ahead.png)

## Attention Layer

![attentio layer](images/attention-layer.png)

## Multi-head attention Layer

![multi-head](images/multi-head.png)

## Positional Encoding

![positional-encod](images/positional-encod.png)

## Feed-forward layers (camadas densas)

- composta de 2 transformações lineares

$$
FFN(x) = \max(0, x W_1 + b_1)W_2 + b_2
$$

## Residual connections:

- **Add & Norm:** não esquecer a informação da etapa anterior, ajudando a aprendizagem durante o *backpropagation*.
- **Last linear:** a saída do decodificador passa por uma camada densa de acordo com o tamanho do vocabulário e com a aplicação da função softmax, gerando probabilidades para cada palavra.

# Prática
----------

## Importação

In [2]:
import pandas as pd
import numpy as np

import math
import re
import time
import zipfile
import random

import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds

- Bases de dados: https://www.statmt.org/europarl/

In [3]:
file_path = '../../_IAExpert_private/7. Processamento de Linguagem Natural com Deep LEarning/pt-en'

with open(f"{file_path}/europarl-v7.pt-en.en", mode='r', encoding='utf-8') as f:
    europarl_en = f.read()

with open(f"{file_path}/europarl-v7.pt-en.pt", mode='r', encoding='utf-8') as f:
    europarl_pt = f.read()

In [4]:
europarl_en[:100]

'Resumption of the session\nI declare resumed the session of the European Parliament adjourned on Frid'

In [5]:
en = europarl_en.split('\n')
pt = europarl_pt.split('\n')

len(en), len(pt)

(1960408, 1960408)

In [6]:
i = random.randint(0, len(en)-1)
en[i], pt[i]

('Since the South Korean Constitutional Court itself recognised that the death penalty could be subject to errors and abuse, our concerns brought forward today might strengthen the democratic institutions of the Republic of Korea in the idea that this method of punishment should be abolished for good.',
 'Uma vez que o próprio Tribunal Constitucional sul-coreano reconheceu que a pena de morte pode estar sujeita a erros e abusos, as nossas preocupações hoje aqui expostas poderão ajudar a reforçar nas instituições democráticas da República da Coreia a ideia de que esse método de punição deve ser abolido de uma vez por todas.')

### Limpeza dos dados

```python
corpus_en = europarl_en
corpus_en = re.sub(r"\.(?=[0-9][a-z][A-Z])", '.$$$', corpus_en)
corpus_en = re.sub(r".\$\$\$", '', corpus_en)
corpus_en = re.sub(r" +", ' ', corpus_en)
corpus_en = corpus_en.split('\n')

```

```python
corpus_pt = europarl_pt
corpus_pt = re.sub(r"\.(?=[0-9][a-z][A-Z])", '.$$$', corpus_pt)
corpus_pt = re.sub(r".\$\$\$", '', corpus_pt)
corpus_pt = re.sub(r" +", ' ', corpus_pt)
corpus_pt = corpus_pt.split('\n')
```

```python
with open("corpus.pkl", "wb") as f:
    pickle.dump([corpus_en, corpus_pt], f)
```

In [7]:
import pickle
with open("corpus.pkl", 'rb') as f:
    corpus_en, corpus_pt = pickle.load(f)

In [8]:
len(corpus_en), len(corpus_pt)

(1960408, 1960408)

### Tokenização

- texto para número

In [9]:
tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(corpus_en, target_vocab_size=2**13)

In [10]:
tokenizer_pt = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(corpus_pt, target_vocab_size=2**13)

In [11]:
tokenizer_en.vocab_size, tokenizer_pt.vocab_size

(8188, 8116)

In [12]:
vocab_size_en = tokenizer_en.vocab_size + 2
vocab_size_pt = tokenizer_pt.vocab_size + 2

In [13]:
inputs = [[vocab_size_en - 2] + tokenizer_en.encode(sentence) + [vocab_size_en - 1] for sentence in corpus_en]

In [14]:
outputs = [[vocab_size_pt - 2] + tokenizer_pt.encode(sentence) + [vocab_size_pt - 1] for sentence in corpus_pt]

In [15]:
inputs[0]

[8188, 2456, 972, 2106, 3, 1, 2569, 8189]

In [16]:
outputs[0]

[8116, 834, 705, 7, 3561, 8117]

### Remoção de sentenças muito longas

#### A partir dos inputs

In [18]:
max_length = 15
idx_to_remove = [count for count, sent in enumerate(inputs) if len(sent) > max_length]

In [19]:
len(idx_to_remove)

1686352

In [21]:
for idx in reversed(idx_to_remove):
    del inputs[idx]
    del outputs[idx]

#### A partir dos outputs

In [22]:
max_length = 15
idx_to_remove = [count for count, sent in enumerate(outputs) if len(sent) > max_length]

In [23]:
len(idx_to_remove)

65914

In [24]:
for idx in reversed(idx_to_remove):
    del inputs[idx]
    del outputs[idx]

In [25]:
len(inputs), len(outputs)

(208142, 208142)

### Paddings e batches

In [33]:
inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs, value=0, padding='post', maxlen=max_length)
outputs = tf.keras.preprocessing.sequence.pad_sequences(outputs, value=0, padding='post', maxlen=max_length)

In [44]:
i = random.randint(0, len(inputs))
print(i, len(inputs[i]), len(outputs[i]))
inputs[i], outputs[i]

19689 15 15


(array([8188, 7972,   25,  265, 4587,    1,  182,  297, 7049, 7973, 8189,
           0,    0,    0,    0]),
 array([8116, 7900,   36,  150, 1460, 7892, 3144,    3,  237,  645, 7901,
        8117,    0,    0,    0]))

Vamos mudar para o formato do tensor flow.

In [45]:
batch_size = 64
buffer_size = 20000

dataset = tf.data.Dataset.from_tensor_slices((inputs, outputs))
dataset = dataset.cache()
dataset = dataset.shuffle(buffer_size).batch(batch_size)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

In [46]:
dataset

<PrefetchDataset shapes: ((None, 15), (None, 15)), types: (tf.int32, tf.int32)>

## Embedding

In [49]:
class PositionalEncoding(layers.Layer):
    
    def __init__(self):
        super(PositionalEncoding, self).__init__()
    
    def get_angles(self, pos, i, d_model):
        angles = 1 / np.power(10000., (2*(i // 2)) / np.float32(d_model))
        return pos * angles
    
    def call(self, inputs):
        seq_length = input.shape.as_list()[-2]
        d_model = input.shape.as_list()[-1]
        angles = self.get_angles(np.range(seq_length)[:, np.newaxis], 
                                 np.arange(d_model)[np.newaxis,:], 
                                 d_model)
        
        angles[:, 0::2] = np.sin(angles[:, 0::2])
        angles[:, 1::2] = np.cos(angles[:, 1::2])
        pos_encoding = angles[np.newaxis, ...]
        
        return inputs + tf.cast(post_encoding, tf.float32)        

## Mecanismo de atenção

In [48]:
def scaled_dot_product_attention(queries, keys, values, mask):
    product = tf.matmul(queries, keys, transpose_b=True)
    keys_dim = tf.cast(tf.shape(keys)[-1], tf.float32)
    scaled_product = product / tf.math.sqrt(keys_dim)
    
    if mask is not None:
        scaled_product += (mask * -1e9)
        
    attention = tf.matmul(tf.nn.softmax(scaled_product, axis=-1), values)
    
    return attention

In [54]:
class MultiHeadAttention(layers.Layer):
    
    def __init__(self, nb_proj):
        super(MultiHeadAttention, self).__init__()
        self.nb_proj = np_proj
        
    def build(self, input_shape):
        self.d_model = input_shape[-1]
        assert self.d_model % self.nb_proj == 0
        
        self.d_proj = self.d_model // self.nb_proj
        
        self.query_lin = layers.Dense(units=self.d_model)
        self.key_lin = layers.Dense(units=self.d_model)
        self.value_lin = layers.Dense(units=self.d_model)
        
        self.final_lin = layers.Dense(units=self.d_model)
        
    def split_proj(self, inputs, batch_size):
        shape = (batch_size, -1, self.nb_proj, self.d_proj)
        splited_inputs = tf.reshape(inputs, shape=shape)
        return tf.transpose(splited_inputs, perm=[0,2,1,3])
        
    def call(self, queries, keys, values, mask):
        batch_size = tf.shape(queries)[0]
        
        queries = self.query_lin(queries)
        keys = self.key_lin(keys)
        values = self.value_lin(values)
        
        queries = self.split_ploj(queries, batch_size)
        keys = self.split_ploj(keys, batch_size)
        values = self.split_ploj(values, batch_size)
        
        attention = scaled_dot_product_attention(queries, keys, values, mask)
        
        attention = tf.transpose(attention, perm=[0,2,1,3])
        
        concat_attention = tf.reshape(attention, shape=(batch_size, -1, self.d_model))
        
        outputs = self.final_lin(concat_attention)
        
        return outputs

## Encoder

In [57]:
class EncoderLayer(layers.Layer):
    
    def __init__(self, FFN_units, nb_proj, dropout_rate):
        super(EncoderLayer, self).__init__()
        self.FFN_units = FFN_units
        self.nb_proj = nb_proj
        self.dropout_rate = dropout_rate
        
    def build(self, input_shape):
        self.d_nodel = input_shape[-1]
        
        self.multi_head_attention = MultiHeadAttention(self.nb_proj)
        self.dropout_1 = layers.Dropout(rate=self.dropout_rate)
        self.norm_1 = layers.LayerNormalization(epsilon=1e-6)
        
        self.dense_1 = layers.Dense(units=self.FFN_units, activation='relu')
        self.dense_2 = layers.Dense(units=self.FFN_units, activation='relu')
        self.dropout_2 = layers.Dropout(rate=self.dropout_rate)
        
        self.norm_2 = layers.LayerNormalization(epsilon=1e-6)
        
    def call(self, inputs, mask, training):
        attention = self.multi_head_attention(inputs, inputs, inputs, mask)        
        attention = self.dropout_1(attention, training=training)
        attention = self.norm_1(attention + inputs)
        
        outputs = self.dense_1(attention)
        outputs = self.dense_2(outputs)
        outputs = self.dropout_2(outputs, training=training)
        outputs = self.norm_2(outputs + attention)
        
        return outputs

In [56]:
class Encoder(layers.Layer):
    
    def __init__(self, 
               nb_layers, 
               FFN_units, 
               nb_proj, 
               dropout_rate, 
               vocab_size, 
               d_model, 
               name='encoder'):
        super(Encoder, self).__init__(name=name)
        self.nb_layers = nb_layers
        self.d_model = d_model
        
        self.embedding = layers.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding()
        self.dropout = layers.Dropout(rate=dropout_rate)
        self.enc_layers = [EncoderLayer(FFN_units, nb_proj, dropout_rate) for _ in range(nb_layers)]
    
    def call(self, inputs, mask, training):
        outputs = self.embedding(inputs)
        outputs *= th.math.sqrt(tf.cast(self.d_model, tf.float32))
        outputs = self.pos_encoding(outputs)
        outputs = self.dropout(outputs, training)
        
        for i in range(self.nb_layers):
            outputs = self.enc_layers[i](outputs, mask, training)
            
        return outputs

## Decoder

In [58]:
class DecoderLayer(layers.Layer):
    
    def __init__(self, FFN_units, nb_proj, dropout_rate):
        super(DecoderLayer, self).__init__()
        self.FFN_units = FFN_units
        self.nb_proj = nb_proj
        self.dropout_rate = dropout_rate
        
    def build(self, input_shape):
        self.d_model = input_shape[-1]
        
        self.multi_head_attention_1 = MultiHeadAttention(self.nb_proj)
        self.dropout_1 = layer.Dropout(rate=self.dropout_rate)
        self.norm_1 = layers.LayerNormalization(epsilon=1e-6)
        
        self.multi_head_attention_2 = MultiHeadAttention(self.nb_proj)
        self.dropout_2 = layer.Dropout(rate=self.dropout_rate)
        self.norm_2 = layers.LayerNormalization(epsilon=1e-6)
        
        self.dense_1 = layers.Dense(units=self.FFN_units, activation='relu')
        self.dense_2 = layers.Dense(units=self.FFN_units, activation='relu')
        self.dropout_3 = layer.Dropout(rate=self.dropout_rate)
        self.norm_3 = layers.LayerNormalization(epsilon=1e-6)
        
    def call(self, inputs, enc_outputs, mask_1, mask_2, taining):
        attention = self.multi_head_attention_1(inputs, inputs, inputs, mask_1)
        attention = self.dropout_1(attention, training)
        attention = self.norm_1(attention + inputs)
        
        attention_2 = self.multi_head_attention_2(attention, enc_outputs, enc_outputs, mask_2)
        attention_2 = self.dropout_2(attention_2, training)
        attention_2 = self.norm_2(attention_2 + attention)
        
        outputs = self.dense_1(attention_2)
        outputs = self.dense_1(outputs)
        outputs = self.dropout_3(outputs, training)
        outputs = self.norm_3(outputs + attention_2)
        
        return outputs

In [60]:
class Decoder(layers.Layer):
    
    def __init__(self, 
                 nb_layers, 
                 FFN_units, 
                 nb_proj, 
                 dropout_rate, 
                 vocab_size, 
                 d_model, 
                 name='encoder'):
        super(Decoder, self).__init__(name=name)
        self.d_model = d_model
        self.nb_layers = nb_layers
        
        self.embedding = layers.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding()
        self.dropout = layers.Dropout(rate=dropout_rate)
        
        self.dec_layers = [DecoderLayer(FFN_units, nb_proj, dropout_rate) for _ in range(nb_layers)]
        
    def call(self, inputs, enc_outputs, mask_1, mask_2, training):
        outputs = self.embedding(inputs)
        outputs *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        outputs = self.pos_encoding(outputs)
        outputs = self.dropout(outputs, training)
        
        for i in range(self.nb_layers):
            outputs = self.dec_layers[i](outputs, enc_outputs, mask_1, mask_2, training)
            
        return outputs

## Transformer

In [None]:
class Transformer(tf.keras.Model):
    
    def __init__(self, 
                 vocab_size_enc, 
                 vocab_size_dec, 
                 d_model, 
                 nb_layers, 
                 FFN_units, 
                 nb_proj, 
                 dropout_rate, 
                 name='transformer'):
        super(Transformer, self).__init__(name=name)
        
    def create_padding_mask(self, seq):
        
    def create_look_ahead_mask(self, seq):
        
    def call(self, enc_inputs, dec_inputs, training):