# Transformer

Referência: [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)

# Teoria
------

## Introdução

### RNNs clássicas para NLP

- Palavras são codificadas em vetores
- Cada novo estado é baseado no estado anterior
- Decodificação começa no estado final do codificador
![Rnns_classica](images/rnn_classica.png)

### Mecanismo de atenção

- ultima camada do codificador sobrecarregada com todos os textos
- atenção maior para as camadas
- maiores pesos para o contexto estado anterior

![Mecanismo de atenção](images/attention.png)

## Arquitetura 

Essa arquitetura tem como foco os mecanismos de atenção. Ao em vez de cada palavra receber a atenção, nos Transformers cada frase terá esse processo.

![arquitetura](images/transformer.png)
![arquitetura](images/self-attention.png)


## Scale-dot product

**Ideia principal:**
- 2 sequências (iguais no caso de self-attention), A e B
- calcular como cada elemento de A está relacionado a cada elmento de B
- depois recombinamos A de acordo com essa relação

**Matematicamente**, dot-product indica a similaridade entre dois vetores.

![scale_dot-product](images/scale_dot-product.png)

![scale_dot-product](images/scale_dot-product2.png)

## Look-ahead Mask

![look-ahead](images/look-ahead.png)

## Attention Layer

![attentio layer](images/attention-layer.png)

## Multi-head attention Layer

![multi-head](images/multi-head.png)

## Positional Encoding

![positional-encod](images/positional-encod.png)

## Feed-forward layers (camadas densas)

- composta de 2 transformações lineares

$$
FFN(x) = \max(0, x W_1 + b_1)W_2 + b_2
$$

## Residual connections:

- **Add & Norm:** não esquecer a informação da etapa anterior, ajudando a aprendizagem durante o *backpropagation*.
- **Last linear:** a saída do decodificador passa por uma camada densa de acordo com o tamanho do vocabulário e com a aplicação da função softmax, gerando probabilidades para cada palavra.

# Prática
----------

## Importação

In [282]:
import pandas as pd
import numpy as np

import math
import re
import time
import zipfile
import random

import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds

# Preparação dos dados

- Bases de dados: https://www.statmt.org/europarl/

In [283]:
file_path = '../../_IAExpert_private/7. Processamento de Linguagem Natural com Deep LEarning/pt-en'

with open(f"{file_path}/europarl-v7.pt-en.en", mode='r', encoding='utf-8') as f:
    europarl_en = f.read()

with open(f"{file_path}/europarl-v7.pt-en.pt", mode='r', encoding='utf-8') as f:
    europarl_pt = f.read()

KeyboardInterrupt: 

In [None]:
europarl_en[:100]

In [None]:
en = europarl_en.split('\n')
pt = europarl_pt.split('\n')

len(en), len(pt)

In [None]:
i = random.randint(0, len(en)-1)
en[i], pt[i]

### Limpeza dos dados

```python
corpus_en = europarl_en
corpus_en = re.sub(r"\.(?=[0-9][a-z][A-Z])", '.$$$', corpus_en)
corpus_en = re.sub(r".\$\$\$", '', corpus_en)
corpus_en = re.sub(r" +", ' ', corpus_en)
corpus_en = corpus_en.split('\n')

```

```python
corpus_pt = europarl_pt
corpus_pt = re.sub(r"\.(?=[0-9][a-z][A-Z])", '.$$$', corpus_pt)
corpus_pt = re.sub(r".\$\$\$", '', corpus_pt)
corpus_pt = re.sub(r" +", ' ', corpus_pt)
corpus_pt = corpus_pt.split('\n')
```

```python
with open("corpus.pkl", "wb") as f:
    pickle.dump([corpus_en, corpus_pt], f)
```

In [None]:
import pickle
with open("corpus.pkl", 'rb') as f:
    corpus_en, corpus_pt = pickle.load(f)

In [None]:
len(corpus_en), len(corpus_pt)

### Tokenização

- texto para número

In [None]:
tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(corpus_en, target_vocab_size=2**13)

In [None]:
tokenizer_pt = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(corpus_pt, target_vocab_size=2**13)

In [None]:
tokenizer_en.vocab_size, tokenizer_pt.vocab_size

In [None]:
vocab_size_en = tokenizer_en.vocab_size + 2
vocab_size_pt = tokenizer_pt.vocab_size + 2

In [None]:
inputs = [[vocab_size_en - 2] + tokenizer_en.encode(sentence) + [vocab_size_en - 1] for sentence in corpus_en]

In [None]:
outputs = [[vocab_size_pt - 2] + tokenizer_pt.encode(sentence) + [vocab_size_pt - 1] for sentence in corpus_pt]

In [None]:
inputs[0]

In [None]:
outputs[0]

### Remoção de sentenças muito longas

#### A partir dos inputs

In [None]:
max_length = 15
idx_to_remove = [count for count, sent in enumerate(inputs) if len(sent) > max_length]

In [None]:
len(idx_to_remove)

In [None]:
for idx in reversed(idx_to_remove):
    del inputs[idx]
    del outputs[idx]

#### A partir dos outputs

In [None]:
max_length = 15
idx_to_remove = [count for count, sent in enumerate(outputs) if len(sent) > max_length]

In [None]:
len(idx_to_remove)

In [None]:
for idx in reversed(idx_to_remove):
    del inputs[idx]
    del outputs[idx]

In [None]:
len(inputs), len(outputs)

### Paddings e batches

In [None]:
inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs, value=0, padding='post', maxlen=max_length)
outputs = tf.keras.preprocessing.sequence.pad_sequences(outputs, value=0, padding='post', maxlen=max_length)

In [None]:
i = random.randint(0, len(inputs))
print(i, len(inputs[i]), len(outputs[i]))
inputs[i], outputs[i]

Vamos mudar para o formato do tensor flow.

In [None]:
batch_size = 64
buffer_size = 20000

dataset = tf.data.Dataset.from_tensor_slices((inputs, outputs))
dataset = dataset.cache()
dataset = dataset.shuffle(buffer_size).batch(batch_size)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

In [None]:
dataset

# Modelo

## Embedding

In [284]:
class PositionalEncoding(layers.Layer):
    
    def __init__(self):
        super(PositionalEncoding, self).__init__()
    
    def get_angles(self, pos, i, d_model):
        angles = 1 / np.power(10000., (2*(i // 2)) / np.float32(d_model))
        return pos * angles
    
    def call(self, inputs):
        seq_length = inputs.shape.as_list()[-2]
        d_model = inputs.shape.as_list()[-1]
        angles = self.get_angles(np.arange(seq_length)[:, np.newaxis], 
                                 np.arange(d_model)[np.newaxis,:], 
                                 d_model)
        
        angles[:, 0::2] = np.sin(angles[:, 0::2])
        angles[:, 1::2] = np.cos(angles[:, 1::2])
        pos_encoding = angles[np.newaxis, ...]
        
        return inputs + tf.cast(pos_encoding, tf.float32)        

## Mecanismo de atenção

In [285]:
def scaled_dot_product_attention(queries, keys, values, mask):
    product = tf.matmul(queries, keys, transpose_b=True)
    keys_dim = tf.cast(tf.shape(keys)[-1], tf.float32)
    scaled_product = product / tf.math.sqrt(keys_dim)

    if mask is not None:
        scaled_product += (mask * -1e9) # 0.0000000001

    attention = tf.matmul(tf.nn.softmax(scaled_product, axis=-1), values)
    return attention

In [286]:
class MultiHeadAttention(layers.Layer):
    
    def __init__(self, nb_proj):
        super(MultiHeadAttention, self).__init__()
        self.nb_proj = nb_proj
        
    def build(self, input_shape):
        self.d_model = input_shape[-1]
        assert self.d_model % self.nb_proj == 0

        self.d_proj = self.d_model // self.nb_proj

        self.query_lin = layers.Dense(units = self.d_model)
        self.key_lin = layers.Dense(units = self.d_model)
        self.value_lin = layers.Dense(units = self.d_model)

        self.final_lin = layers.Dense(units = self.d_model)
    
    def split_proj(self, inputs, batch_size): 
        shape = (batch_size, -1, self.nb_proj, self.d_proj)
        splited_inputs = tf.reshape(inputs, shape = shape) 
        return tf.transpose(splited_inputs, perm=[0, 2, 1, 3]) 
   
    def call(self, queries, keys, values, mask):
        batch_size = tf.shape(queries)[0]

        queries = self.query_lin(queries)
        keys = self.key_lin(keys)
        values = self.value_lin(values)

        queries = self.split_proj(queries, batch_size)
        keys = self.split_proj(keys, batch_size)
        values = self.split_proj(values, batch_size)

        attention = scaled_dot_product_attention(queries, keys, values, mask)

        attention = tf.transpose(attention, perm=[0, 2, 1, 3])

        concat_attention = tf.reshape(attention, shape=(batch_size, -1, self.d_model))

        outputs = self.final_lin(concat_attention)
      
        return outputs    

## Encoder

In [287]:
class EncoderLayer(layers.Layer):
    
    def __init__(self, FFN_units, nb_proj, dropout_rate):
        super(EncoderLayer, self).__init__()
        self.FFN_units = FFN_units
        self.nb_proj = nb_proj
        self.dropout_rate = dropout_rate
        
    def build(self, input_shape):
        self.d_model = input_shape[-1]
        
        self.multi_head_attention = MultiHeadAttention(self.nb_proj)
        self.dropout_1 = layers.Dropout(rate=self.dropout_rate)
        self.norm_1 = layers.LayerNormalization(epsilon=1e-6)
        
        self.dense_1 = layers.Dense(units=self.FFN_units, activation='relu')
        self.dense_2 = layers.Dense(units=self.d_model, activation='relu')
        self.dropout_2 = layers.Dropout(rate=self.dropout_rate)
        
        self.norm_2 = layers.LayerNormalization(epsilon=1e-6)
        
    def call(self, inputs, mask, training):
        attention = self.multi_head_attention(inputs, inputs, inputs, mask)        
        attention = self.dropout_1(attention, training=training)
        attention = self.norm_1(attention + inputs)
        
        outputs = self.dense_1(attention)
        outputs = self.dense_2(outputs)
        outputs = self.dropout_2(outputs, training=training)
        outputs = self.norm_2(outputs + attention)
        
        return outputs

In [288]:
class Encoder(layers.Layer):
    
    def __init__(self, 
               nb_layers, 
               FFN_units, 
               nb_proj, 
               dropout_rate, 
               vocab_size, 
               d_model, 
               name='encoder'):
        super(Encoder, self).__init__(name=name)
        self.nb_layers = nb_layers
        self.d_model = d_model
        
        self.embedding = layers.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding()
        self.dropout = layers.Dropout(rate=dropout_rate)
        self.enc_layers = [EncoderLayer(FFN_units, nb_proj, dropout_rate) for _ in range(nb_layers)]
    
    def call(self, inputs, mask, training):
        outputs = self.embedding(inputs)
        outputs *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        outputs = self.pos_encoding(outputs)
        outputs = self.dropout(outputs, training)
        
        for i in range(self.nb_layers):
            outputs = self.enc_layers[i](outputs, mask, training)
            
        return outputs

## Decoder

In [289]:
class DecoderLayer(layers.Layer):
    
    def __init__(self, FFN_units, nb_proj, dropout_rate):
        super(DecoderLayer, self).__init__()
        self.FFN_units = FFN_units
        self.nb_proj = nb_proj
        self.dropout_rate = dropout_rate
        
    def build(self, input_shape):
        self.d_model = input_shape[-1]
        
        self.multi_head_attention_1 = MultiHeadAttention(self.nb_proj)
        self.dropout_1 = layers.Dropout(rate=self.dropout_rate)
        self.norm_1 = layers.LayerNormalization(epsilon=1e-6)
        
        self.multi_head_attention_2 = MultiHeadAttention(self.nb_proj)
        self.dropout_2 = layers.Dropout(rate=self.dropout_rate)
        self.norm_2 = layers.LayerNormalization(epsilon=1e-6)
        
        self.dense_1 = layers.Dense(units=self.FFN_units, activation='relu')
        self.dense_2 = layers.Dense(units=self.d_model, activation='relu')
        self.dropout_3 = layers.Dropout(rate=self.dropout_rate)
        self.norm_3 = layers.LayerNormalization(epsilon=1e-6)
        
    def call(self, inputs, enc_outputs, mask_1, mask_2, training):
        attention = self.multi_head_attention_1(inputs, inputs, inputs, mask_1)
        attention = self.dropout_1(attention, training)
        attention = self.norm_1(attention + inputs)
        
        attention_2 = self.multi_head_attention_2(attention, enc_outputs, enc_outputs, mask_2)
        attention_2 = self.dropout_2(attention_2, training)
        attention_2 = self.norm_2(attention_2 + attention)
        
        outputs = self.dense_1(attention_2)
        outputs = self.dense_2(outputs)
        outputs = self.dropout_3(outputs, training)
        outputs = self.norm_3(outputs + attention_2)
        
        return outputs

In [290]:
class Decoder(layers.Layer):
    
    def __init__(self, 
                 nb_layers, 
                 FFN_units, 
                 nb_proj, 
                 dropout_rate, 
                 vocab_size, 
                 d_model, 
                 name='encoder'):
        super(Decoder, self).__init__(name=name)
        self.d_model = d_model
        self.nb_layers = nb_layers
        
        self.embedding = layers.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding()
        self.dropout = layers.Dropout(rate=dropout_rate)
        
        self.dec_layers = [DecoderLayer(FFN_units, nb_proj, dropout_rate) for i in range(nb_layers)]
        
    def call(self, inputs, enc_outputs, mask_1, mask_2, training):
        outputs = self.embedding(inputs)
        outputs *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        outputs = self.pos_encoding(outputs)
        outputs = self.dropout(outputs, training)
        
        for i in range(self.nb_layers):
            outputs = self.dec_layers[i](outputs, enc_outputs, mask_1, mask_2, training)
            
        return outputs

## Transformer

In [291]:
class Transformer(tf.keras.Model):
    
    def __init__(self,
                   vocab_size_enc,
                   vocab_size_dec,
                   d_model,
                 nb_layers,
                   FFN_units,
                   nb_proj,
                   dropout_rate,
                   name="transformer"):
        super(Transformer, self).__init__(name=name)

        self.encoder = Encoder(nb_layers, FFN_units, nb_proj, dropout_rate, 
                               vocab_size_enc, d_model)
        self.decoder = Decoder(nb_layers, FFN_units, nb_proj, dropout_rate,
                               vocab_size_dec, d_model)
        self.last_linear = layers.Dense(units=vocab_size_dec, name='lin_output')
                
    def create_padding_mask(self, seq): 
        mask = tf.cast(tf.math.equal(seq, 0), tf.float32)
        
        return mask[:, tf.newaxis, tf.newaxis, :]
    
    def create_look_ahead_mask(self, seq):
        seq_len = tf.shape(seq)[1]
        look_ahed_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
        return look_ahed_mask
    
    def call(self, enc_inputs, dec_inputs, training):
        enc_mask = self.create_padding_mask(enc_inputs)
        dec_mask_1 = tf.maximum(self.create_padding_mask(dec_inputs), self.create_look_ahead_mask(dec_inputs))
        dec_mask_2 = self.create_padding_mask(enc_inputs)

        enc_outputs = self.encoder(enc_inputs, enc_mask, training)
        dec_outputs = self.decoder(dec_inputs, enc_outputs, dec_mask_1, dec_mask_2, training)

        outputs = self.last_linear(dec_outputs)

        return outputs

# Treinamento

In [292]:
tf.keras.backend.clear_session()

d_model = 128
nb_layers = 4
ffn_units = 512
nb_proj = 8
dropout_rate = 0.1

In [293]:
transformer = Transformer(vocab_size_enc=vocab_size_en, 
                          vocab_size_dec=vocab_size_pt, 
                          d_model=d_model, 
                          nb_layers=nb_layers,
                          FFN_units=ffn_units, 
                          nb_proj=nb_proj, 
                          dropout_rate=dropout_rate)

In [294]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

In [301]:
def loss_function(target, pred):
    mask = tf.math.logical_not(tf.math.equal(target, 0))
    loss_ = loss_object(target, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

In [296]:
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

In [297]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()
        
        self.d_model = tf.cast(d_model, tf.float32)
        self.warmup_steps = warmup_steps
        
    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)
        
        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

In [298]:
learning_rate = CustomSchedule(d_model)

In [299]:
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

In [302]:
epochs = 10
for epoch in range(epochs):
    print('Start or epoch {}'.format(epoch + 1))
    start = time.time()

    train_loss.reset_states()
    train_accuracy.reset_states()

    for (batch, (enc_inputs, targets)) in enumerate(dataset):
        dec_inputs = targets[:, :-1]
        dec_outputs_real = targets[:, 1:]
        with tf.GradientTape() as tape:
            predictions = transformer(enc_inputs, dec_inputs, True)
            loss = loss_function(dec_outputs_real, predictions)
    
        gradients = tape.gradient(loss, transformer.trainable_variables)
        optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

        train_loss(loss)
        train_accuracy(dec_outputs_real, predictions)
   
        if batch % 50 == 0:
            print('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch+1, batch, train_loss.result(), train_accuracy.result()))
            
    print('Time taken for 1 epoch {} secs\n'.format(time.time() - start))

Start or epoch 1
Epoch 1 Batch 0 Loss 6.2424 Accuracy 0.0000
Epoch 1 Batch 50 Loss 6.3664 Accuracy 0.0204
Epoch 1 Batch 100 Loss 6.2987 Accuracy 0.0457
Epoch 1 Batch 150 Loss 6.2138 Accuracy 0.0542
Epoch 1 Batch 200 Loss 6.1287 Accuracy 0.0585
Epoch 1 Batch 250 Loss 6.0355 Accuracy 0.0611
Epoch 1 Batch 300 Loss 5.9247 Accuracy 0.0628
Epoch 1 Batch 350 Loss 5.7906 Accuracy 0.0652
Epoch 1 Batch 400 Loss 5.6493 Accuracy 0.0709
Epoch 1 Batch 450 Loss 5.5287 Accuracy 0.0769
Epoch 1 Batch 500 Loss 5.4141 Accuracy 0.0820
Epoch 1 Batch 550 Loss 5.3062 Accuracy 0.0868
Epoch 1 Batch 600 Loss 5.2063 Accuracy 0.0920
Epoch 1 Batch 650 Loss 5.1108 Accuracy 0.0972
Epoch 1 Batch 700 Loss 5.0182 Accuracy 0.1023
Epoch 1 Batch 750 Loss 4.9381 Accuracy 0.1076
Epoch 1 Batch 800 Loss 4.8591 Accuracy 0.1127
Epoch 1 Batch 850 Loss 4.7847 Accuracy 0.1175
Epoch 1 Batch 900 Loss 4.7125 Accuracy 0.1223
Epoch 1 Batch 950 Loss 4.6452 Accuracy 0.1269
Epoch 1 Batch 1000 Loss 4.5791 Accuracy 0.1312
Epoch 1 Batch 1050 

Epoch 3 Batch 2050 Loss 1.3948 Accuracy 0.4395
Epoch 3 Batch 2100 Loss 1.3903 Accuracy 0.4404
Epoch 3 Batch 2150 Loss 1.3867 Accuracy 0.4411
Epoch 3 Batch 2200 Loss 1.3835 Accuracy 0.4420
Epoch 3 Batch 2250 Loss 1.3813 Accuracy 0.4425
Epoch 3 Batch 2300 Loss 1.3796 Accuracy 0.4429
Epoch 3 Batch 2350 Loss 1.3790 Accuracy 0.4432
Epoch 3 Batch 2400 Loss 1.3790 Accuracy 0.4434
Epoch 3 Batch 2450 Loss 1.3793 Accuracy 0.4435
Epoch 3 Batch 2500 Loss 1.3802 Accuracy 0.4435
Epoch 3 Batch 2550 Loss 1.3819 Accuracy 0.4434
Epoch 3 Batch 2600 Loss 1.3831 Accuracy 0.4434
Epoch 3 Batch 2650 Loss 1.3846 Accuracy 0.4434
Epoch 3 Batch 2700 Loss 1.3860 Accuracy 0.4434
Epoch 3 Batch 2750 Loss 1.3870 Accuracy 0.4434
Epoch 3 Batch 2800 Loss 1.3890 Accuracy 0.4433
Epoch 3 Batch 2850 Loss 1.3900 Accuracy 0.4432
Epoch 3 Batch 2900 Loss 1.3908 Accuracy 0.4431
Epoch 3 Batch 2950 Loss 1.3918 Accuracy 0.4431
Epoch 3 Batch 3000 Loss 1.3931 Accuracy 0.4430
Epoch 3 Batch 3050 Loss 1.3940 Accuracy 0.4429
Epoch 3 Batch

Epoch 6 Batch 750 Loss 1.2025 Accuracy 0.4709
Epoch 6 Batch 800 Loss 1.1982 Accuracy 0.4713
Epoch 6 Batch 850 Loss 1.1924 Accuracy 0.4717
Epoch 6 Batch 900 Loss 1.1881 Accuracy 0.4720
Epoch 6 Batch 950 Loss 1.1815 Accuracy 0.4722
Epoch 6 Batch 1000 Loss 1.1755 Accuracy 0.4727
Epoch 6 Batch 1050 Loss 1.1691 Accuracy 0.4730
Epoch 6 Batch 1100 Loss 1.1639 Accuracy 0.4733
Epoch 6 Batch 1150 Loss 1.1567 Accuracy 0.4738
Epoch 6 Batch 1200 Loss 1.1499 Accuracy 0.4743
Epoch 6 Batch 1250 Loss 1.1444 Accuracy 0.4746
Epoch 6 Batch 1300 Loss 1.1387 Accuracy 0.4750
Epoch 6 Batch 1350 Loss 1.1335 Accuracy 0.4755
Epoch 6 Batch 1400 Loss 1.1282 Accuracy 0.4761
Epoch 6 Batch 1450 Loss 1.1240 Accuracy 0.4766
Epoch 6 Batch 1500 Loss 1.1193 Accuracy 0.4772
Epoch 6 Batch 1550 Loss 1.1154 Accuracy 0.4779
Epoch 6 Batch 1600 Loss 1.1105 Accuracy 0.4786
Epoch 6 Batch 1650 Loss 1.1072 Accuracy 0.4794
Epoch 6 Batch 1700 Loss 1.1023 Accuracy 0.4800
Epoch 6 Batch 1750 Loss 1.0985 Accuracy 0.4808
Epoch 6 Batch 1800

Epoch 8 Batch 2800 Loss 1.0121 Accuracy 0.4979
Epoch 8 Batch 2850 Loss 1.0149 Accuracy 0.4977
Epoch 8 Batch 2900 Loss 1.0173 Accuracy 0.4974
Epoch 8 Batch 2950 Loss 1.0196 Accuracy 0.4970
Epoch 8 Batch 3000 Loss 1.0220 Accuracy 0.4967
Epoch 8 Batch 3050 Loss 1.0242 Accuracy 0.4964
Epoch 8 Batch 3100 Loss 1.0264 Accuracy 0.4963
Epoch 8 Batch 3150 Loss 1.0289 Accuracy 0.4959
Epoch 8 Batch 3200 Loss 1.0309 Accuracy 0.4957
Epoch 8 Batch 3250 Loss 1.0332 Accuracy 0.4955
Time taken for 1 epoch 3226.426320552826 secs

Start or epoch 9
Epoch 9 Batch 0 Loss 1.1805 Accuracy 0.4900
Epoch 9 Batch 50 Loss 1.2201 Accuracy 0.4770
Epoch 9 Batch 100 Loss 1.1712 Accuracy 0.4806
Epoch 9 Batch 150 Loss 1.1619 Accuracy 0.4807
Epoch 9 Batch 200 Loss 1.1608 Accuracy 0.4818
Epoch 9 Batch 250 Loss 1.1503 Accuracy 0.4837
Epoch 9 Batch 300 Loss 1.1450 Accuracy 0.4841
Epoch 9 Batch 350 Loss 1.1404 Accuracy 0.4841
Epoch 9 Batch 400 Loss 1.1312 Accuracy 0.4848
Epoch 9 Batch 450 Loss 1.1254 Accuracy 0.4855
Epoch 9 B

# Avaliação

In [303]:
text = 'you are smart'
text = [vocab_size_en - 2] + tokenizer_en.encode(text) + [vocab_size_en - 1]
text

[8188, 55, 17, 2201, 4093, 8189]

In [304]:
text = tf.expand_dims(text, axis=0)
text.shape

TensorShape([1, 6])

In [305]:
output = tf.expand_dims([vocab_size_pt - 2], axis = 0)
output.shape

TensorShape([1, 1])

In [306]:
def evaluate(inp_sentence):
  inp_sentence = [vocab_size_en - 2] + tokenizer_en.encode(inp_sentence) + [vocab_size_en - 1]
  enc_input = tf.expand_dims(inp_sentence, axis=0)

  output = tf.expand_dims([vocab_size_pt - 2], axis = 0)

  # i am -> am happy

  for _ in range(max_length):
    # (1, seq_length, vocab_size)
    predictions = transformer(enc_input, output, False)
    prediction = predictions[:, -1:, :]

    predicted_id = tf.cast(tf.argmax(prediction, axis=-1), tf.int32)

    if predicted_id == vocab_size_pt - 1:
      return tf.squeeze(output, axis=0)

    output = tf.concat([output, predicted_id], axis=1)
  
  return tf.squeeze(output, axis = 0)

In [307]:
def translate(sentence):
  output = evaluate(sentence).numpy()

  predicted_sentence = tokenizer_pt.decode([i for i in output if i < vocab_size_pt - 2])
  
  print('Input: {}'.format(sentence))
  print('Predicted translation: {}'.format(predicted_sentence))


In [308]:
translate("this is a really powerful tool")

Input: this is a really powerful tool
Predicted translation: Também isto é um instrumento verdadeiramente poderoso.
