**Human Language Technologies Project**

**Authors:** Dalla Noce Niko, Ristori Alessandro

#HLT Project

This work is higly based on the tensorflow tutorial https://www.tensorflow.org/text/tutorials/transformer, our aim was to introduce BERT as an encoder in the model and try combinations with different architectures (both RNNs and transformers).

##Setup

We need to install the transformers package to use the models and tokenizers from HuggingFace.

In [None]:
!pip install transformers



Import the libraries needed for the project to work.

In [None]:
import logging
import time
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

The model training is going to run on TPUs since they are the optimized for working with tensors, to do so we need colab to assign us as much TPUs as possible.

In [None]:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)

INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Initializing the TPU system: grpc://10.3.163.74:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.3.163.74:8470


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


In [None]:
logging.getLogger('tensorflow').setLevel(logging.ERROR)  # suppress warnings

##Preprocess the dataset

Let's define the method to preprocess the anki dataset.

In [None]:
def create_dataset_anki(name: str, preprocessed:bool) -> (list, list):
    with open(name, encoding="UTF-8") as datafile:
        src_set = list()
        dst_set = list()
        for sentence in datafile:
            sentence = sentence.split("\t")
            src_set.append(sentence[0])
            if preprocessed:
                dst_set.append(sentence[1].split("\n")[0])
            else:
                dst_set.append(sentence[1])

    return src_set, dst_set

We assume that the dataset was uploaded on colab with a zip file, we need to extract it and then we can build our lists using the previous method.

In [None]:
import zipfile
with zipfile.ZipFile("dataset_anki_it.zip", 'r') as zip_ref:
    zip_ref.extractall("")

en_set, it_set = create_dataset_anki("ita_preprocessed.txt", True)
print("Il corpus ha dimensione: {0}".format(len(en_set)))

Il corpus ha dimensione: 352040


##Build the dataset

Before we create the dataset from our lists, we have to tokenize each sentence from the corpus by using the BERT tokenizer for english and the one for italian. Moreover we can get the number of tokens for both source and target.

In [None]:
from transformers import BertTokenizer

# Create the tokenizers and get the number of tokens
tokenizer_en = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer_it = BertTokenizer.from_pretrained("dbmdz/bert-base-italian-uncased")
v_size_en = tokenizer_en.vocab_size
v_size_it = tokenizer_it.vocab_size

In [None]:
# Tokenize the dataset
tokens_en = tokenizer_en(en_set[:15000], add_special_tokens=True,
                          truncation=True, padding="max_length", return_attention_mask=True,
                          return_tensors="tf", max_length=30).data["input_ids"]
tokens_it = tokenizer_it(it_set[:15000], add_special_tokens=True,
                          truncation=True, padding="max_length", return_attention_mask=True,
                          return_tensors="tf", max_length=30).data["input_ids"]

Then we build the tf dataset and we split it in training, validation and test sets.

In [None]:
def split_set(dataset: tf.data.Dataset,
              tr: float = 0.8,
              val: float = 0.1,
              ts: float = 0.1,
              shuffle: bool = True) -> (tf.data.Dataset, tf.data.Dataset, tf.data.Dataset):
    if tr+val+ts != 1:
        raise ValueError("Train, validation and test partition not allowed with such splits")

    dataset_size = dataset.cardinality().numpy()
    if shuffle:
        dataset = dataset.shuffle(dataset_size)

    tr_size = int(tr * dataset_size)
    val_size = int(val * dataset_size)

    tr_set = dataset.take(tr_size)
    val_set = dataset.skip(tr_size).take(val_size)
    ts_set = dataset.skip(tr_size).skip(val_size)
    return tr_set, val_set, ts_set

In [None]:
# Build the dataset and split it in train, validation and test
dataset = tf.data.Dataset.from_tensor_slices((tokens_en, tokens_it))  # build the tf dataset
tr_set, val_set, ts_set = split_set(dataset, 0.8, 0.1, 0.1)  # split the tf dataset
print("Dimensione training set: {0}".format(len(tr_set)))
print("Dimensione validation set: {0}".format(len(val_set)))
print("Dimensione test set: {0}".format(len(ts_set)))

Dimensione training set: 12000
Dimensione validation set: 1500
Dimensione test set: 1500


After we have built our development and test set, we need to split the first one (both training and validation) in batches.

In [None]:
def make_batches(dataset_src_dst: tf.data.Dataset, batch_size: int):
    return dataset_src_dst.cache().batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)

In [None]:
with strategy.scope():
  tr_batches = make_batches(tr_set, 128)
  val_batches = make_batches(val_set, 128)

In [None]:
for en, it in tr_batches.take(1):
  print(en.shape)

(128, 30)


##Positional encoding

Attention layers see their input as a set of vectors, with no sequential order. This model also doesn't contain any recurrent or convolutional layers. Because of this a "positional encoding" is added to give the model some information about the relative position of the tokens in the sentence.

The positional encoding vector is added to the embedding vector. Embeddings represent a token in a d-dimensional space where tokens with similar meaning will be closer to each other. But the embeddings do not encode the relative position of tokens in a sentence. So after adding the positional encoding, tokens will be closer to each other based on the similarity of their meaning and their position in the sentence, in the d-dimensional space.

In [None]:
def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

  # apply sin to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

  pos_encoding = angle_rads[np.newaxis, ...]

  return tf.cast(pos_encoding, dtype=tf.float32)

##Masking

Mask all the pad tokens in the batch of sequence. It ensures that the model does not treat padding as the input. The mask indicates where pad value 0 is present: it outputs a 1 at those locations, and a 0 otherwise.

In [None]:
def create_padding_mask(seq):
  seq = tf.cast(tf.math.equal(seq, 0), tf.float32) #cambiato tf.math.equal in tf.math.not_equal

  # add extra dimensions to add the padding
  # to the attention logits.
  return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

The look-ahead mask is used to mask the future tokens in a sequence. In other words, the mask indicates which entries should not be used.

This means that to predict the third token, only the first and second token will be used. Similarly to predict the fourth token, only the first, second and the third tokens will be used and so on.

In [None]:
def create_look_ahead_mask(size):
  mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
  return mask  # (seq_len, seq_len)

##Encoder and decoder

As we know from the HLT course, NMT models are based on the encoder-decoder paradigm, therefore we have to build both. We based the architecture of those layers from the paper "Attention is all you need" from Vaswani et al.

###Encoder

The single layer of the encoder transformer.

In [None]:
class EncoderLayer(tf.keras.layers.Layer):

    def __init__(self,
                 layers_size: int,
                 num_heads: int,
                 dff: int,
                 dropout: float = 0.1) -> None:
        super(EncoderLayer, self).__init__()

        self.mha = tf.keras.layers.MultiHeadAttention(num_heads, layers_size)

        self.ffn = tf.keras.Sequential([
              tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
              tf.keras.layers.Dense(layers_size)  # (batch_size, seq_len, d_model)
            ])

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(dropout)
        self.dropout2 = tf.keras.layers.Dropout(dropout)

    def call(self, src_tokens: tf.Tensor, training: bool, mask: tf.Tensor) -> tf.Tensor:
        attn_output = self.mha(src_tokens, src_tokens,
                                       src_tokens, mask)  # (batch_size, input_seq_len, layers_size)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(src_tokens + attn_output)  # (batch_size, input_seq_len, layers_size)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, layers_size)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, layers_size)
        return out2

In [None]:
class EncoderTransformer(tf.keras.layers.Layer):

    def __init__(self, num_layers: int,
                 layers_size: int,
                 num_heads: int,
                 dff: int,
                 src_vocab_size: int,
                 maximum_position_encoding: int,
                 dropout: float = 0.1) -> None:
        super(EncoderTransformer, self).__init__()

        self.layers_size = layers_size
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(src_vocab_size, layers_size)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.layers_size)

        self.enc_layers = [EncoderLayer(layers_size, num_heads, dff, dropout) for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(dropout)

    def call(self, src_tokens: tf.Tensor, training: bool, mask: tf.Tensor) -> tf.Tensor:
        seq_len = tf.shape(src_tokens)[1]

        # adding embedding and position encoding.
        x = self.embedding(src_tokens)  # (batch_size, input_seq_len, layers_size)
        x *= tf.math.sqrt(tf.cast(self.layers_size, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i].call(x, training, mask)

        return x  # (batch_size, input_seq_len, layers_size)

###Decoder

The single layer of the decoder transformer.

In [None]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self,
                 layers_size: int,
                 num_heads: int,
                 dff: int,
                 dropout=0.1) -> None:
        super(DecoderLayer, self).__init__()

        self.mha1 = tf.keras.layers.MultiHeadAttention(num_heads, layers_size)
        self.mha2 = tf.keras.layers.MultiHeadAttention(num_heads, layers_size)

        self.ffn = tf.keras.Sequential([
              tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
              tf.keras.layers.Dense(layers_size)  # (batch_size, seq_len, d_model)
            ])

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(dropout)
        self.dropout2 = tf.keras.layers.Dropout(dropout)
        self.dropout3 = tf.keras.layers.Dropout(dropout)

    @tf.function()
    def call(self,
             dst_tokens: tf.Tensor,
             enc_output: tf.Tensor,
             training: bool,
             look_ahead_mask: tf.Tensor,
             padding_mask: tf.Tensor):
        # enc_output.shape == (batch_size, input_seq_len, layers_size)
        attn1, attn_weights_block1 = self.mha1(dst_tokens, dst_tokens, dst_tokens,
                             look_ahead_mask, return_attention_scores=True)  # (batch_size, target_seq_len, layers_size)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + dst_tokens)

        attn2, attn_weights_block2 = self.mha2(out1,
                                               enc_output, enc_output, padding_mask, return_attention_scores=True)  # (batch_size, target_seq_len, layers_size)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, layers_size)

        ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, layers_size)

        return out3, attn_weights_block1, attn_weights_block2

In [None]:
class DecoderTransformer(tf.keras.layers.Layer):

    def __init__(self,
                 num_layers: int,
                 layers_size: int,
                 num_heads: int,
                 dff: int,
                 target_vocab_size: int,
                 maximum_position_encoding: int, dropout=0.1) -> None:
        super(DecoderTransformer, self).__init__()

        self.layers_size = layers_size
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(target_vocab_size, layers_size)
        self.pos_encoding = positional_encoding(maximum_position_encoding, layers_size)

        self.dec_layers = [DecoderLayer(layers_size, num_heads, dff, dropout) for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(dropout)

    def call(self,
             dst_tokens: tf.Tensor,
             enc_output: tf.Tensor,
             training: bool,
             look_ahead_mask: tf.Tensor,
             padding_mask: tf.Tensor) -> (tf.Tensor, tf.Tensor):
        seq_len = tf.shape(dst_tokens)[1]
        attention_weights = {}

        x = self.embedding(dst_tokens)  # (batch_size, target_seq_len, layers_size)
        x *= tf.math.sqrt(tf.cast(self.layers_size, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)
            # x = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)

            attention_weights[f'decoder_layer{i + 1}_block1'] = block1
            attention_weights[f'decoder_layer{i + 1}_block2'] = block2

        return x, attention_weights

In [None]:
sample_encoder = EncoderTransformer(num_layers=8, layers_size=512, num_heads=8,
                         dff=2048, src_vocab_size=8500,
                         maximum_position_encoding=10000)
temp_input = tf.random.uniform((64, 62), dtype=tf.int64, minval=0, maxval=200)

sample_encoder_output = sample_encoder(temp_input, training=False, mask=None)

print(sample_encoder_output.shape)  # (batch_size, input_seq_len, d_model)

sample_decoder = DecoderTransformer(num_layers=8, layers_size=512, num_heads=8,
                         dff=2048, target_vocab_size=8000,
                         maximum_position_encoding=5000)
temp_input = tf.random.uniform((64, 26), dtype=tf.int64, minval=0, maxval=200)

output, attn = sample_decoder(temp_input,
                              enc_output=sample_encoder_output,
                              training=False,
                              look_ahead_mask=None,
                              padding_mask=None)

output.shape, attn['decoder_layer2_block2'].shape, sample_encoder_output.shape

(64, 62, 512)


(TensorShape([64, 26, 512]),
 TensorShape([64, 8, 26, 62]),
 TensorShape([64, 62, 512]))

##Build the model

The model consists of the encoder, decoder and a final linear layer. The output of the decoder is the input to the linear layer and its output is returned.

In [50]:
class TransformerNMT(tf.keras.Model):

    def __init__(self,
                 encoder: tf.keras.layers.Layer,
                 decoder: tf.keras.layers.Layer,
                 dst_v_size: int,
                 lan_src: str = "english",
                 lan_dst: str = "italian") -> None:
        super(TransformerNMT, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.lan_src = lan_src
        self.lan_dst = lan_dst
        self.final_layer = tf.keras.layers.Dense(dst_v_size)

    def __create_masks(self, src: tf.Tensor, dst: tf.Tensor) -> (tf.Tensor, tf.Tensor):
        # Encoder padding mask
        enc_padding_mask = create_padding_mask(src)

        # Used in the 2nd attention block in the decoder.
        # This padding mask is used to mask the encoder outputs.
        dec_padding_mask = create_padding_mask(src)

        # Used in the 1st attention block in the decoder.
        # It is used to pad and mask future tokens in the input received by
        # the decoder.
        look_ahead_mask = create_look_ahead_mask(tf.shape(dst)[1])
        dec_target_padding_mask = create_padding_mask(dst)
        look_ahead_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

        return enc_padding_mask, look_ahead_mask, dec_padding_mask

    def call(self, inputs: list, training: bool) -> (tf.Tensor, tf.Tensor):
        # Keras models prefer if you pass all your inputs in the first argument
        src, dst = inputs

        enc_padding_mask, look_ahead_mask, dec_padding_mask = self.__create_masks(src, dst)

        enc_output = self.encoder(src, training, enc_padding_mask)  # (batch_size, inp_seq_len, layers_size)

        # dec_output, attention_weights = self.decoder(dst, enc_output, training, look_ahead_mask, dec_padding_mask)
        dec_output, attention_weights = self.decoder(dst, enc_output, training, look_ahead_mask, dec_padding_mask)

        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

        return final_output, attention_weights

##Optmizer

In [None]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def get_config(self):
        pass

    def __init__(self, layers_size, warmup_steps=4000):
        super(CustomSchedule, self).__init__()

        self.layers_size = layers_size
        self.layers_size = tf.cast(self.layers_size, tf.float32)

        self.warmup_steps = warmup_steps

    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)
        return tf.math.rsqrt(self.layers_size) * tf.math.minimum(arg1, arg2)

##Loss and metrics

In [None]:
def loss_function(real, pred):
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_sum(loss_) / tf.reduce_sum(mask)


def accuracy_function(real, pred):
    accuracies = tf.equal(real, tf.argmax(pred, axis=2, output_type=tf.int32))

    mask = tf.math.logical_not(tf.math.equal(real, 0))
    accuracies = tf.math.logical_and(mask, accuracies)

    accuracies = tf.cast(accuracies, dtype=tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)
    return tf.reduce_sum(accuracies) / tf.reduce_sum(mask)

In [None]:
with strategy.scope():
  train_loss = tf.keras.metrics.Mean(name='train_loss')
  train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

##Trainer

In [48]:
with strategy.scope():
    train_loss = tf.keras.metrics.Mean(name='train_loss')
    train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')


class Trainer:

    def __init__(self, transformer: TransformerNMT):
        learning_rate = CustomSchedule(transformer.decoder.layers_size)
        self.optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)
        self.transformer = transformer

    @tf.function()
    def __train_step(self, src: tf.Tensor, dst: tf.Tensor) -> None:
        dst_inp = dst[:, :-1]
        dst_real = dst[:, 1:]

        @tf.function()
        def step_fn(dst_inp, dst_real):
            with tf.GradientTape() as tape:
                predictions, _ = self.transformer([src, dst_inp], training=True)
                loss = loss_function(dst_real, predictions)

            gradients = tape.gradient(loss, self.transformer.trainable_variables)
            self.optimizer.apply_gradients(zip(gradients, self.transformer.trainable_variables))

            train_loss(loss)
            train_accuracy(accuracy_function(dst_real, predictions))

        strategy.run(step_fn, args=(dst_inp, dst_real))

    def train(self, epochs: int, tr_batches) -> None:
        for epoch in range(epochs):
            start = time.time()

            train_loss.reset_states()
            train_accuracy.reset_states()

            for (batch, (src, dst)) in enumerate(tr_batches):
                self.__train_step(src, dst)

                if batch % 50 == 0:
                    print(
                        f'Epoch {epoch + 1} Batch {batch} '
                        f'Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')

            if (epoch + 1) % 5 == 0:
                pass
                # print("save")

            print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')

            print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')

##Train the model

In [None]:
# Setup the hyperparameters
num_layers = 6
layers_size = 512
dff = 2048
num_heads = 8

# Build the model
encoder = EncoderTransformer(num_layers, layers_size, num_heads, dff, v_size_en, 30)
decoder = DecoderTransformer(num_layers, layers_size, num_heads, dff, v_size_it, 30)
model = TransformerNMT(encoder, decoder, v_size_it)

# Build the trainer and train the model
trainer = Trainer(model)
trainer.train(3, tr_batches)

Epoch 1 Batch 0 Loss 10.3691 Accuracy 0.0000
Epoch 1 Batch 50 Loss 9.7687 Accuracy 0.1460
Epoch 1 Loss 9.3667 Accuracy 0.2218
Time taken for 1 epoch: 222.84 secs

Epoch 2 Batch 0 Loss 8.5318 Accuracy 0.3605
Epoch 2 Batch 50 Loss 8.1160 Accuracy 0.3697
Epoch 2 Loss 7.6993 Accuracy 0.3895
Time taken for 1 epoch: 39.28 secs

Epoch 3 Batch 0 Loss 6.7043 Accuracy 0.4573
Epoch 3 Batch 50 Loss 6.0910 Accuracy 0.4972
Epoch 3 Loss 5.5951 Accuracy 0.5302
Time taken for 1 epoch: 39.29 secs



In [None]:
model.save_weights("nmt_transformer_transformer.h5")

transformer_model = TransformerNMT(encoder, decoder, v_size_it)

temp_input = tf.random.uniform((64, 38), dtype=tf.int64, minval=0, maxval=200)
temp_target = tf.random.uniform((64, 36), dtype=tf.int64, minval=0, maxval=200)

fn_out, _ = transformer_model([temp_input, temp_target], training=False)

transformer_model.load_weights("nmt_transfomer_transformer.h5")

##Translator

In [None]:
class Translator(tf.Module):
    def __init__(self, src_tokenizer: BertTokenizer, targ_tokenizer: BertTokenizer, transformer: TransformerNMT) -> None:
        super(Translator, self).__init__()
        self.src_tokenizer = src_tokenizer
        self.targ_tokenizer = targ_tokenizer
        self.transformer = transformer

    def __call__(self, sentence, max_length=30) -> str:
        sentence_tok = self.src_tokenizer(sentence, padding='max_length', return_tensors='tf', max_length=max_length,
                                          add_special_tokens=True).data['input_ids']

        encoder_input = tf.reshape(sentence_tok, [1, max_length])

        start_end = self.targ_tokenizer("", padding='max_length', return_tensors='np',
                                        max_length=3, add_special_tokens=True).data['input_ids']

        start = tf.convert_to_tensor([start_end[0, 0]], dtype=tf.int32)
        end = tf.convert_to_tensor([start_end[0, 1]], dtype=tf.int32)

        # `tf.TensorArray` is required here (instead of a python list) so that the
        # dynamic-loop can be traced by `tf.function`.
        output_array = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True)
        output_array = output_array.write(0, start)

        for i in tf.range(max_length):
            output = tf.transpose(output_array.stack())
            predictions, _ = self.transformer([encoder_input, output], training=False)

            # select the last token from the seq_len dimension
            predictions = predictions[:, -1:, :]  # (batch_size, 1, vocab_size)

            predicted_id = tf.argmax(predictions, axis=-1, output_type=tf.int32)

            # concatentate the predicted_id to the output which is given to the decoder
            # as its input.
            output_array = output_array.write(i + 1, predicted_id[0])

            if predicted_id == end:
                break

        output = tf.transpose(output_array.stack())
        output = output.numpy().tolist()[0]
        # out = list()
        # for r in output:
        #    for e in r:
        #        out.append(e)
        
        # text = self.targ_tokenizer.convert_ids_to_tokens(out)
        text = self.targ_tokenizer.convert_ids_to_tokens(output)

        return self.targ_tokenizer.convert_tokens_to_string(text)

Let's try the translator.

In [None]:
# Build the translator
translator = Translator(tokenizer_en, tokenizer_it, model)

# Translate some examples
out = translator("I want to beat you hard.")
print(out)

Da qui in poi va messo BERT.

##BERT

In [45]:
class EncoderBERT(tf.keras.layers.Layer):

    def __init__(self, bert: TFBertModel) -> None:
        self.bert = bert

    def call(self, src_tokens: tf.Tensor, training: bool, mask: tf.Tensor) -> tf.Tensor:
        print(bert)
        mask = tf.ones(src_tokens.shape) - mask
        output = self.bert([src_tokens, mask], training=training)[0]  # last_hidden_state
        return output  # (batch_size, input_seq_len, layers_size)

In [51]:
encoder_bert = EncoderBERT(TFBertModel.from_pretrained("bert-base-uncased", trainable=False))
decoder = DecoderTransformer(6, 512, 8, 2048, v_size_it, 10000)
model_bert = TransformerNMT(encoder_bert, decoder, v_size_it)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [1]:
trainer = Trainer(model_bert)
trainer.train(5, tr_batches)

NameError: ignored