**Human Language Technologies Project**

**Authors:** Dalla Noce Niko, Ristori Alessandro

#HLT Project

This work is higly based on the tensorflow tutorial https://www.tensorflow.org/text/tutorials/transformer, our aim was to introduce BERT as an encoder in the model and try combinations with different architectures (both RNNs and transformers).

##Setup

We need to install the transformers package to use the models and tokenizers from HuggingFace.

In [None]:
!pip install transformers
!pip install sentencepiece



Import the libraries needed for the project to work.

In [None]:
import logging
import time
import numpy as np
import matplotlib.pyplot as plt
import random
import tensorflow as tf
from tensorflow.keras import layers

The model training is going to run on TPUs since they are the optimized for working with tensors, to do so we need colab to assign us as much TPUs as possible.

In [None]:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)

In [None]:
logging.getLogger('tensorflow').setLevel(logging.ERROR)  # suppress warnings

##Preprocess the dataset

Let's define the method to preprocess the anki dataset.

In [None]:
def create_dataset_anki(name: str, preprocessed:bool) -> (list, list):
    with open(name, encoding="UTF-8") as datafile:
        src_set = list()
        dst_set = list()
        for sentence in datafile:
            sentence = sentence.split("\t")
            src_set.append(sentence[0])
            if preprocessed:
                dst_set.append(sentence[1].split("\n")[0])
            else:
                dst_set.append(sentence[1])

    return src_set, dst_set

We assume that the dataset was uploaded on colab with a zip file, we need to extract it and then we can build our lists using the previous method.

In [None]:
import zipfile
with zipfile.ZipFile("dataset_anki_it.zip", 'r') as zip_ref:
    zip_ref.extractall("")

en_set, it_set = create_dataset_anki("ita_preprocessed.txt", True)
print("The corpus' size is: {0}".format(len(en_set)))

The corpus' size is: 352040


##Build the dataset

Before we create the dataset from our lists, we have to tokenize each sentence from the corpus by using the BERT tokenizer for english and the one for italian. Moreover we can get the number of tokens for both source and target.

In [None]:
from transformers import BertTokenizer

logging.getLogger("transformers").setLevel(logging.ERROR)  # suppress warning for transformers

# Create the tokenizers and get the number of tokens
tokenizer_en = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer_it = BertTokenizer.from_pretrained("dbmdz/bert-base-italian-cased")
v_size_en = tokenizer_en.vocab_size
v_size_it = tokenizer_it.vocab_size

print("Number of tokens for the english dataset: {0}".format(v_size_en))
print("Number of tokens for the italian dataset: {0}".format(v_size_it))

DEBUG:filelock:Attempting to acquire lock 140148252003408 on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
DEBUG:filelock:Lock 140148252003408 acquired on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 140148252003408 on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
DEBUG:filelock:Lock 140148252003408 released on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
DEBUG:filelock:Attempting to acquire lock 140148266403472 on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
DEBUG:filelock:Lock 140148266403472 acquired on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 140148266403472 on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
DEBUG:filelock:Lock 140148266403472 released on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
DEBUG:filelock:Attempting to acquire lock 140148238790544 on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
DEBUG:filelock:Lock 140148238790544 acquired on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock


Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 140148238790544 on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
DEBUG:filelock:Lock 140148238790544 released on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
DEBUG:filelock:Attempting to acquire lock 140148238807440 on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
DEBUG:filelock:Lock 140148238807440 acquired on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 140148238807440 on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
DEBUG:filelock:Lock 140148238807440 released on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
DEBUG:filelock:Attempting to acquire lock 140148234263824 on /root/.cache/huggingface/transformers/e386d7030c11abe3c82da83b0aa728f3c09ab3a6728e325fe78bb5a0c67d7c71.83ca512ab51c5bc2809e83002a054b84ab85a200b98d5c0eb036d7611ee4362e.lock
DEBUG:filelock:Lock 140148234263824 acquired on /root/.cache/huggingface/transformers/e386d7030c11abe3c82da83b0aa728f3c09ab3a6728e325fe78bb5a0c67d7c71.83ca512ab51c5bc2809e83002a054b84ab85a200b98d5c0eb036d7611ee4362e.lock


Downloading:   0%|          | 0.00/230k [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 140148234263824 on /root/.cache/huggingface/transformers/e386d7030c11abe3c82da83b0aa728f3c09ab3a6728e325fe78bb5a0c67d7c71.83ca512ab51c5bc2809e83002a054b84ab85a200b98d5c0eb036d7611ee4362e.lock
DEBUG:filelock:Lock 140148234263824 released on /root/.cache/huggingface/transformers/e386d7030c11abe3c82da83b0aa728f3c09ab3a6728e325fe78bb5a0c67d7c71.83ca512ab51c5bc2809e83002a054b84ab85a200b98d5c0eb036d7611ee4362e.lock
DEBUG:filelock:Attempting to acquire lock 140148223565968 on /root/.cache/huggingface/transformers/534fa05777338ca7e2b068a37beb83688543de270a20252296be3eadd10caca1.6391beef2ceed2cdba47401eb12680200856c97d2f2b56143e515d7c0f36a66a.lock
DEBUG:filelock:Lock 140148223565968 acquired on /root/.cache/huggingface/transformers/534fa05777338ca7e2b068a37beb83688543de270a20252296be3eadd10caca1.6391beef2ceed2cdba47401eb12680200856c97d2f2b56143e515d7c0f36a66a.lock


Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 140148223565968 on /root/.cache/huggingface/transformers/534fa05777338ca7e2b068a37beb83688543de270a20252296be3eadd10caca1.6391beef2ceed2cdba47401eb12680200856c97d2f2b56143e515d7c0f36a66a.lock
DEBUG:filelock:Lock 140148223565968 released on /root/.cache/huggingface/transformers/534fa05777338ca7e2b068a37beb83688543de270a20252296be3eadd10caca1.6391beef2ceed2cdba47401eb12680200856c97d2f2b56143e515d7c0f36a66a.lock
DEBUG:filelock:Attempting to acquire lock 140148225515280 on /root/.cache/huggingface/transformers/4641bcb7c4ac61788587ad50d2f1598c64a1c28a71631929524c234bcf1e422e.6ec690b98e01c56d26601258d2be34c3e5a76b949465ed58983cff81e5f9fa88.lock
DEBUG:filelock:Lock 140148225515280 acquired on /root/.cache/huggingface/transformers/4641bcb7c4ac61788587ad50d2f1598c64a1c28a71631929524c234bcf1e422e.6ec690b98e01c56d26601258d2be34c3e5a76b949465ed58983cff81e5f9fa88.lock


Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 140148225515280 on /root/.cache/huggingface/transformers/4641bcb7c4ac61788587ad50d2f1598c64a1c28a71631929524c234bcf1e422e.6ec690b98e01c56d26601258d2be34c3e5a76b949465ed58983cff81e5f9fa88.lock
DEBUG:filelock:Lock 140148225515280 released on /root/.cache/huggingface/transformers/4641bcb7c4ac61788587ad50d2f1598c64a1c28a71631929524c234bcf1e422e.6ec690b98e01c56d26601258d2be34c3e5a76b949465ed58983cff81e5f9fa88.lock


Number of tokens for the english dataset: 30522
Number of tokens for the italian dataset: 31102


Let's calculate the max number of tokens allowed, this number is taken such that 99% of the sentences in the dataset are fully tokenized.

In [None]:
def set_max_tokens(dataset: list(), language: str = "en") -> int:
    len_sentences = [len(sentence.split()) for sentence in dataset[:351000]]
    # plt.boxplot(len_sentences)
    mean_len_sentences = np.mean(len_sentences)
    print("{0} dataset average sentence length: {1}".format(language, mean_len_sentences))
    max_length = int(mean_len_sentences + 3 * np.std(len_sentences))
    print("{0} dataset max length allowed: {1}".format(language, max_length))
    return max_length

max_length_en = set_max_tokens(en_set, "en")
max_length_it = set_max_tokens(it_set, "it")

en dataset average sentence length: 5.492988603988604
en dataset max length allowed: 11
it dataset average sentence length: 5.35585754985755
it dataset max length allowed: 11


Tokenize the source and target dataset.

In [None]:
# Tokenize the dataset
tokens_en = tokenizer_en(en_set[:351000], add_special_tokens=True,
                          truncation=True, padding="max_length", return_attention_mask=True,
                          return_tensors="tf", max_length=30).data["input_ids"]
tokens_it = tokenizer_it(it_set[:351000], add_special_tokens=True,
                          truncation=True, padding="max_length", return_attention_mask=True,
                          return_tensors="tf", max_length=31).data["input_ids"]

In [None]:
for _ in range(5):
  i = np.random.randint(len(tokens_en))
  print("En sentence: {0}\nTokenized sentence: {1}".format(en_set[i], tokens_en[i]))
  print("It sentence: {0}\nTokenized sentence: {1}\n".format(it_set[i], tokens_it[i]))

En sentence: You've been selected.
Tokenized sentence: [ 101 2017 1005 2310 2042 3479 1012  102    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]
It sentence: È stata selezionata.
Tokenized sentence: [  102   696   749 27389   697   103     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0]

En sentence: You're too young to die.
Tokenized sentence: [ 101 2017 1005 2128 2205 2402 2000 3280 1012  102    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]
It sentence: Siete troppo giovani per morire.
Tokenized sentence: [ 102 5464 1740 2420  156 4167  697  103    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0]

En sentence: Do you know those people?
Tokenized sentence: [ 101 2079 2017 2113 2216 2111 1029  102    0    

Then we build the tf dataset and split it into training, validation and test sets.

In [None]:
def split_set(dataset: tf.data.Dataset,
              tr: float = 0.8,
              val: float = 0.1,
              ts: float = 0.1,
              shuffle: bool = True) -> (tf.data.Dataset, tf.data.Dataset, tf.data.Dataset):
    if tr+val+ts != 1:
        raise ValueError("Train, validation and test partition not allowed with such splits")

    dataset_size = dataset.cardinality().numpy()
    if shuffle:
        dataset = dataset.shuffle(dataset_size)

    tr_size = int(tr * dataset_size)
    val_size = int(val * dataset_size)

    tr_set = dataset.take(tr_size)
    val_set = dataset.skip(tr_size).take(val_size)
    ts_set = dataset.skip(tr_size).skip(val_size)
    return tr_set, val_set, ts_set

In [None]:
# Build the dataset and split it in train, validation and test
dataset = tf.data.Dataset.from_tensor_slices((tokens_en, tokens_it))  # build the tf dataset
tr_set, val_set, ts_set = split_set(dataset, 0.8, 0.1, 0.1)  # split the tf dataset
print("Training set size: {0}".format(len(tr_set)))
print("Validation set size: {0}".format(len(val_set)))
print("Test set size: {0}".format(len(ts_set)))

Training set size: 280800
Validation set size: 35100
Test set size: 35100


After we have built our development and test set, we need to split the first one (both training and validation) in batches.

In [None]:
def __format_dataset(eng, ita):
    return ({"encoder_inputs": eng, "decoder_inputs": ita[:, :-1],}, ita[:, 1:])

def make_batches(dataset_src_dst: tf.data.Dataset, batch_size: int) -> tf.data.Dataset:
    dataset = dataset_src_dst.batch(batch_size)
    dataset = dataset.map(__format_dataset)
    return dataset.prefetch(tf.data.experimental.AUTOTUNE).cache()

In [None]:
with strategy.scope():
  tr_batches = make_batches(tr_set, 128)
  val_batches = make_batches(val_set, 128)

In [None]:
for src, dst in tr_batches.take(1):
  print("encoder inputs shape: {0}".format(src["encoder_inputs"].shape))
  print("decoder inputs shape: {0}".format(src["decoder_inputs"].shape))
  print("targets shape: {0}".format(dst.shape))

encoder inputs shape: (128, 30)
decoder inputs shape: (128, 30)
targets shape: (128, 30)


##Positional embeddings layer

In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, v_size, embed_dim, **kwargs):
        super(PositionalEmbedding, self).__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=v_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.v_size = v_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

##Encoder and decoder

As we know from the HLT course, NMT models are based on the encoder-decoder paradigm, therefore we have to build both. We based the architecture of those layers from the paper "Attention is all you need" from Vaswani et al.

###Encoder

The single layer of the encoder transformer.

In [None]:
class EncoderLayer(layers.Layer):

    def __init__(self, layers_size: int, dense_size: int, num_heads: int, dropout=0.1, **kwargs) -> None:
        super(EncoderLayer, self).__init__(**kwargs)
        
        self.layers_size = layers_size
        self.dense_size = dense_size
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(num_heads, layers_size)
        self.dense_proj = tf.keras.Sequential(
            [layers.Dense(dense_size, activation="relu"), layers.Dense(layers_size)]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.dropout_1 = layers.Dropout(dropout)
        self.dropout_2 = layers.Dropout(dropout)
        self.supports_masking = True

    def call(self, inputs: tf.Tensor, mask=None) -> tf.Tensor:
        if mask is not None:  
            padding_mask = tf.cast(mask[:, tf.newaxis, tf.newaxis, :], dtype="int32")
        else:
            print("Mask not built")
            assert False
        
        attention_output = self.attention(
            query=inputs, value=inputs, key=inputs, attention_mask=padding_mask
        )
        attention_output = self.dropout_1(attention_output)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        proj_output = self.dropout_1(proj_output)
        return self.layernorm_2(proj_input + proj_output)

In [None]:
class EncoderTransformer(layers.Layer):

    def __init__(self,
                 num_layers: int,
                 layers_size: int,
                 dense_size: int,
                 num_heads: int,
                 max_length: int,
                 v_size_src: int,
                 dropout: float = 0.1) -> None:
        super(EncoderTransformer, self).__init__()

        self.layers_size = layers_size
        self.num_layers = num_layers
        self.pos_embedding = PositionalEmbedding(max_length, v_size_src, layers_size)
        self.enc_layers = [EncoderLayer(layers_size, dense_size, num_heads) for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(dropout)
        self.supports_masking = True

    def call(self, inputs: tf.Tensor, mask=None) -> tf.Tensor:
        src_embeddings = self.pos_embedding(inputs)
        enc_out = self.dropout(src_embeddings)
        for i in range(self.num_layers):
            enc_out = self.enc_layers[i](enc_out)

        return enc_out  # (batch_size, input_seq_len, layers_size)

###Decoder

The single layer of the decoder transformer.

In [None]:
class DecoderLayer(layers.Layer):

    def __init__(self, layers_size: int, dense_size: int, num_heads: int, dropout=0.1, **kwargs) -> None:
        super(DecoderLayer, self).__init__(**kwargs)
        
        self.layers_size = layers_size
        self.dense_size = dense_size
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(num_heads, layers_size)
        self.attention_2 = layers.MultiHeadAttention(num_heads, layers_size)
        self.dense_proj = tf.keras.Sequential(
            [layers.Dense(dense_size, activation="relu"), layers.Dense(layers_size)]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.dropout_1 = layers.Dropout(dropout)
        self.dropout_2 = layers.Dropout(dropout)
        self.dropout_3 = layers.Dropout(dropout)
        self.supports_masking = True

    def call(self, inputs: tf.Tensor, encoder_outputs: tf.Tensor, mask=None) -> tf.Tensor:
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)

        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
        )
        attention_output_1 = self.dropout_1(attention_output_1)
        out_1 = self.layernorm_1(inputs + attention_output_1)

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.dropout_2(attention_output_2)
        out_2 = self.layernorm_2(out_1 + attention_output_2)

        proj_output = self.dense_proj(out_2)
        proj_output = self.dropout_3(proj_output)
        return self.layernorm_3(out_2 + proj_output)

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)],
            axis=0,
        )
        return tf.tile(mask, mult)

In [None]:
class DecoderTransformer(layers.Layer):

    def __init__(self,
                 num_layers: int,
                 layers_size: int,
                 dense_size: int,
                 num_heads: int,
                 max_length: int,
                 v_size_dst: int,
                 dropout=0.1) -> None:
        super(DecoderTransformer, self).__init__()

        self.layers_size = layers_size
        self.num_layers = num_layers
        self.pos_embedding = PositionalEmbedding(max_length, v_size_dst, layers_size)
        self.dec_layers = [DecoderLayer(layers_size, dense_size, num_heads) for _ in range(num_layers)]
        self.dropout = layers.Dropout(dropout)
        self.supports_masking = True

    def call(self, inputs: tf.Tensor, enc_output: tf.Tensor, mask=None) -> tf.Tensor:
        dst_embeddings = self.pos_embedding(inputs)
        dec_output = self.dropout(dst_embeddings)
        for i in range(self.num_layers):
            dec_output = self.dec_layers[i](dec_output, enc_output)

        return dec_output

##Build the model

In [None]:
layers_size = 512
num_layers = 6
dense_size = 2048
num_heads = 8

def create_model():
    # Encoder
    encoder_inputs = tf.keras.Input(shape=(None,), dtype="int32", name="encoder_inputs")
    encoder_outputs = EncoderTransformer(num_layers, layers_size, dense_size, num_heads, max_length_en, v_size_en)(encoder_inputs)

    # Decoder
    decoder_inputs = tf.keras.Input(shape=(None,), dtype="int32", name="decoder_inputs")
    encoded_seq_inputs = tf.keras.Input(shape=(None, layers_size), name="decoder_state_inputs")
    decoder_outputs = DecoderTransformer(num_layers, layers_size, dense_size, num_heads, max_length_it, v_size_it)(decoder_inputs, encoded_seq_inputs)
    decoder_outputs = layers.Dropout(0.4)(decoder_outputs)
    decoder_outputs = layers.Dense(v_size_it, activation="softmax")(decoder_outputs)
    decoder = tf.keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs, name="decoder_transformer")
    # decoder.summary()

    decoder_outputs = decoder([decoder_inputs, encoder_outputs])
    transformer = tf.keras.Model([encoder_inputs, decoder_inputs], decoder_outputs, name="transformer")
    # transformer.summary()
    return transformer

In [None]:
with strategy.scope():
    opt = tf.keras.optimizers.Nadam(learning_rate = 0.0001)
    transformer = create_model()
    transformer.summary()
    transformer.compile(opt, loss = "sparse_categorical_crossentropy", metrics=["accuracy"])

Model: "transformer"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     [(None, None)]       0                                            
__________________________________________________________________________________________________
decoder_inputs (InputLayer)     [(None, None)]       0                                            
__________________________________________________________________________________________________
encoder_transformer (EncoderTra (None, None, 512)    78651904    encoder_inputs[0][0]             
__________________________________________________________________________________________________
decoder_transformer (Functional (None, None, 31102)  145318782   decoder_inputs[0][0]             
                                                                 encoder_transformer[0][

##Train and evaluate the model

In [None]:
transformer.fit(tr_batches, epochs=5, validation_data = val_batches)

Epoch 1/5


INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'cond_8/Identity:0' shape=(None, 30) dtype=int32>, <tf.Tensor 'cond_8/Identity_1:0' shape=(None, 30) dtype=int32>, <tf.Tensor 'cond_8/Identity_2:0' shape=(None, 30) dtype=int32>]
INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'cond_8/Identity:0' shape=(None, 30) dtype=int32>, <tf.Tensor 'cond_8/Identity_1:0' shape=(None, 30) dtype=int32>, <tf.Tensor 'cond_8/Identity_2:0' shape=(None, 30) dtype=int32>]




INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'cond_8/Identity:0' shape=(None, 30) dtype=int32>, <tf.Tensor 'cond_8/Identity_1:0' shape=(None, 30) dtype=int32>, <tf.Tensor 'cond_8/Identity_2:0' shape=(None, 30) dtype=int32>]


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f76c6e686d0>

In [None]:
ts_loss, ts_accuracy = transformer.evaluate(make_batches(ts_set, 128))
print("Test loss: {0}\nTest accuracy: {1}".format(ts_loss, ts_accuracy))

Test loss: 0.133095845580101
Test accuracy: 0.8693563342094421


In [None]:
transformer.save_weights('./translator.h5', overwrite=True)

##Translator

In [None]:
max_decoded_sentence_length = 30
sequence_length = 30

def translate(input_sentence):
    tokenized_input_sentence = tokenizer_en(input_sentence, return_tensors='tf', add_special_tokens=True, max_length = sequence_length, padding='max_length').data["input_ids"]
    decoded_sentence = "[CLS]"
    list_tokens=[decoded_sentence]
    for i in range(max_decoded_sentence_length):
        decoded_sentence = tokenizer_it.convert_tokens_to_string(list_tokens)
        tokenized_target_sentence = tokenizer_it(decoded_sentence, return_tensors='tf', add_special_tokens=False, max_length = sequence_length, padding='max_length').data['input_ids']
        predictions = transformer.predict([tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = tokenizer_it.ids_to_tokens[sampled_token_index]
       
        if sampled_token == "[SEP]":
          decoded_sentence = tokenizer_it.convert_tokens_to_string(list_tokens[1:])
          break
        list_tokens.append(sampled_token)
    
    return list_tokens, decoded_sentence

In [None]:
for encoder, decoder_target in tr_batches.take(1):
  print(encoder["encoder_inputs"].shape)

tokens, translated = translate("Are you sure it's safe?")
print(translated)

(128, 30)


INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'cond_8/Identity:0' shape=(None, 30) dtype=int32>, <tf.Tensor 'cond_8/Identity_1:0' shape=(None, 30) dtype=int32>]


È sicuro che sia al sicuro ?


##BERT

In [None]:
from transformers import BertTokenizer, BertTokenizerFast, TFBertModel, TFMT5EncoderModel, T5TokenizerFast

encoder_bert = {
    "tokenizer" : BertTokenizerFast.from_pretrained("bert-base-uncased"),
    "model" : TFBertModel.from_pretrained("bert-base-uncased"),
}

encoder_T5 = {
    "tokenizer" : T5TokenizerFast.from_pretraind("t5-small"),
    "model" : TFMT5EncoderModel("t5-small"),
}

class EncoderBERT(layer.Layer):

    def __init__(self, bert, **kwargs) -> None:
        super(EncoderBERT, self).__init__(**kwargs)

        self.bert = bert
    
    def call(self, inputs) -> tf.Tensor:
      # mask =  tf.ones(src_tokens.shape) - tf.cast(tf.math.equal(src_tokens, 0), tf.float32)
      bert_output = self.bert([inputs, mask])[0]
      return bert_output

SyntaxError: ignored