# English-to-Spanish translation with a sequence-to-sequence Transformer

**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2021/05/26<br>
**Last modified:** 2021/05/26<br>
**Description:** Implementing a sequence-to-sequene Transformer and training it on a machine translation task.

## Introduction

In this example, we'll build a sequence-to-sequence Transformer model, which
we'll train on an English-to-Spanish machine translation task.

You'll learn how to:

- Vectorize text using the Keras `TextVectorization` layer.
- Implement a `TransformerEncoder` layer, a `TransformerDecoder` layer,
and a `PositionalEmbedding` layer.
- Prepare data for training a sequence-to-sequence model.
- Use the trained model to generate translations of never-seen-before
input sentences (sequence-to-sequence inference).

The code featured here is adapted from the book
[Deep Learning with Python, Second Edition](https://www.manning.com/books/deep-learning-with-python-second-edition)
(chapter 11: Deep learning for text).
The present example is fairly barebones, so for detailed explanations of
how each building block works, as well as the theory behind Transformers,
I recommend reading the book.

## Setup

In [1]:
from google.colab import drive
drive.mount('/content/drive')  # mount the drive

Mounted at /content/drive


In [2]:
!pip install transformers
!pip install sentencepiece
!pip install datasets

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 45.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.6 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 38.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

In [3]:
import pathlib
import random
import string
import re
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import regularizers
from datasets import load_dataset

In [4]:
# Detect hardware
try:
  tpu_resolver = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
except ValueError:
  tpu_resolver = None
  gpus = tf.config.experimental.list_logical_devices("GPU")

# Select appropriate distribution strategy
if tpu_resolver:
  tf.config.experimental_connect_to_cluster(tpu_resolver)
  tf.tpu.experimental.initialize_tpu_system(tpu_resolver)
  strategy = tf.distribute.experimental.TPUStrategy(tpu_resolver)
  print('Running on TPU ', tpu_resolver.cluster_spec().as_dict()['worker'])
elif len(gpus) > 1:
  strategy = tf.distribute.MirroredStrategy([gpu.name for gpu in gpus])
  print('Running on multiple GPUs ', [gpu.name for gpu in gpus])
elif len(gpus) == 1:
  strategy = tf.distribute.get_strategy() # default strategy that works on CPU and single GPU
  print('Running on single GPU ', gpus[0].name)
else:
  strategy = tf.distribute.get_strategy() # default strategy that works on CPU and single GPU
  print('Running on CPU')
  
print("Number of accelerators: ", strategy.num_replicas_in_sync)

INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.


INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.


INFO:tensorflow:Initializing the TPU system: grpc://10.80.4.122:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.80.4.122:8470


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


Running on TPU  ['10.80.4.122:8470']
Number of accelerators:  8


In [5]:
!git clone 'https://github.com/nikodallanoce/HLT/'

Cloning into 'HLT'...
remote: Enumerating objects: 467, done.[K
remote: Counting objects: 100% (467/467), done.[K
remote: Compressing objects: 100% (380/380), done.[K
remote: Total 467 (delta 249), reused 190 (delta 76), pack-reused 0[K
Receiving objects: 100% (467/467), 32.94 MiB | 14.89 MiB/s, done.
Resolving deltas: 100% (249/249), done.


In [6]:
from transformers import BertTokenizer, BertTokenizerFast, TFMT5EncoderModel, T5TokenizerFast, XLMTokenizer, TFT5EncoderModel, DistilBertTokenizerFast, TFDistilBertModel

import zipfile
with zipfile.ZipFile("/content/HLT/dataset/ita-eng.zip", 'r') as zip_ref:
    zip_ref.extractall("")

In [7]:
from tensorflow.python.ops.array_ops import boolean_mask
# test
# from tokenizer import *
#import tensorflow as tf


def create_patterns(name: str):
    with open(name, encoding='UTF-8') as datafile:
        sentences = datafile.readlines()
    return sentences

def extract_europarl():
    import tarfile
    fname = "/content/drive/Shareddrives/HLT/datasets/it-en.tar"
    if fname.endswith("tar.gz"):
        tar = tarfile.open(fname, "r:gz")
        tar.extractall()
        tar.close()
    elif fname.endswith("tar"):
        tar = tarfile.open(fname, "r:")
        tar.extractall()
        tar.close()


def create_dataset_euparl(name: str, start : float, end : float, src: str = "en", dst: str = "it") -> (list, list):
    extract_europarl()
    with open(name+".{0}".format(src), encoding="UTF-8") as datafile:
        src_set = datafile.readlines()

    with open(name+".{0}".format(dst), encoding="UTF-8") as datafile:
        dst_set = datafile.readlines()

    
    datasets_to_shuffle = list((zip(src_set, dst_set)))
    #np.random.shuffle(datasets_to_shuffle)
    start, end = int(len(src_set)*start) , int(len(src_set)*end)
    src_set, dst_set = zip(*datasets_to_shuffle)
    src_set = list(src_set[start:end])
    dst_set = list(dst_set[start:end])

    return src_set, dst_set


def split_set(dataset: tf.data.Dataset,
              tr: float = 0.8,
              val: float = 0.1,
              ts: float = 0.1,
              shuffle: bool = True) -> (tf.data.Dataset, tf.data.Dataset, tf.data.Dataset):
    if tr+val+ts != 1:
        raise ValueError("Train, validation and test partition not allowed with such splits")

    dataset_size = dataset.cardinality().numpy()
    if shuffle:
        dataset = dataset.shuffle(dataset_size)

    tr_size = int(tr * dataset_size)
    val_size = int(val * dataset_size)

    tr_set = dataset.take(tr_size)
    val_set = dataset.skip(tr_size).take(val_size)
    ts_set = dataset.skip(tr_size).skip(val_size)
    return tr_set, val_set, ts_set


def make_batches(dataset_src_dst: tf.data.Dataset, batch_size: int):
    return dataset_src_dst.cache().batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)


def create_dataset_anki(name: str, preprocessed:bool):
    with open(name, encoding="UTF-8") as datafile:
        src_set = list()
        dst_set = list()
        for sentence in datafile:
            sentence = sentence.split("\t")
            src_set.append(sentence[0])
            if preprocessed:
                dst_set.append(sentence[1].split("\n")[0])
            else:
                dst_set.append(sentence[1])

    return src_set, dst_set

def dataset_merged(ds1, ds2):
    src_ds1, dst_ds1 = ds1
    src_ds2, dst_ds2 = ds2
    src = src_ds1 + src_ds2
    dst = dst_ds1 + dst_ds2

    return src, dst    

In [8]:
BUFFER_SIZE = 20000
DS_SIZE= 2**17
sl = 90
def format_dataset(src, trg):
    return ({"encoder_inputs": src, "decoder_inputs": trg[:, :-1]}, trg[:, 1:])

def make_dataset(dataset, batch_size):
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.prefetch(tf.data.experimental.AUTOTUNE).cache()

In [9]:
from transformers import XLNetTokenizerFast, ByT5Tokenizer, GPT2TokenizerFast, TFMT5EncoderModel, RobertaTokenizerFast
source_src= "google/t5-v1_1-small" #"DeepESP/gpt2-spanish"
target_src = "dbmdz/bert-base-italian-cased"
tokenizer_source = T5TokenizerFast.from_pretrained(source_src)
tokenizer_target = BertTokenizerFast.from_pretrained(target_src)

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.81k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/537 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/230k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

In [10]:
# test
#source_set, target_set = dataset_merged(create_dataset_anki("ita.txt", False), create_dataset_euparl("europarl-v7.it-en", size=0.22))
#source_set, target_set = create_dataset_euparl("europarl-v7.it-en", size=0.22)

sequence_length = sl
# Create the tokenizers and get the number of tokens
#logging.set_verbosity_error()  # suppress warnings for transformers
#source_set, target_set = create_dataset_anki("ita.txt", False)
tokenizer_source = T5TokenizerFast.from_pretrained(source_src)
tokenizer_target = BertTokenizerFast.from_pretrained(target_src)
  
v_size_src = tokenizer_source.vocab_size

v_size_trg = tokenizer_target.vocab_size

# Tokenize the dataset
def tokenize(entire_set):

  source_set, target_set = entire_set
  tokens_source = tokenizer_source(source_set, truncation=True, padding="max_length",
                              return_tensors="tf", max_length=sequence_length).data["input_ids"]
  tokens_source = tf.cast(tokens_source, dtype=tf.int32)                            
  tokens_target = tokenizer_target(target_set, add_special_tokens=True,
                              truncation=True, padding="max_length",
                              return_tensors="tf", max_length=sequence_length + 1).data["input_ids"]
  tokens_target = tf.cast(tokens_target, dtype=tf.int32)
  return tokens_source, tokens_target                   

In [11]:
dataset = tf.data.Dataset.from_tensor_slices(tokenize(create_dataset_anki("ita.txt", False)))
dataset = dataset.concatenate(tf.data.Dataset.from_tensor_slices(tokenize(create_dataset_euparl("europarl-v7.it-en", start=0, end=0.3))))
dataset = dataset.concatenate(tf.data.Dataset.from_tensor_slices(tokenize(create_dataset_euparl("europarl-v7.it-en", start=0.3, end=0.7))))
dataset = dataset.concatenate(tf.data.Dataset.from_tensor_slices(tokenize(create_dataset_euparl("europarl-v7.it-en", start=0.7, end=1))))

In [12]:
#dataset = tf.data.Dataset.from_tensor_slices((tokens_source, tokens_target))
#dataset = dataset.concatenate(dataset)  # build the tf dataset
tr_set, val_set, ts_set = split_set(dataset, 0.9, 0.05, 0.05)  # split the tf dataset
print(len(dataset))

2261155


In [13]:
tokenizer_source.vocab_size

32100

In [14]:
batch_size =  16 * strategy.num_replicas_in_sync
with strategy.scope():
  train_ds = make_dataset(tr_set, batch_size)
  val_ds = make_dataset(val_set, batch_size)
train_ds  

<CacheDataset shapes: ({encoder_inputs: (None, 90), decoder_inputs: (None, 90)}, (None, 90)), types: ({encoder_inputs: tf.int32, decoder_inputs: tf.int32}, tf.int32)>

In [15]:
#pe = PositionalEmbedding(30, 100, 128)

for inputs, targets in train_ds:
    print(inputs)
    print(targets)
    break;

{'encoder_inputs': <tf.Tensor: shape=(128, 90), dtype=int32, numpy=
array([[1615,  410,   34, ...,    0,    0,    0],
       [ 101,  398, 2459, ...,    0,    0,    0],
       [  86,  455,   12, ...,    0,    0,    0],
       ...,
       [  86,   48, 1445, ...,    0,    0,    0],
       [ 148,   31,  195, ...,    0,    0,    0],
       [ 101,   33, 2508, ...,    0,    0,    0]], dtype=int32)>, 'decoder_inputs': <tf.Tensor: shape=(128, 90), dtype=int32, numpy=
array([[  102,  1529,  1307, ...,     0,     0,     0],
       [  102,  2689,  3691, ...,     0,     0,     0],
       [  102,   435,  1027, ...,     0,     0,     0],
       ...,
       [  102,   369,  4130, ...,     0,     0,     0],
       [  102,   313,  7894, ...,     0,     0,     0],
       [  102,  4231, 24603, ...,     0,     0,     0]], dtype=int32)>}
tf.Tensor(
[[ 1529  1307   527 ...     0     0     0]
 [ 2689  3691   406 ...     0     0     0]
 [  435  1027   120 ...     0     0     0]
 ...
 [  369  4130  1307 ...     

## Building the model

Our sequence-to-sequence Transformer consists of a `TransformerEncoder`
and a `TransformerDecoder` chained together. To make the model aware of word order,
we also use a `PositionalEmbedding` layer.

The source sequence will be pass to the `TransformerEncoder`,
which will produce a new representation of it.
This new representation will then be passed
to the `TransformerDecoder`, together with the target sequence so far (target words 0 to N).
The `TransformerDecoder` will then seek to predict the next words in the target sequence (N+1 and beyond).

A key detail that makes this possible is causal masking
(see method `get_causal_attention_mask()` on the `TransformerDecoder`).
The `TransformerDecoder` sees the entire sequences at once, and thus we must make
sure that it only uses information from target tokens 0 to N when predicting token N+1
(otherwise, it could use information from the future, which would
result in a model that cannot be used at inference time).

In [16]:
dropout_rate= 0.2

In [17]:
from transformers import TFXLNetModel, TFGPT2Model, TFBertModel, TFRobertaModel
with strategy.scope():
  encoder = TFT5EncoderModel.from_pretrained(source_src)
print(encoder.config)  

Downloading:   0%|          | 0.00/294M [00:00<?, ?B/s]

Some layers from the model checkpoint at google/t5-v1_1-small were not used when initializing TFT5EncoderModel: ['decoder', 'lm_head']
- This IS expected if you are initializing TFT5EncoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFT5EncoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFT5EncoderModel were initialized from the model checkpoint at google/t5-v1_1-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5EncoderModel for predictions without further training.


T5Config {
  "_name_or_path": "google/t5-v1_1-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 1024,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 8,
  "num_heads": 6,
  "num_layers": 8,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.12.5",
  "use_cache": true,
  "vocab_size": 32128
}



In [18]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super(TransformerEncoder, self).__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads

        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="elu"), layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        if mask is not None:  
            padding_mask = tf.cast(mask[:, tf.newaxis, tf.newaxis, :], dtype="int32")
        else:
            assert False
        attention_output = self.attention(
            query=inputs, value=inputs, key=inputs, attention_mask=padding_mask
        )
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)


class PositionalEmbedding(layers.Layer):
  def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
      super(PositionalEmbedding, self).__init__(**kwargs)
      self.token_embeddings = layers.Embedding(
          input_dim= vocab_size, output_dim=embed_dim
      )
      self.position_embeddings = layers.Embedding(
          input_dim=sequence_length, output_dim=embed_dim
      )
      self.sequence_length = sequence_length
      self.vocab_size = vocab_size
      self.embed_dim = embed_dim

  def call(self, inputs):
      length = tf.shape(inputs)[-1] 
      positions = tf.range(start=0, limit=length, delta=1)
      embedded_tokens = self.token_embeddings(inputs)
      embedded_positions = self.position_embeddings(positions)
      return embedded_tokens + embedded_positions

  def compute_mask(self, inputs, mask=None):
      return tf.math.not_equal(inputs, 0)


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super(TransformerDecoder, self).__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim, dropout=dropout_rate
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim, dropout=dropout_rate
        )
        self.dense_proj = keras.Sequential(
            [layers.Dense(latent_dim, activation="elu"), layers.Dropout(dropout_rate), layers.Dense(embed_dim)])
        #self.dense_proj_f = keras.Sequential(
        #    [layers.Dense(latent_dim, activation="elu"), layers.Dropout(dropout_rate), layers.Dense(embed_dim)])
        self.layernorm_1 = layers.LayerNormalization()
        #self.layernorm_1_f = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()

        #self.dropout_1 = layers.Dropout(dropout_rate)
        #self.dropout_2 = layers.Dropout(dropout_rate)

        self.resid1=layers.Add()
        self.resid2=layers.Add()
        self.resid3=layers.Add()
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)

        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
        )
        
        #attention_output_1 = self.dropout_1(attention_output_1)
        out_1 = self.layernorm_1(self.resid1([inputs, attention_output_1]))
        #proj_output_f = self.dense_proj_f(out_1)
        #out_1 = self.layernorm_1_f(layers.Add()([out_1, proj_output_f]))

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask
        )
        #attention_output_2 = self.dropout_2(attention_output_2)
        out_2 = self.layernorm_2(self.resid2([out_1, attention_output_2]))

        proj_output = self.dense_proj(out_2)
        #proj_output = self.dropout_1(proj_output)
        return self.layernorm_3(self.resid3([out_2, proj_output]))

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)],
            axis=0,
        )
        return tf.tile(mask, mult)

Next, we assemble the end-to-end model.

In [19]:
latent_dim = 2048
num_heads = 8

def create_model(encoder, embed_dim, v_size_trg):
  encoder_inputs = tf.keras.Input(shape=(None,), dtype="int32", name="encoder_inputs")
  outputs = encoder(encoder_inputs)
  encoder_outputs = outputs.last_hidden_state

  decoder_inputs = tf.keras.Input(shape=(None,), dtype="int32", name="decoder_inputs")
  encoded_seq_inputs = tf.keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
  x = PositionalEmbedding(sequence_length, v_size_trg, embed_dim)(decoder_inputs)
  x = layers.Dropout(dropout_rate)(x)
  for i in range(8):
    x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)

  decoder_outputs = layers.Dense(v_size_trg, activation="softmax")(x)
  decoder = tf.keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

  decoder_outputs = decoder([decoder_inputs, encoder_outputs])
  transformer = tf.keras.Model(
      [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
  )
  return transformer

## Training our model

We'll use accuracy as a quick way to monitor training progress on the validation data.
Note that machine translation typically uses BLEU scores as well as other metrics, rather than accuracy.

Here we only train for 1 epoch, but to get the model to actually converge
you should train for at least 30 epochs.

In [20]:
with strategy.scope():

  model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='/content/drive/Shareddrives/HLT/t5v11small_giga_tot.h5',
    monitor="val_loss",
    verbose=0,
    save_weights_only=True,
    mode="auto",
    save_freq="epoch",
    overwrite=True)

In [29]:
class CustomSchedule(tf.keras.callbacks.LearningRateScheduler):
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

In [27]:
d_model = 512
with strategy.scope():
  transformer = create_model(encoder, d_model, v_size_trg)
  learning_rate = CustomSchedule(d_model)

In [30]:
epochs = 8  # This should be at least 30 for convergence

with strategy.scope():
  opt = tf.keras.optimizers.Adam()
  train_ds = train_ds.shuffle(10**6)
  transformer.summary()
  transformer.compile(opt, loss = "sparse_categorical_crossentropy", metrics=["accuracy"])
transformer.fit(train_ds, epochs=1, validation_data = val_ds, callbacks=[learning_rate_scheduler , model_checkpoint_callback], shuffle=True)
#♦transformer.save_weights('/content/drive/Shareddrives/HLT/t5v11small_giga_tot.h5', overwrite=True)

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 encoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 decoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 tft5_encoder_model (TFT5Encode  TFBaseModelOutput(l  35332800   ['encoder_inputs[0][0]']         
 rModel)                        ast_hidden_state=(N                                               
                                one, None, 512),                                                  
                                 hidden_states=None                                     

AttributeError: ignored

In [None]:
tf.keras.utils.plot_model(
    transformer, to_file='model.png', show_shapes=True, dpi=90
)

## Decoding test sentences

Finally, let's demonstrate how to translate brand new English sentences.
We simply feed into the model the vectorized English sentence
as well as the target token `"[start]"`, then we repeatedly generated the next token, until
we hit the token `"[end]"`.

In [None]:
import shutil
shutil.move("/content/translator.h5", "/content/drive/Shareddrives/HLT/t5v11small_giga.h5")

In [None]:
transformer.save_weights('/content/drive/Shareddrives/HLT/t5v11small_giga.h5', overwrite=True)

In [22]:
#tokenizer_it = BertTokenizer.from_pretrained(ita_src)
with strategy.scope():
  transformer = create_model(encoder, 512, v_size_trg)
  transformer.load_weights('/content/drive/Shareddrives/HLT/t5v11small_giga_tot.h5')

In [None]:
def decode_sequence(input_sentence, tokenizer_source, tokenizer_target, transformer):
    #tokenized_input_sentence=input_sentence
    tokenized_input_sentence = tokenizer_source(input_sentence, return_tensors='tf', add_special_tokens=True, max_length = sequence_length, padding='max_length', truncation=True).data["input_ids"]
    decoded_sentence = "[CLS]"
    list_tokens=[decoded_sentence]
    for i in range(sequence_length):

        decoded_sentence = tokenizer_target.convert_tokens_to_string(list_tokens)
        tokenized_target_sentence = tokenizer_target(decoded_sentence, return_tensors='tf', add_special_tokens=False, max_length = sequence_length, padding='max_length').data['input_ids']
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = tokenizer_target.ids_to_tokens[sampled_token_index] #spa_index_lookup[sampled_token_index]
        
        #decoded_sentence += sampled_token
       
        if sampled_token == "[SEP]":
          decoded_sentence = tokenizer_target.convert_tokens_to_string(list_tokens[1:])
          break
        list_tokens.append(sampled_token)
    
    return list_tokens, decoded_sentence


In [None]:
from transformers import GPT2Tokenizer, TFGPT2Model, T5Tokenizer, BertTokenizer, XLNetTokenizer

tokenizer_source1 = T5TokenizerFast.from_pretrained(source_src)
tokenizer_target1 = BertTokenizer.from_pretrained(target_src)

In [None]:
to_translate = "I often go to the sea on the weekend, but this week I'm busy and I can't go there."
with strategy.scope():
    tokens, translated = decode_sequence(to_translate, tokenizer_source1, tokenizer_target1, transformer)
print([to_translate, translated])

In [None]:
from nltk.translate.bleu_score import sentence_bleu
from tqdm.notebook import tqdm_notebook
references = []
start, end= 349950, 350000

for sent, frase in zip(en_set[start:end], it_set[start:end]):
    
    frase = tokenizer_it(frase, add_special_tokens=False).data['input_ids']
    frase = tokenizer_it.convert_ids_to_tokens(frase)
    frase = tokenizer_it.convert_tokens_to_string(frase)
    references.append((sent, frase.split()))

score = 0
with strategy.scope():
  for en, it in tqdm_notebook(references):
      _, translated = decode_sequence(en)
      #print([translated, it])
      score += sentence_bleu([it], translated.split())
      print((it, translated.split()))
score = score / len(references)
print(score)

In [None]:
score

In [None]:
epochs = 5  # This should be at least 30 for convergence

with strategy.scope():
  opt = tf.keras.optimizers.Adam(learning_rate = 0.00005)
  #opt = tf.keras.optimizers.RMSprop(learning_rate=0.0001)
#  opt = tf.keras.optimizers.SGD(learning_rate = 0.05, momentum = 0.65)
  transformer.compile(opt, loss = "sparse_categorical_crossentropy", metrics=["accuracy"])
transformer.fit(train_ds, epochs=epochs, validation_data = val_ds, callbacks=[model_checkpoint_callback])