<a href="https://colab.research.google.com/github/kyle-gao/ML_ipynb/blob/master/TF_Transformer_LinearTransformerAreRNNPaperExperiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright 2020 Yi Lin(Kyle) Gao


##### Copyright 2019 The TensorFlow Authors.

In this notebook, I will attempt to implement the attention mechanism from the paper
**Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention by Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret.** https://arxiv.org/abs/2006.16236

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [4]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import time

In [5]:
def positional_encoding(pos, d_model):
    """
    :param pos: int max position
    :param d_model: dimension of the model
    :return: (1,pos,d_model) array of sinusoidal positional encoding
    """
    pos_enc = np.zeros((1, pos, d_model))
    for p in range(pos):
        for i in range(d_model // 2):
            angles = p / np.power(10000, (2 * i) / np.float32(d_model))
            pos_enc[:, p, 2 * i] = np.sin(angles)
            pos_enc[:, p, 2 * i + 1] = np.cos(angles)
        if d_model % 2 == 1:
            # if d_model is odd loop doesn't hit last even index
            angles = p / np.power(10000, (2 * d_model) / np.float32(d_model))
            pos_enc[:, p, d_model - 1] = np.sin(angles)
    return tf.cast(pos_enc, tf.float32)

#Masks
The 5d array masks are for causal attention

In [6]:

def padding_mask5(seq):
    # Returns (batch, seq_len, 1, 1, 1 ) tensor with 0's where the sequence is padded, 1 where it is not

    mask = tf.cast(tf.math.not_equal(seq,0), tf.float32)
    #we apply mask to (m, j, h, d) <- to mask j
    return mask[:, :, tf.newaxis, tf.newaxis,tf.newaxis]  # (batch, seq_len, 1, 1) 

def forward_mask5(seq):
    """
    Calculates a combined forward mask and padding mask for a batch of sequences
    :param seq: (batch,seq_len) a batch of sequences
    :return:  a combined look_ahead_mask (lower triangular 1s)
    and padding mask (batch, seq_len, seq_len, 1, 1)
    """
    seq_len = tf.shape(seq)[1]

    look_ahead_mask = tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
    look_ahead_mask = look_ahead_mask[tf.newaxis, :, :, tf.newaxis,tf.newaxis]  # (batch, seq_len, seq_len, 1)

    padded_mask = padding_mask5(seq)
    #forward mask is applied to (m, l, j ,h) 
    #return tf.minimum(padded_mask, look_ahead_mask)
    return padded_mask * look_ahead_mask


def padding_mask(seq):
    # Returns (batch, seq_len, 1, 1) tensor with 0's where the sequence is padded, 1 where it is not

    mask = tf.cast(tf.math.not_equal(seq,0), tf.float32)
    #we apply mask to (m, j, h, d) <- to mask j
    return mask[:, :, tf.newaxis, tf.newaxis]  # (batch, seq_len, 1, 1) 

def forward_mask(seq):
    """
    Calculates a combined forward mask and padding mask for a batch of sequences
    :param seq: (batch,seq_len) a batch of sequences
    :return:  a combined look_ahead_mask (lower triangular 1s)
    and padding mask (batch, seq_len, seq_len, 1)
    """
    seq_len = tf.shape(seq)[1]

    look_ahead_mask = tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
    look_ahead_mask = look_ahead_mask[tf.newaxis, :, :, tf.newaxis]  # (batch, seq_len, seq_len, 1)

    padded_mask = padding_mask(seq)
    #forward mask is applied to (m, l, j ,h) 
    #return tf.minimum(padded_mask, look_ahead_mask)
    return padded_mask * look_ahead_mask

#Linear attention mechanism from Katharopoulos et al. 

Practice implementation: causal and non causal.

In [7]:
def elu(z):
  """elu feature map used by Katharopoulos et al."""
  return tf.nn.elu(z)+1
  
class MultiHeadAttentionCausalMasked(tf.keras.layers.Layer):
  """LinearAttention Mechanism from Transformers are RNNs: Fast Autoregressive Transformers 
  with Linear Attention by Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret.

  Uses linear feature maps f, to replace the softmax by a kernel k(x,y)->R+.
  so that f(x)*f(y) = k(x,y)
  
  The authors use the elu feature map.
  
  This version has causal i.e. forward masking. This cannot be implemented in the usual way due to the 
  Q*K term not existing in isolation in this Linear Attention.
  I have implemented it in clumsy way which makes this slower than the usual softmax attention by quite a bit.

  Tne authors of the paper implemented causal attention via a triangular tensor product (and its back prop) in c++.

  I have implemented it in clumsy way which makes this slower than the usual softmax attention by quite a bit
  by introducing an intermediate step with the dimensions of the Q*K product. 
  """

  def __init__(self,d_model,num_heads):
    super().__init__()
    self.d_model=d_model
    self.num_heads=num_heads
    
    assert d_model%self.num_heads==0
    
    
    self.depth=d_model//self.num_heads

    self.wq = tf.keras.layers.Dense(d_model)
    self.wk = tf.keras.layers.Dense(d_model)
    self.wv = tf.keras.layers.Dense(d_model)

    self.dense = tf.keras.layers.Dense(d_model)

  def split_heads(self,x, batch_size):
    """Split the last dimension into (num_heads,depth)
    Arguments:
    x -- A tokenized sequence (batch_size,seq_len,d_model)
    
    Returns:
    A tokenized sequence with dimensions (batch_size, num_heads, seq_len, depth)
    """
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) 
    return x

  def call(self, q, k, v, mask=None, eps=1e-8):

    batch_size = tf.shape(q)[0]

    q = self.wq(q) #(batch_size,len_q, dim_q) 
    k = self.wk(k) #(batch_size,len_v, dim_q) 
    v = self.wv(v) #(batch_size,len_v, dim_v) 

    q = elu(self.split_heads(q, batch_size))  # (batch_size, seq_len_q, num_heads, depth_q) (m,l,h,d)
    k = elu(self.split_heads(k, batch_size))  # (batch_size,  seq_len_v, num_heads, depth_q) (m,j,h,d)
    v = self.split_heads(v, batch_size)  # (batch_size,  seq_len_v, num_heads, depth_v) (m,j,h,e)

    k_reduced = tf.math.reduce_sum(k,axis=1) + 1e-8

    z = 1/(tf.einsum("mlhd,mhd->mlh", q, k_reduced)) #(batch_size, num_heads, seq_len_q)

    output = tf.einsum("mjhd,mjhe->mjehd",k,v) #(batch_size, len_v, depth_q, num_heads, depth_v)

    output = tf.einsum("mlhd,mjehd,mlh->mjlhe",q,output,z) #(batch_size, len_q, len_v, num_heads, depth_v)

    if mask is not None:
      output = output * mask #Mask must broadcast to j and l axis correctly
    
    output = tf.einsum("mjlhe->mlhe",output) 
  

    output = tf.reshape(output,(batch_size,-1,self.num_heads*self.depth)) #(batch_size,len_q, d_model)
    return output #(m,l,h*e)


class MultiHeadAttention(tf.keras.layers.Layer):
  """Linear Attention Mechanism from Transformers are RNNs: Fast Autoregressive Transformers 
  with Linear Attention by Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret.

  Uses linear feature maps f, to replace the softmax by a kernel k(x,y)->R+.
  so that f(x)*f(y) = k(x,y)
  
  This version lacks causal masking, is quite fast."""

  def __init__(self,d_model,num_heads):
    super().__init__()
    self.d_model=d_model
    self.num_heads=num_heads
    
    assert d_model%self.num_heads==0
    
    
    self.depth=d_model//self.num_heads

    self.wq = tf.keras.layers.Dense(d_model)
    self.wk = tf.keras.layers.Dense(d_model)
    self.wv = tf.keras.layers.Dense(d_model)

    self.dense = tf.keras.layers.Dense(d_model)

  def split_heads(self,x, batch_size):
    """Split the last dimension into (num_heads,depth)
    Arguments:
    x -- A tokenized sequence (batch_size,seq_len,d_model)
    
    Returns:
    A tokenized sequence with dimensions (batch_size, num_heads, seq_len, depth)
    """
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) 
    return x
  def __init__(self,d_model,num_heads):
    super().__init__()
    self.d_model=d_model
    self.num_heads=num_heads
    
    assert d_model%self.num_heads==0
    
    
    self.depth=d_model//self.num_heads

    self.wq = tf.keras.layers.Dense(d_model)
    self.wk = tf.keras.layers.Dense(d_model)
    self.wv = tf.keras.layers.Dense(d_model)

    self.dense = tf.keras.layers.Dense(d_model)

  def split_heads(self,x, batch_size):
    """Split the last dimension into (num_heads,depth)
    Arguments:
    x -- A tokenized sequence (batch_size,seq_len,d_model)
    
    Returns:
    A tokenized sequence with dimensions (batch_size, num_heads, seq_len, depth)
    """
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) 
    return x

  def call(self,q,k,v,mask=None):
    batch_size = tf.shape(q)[0]

    q = self.wq(q) #(batch_size,len_q, dim_q) 
    k = self.wk(k) #(batch_size,len_v, dim_q) 
    v = self.wv(v) #(batch_size,len_v, dim_v)

    q = elu(self.split_heads(q, batch_size))  # (batch_size, seq_len_q, num_heads, depth_q) (m,l,h,d)
    k = elu(self.split_heads(k, batch_size))  # (batch_size,  seq_len_k, num_heads, depth_q) (m,j,h,d)
    v = self.split_heads(v, batch_size)  # (batch_size,  seq_len_v, num_heads, depth_v) (m,j,h,e)


    kv = tf.einsum("mjhd,mjhe->mdeh",k,v) #(batch_size, depth_k, depth_v, seq_len_v)

    if mask is not None: #padding mask is (m,j,1,1)
                          #causal mask is (m,j,j,1) cannot be broadcast here
       k = k*mask

    #we contract k over the j axis and add an epsilon numerical stability.
    k_reduced = tf.math.reduce_sum(k,axis=1) + 1e-8

    z = 1/(tf.einsum("mlhd,mhd->mlh", q, k_reduced)) #(batch_size, num_heads, seq_len_q)

    output = tf.einsum("mlhd,mdeh,mlh->mlhe",q,kv,z) #(batch_size,len_q, heads, depth_v)
    output = tf.reshape(output,(batch_size,-1,self.num_heads*self.depth)) #(batch_size,len_q, d_model)

    return output

In [8]:
class EncoderLayer(tf.keras.layers.Layer):
    """The EncoderLayer consists of one MultiHeadAttention layer connected to a FeedForward layer,
    each of these 2 layers have a residual connection."""

    def __init__(self, num_heads, d_model, dense_dim, dropout=0.1, causal = False):
        super().__init__()
        if causal: 
          self.attention = MultiHeadAttentionCausalMasked(d_model, num_heads)         
        else:
          self.attention = MultiHeadAttention(d_model, num_heads)
        self.dense = tf.keras.Sequential([tf.keras.layers.Dense(dense_dim, activation='relu'),
                                          tf.keras.layers.Dense(d_model)])

        self.norm1 = tf.keras.layers.LayerNormalization()
        self.norm2 = tf.keras.layers.LayerNormalization()

        self.dropout1 = tf.keras.layers.Dropout(dropout)
        self.dropout2 = tf.keras.layers.Dropout(dropout)

    def call(self, x, training, mask):
        out_attention = self.attention(x, x, x, mask)  # (batch_size,seq_len,d_model)
        out_attention = self.dropout1(out_attention, training=training)
        out1 = self.norm1(x + out_attention)  # residual connection (batch_size,seq_len,d_model)

        out_dense = self.dense(out1)  # (batch_size,seq_len,d_model)
        out2 = self.norm2(out1 + out_dense)  # residual conenction (batch_size,seq_len,d_model)
        return out2


class Encoder(tf.keras.layers.Layer):
    """The Encoder consists of EncoderLayer"""

    def __init__(self, num_layers, num_heads, d_model, dense_dim,
                 vocab_size, max_encoding_position, dropout=0.1, causal = False):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.num_layers = num_layers
        self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
        self.positional_encoding = positional_encoding(max_encoding_position, d_model)
        self.encoding_layers = [EncoderLayer(num_heads, d_model, dense_dim, dropout, causal= causal) for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(dropout)

    def call(self, x, training, mask=None):
        seq_len = tf.shape(x)[1]
        x = self.embedding(x)  # (batch_size,input_len,d_model)
        x = x * tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x = x + self.positional_encoding[:, :seq_len, :]
        x = self.dropout(x, training=training)
        for i in range(self.num_layers):
            x = self.encoding_layers[i](x, training, mask)  # (batch_size, input_seq_len, d_model)

        return x
class DecoderLayer(tf.keras.layers.Layer):
    """A decoder layers consists of two MultiHeadAttention, one for the Decoder input, one from Encoder output"""
    def __init__(self, num_heads, d_model, dense_dim, dropout=0.1,causal=False):
        super().__init__()
        if causal:
          self.attention1 = MultiHeadAttentionCausalMasked(d_model, num_heads)
          self.attention2 = MultiHeadAttentionCausalMasked(d_model, num_heads)         
        else:
          self.attention1 = MultiHeadAttention(d_model, num_heads)
          self.attention2 = MultiHeadAttention(d_model, num_heads)

        self.dense = tf.keras.Sequential([tf.keras.layers.Dense(dense_dim, activation='relu'),
                                          tf.keras.layers.Dense(d_model)])

        self.norm1 = tf.keras.layers.LayerNormalization()
        self.norm2 = tf.keras.layers.LayerNormalization()
        self.norm3 = tf.keras.layers.LayerNormalization()

        self.dropout1 = tf.keras.layers.Dropout(dropout)
        self.dropout2 = tf.keras.layers.Dropout(dropout)
        self.dropout3 = tf.keras.layers.Dropout(dropout)

    def call(self, encoder_out, x, training, forward_mask, padding_mask, ):

        out_attention1 = self.attention1(x, x, x,
                                         mask = forward_mask)  # (batch_size, seq_len_answer, d_model) -> The return seq_len is the same as that of the first argument of the call.
        out_attention1 = self.dropout1(out_attention1, training=training)
        out1 = self.norm1(x + out_attention1)  # residual connection (batch_size, seq_len_answer, d_model)

        out_attention2 = self.attention2(out1, encoder_out, encoder_out,
                                         padding_mask)  # (batch_size, seq_len_answer, d_model)
        out_attention2 = self.dropout2(out_attention2, training=training)
        out2 = self.norm2(out1 + out_attention2)

        out_dense = self.dense(out2)
        out_dense = self.dropout3(out_dense + out2)

        return out_dense


class Decoder(tf.keras.layers.Layer):
    """The Decoder consists of multiple DecoderLayer"""
    def __init__(self, num_layers, num_heads, d_model, dense_dim,
                 vocab_size, max_encoding_position, dropout=0.1, causal = False):
        super().__init__()

        self.num_heads = num_heads
        self.d_model = d_model
        self.num_layers = num_layers
        self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
        self.positional_encoding = positional_encoding(max_encoding_position, d_model)
        self.decoder_layers = [DecoderLayer(num_heads, d_model, dense_dim, dropout, causal = causal) for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(dropout)

    def call(self, encoder_out, x, training, forward_mask=None, padding_mask=None):
        seq_len = tf.shape(x)[1]
        x = self.embedding(x)  # (batch_size,input_len,d_model)
        x = x * tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x = x + self.positional_encoding[:, :seq_len, :]
        x = self.dropout(x, training=training)
        for i in range(self.num_layers):
            x = self.decoder_layers[i](encoder_out, x, training, forward_mask,
                                       padding_mask)  # (batch_size, input_seq_len, d_model)
        return x

class Transformer(tf.keras.Model):

    def __init__(self, num_layers, num_heads, d_model, dense_dim, in_vocab_size, tar_vocab_size,
                 input_max_position, target_max_position, rate=0.1, causal = False):
        super().__init__()

        self.encoder = Encoder(num_layers, num_heads, d_model, dense_dim,
                               in_vocab_size, max_encoding_position=input_max_position, dropout=0.1, causal=causal)

        self.decoder = Decoder(num_layers, num_heads, d_model, dense_dim,
                               tar_vocab_size, max_encoding_position=target_max_position, dropout=0.1,causal=causal)

        self.dense = tf.keras.layers.Dense(tar_vocab_size)

    def call(self, input, target, training=False, enc_mask=None, dec_forward_mask=None, dec_padding_mask=None):
        out_encoder = self.encoder(input, training=training, mask=enc_mask)

        out_decoder = self.decoder(out_encoder, target, training=training, forward_mask=dec_forward_mask,
                                   padding_mask=dec_padding_mask)

        out = self.dense(out_decoder)

        return out

#Training and Evaluation

Data and preprocessing from Trung Tran
https://machinetalk.org/2019/04/29/create-the-transformer-with-tensorflow-2-0/

In [9]:
raw_data = (
    ('What a ridiculous concept!', 'Quel concept ridicule !'),
    ('Your idea is not entirely crazy.', "Votre idée n'est pas complètement folle."),
    ("A man's worth lies in what he is.", "La valeur d'un homme réside dans ce qu'il est."),
    ('What he did is very wrong.', "Ce qu'il a fait est très mal."),
    ("All three of you need to do that.", "Vous avez besoin de faire cela, tous les trois."),
    ("Are you giving me another chance?", "Me donnez-vous une autre chance ?"),
    ("Both Tom and Mary work as models.", "Tom et Mary travaillent tous les deux comme mannequins."),
    ("Can I have a few minutes, please?", "Puis-je avoir quelques minutes, je vous prie ?"),
    ("Could you close the door, please?", "Pourriez-vous fermer la porte, s'il vous plaît ?"),
    ("Did you plant pumpkins this year?", "Cette année, avez-vous planté des citrouilles ?"),
    ("Do you ever study in the library?", "Est-ce que vous étudiez à la bibliothèque des fois ?"),
    ("Don't be deceived by appearances.", "Ne vous laissez pas abuser par les apparences."),
    ("Excuse me. Can you speak English?", "Je vous prie de m'excuser ! Savez-vous parler anglais ?"),
    ("Few people know the true meaning.", "Peu de gens savent ce que cela veut réellement dire."),
    ("Germany produced many scientists.", "L'Allemagne a produit beaucoup de scientifiques."),
    ("Guess whose birthday it is today.", "Devine de qui c'est l'anniversaire, aujourd'hui !"),
    ("He acted like he owned the place.", "Il s'est comporté comme s'il possédait l'endroit."),
    ("Honesty will pay in the long run.", "L'honnêteté paye à la longue."),
    ("How do we know this isn't a trap?", "Comment savez-vous qu'il ne s'agit pas d'un piège ?"),
    ("I can't believe you're giving up.", "Je n'arrive pas à croire que vous abandonniez."),
)

In [10]:
import unicodedata
import re

def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')


def normalize_string(s):
    s = unicode_to_ascii(s)
    s = re.sub(r'([!.?])', r' \1', s)
    s = re.sub(r'[^a-zA-Z.!?]+', r' ', s)
    s = re.sub(r'\s+', r' ', s)
    return s

raw_data_en, raw_data_fr = list(zip(*raw_data))
raw_data_en, raw_data_fr = list(raw_data_en), list(raw_data_fr)
raw_data_en = ['<start> ' + normalize_string(data) + ' <end>' for data in raw_data_en]
raw_data_fr = ['<start> ' + normalize_string(data) + ' <end>' for data in raw_data_fr]

In [11]:
raw_data_en=[s.lower() for s in raw_data_en]
raw_data_fr=[s.lower() for s in raw_data_fr]

tokenizer_en = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n',lower=False)
tokenizer_en.fit_on_texts(raw_data_en)
data_en=tokenizer_en.texts_to_sequences(raw_data_en)
data_en=tf.keras.preprocessing.sequence.pad_sequences(data_en,padding="post")

tokenizer_fr = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n',lower=False)
tokenizer_fr.fit_on_texts(raw_data_fr)
data_fr_in = tokenizer_fr.texts_to_sequences(raw_data_fr)
data_fr_in = tf.keras.preprocessing.sequence.pad_sequences(data_fr_in,
                                                           padding='post')

BATCH_SIZE = 20
dataset = tf.data.Dataset.from_tensor_slices(
    (data_fr_in, data_en))
small_dataset = dataset.shuffle(20).batch(BATCH_SIZE).cache()

Hyperparameters

In [12]:
num_layers = 2
d_model = 128
dense_dim = 512
num_heads = 2
max_len_fr = 35
max_len_en = 35
en_vocab_size = len(tokenizer_en.word_index) +2
fr_vocab_size = len(tokenizer_fr.word_index) + 2 
dropout_rate = 0.1
causal = False

Evaluation

In [13]:
def evaluate(question):

    start_token = [1]
    end_token = [2]
    question = tokenizer_fr.texts_to_sequences([question])[0]
    question = tf.expand_dims(question, 0)
    answer_in = start_token
    answer_in = tf.expand_dims(answer_in, 0)

    for i in range(max_len_fr):
        if causal:
          enc_padding_mask = padding_mask5(question)
          dec_padding_mask = padding_mask5(question)
          dec_forward_mask = forward_mask5(answer_in)
        else:
          enc_padding_mask = padding_mask(question)
          dec_padding_mask = padding_mask(question)
          dec_forward_mask = padding_mask(answer_in)

        predictions = transformer(question, answer_in, training=False, enc_mask=enc_padding_mask,
                                  dec_forward_mask=dec_forward_mask, dec_padding_mask=dec_padding_mask)
        prediction = predictions[:, -1, :]  # select the last word to add to the outputs

        predicted_id = tf.cast(tf.argmax(prediction, axis=-1), tf.int32)

        if predicted_id == 2:
            return tf.squeeze(answer_in, axis=0)
        predicted_id = tf.expand_dims(predicted_id, 0)
        answer_in = tf.concat([answer_in, predicted_id], axis=-1)

    return tf.squeeze(answer_in, axis=0)

def translate(sentence):
  result = [np.array(evaluate(sentence))]
  
  predicted_sentence = tokenizer_en.sequences_to_texts(result)
  print('Input: {}'.format(sentence))
  print('Predicted translation: {}'.format(predicted_sentence))

Loss function and Optimzer

In [14]:
optimizer = tf.keras.optimizers.Adam(0.0003, beta_1=0.9, beta_2=0.98,
                                         epsilon=1e-9)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction='none')

def masked_loss_fn(answer, prediction):
        mask = tf.math.logical_not(tf.math.equal(answer, 0))  # 0 at zeroes, 1 at non-zeroes since seq is padded
        # mask = tf.math.equal(answer, 0)
        mask = tf.cast(mask, tf.int32)
        loss_value = loss_fn(answer, prediction,
                             sample_weight=mask)  # set the zeros to zero weight, other values have weight of 1.

        return loss_value

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
        name='train_accuracy')

Training

In [18]:
causal = False
transformer = Transformer(num_layers=num_layers, num_heads=num_heads, d_model=d_model, dense_dim=dense_dim,
                          in_vocab_size=fr_vocab_size, tar_vocab_size=en_vocab_size,
                          input_max_position=max_len_fr, target_max_position=max_len_en, rate=0.1, causal = causal)

In [19]:
    signature = [tf.TensorSpec(shape=(None,None), dtype=tf.int32),
                 tf.TensorSpec(shape=(None,None),
                               dtype=tf.int32), ]  # a bit faster if we specify the signature
    EPOCHS = 200
    @tf.function(input_signature=signature)
    def train_step(question, answer):
        answer_in = answer[:, :-1]
        answer_tar = answer[:, 1:]

        if causal:
          enc_padding_mask = padding_mask5(question)
          dec_padding_mask = padding_mask5(question)
          dec_forward_mask = forward_mask5(answer_in)
        else:
          enc_padding_mask = padding_mask(question)
          dec_padding_mask = padding_mask(question)
          dec_forward_mask = padding_mask(answer_in)

        with tf.GradientTape() as tape:
            predictions = transformer(question, answer_in, training=True, enc_mask=enc_padding_mask,
                                      dec_forward_mask=dec_forward_mask, dec_padding_mask=dec_padding_mask)
            loss = masked_loss_fn(answer_tar, predictions)

        gradients = tape.gradient(loss, transformer.trainable_variables)
        optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

        train_loss(loss)
        train_accuracy(answer_tar, predictions)

    for epoch in range(EPOCHS):
        start = time.time()

        train_loss.reset_states()
        train_accuracy.reset_states()

        for (batch, (question, answer)) in enumerate(small_dataset):
            train_step(question, answer)
        if (epoch +1) % 20 == 0:
          print('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1,
                                                            train_loss.result(),
                                                            train_accuracy.result()))
        
          print('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))
          id = np.random.randint(0,20)
          translate(raw_data_fr[id])

Epoch 20 Loss 3.0401 Accuracy 0.1000
Time taken for 1 epoch: 0.018494844436645508 secs

Input: <start> est ce que vous etudiez a la bibliotheque des fois ? <end>
Predicted translation: ['<start>']
Epoch 40 Loss 1.6015 Accuracy 0.3500
Time taken for 1 epoch: 0.017268896102905273 secs

Input: <start> ne vous laissez pas abuser par les apparences . <end>
Predicted translation: ['<start> do do do do do do do']
Epoch 60 Loss 0.5577 Accuracy 0.6700
Time taken for 1 epoch: 0.01831960678100586 secs

Input: <start> tom et mary travaillent tous les deux comme mannequins . <end>
Predicted translation: ['<start> all three of all three of all three of']
Epoch 80 Loss 0.1778 Accuracy 0.7400
Time taken for 1 epoch: 0.021340370178222656 secs

Input: <start> me donnez vous une autre chance ? <end>
Predicted translation: ['<start> all three of you ever another chance']
Epoch 100 Loss 0.1047 Accuracy 0.7500
Time taken for 1 epoch: 0.018093585968017578 secs

Input: <start> l honnetete paye a la longue . <

In [20]:
for fr in raw_data_fr:
  translate(fr)
  print("\n")

Input: <start> quel concept ridicule ! <end>
Predicted translation: ['<start> a few minutes please']


Input: <start> votre idee n est pas completement folle . <end>
Predicted translation: ['<start> all three of you need to do that']


Input: <start> la valeur d un homme reside dans ce qu il est . <end>
Predicted translation: ['<start> a man s worth lies in what he is']


Input: <start> ce qu il a fait est tres mal . <end>
Predicted translation: ['<start> a man s worth lies in what he is']


Input: <start> vous avez besoin de faire cela tous les trois . <end>
Predicted translation: ['<start> all three of you need to do that']


Input: <start> me donnez vous une autre chance ? <end>
Predicted translation: ['<start> all three of you need to do that']


Input: <start> tom et mary travaillent tous les deux comme mannequins . <end>
Predicted translation: ['<start> both tom and mary work as models']


Input: <start> puis je avoir quelques minutes je vous prie ? <end>
Predicted translation: [

#With causal masking

In [21]:
causal = True
transformer = Transformer(num_layers=num_layers, num_heads=num_heads, d_model=d_model, dense_dim=dense_dim,
                          in_vocab_size=fr_vocab_size, tar_vocab_size=en_vocab_size,
                          input_max_position=max_len_fr, target_max_position=max_len_en, rate=0.1, causal = causal)

In [22]:
    signature = [tf.TensorSpec(shape=(None,None), dtype=tf.int32),
                 tf.TensorSpec(shape=(None,None),
                               dtype=tf.int32), ]  # a bit faster if we specify the signature
    EPOCHS = 200
    @tf.function(input_signature=signature)
    def train_step(question, answer):
        answer_in = answer[:, :-1]
        answer_tar = answer[:, 1:]
        if causal:
          enc_padding_mask = padding_mask5(question)
          dec_padding_mask = padding_mask5(question)
          dec_forward_mask = forward_mask5(answer_in)
        else:
          enc_padding_mask = padding_mask(question)
          dec_padding_mask = padding_mask(question)
          dec_forward_mask = padding_mask(answer_in)

        with tf.GradientTape() as tape:
            predictions = transformer(question, answer_in, training=True, enc_mask=enc_padding_mask,
                                      dec_forward_mask=dec_forward_mask, dec_padding_mask=dec_padding_mask)
            loss = masked_loss_fn(answer_tar, predictions)

        gradients = tape.gradient(loss, transformer.trainable_variables)
        optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

        train_loss(loss)
        train_accuracy(answer_tar, predictions)

    for epoch in range(EPOCHS):
        start = time.time()

        train_loss.reset_states()
        train_accuracy.reset_states()

        for (batch, (question, answer)) in enumerate(small_dataset):
            train_step(question, answer)
        if (1+ epoch) % 20 == 0:
          print('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1,
                                                            train_loss.result(),
                                                            train_accuracy.result()))
        
          print('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))
          id = np.random.randint(0,20)
          translate(raw_data_fr[id])

Epoch 20 Loss 1.9159 Accuracy 0.3000
Time taken for 1 epoch: 0.0251312255859375 secs

Input: <start> je vous prie de m excuser ! savez vous parler anglais ? <end>
Predicted translation: ['<start> can t a few the english']
Epoch 40 Loss 0.5980 Accuracy 0.6800
Time taken for 1 epoch: 0.025606632232666016 secs

Input: <start> la valeur d un homme reside dans ce qu il est . <end>
Predicted translation: ['<start> a a man s worth lies in']
Epoch 60 Loss 0.1684 Accuracy 0.7350
Time taken for 1 epoch: 0.028249025344848633 secs

Input: <start> quel concept ridicule ! <end>
Predicted translation: ['<start> what a ridiculous concept']
Epoch 80 Loss 0.0713 Accuracy 0.7550
Time taken for 1 epoch: 0.0248415470123291 secs

Input: <start> je vous prie de m excuser ! savez vous parler anglais ? <end>
Predicted translation: ['<start> excuse me can you speak english']
Epoch 100 Loss 0.0427 Accuracy 0.7550
Time taken for 1 epoch: 0.025027990341186523 secs

Input: <start> devine de qui c est l anniversaire

In [23]:
for fr in raw_data_fr:
  translate(fr)
  print("\n")

Input: <start> quel concept ridicule ! <end>
Predicted translation: ['<start> what a ridiculous concept']


Input: <start> votre idee n est pas completement folle . <end>
Predicted translation: ['<start> your idea is not entirely crazy']


Input: <start> la valeur d un homme reside dans ce qu il est . <end>
Predicted translation: ['<start> a man s worth lies in what he is']


Input: <start> ce qu il a fait est tres mal . <end>
Predicted translation: ['<start> what he did is very wrong']


Input: <start> vous avez besoin de faire cela tous les trois . <end>
Predicted translation: ['<start> all three of you need to do that']


Input: <start> me donnez vous une autre chance ? <end>
Predicted translation: ['<start> are you giving me another chance']


Input: <start> tom et mary travaillent tous les deux comme mannequins . <end>
Predicted translation: ['<start> how do we know the library']


Input: <start> puis je avoir quelques minutes je vous prie ? <end>
Predicted translation: ['<start> 

From a quick inspection we see that the use of causal masking improves the model's performance.

**Input: <start> quel concept ridicule ! <end>**

Predicted translation: ['<start> a few minutes please'] #no causal


Predicted translation: ['<start> what a ridiculous concept'] #causal (correct)

**Input: <start> puis je avoir quelques minutes je vous prie ? <end>**


Predicted translation: ['<start> can i can i have a few minutes please']# no causal


Predicted translation: ['<start> can i have a few minutes please'] #causal (correct)




Both model managed to fit(overfit) the training data.

The training speed are similar for this small dataset. However they are vastly different for larger datasets, with my implementation of causal masking being significantely slower (to the point of being untrainable on my machine).

A future test could be to verify the linearity wrt. sequence length;

Is the non causal attention mechanism linear wtr. sequence length?
