# Transformer Network

## Packages

In [None]:
import tensorflow as tf
import time
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization

<a name='1'></a>
## 1 - Positional Encoding

In sequence to sequence tasks, the relative order of your data is extremely important to its meaning. When you were training sequential neural networks such as RNNs, you fed your inputs into the network in order. Information about the order of your data was automatically fed into your model.  However, when you train a Transformer network using multi-head attention, you feed your data into the model all at once. While this dramatically reduces training time, there is no information about the order of your data. This is where positional encoding is useful - you can specifically encode the positions of your inputs and pass them into the network using these sine and cosine formulas:
    
$$
PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
\tag{1}$$
<br>
$$
PE_{(pos, 2i+1)}= cos\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
\tag{2}$$

* $d$ is the dimension of the word embedding and positional encoding
* $pos$ is the position of the word.
* $k$ refers to each of the different dimensions in the positional encodings, with $i$ equal to $k$ $//$ $2$.

To develop some intuition about positional encodings, we can think of them broadly as a feature that contains the information about the relative positions of words. The sum of the positional encoding and word embedding is ultimately what is fed into the model. If we just hard code the positions by adding a matrix of 1's or whole numbers to the word embedding (for example), the semantic meaning is distorted. Conversely, the values of the sine and cosine equations are small enough (between -1 and 1) that when we add the positional encoding to a word embedding, the word embedding is not significantly distorted, and is instead enriched with positional information. Using a combination of these two equations helps the Transformer network attend to the relative positions of the input data.

<a name='1-1'></a>
### 1.1 - Sine and Cosine Angles

Notice that even though the sine and cosine positional encoding equations take in different arguments (`2i` versus `2i+1`, or even versus odd numbers) the inner terms for both equations are the same: $$\theta(pos, i, d) = \frac{pos}{10000^{\frac{2i}{d}}} \tag{3}$$

Consider the inner term as you calculate the positional encoding for a word in a sequence.<br> 

$PE_{(pos, 0)}= sin\left(\frac{pos}{{10000}^{\frac{0}{d}}}\right)$, since solving `2i = 0` gives `i = 0`
<br>

$PE_{(pos, 1)}= cos\left(\frac{pos}{{10000}^{\frac{0}{d}}}\right)$, since solving `2i + 1 = 1` gives `i = 0`

The angle is the same for both! The angles for $PE_{(pos, 2)}$ and $PE_{(pos, 3)}$ are the same as well, since for both, `i = 1` and therefore the inner term is $\left(\frac{pos}{{10000}^{\frac{2}{d}}}\right)$. This relationship holds true for all paired sine and cosine curves.

In [None]:
def get_angles(pos, k, d):
    """
    Get the angles for the positional encoding
    
    Arguments:
        pos -- Column vector containing the positions [[0], [1], ...,[N-1]]
        k --   Row vector containing the dimension span [[0, 1, 2, ..., d-1]]
        d(integer) -- Encoding size
    
    Returns:
        angles -- (pos, d) numpy array 
    """
    
    i = k//2
    angles = pos/(10000**(2*i/d)) # (sequence_length, encoding_size)
    
    return angles

<a name='1-2'></a>
### 1.2 - Sine and Cosine Positional Encodings

Now you can use the angles you computed to calculate the sine and cosine positional encodings.

$$
PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
$$
<br>
$$
PE_{(pos, 2i+1)}= cos\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
$$

In [None]:
def positional_encoding(positions, d):
    """
    Precomputes a matrix with all the positional encodings 
    
    Arguments:
        positions (int) -- Maximum number of positions to be encoded 
        d (int) -- Encoding size 
    
    Returns:
        pos_encoding -- (1, positions, d_model) A matrix with the positional encodings
    """

    angle_rads = get_angles(np.reshape(np.arange(positions), (positions, 1)), np.reshape(np.arange(d), (1, d)), d)
  
    # applying sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
  
    # applying cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...] # (1, sequence_length, encoding_size)
    
    return tf.cast(pos_encoding, dtype=tf.float32)

<a name='2'></a>
## 2 - Masking

There are two types of masks that are useful when building your Transformer network: the *padding mask* and the *look-ahead mask*. Both help the softmax computation give the appropriate weights to the words in your input sentence. 

<a name='2-1'></a>
### 2.1 - Padding Mask

Oftentimes your input sequence will exceed the maximum length of a sequence your network can process. Let's say the maximum length of your model is five, it is fed the following sequences:

    [["Do", "you", "know", "when", "Jane", "is", "going", "to", "visit", "Africa"], 
     ["Jane", "visits", "Africa", "in", "September" ],
     ["Exciting", "!"]]

which might get vectorized as:

    [[ 71, 121, 4, 56, 99, 2344, 345, 1284, 15],
     [ 56, 1285, 15, 181, 545],
     [ 87, 600]]
    
When passing sequences into a transformer model, it is important that they are of uniform length. You can achieve this by padding the sequence with zeros, and truncating sentences that exceed the maximum length of your model:

    [[ 71, 121, 4, 56, 99],
     [ 2344, 345, 1284, 15, 0],
     [ 56, 1285, 15, 181, 545],
     [ 87, 600, 0, 0, 0]]
    
Sequences longer than the maximum length of five will be truncated, and zeros will be added to the truncated sequence to achieve uniform length. Similarly, for sequences shorter than the maximum length, zeros will also be added for padding. However, these zeros will affect the softmax calculation - this is when a padding mask comes in handy! You will need to define a boolean mask that specifies to which elements you must attend(1) and which elements you must ignore(0). Later you will use that mask to set all the zeros in the sequence to a value close to negative infinity (-1e9).

**Note:** The below function only creates the mask of an _already padded sequence_.

In [None]:
def create_padding_mask(decoder_token_ids):
    """
    Creates a matrix mask for the padding cells
    
    Arguments:
        decoder_token_ids -- (n, m) matrix
    
    Returns:
        mask -- (n, 1, m) binary tensor
    """    
    seq = 1 - tf.cast(tf.math.equal(decoder_token_ids, 0), tf.float32)
  
    # adding extra dimensions to add the padding to the attention logits. 
    # this will allow for broadcasting later when comparing sequences
    
    return seq[:, tf.newaxis, :] 

<a name='2-2'></a>
### 2.2 - Look-ahead Mask

The look-ahead mask follows similar intuition. In training, you will have access to the complete correct output of your training example. The look-ahead mask helps your model pretend that it correctly predicted a part of the output and see if, *without looking ahead*, it can correctly predict the next output. 

For example, if the expected correct output is `[1, 2, 3]` and you wanted to see if given that the model correctly predicted the first value it could predict the second value, you would mask out the second and third values. So you would input the masked sequence `[1, -1e9, -1e9]` and see if it could generate `[1, 2, -1e9]`.

In [None]:
def create_look_ahead_mask(sequence_length):
    """
    Returns a lower triangular matrix filled with ones
    
    Arguments:
        sequence_length -- matrix size
    
    Returns:
        mask -- (size, size) tensor
    """
    
    mask = tf.linalg.band_part(tf.ones((1, sequence_length, sequence_length)), -1, 0)
    
    return mask 

<a name='3'></a>
## 3 - Self-Attention

As the authors of the Transformers paper state, "Attention is All You Need". 

<center><img src="images/self-attention.png" alt="Encoder" width="50%" heigth="50%"></center>
<caption><center><b>Figure 1:</b> Self-Attention calculation visualization</center></caption>
<br>
    
The use of self-attention paired with traditional convolutional networks allows for parallelization, which speeds up training. You will implement **scaled dot product attention**, which takes in a query, key, value, and a mask as inputs to return rich attention-based vector representations of the words in your sequence. This type of self-attention can be mathematically expressed as:
$$
\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}+{M}\right) V\tag{4}\
$$

* $Q$ is the matrix of queries 
* $K$ is the matrix of keys
* $V$ is the matrix of values
* $M$ is the optional mask you choose to apply 
* ${d_k}$ is the dimension of the keys, which is used to scale everything down so the softmax doesn't explode

In [None]:
def scaled_dot_product_attention(q, k, v, mask):
    """
    Calculate the attention weights.
      q, k, v must have matching leading dimensions.
      k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
      The mask has different shapes depending on its type (padding or look ahead) 
      but it must be broadcastable for addition.

    Arguments:
        q -- query shape == (..., seq_len_q, depth)
        k -- key shape == (..., seq_len_k, depth)
        v -- value shape == (..., seq_len_v, depth_v)
        mask: Float tensor with shape broadcastable 
              to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
        output -- attention_weights
    """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # scaling matmul_qk
    dk = k.shape[1]
    scaled_attention_logits = matmul_qk/dk**(1/2)

    # adding the mask to the scaled tensor
    if mask is not None:
        scaled_attention_logits += (1. - mask)*-1.0e9

    # applying softmax to the last axis    
    attention_weights = tf.keras.activations.softmax(scaled_attention_logits) # (..., seq_len_q, seq_len_k)

    # calculating v
    output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v)

    return output, attention_weights

<a name='4'></a>
## 4 - Encoder

The Transformer Encoder layer pairs self-attention and convolutional neural network style of processing to improve the speed of training and passes K and V matrices to the Decoder, which you'll build later in the assignment. In this section of the assignment, you will implement the Encoder by pairing multi-head attention and a feed forward neural network (Figure 2a). 


<center><img src="images/encoder_layer.png" alt="Encoder" width="25%"></center>
<caption><center><b>Figure 2a:</b> Transformer encoder layer</center></caption>

* `MultiHeadAttention` you can think of as computing the self-attention several times to detect different features. 
* Feed forward neural network contains two Dense layers which we'll implement as the function `FullyConnected`

Your input sentence first passes through a *multi-head attention layer*, where the encoder looks at other words in the input sentence as it encodes a specific word. The outputs of the multi-head attention layer are then fed to a *feed forward neural network*. The exact same feed forward network is independently applied to each position.

In [None]:
def FullyConnected(fully_connected_dim, embedding_dim):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(fully_connected_dim, activation='relu'),  # (batch_size, seq_len, dff)
        tf.keras.layers.Dense(embedding_dim)  # (batch_size, seq_len, d_model)
    ])

<a name='4-1'></a>
### 4.1 - Encoder Layer

Now you can pair multi-head attention and feed forward neural network together in an encoder layer! You will also use residual connections and layer normalization to help speed up training (Figure 2a).

In [None]:
class EncoderLayer(tf.keras.layers.Layer):
    """
    The encoder layer is composed by a multi-head self-attention mechanism,
    followed by a simple, positionwise fully connected feed-forward network. 
    This architecture includes a residual connection around each of the two 
    sub-layers, followed by layer normalization.
    """
    
    def __init__(self, 
                 embedding_dim, 
                 num_heads, 
                 fully_connected_dim,
                 dropout_rate=0.1, 
                 layernorm_eps=1e-6):
        
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(num_heads = num_heads,
                                      key_dim = embedding_dim,
                                      dropout = dropout_rate)

        self.ffn = FullyConnected(fully_connected_dim = fully_connected_dim,
                                  embedding_dim = embedding_dim)

        self.layernorm1 = LayerNormalization(epsilon = layernorm_eps)
        self.layernorm2 = LayerNormalization(epsilon = layernorm_eps)

        self.dropout_ffn = Dropout(dropout_rate)
    
    def call(self, x, training, mask):
        """
        Forward pass for the Encoder Layer
        
        Arguments:
            x -- Tensor of shape (batch_size, input_seq_len, embedding_dim)
            training -- Boolean, set to true to activate the training mode for dropout layers
            mask -- Boolean mask to ensure that the padding is not treated as part of the input
                    
        Returns:
            encoder_layer_out -- Tensor of shape (batch_size, input_seq_len, embedding_dim)
        """
        
        # calculating self-attention using mha
        self_mha_output = self.mha(x, x, attention_mask=mask, training=training)  # (batch_size, input_seq_len, embedding_dim)
        
        # applying skip conection and layer normalization 
        skip_x_attention = self.layernorm1(self_mha_output + x)  # (batch_size, input_seq_len, embedding_dim)

        # passing the output of the mha through a ffn
        ffn_output = self.ffn(skip_x_attention)  # (batch_size, input_seq_len, embedding_dim)
        
        # applying dropout to ffn output during training
        ffn_output = self.dropout_ffn(ffn_output, training=training) # (batch_size, input_seq_len, embedding_dim)
        
        # applying skip conection and layer normalization
        encoder_layer_out = self.layernorm2(ffn_output + skip_x_attention)  # (batch_size, input_seq_len, embedding_dim)
        
        return encoder_layer_out
    

<a name='4-2'></a>
### 4.2 - Full Encoder
 

<center><img src="images/encoder.png" alt="Encoder" width="30%"></center>
<caption><center><b>Figure 2b:</b> Transformer Encoder</center></caption>

In [None]:
class Encoder(tf.keras.layers.Layer):
    """
    The entire Encoder starts by passing the input to an embedding layer 
    and using positional encoding to then pass the output through a stack of
    encoder Layers
    """  
    
    def __init__(self, 
                 num_layers, 
                 embedding_dim, 
                 num_heads, 
                 fully_connected_dim, 
                 input_vocab_size,
                 maximum_position_encoding, 
                 dropout_rate=0.1, 
                 layernorm_eps=1e-6):
        
        super(Encoder, self).__init__()

        self.embedding_dim = embedding_dim
        self.num_layers = num_layers

        self.embedding = Embedding(input_vocab_size, self.embedding_dim)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.embedding_dim)

        self.enc_layers = [EncoderLayer(embedding_dim = self.embedding_dim,
                                        num_heads = num_heads,
                                        fully_connected_dim = fully_connected_dim,
                                        dropout_rate = dropout_rate,
                                        layernorm_eps = layernorm_eps) 
                           for _ in range(self.num_layers)]

        self.dropout = Dropout(dropout_rate)
        
    def call(self, x, training, mask):
        """
        Forward pass for the Encoder
        
        Arguments:
            x -- Tensor of shape (batch_size, input_seq_len)
            training -- Boolean, set to true to activate the training mode for dropout layers
            mask -- Boolean mask to ensure that the padding is not treated as part of the input
                    
        Returns:
            out2 -- Tensor of shape (batch_size, input_seq_len, embedding_dim)
        """
        
        seq_len = tf.shape(x)[1]

        # passing input through the Embedding layer
        x = self.embedding(x)  # (batch_size, input_seq_len, embedding_dim)
        
        # scaleing embedding
        x = tf.cast(x, tf.float32)*self.embedding_dim**(1/2)
        
        # adding the position encoding
        x += self.pos_encoding[:,:seq_len,:]
        
        # passing the encoded embedding through a dropout layer
        x = self.dropout(x, training=training)
        
        # passing the output through the stack of encoding layers 
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x  # (batch_size, input_seq_len, embedding_dim)

<a name='5'></a>
## 5 - Decoder

The Decoder layer takes the K and V matrices generated by the Encoder and computes the second multi-head attention layer with the Q matrix from the output (Figure 3a).

<center><img src="images/decoder_layer.png" alt="Encoder" width="20%"/></center>
<caption><center><b>Figure 3a:</b> Transformer Decoder layer</center></caption>

<a name='5-1'></a>    
### 5.1 - Decoder Layer

In [None]:
class DecoderLayer(tf.keras.layers.Layer):
    """
    The decoder layer is composed by two multi-head attention blocks, 
    one that takes the new input and uses self-attention, and the other 
    one that combines it with the output of the encoder, followed by a
    fully connected block. 
    """
    
    def __init__(self, 
                 embedding_dim, 
                 num_heads, 
                 fully_connected_dim, 
                 dropout_rate=0.1, 
                 layernorm_eps=1e-6):
        
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(num_heads = num_heads,
                                       key_dim = embedding_dim,
                                       dropout = dropout_rate)

        self.mha2 = MultiHeadAttention(num_heads = num_heads,
                                       key_dim = embedding_dim,
                                       dropout = dropout_rate)

        self.ffn = FullyConnected(fully_connected_dim = fully_connected_dim,
                                  embedding_dim = embedding_dim)

        self.layernorm1 = LayerNormalization(epsilon = layernorm_eps)
        self.layernorm2 = LayerNormalization(epsilon = layernorm_eps)
        self.layernorm3 = LayerNormalization(epsilon = layernorm_eps)

        self.dropout_ffn = Dropout(dropout_rate)
    
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        """
        Forward pass for the Decoder Layer
        
        Arguments:
            x -- Tensor of shape (batch_size, target_seq_len, embedding_dim)
            enc_output --  Tensor of shape(batch_size, input_seq_len, embedding_dim)
            training -- Boolean, set to true to activate the training mode for dropout layers
            look_ahead_mask -- Boolean mask for the target_input
            padding_mask -- Boolean mask for the second multihead attention layer
            
        Returns:
            out3 -- Tensor of shape (batch_size, target_seq_len, embedding_dim)
            attn_weights_block1 -- Tensor of shape(batch_size, num_heads, target_seq_len, input_seq_len)
            attn_weights_block2 -- Tensor of shape(batch_size, num_heads, target_seq_len, input_seq_len)
        """

        # calculating self-attention
        mult_attn_out1, attn_weights_block1 = self.mha1(x, 
                                                        x, 
                                                        attention_mask=look_ahead_mask, 
                                                        training=training, 
                                                        return_attention_scores=True)  # (batch_size, target_seq_len, embedding_dim)
        
        # apply skip conection and layer normalization 
        Q1 = self.layernorm1(mult_attn_out1 + x)

        # calculatin self-attention using the Q1 and the encoder output 
        mult_attn_out2, attn_weights_block2 = self.mha2(Q1, 
                                                        enc_output, 
                                                        attention_mask=padding_mask, 
                                                        training=training, 
                                                        return_attention_scores=True)  # (batch_size, target_seq_len, embedding_dim)
        
        # applying skip conection and layer normalization 
        mult_attn_out2 = self.layernorm2(mult_attn_out2 + Q1)  # (batch_size, target_seq_len, embedding_dim)
                
        # passing the output through a ffn
        ffn_output = self.ffn(mult_attn_out2)  # (batch_size, target_seq_len, embedding_dim)
        
        # applying a dropout to the ffn output during training
        ffn_output = self.dropout_ffn(ffn_output, training=training)
        
        # apply skip conection and layer normalization 
        out3 = self.layernorm3(ffn_output + mult_attn_out2)  # (batch_size, target_seq_len, embedding_dim)

        return out3, attn_weights_block1, attn_weights_block2 

<a name='5-2'></a> 
### 5.2 - Full Decoder


<center><img src="images/decoder.png" alt="Encoder" width="30%"/></center>
<caption><center><b>Figure 3b:</b> Transformer Decoder</center></caption>

<a name='ex-7'></a>     

In [None]:
class Decoder(tf.keras.layers.Layer):
    """
    The entire Encoder starts by passing the target input to an embedding layer 
    and using positional encoding to then pass the output through a stack of
    decoder Layers
        
    """ 
    
    def __init__(self, 
                 num_layers, 
                 embedding_dim, 
                 num_heads, 
                 fully_connected_dim, 
                 target_vocab_size,
                 maximum_position_encoding, 
                 dropout_rate=0.1, 
                 layernorm_eps=1e-6):
        
        super(Decoder, self).__init__()

        self.embedding_dim = embedding_dim
        self.num_layers = num_layers

        self.embedding = Embedding(target_vocab_size, self.embedding_dim)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.embedding_dim)

        self.dec_layers = [DecoderLayer(embedding_dim = self.embedding_dim,
                                        num_heads = num_heads,
                                        fully_connected_dim = fully_connected_dim,
                                        dropout_rate = dropout_rate,
                                        layernorm_eps = layernorm_eps) 
                           for _ in range(self.num_layers)]
        
        self.dropout = Dropout(dropout_rate)
    
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        """
        Forward  pass for the Decoder
        
        Arguments:
            x -- Tensor of shape (batch_size, target_seq_len, embedding_dim)
            enc_output --  Tensor of shape(batch_size, input_seq_len, embedding_dim)
            training -- Boolean, set to true to activate the training mode for dropout layers
            look_ahead_mask -- Boolean mask for the target_input
            padding_mask -- Boolean mask for the second multihead attention layer
            
        Returns:
            x -- Tensor of shape (batch_size, target_seq_len, embedding_dim)
            attention_weights - Dictionary of tensors containing all the attention weights
                                each of shape Tensor of shape (batch_size, num_heads, target_seq_len, input_seq_len)
        """

        seq_len = tf.shape(x)[1]
        attention_weights = {}

        # creating word embeddings 
        x = self.embedding(x)  # (batch_size, target_seq_len, embedding_dim)
        
        # scaling word embeddings
        x = tf.cast(x, tf.float32)*self.embedding_dim**(1/2)
        
        # calculating positional encodings and adding to word embeddings
        x += self.pos_encoding[:,:seq_len,:]

        # applying a dropout to x during training
        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            # stacking multiple decoder layers
            x, block1, block2 = self.dec_layers[i](x, 
                                                   enc_output, 
                                                   training,
                                                   look_ahead_mask, 
                                                   padding_mask)
            
            attention_weights['decoder_layer{}_block1_self_att'.format(i+1)] = block1
            attention_weights['decoder_layer{}_block2_decenc_att'.format(i+1)] = block2
            
        return x, attention_weights

<a name='6'></a> 
## 6 - Transformer 

<center><img src="images/transformer.png" alt="Transformer" width="50%"/></center>
<caption><center><b>Figure 4:</b> Transformer</center></caption>

    
The flow of data through the Transformer Architecture is as follows:
* First your input passes through an Encoder, which is just repeated Encoder layers that you implemented:
    - embedding and positional encoding of your input
    - multi-head attention on your input
    - feed forward neural network to help detect features <br><br>
    
* Then the predicted output passes through a Decoder, consisting of the decoder layers that you implemented:
    - embedding and positional encoding of the output
    - multi-head attention on your generated output
    - multi-head attention with the Q from the first multi-head attention layer and the K and V from the Encoder
    - a feed forward neural network to help detect features <br><br>
    
* Finally, after the Nth Decoder layer, one dense layer and a softmax are applied to generate prediction for the next output in your sequence.

<a name='ex-8'></a> 

In [None]:
class Transformer(tf.keras.Model):
    """
    Complete transformer with an Encoder and a Decoder
    """
    
    def __init__(self, 
                 num_layers, 
                 embedding_dim, 
                 num_heads, 
                 fully_connected_dim,
                 input_vocab_size,
                 target_vocab_size, 
                 max_positional_encoding_input, 
                 max_positional_encoding_target, 
                 dropout_rate=0.1, 
                 layernorm_eps=1e-6):
        
        super(Transformer, self).__init__()

        self.encoder = Encoder(num_layers = num_layers,
                               embedding_dim = embedding_dim,
                               num_heads = num_heads,
                               fully_connected_dim = fully_connected_dim,
                               input_vocab_size = input_vocab_size,
                               maximum_position_encoding = max_positional_encoding_input,
                               dropout_rate = dropout_rate,
                               layernorm_eps = layernorm_eps)

        self.decoder = Decoder(num_layers = num_layers, 
                               embedding_dim = embedding_dim,
                               num_heads = num_heads,
                               fully_connected_dim = fully_connected_dim,
                               target_vocab_size = target_vocab_size, 
                               maximum_position_encoding = max_positional_encoding_target,
                               dropout_rate = dropout_rate,
                               layernorm_eps = layernorm_eps)

        self.final_layer = Dense(target_vocab_size, activation='softmax')
    
    def call(self, 
             input_sentence, 
             output_sentence, 
             training, 
             enc_padding_mask, 
             look_ahead_mask, 
             dec_padding_mask):
        """
        Forward pass for the entire Transformer
        Arguments:
            input_sentence -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
                              An array of the indexes of the words in the input sentence
            output_sentence -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
                              An array of the indexes of the words in the output sentence
            training -- Boolean, set to true to activate the training mode for dropout layers
            enc_padding_mask -- Boolean mask to ensure that the padding is not treated as part of the input
            look_ahead_mask -- Boolean mask for the target_input
            dec_padding_mask -- Boolean mask for the second multihead attention layer
            
        Returns:
            final_output -- the full sentence prediction
            attention_weights - Dictionary of tensors containing all the attention weights for the decoder
                                each of shape Tensor of shape (batch_size, num_heads, target_seq_len, input_seq_len)
        """
        
        # calling the encoder 
        enc_output = self.encoder(input_sentence, 
                                  training,
                                  enc_padding_mask)  # (batch_size, inp_seq_len, fully_connected_dim)

        # calling the decoder 
        dec_output, attention_weights = self.decoder(output_sentence, 
                                                     enc_output, 
                                                     training, 
                                                     look_ahead_mask, 
                                                     dec_padding_mask) # (batch_size, tar_seq_len, fully_connected_dim)
        
        # passing hte decoder output through a linear layer and softmax
        final_output = self.final_layer(dec_output) # (batch_size, tar_seq_len, target_vocab_size)

        return final_output, attention_weights