<a href="https://colab.research.google.com/github/pavansai26/TRANSFORMER-NLP-MODELS/blob/main/Block_diagonal_attention_mechanism_in_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **what is block diagonal" attention pattern in tranfomers**

# **The Transformer network uses an attention mechanism to weigh the importance of different parts of the input sequence when processing it**

# **In standard Transformer networks, the attention mechanism computes an attention score between each token in the input sequence and every other token.**

# **This results in a computational cost that grows quadratically with the length of the sequence. For very long sequences, this can make the model training very slow and memory-intensive.**

# **To address this issue, researchers have proposed several variants of the attention mechanism that limit the number of tokens that can attend to each other. One such variant is the "block diagonal" attention pattern**

# **The "block diagonal" attention pattern in Transformer layers refers to a specific type of attention mechanism.**

# **where the input sequence is divided into smaller blocks, and attention is only allowed within each block.**

# **In other words, attention is only allowed between tokens that are in the same block, and tokens from different blocks cannot attend to each other.**

# **The figure below illustrates the "block diagonal" attention pattern for a sequence with four tokens and two blocks:**

Input sequence: [A, B, C, D]
Blocks: [A, B], [C, D]

+---+---+     +---+---+
| A | B | --> | A |   |
+---+---+     +---+---+
| C | D |     |   | D |
+---+---+     +---+---+


# **In this example, the input sequence is divided into two blocks, [A, B] and [C, D].**

# **The attention mechanism only allows attention within each block, so tokens A and B can attend to each other, and tokens C and D can attend to each other. However, tokens A and C, and tokens B and D cannot attend to each other.**

# **The "block diagonal" attention pattern can be implemented in the Transformer network by modifying the attention mask**

# **which is a binary matrix that indicates which tokens are allowed to attend to each other. In the standard Transformer, the attention mask is a fully-connected matrix that allows all tokens to attend to each other.**

# **In contrast, in the "block diagonal" attention pattern, the attention mask is a block diagonal matrix that only allows attention within each block.**

In [None]:
import tensorflow as tf

class BlockDiagonalAttention(tf.keras.layers.Layer):
    def __init__(self, block_size):
        super(BlockDiagonalAttention, self).__init__()
        self.block_size = block_size

    def call(self, inputs, mask=None):
        query, key, value = inputs

        # Divide input sequence into blocks
        batch_size, seq_length, hidden_size = query.shape.as_list()
        num_blocks = seq_length // self.block_size
        query = tf.reshape(query, [batch_size, num_blocks, self.block_size, hidden_size])
        key = tf.reshape(key, [batch_size, num_blocks, self.block_size, hidden_size])
        value = tf.reshape(value, [batch_size, num_blocks, self.block_size, hidden_size])

        # Compute attention within each block
        query = tf.transpose(query, [0, 2, 1, 3])
        query = tf.reshape(query, [batch_size * self.block_size, num_blocks, hidden_size])
        key = tf.transpose(key, [0, 2, 1, 3])
        key = tf.reshape(key, [batch_size * self.block_size, num_blocks, hidden_size])
        attention_scores = tf.matmul(query, key, transpose_b=True)
        attention_scores = tf.reshape(attention_scores, [batch_size, self.block_size, num_blocks, self.block_size])
        attention_scores = tf.transpose(attention_scores, [0, 2, 3, 1])
        attention_scores = tf.reshape(attention_scores, [batch_size, seq_length, self.block_size])

        # Apply attention mask
        if mask is not None:
            attention_scores = tf.where(mask, attention_scores, tf.constant(-1e9, dtype=attention_scores.dtype))

        attention_probs = tf.nn.softmax(attention_scores, axis=-1)
        attention_probs = tf.keras.layers.Dropout(rate=0.1)(attention_probs, training=self._in_train_mode())

        # Compute weighted sum of value vectors
        value = tf.transpose(value, [0, 2, 1, 3])
        value = tf.reshape(value, [batch_size * self.block_size, num_blocks, hidden_size])
        attention_probs = tf.reshape(attention_probs, [batch_size * seq_length, self.block_size, 1])
        context_vector = tf.matmul(value, attention_probs)
        context_vector = tf.reshape(context_vector, [batch_size, seq_length, hidden_size])
        return context_vector


# Note that this implementation assumes that the query, key, and value vectors are all of the same shape. If this is not the case, you may need to adjust the reshape operations accordingly.