<a href="https://colab.research.google.com/github/pavansai26/Attention-mechanisms/blob/main/DOT_PRODUCT_ATTENTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Attention mechanism: A technique in deep learning that allows models to focus on specific parts of the input, improving performance for tasks like machine translation, text summarization, and question answering.


#Query, key, and value: The three essential components of attention

#Query:

#It represents what information the model is currently interested in and wants to understand.

#Think of these queries as mini-questions the model asks about the input.

#Key:

#Each key holds access to a piece of information in the input sequence.

#The keys tell the model where to find the relevant information it's seeking with the queries.

# Value:

#It stores the actual information associated with each key.

#The values are ultimately what the model attends to based on the query-key matching.


#Advantages of Dot Product Attention:


#Simple and efficient: Dot product attention has a straightforward implementation with relatively low computational cost, making it suitable for real-time applications.

#Interpretability: The attention weights provide some interpretability, allowing you to see which parts of the input the model is focusing on.

#Scalability: It can be easily scaled to larger datasets and longer sequences by increasing the number of heads and hidden dimensions.

#Multiple heads: Supports multi-head attention, which allows the model to learn different representations from the input data simultaneously.

#Disadvantages of Dot Product Attention:

#Quadratic complexity with sequence length: The computational cost grows quadratically with the sequence length, which can be problematic for very long sequences.

#Limited expressiveness: The dot product similarity measure may not capture complex relationships between queries and keys effectively.

#Sensitivity to input scaling: The attention scores are sensitive to the scale of the input vectors, which can affect the model's performance.

#No inherent positional encoding: The model might not capture the order of elements in the input sequence without additional positional encoding techniques.

In [1]:
import numpy as np

def dot_product_attention(query, key, value):
    """Dot product attention.

    Args:
        query: Query matrix of shape (batch_size, num_heads, seq_len, d_k).
        key: Key matrix of shape (batch_size, num_heads, seq_len, d_k).
        value: Value matrix of shape (batch_size, num_heads, seq_len, d_v).
        mask: Optional mask tensor to prevent attention to certain positions.

    Returns:
        output: Attention output of shape (batch_size, num_heads, seq_len, d_v).
        attention_weights: Attention weights of shape (batch_size, num_heads, seq_len, seq_len).
    """

    # Scaled dot product attention
    scores = np.matmul(query, key.transpose(0, 1, 3, 2)) / np.sqrt(query.shape[-1])



    # Normalize scores with softmax
    attention_weights = np.softmax(scores, axis=-1)

    # Compute weighted sum of values
    output = np.matmul(attention_weights, value)

    return output, attention_weights
