Task: Implement the Self-Attention Mechanism
Your task is to implement the self-attention mechanism, which is a fundamental component of transformer models, widely used in natural language processing and computer vision tasks. The self-attention mechanism allows a model to dynamically focus on different parts of the input sequence when generating a contextualized representation.

Your function should return the self-attention output as a numpy array.

Example:
Input:
import numpy as np

X = np.array([[1, 0], [0, 1]])
W_q = np.array([[1, 0], [0, 1]])
W_k = np.array([[1, 0], [0, 1]])
W_v = np.array([[1, 2], [3, 4]])

Q, K, V = compute_qkv(X, W_q, W_k, W_v)
output = self_attention(Q, K, V)

print(output)
Output:
[[1.660477 2.660477]
[2.339523 3.339523]]
Reasoning:
The self-attention mechanism calculates the attention scores for each input, determining how much focus to put on other inputs when generating a contextualized representation. The output is the weighted sum of the values based on the attention scores.

The provided code implements self-attention, a mechanism in transformer models that allows the network to focus on different parts of the input sequence when computing contextualized representations

	•	Why do we need self-attention?
	•	Instead of treating tokens independently, self-attention allows each token to “attend” to all other tokens and extract useful contextual information.
	•	Why is softmax used?
	•	It helps distribute attention across all tokens so that the focus is not entirely on one token.
	•	Why multiply Q and K.T?
	•	This measures how similar each token’s query (what it wants) is to another token’s key (what it has).

In [1]:
import numpy as np
import math

def compute_qkv(X, W_q, W_k, W_v):
    """
    Compute the query (Q), key (K), and value (V) matrices.

    :param X: Input matrix (sequence_length, embedding_size)
    :param W_q: Query weight matrix (embedding_size, hidden_size)
    :param W_k: Key weight matrix (embedding_size, hidden_size)
    :param W_v: Value weight matrix (embedding_size, hidden_size)
    :return: Q, K, V matrices
    """
    Q = np.dot(X, W_q)
    K = np.dot(X, W_k)
    V = np.dot(X, W_v)
    return Q, K, V

def self_attention(Q, K, V):
    d_k = K.shape[-1]  # Dimension of key vectors
    #shape[-1] selects the last dimension, regardless of how many dimensions exist
    # ensures we always extract the embedding size (number of features per token)

    scores = np.dot(Q, K.T) / math.sqrt(d_k)  # Scaled dot-product attention
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True)  # Softmax
    output = np.dot(attention_weights, V)  # Weighted sum of values
    return output
# axis=1 means we are summing along each row (across columns).
# This ensures that each row in attention_weights sums to 1, making it a valid probability distribution.

# keepdims=True ensures that the original shape is maintained.
#Without keepdims=True, np.sum() would reduce the dimension.

# Example usage
X = np.array([[1, 0], [0, 1]])
W_q = np.array([[1, 0], [0, 1]])
W_k = np.array([[1, 0], [0, 1]])
W_v = np.array([[1, 2], [3, 4]])

Q, K, V = compute_qkv(X, W_q, W_k, W_v)
output = self_attention(Q, K, V)

print(output)

[[1.6604769 2.6604769]
 [2.3395231 3.3395231]]
