# Scaled Dot-Product Attention in PyTorch

The basic idea is to compute the attention scores between a set of **queries (Q)** and **keys (K)** and use these scores to weight a set of **values (V)**. 


In [1]:
import torch
import torch.nn.functional as F
import torch.nn as nn

### Scaled Dot-Product Attention

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
$$

Where:
-  *Q*  is the query matrix
-  *K*  is the key matrix
-  *V*  is the value matrix
-  *d_k* is the dimension of the key vectors (used for scaling)

The softmax function => to normalize attention scores, and the output is a weighted sum of the values *V*.

### Steps to Implement
1. **Compute the dot product of queries and keys**: This gives us raw attention scores.
2. **Scale the scores**: divide by \( \sqrt{d_k} \).
3. **Apply softmax**:  raw scores into probabilities.
4. **Compute the weighted sum of the values**: Multiply the attention weights with the values *V*.


#### Requirements

**batch_size**:
   - **Definition**: The number of sequences (or examples) that are processed together in one forward pass through the model.
   -to take advantage of vectorization (parallel processing) and to make training more efficient. 
   -The **batch_size** specifies how many sequences are included in each batch.

 **seq_len** (Sequence Length):
   - **Definition**: The length of each input sequence (i.e., the number of tokens or words in each sequence).

 **d_k** (Dimension of the Key and Query vectors):
   - **Definition**: The dimensionality of the **key** and **query** vectors in the attention mechanism.

 **d_v** (Dimension of the Value vector):
   - **Definition**: Dimension of the **value** vectors .If **d_v = 128**, each value vector would have 128 dimensions.


---



##### Input:
- **Q**: $(\text{batch\_size}, \text{seq\_len}, d_k)$  
- **K**: $(\text{batch\_size}, \text{seq\_len}, d_k)$  
- **V**: $(\text{batch\_size}, \text{seq\_len}, d_v)$  
- **Mask** (optional): $(\text{batch\_size}, \text{seq\_len}, \text{seq\_len})$
##### Output:
- **Attention Output**: $(\text{batch\_size}, \text{seq\_len}, d_v)$  
- **Attention Weights**: $(\text{batch\_size}, \text{seq\_len}, \text{seq\_len})$



### Performance
##### Time Complexity:
$$
O(\text{batch\_size} \times \text{seq\_len}^2 \times (d_k + d_v))
$$
##### Space Complexity:
$$
O(\text{batch\_size} \times \text{seq\_len}^2 + \text{batch\_size} \times \text{seq\_len} \times d_v)
$$






In [None]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k):
        super(ScaledDotProductAttention, self).__init__()
        # The dimension of the key vectors for scaling
        self.d_k = d_k 

    def forward(self, Q, K, V, mask=None):
        # Step 1: Calculate dot product of Q and K
        # Step 2: Scale the scores by sqrt(d_k)
        # Step 3: Apply mask (optional) to prevent attending to certain positions     
        # Step 4: Apply softmax to get attention weights
        # Step 5: Compute the weighted sum of values (V)
        return "Compute the weighted sum of values", "attention_weights"