### Self-Attention in Transformers

Self-attention is a key mechanism in transformer models that allows the model to weigh the importance of different words in a sequence when encoding a particular word. Here's a breakdown of how self-attention is computed in transformers:

1\. **Input Representation**

Each input word (or token) is first converted into a <ins>vector representation</ins>, typically using embeddings. For a sequence of words, this results in a matrix $X$ of shape $(n, d)$, where $n$ is the number of tokens and $d$ is the dimensionality of the embeddings.

2\. **Linear Transformations**

The input matrix $X$ is transformed into three different matrices: **Queries (Q)**, **Keys (K)**, and **Values (V)**. This is done using <ins>learned weight matrices</ins>:

- **Queries**: $Q = XW_Q$
- **Keys**: $K = XW_K$
- **Values**: $V = XW_V$

Here, $W_Q$, $W_K$, and $W_V$ are weight matrices of shape $(d, d_k)$, $(d, d_k)$, and $(d, d_v)$ respectively, where $d_k$ and $d_v$ are the dimensions of the keys and values.

3\. **Compute Attention Scores**

The attention scores are computed by taking the <ins>dot product of the queries with the keys</ins>:

$$
\text{Attention Scores} = QK^T
$$

This results in a matrix of shape $(n, n)$, where each element $(i, j)$ represents the attention score of the $i$\-th token with respect to the $j$\-th token.

4\. **Scale the Scores**

<ins>To prevent the dot products from growing too large</ins> (which can lead to gradients that are too small), the scores are <ins>scaled</ins> by the <ins>square root of the dimension of the keys</ins>:

$$
\text{Scaled Scores} = \frac{QK^T}{\sqrt{d_k}}
$$

5\. **Apply Softmax**

The scaled scores are then passed through a <ins>softmax</ins> function to obtain the <ins>attention weights</ins>. This converts the scores into a probability distribution:

$$
\text{Attention Weights} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)
$$

6\. **Compute the Output**

Finally, the output of the self-attention layer is computed by <ins>multiplying the attention weights with the values</ins>:

$$
\text{Output} = \text{Attention Weights} \cdot V
$$

This results in a new matrix of shape $(n, d_v)$, which is the weighted sum of the values based on the attention scores.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (self.head_dim * heads == embed_size), "Embed size must be divisible by number of heads"

        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask):
        # Number of training examples
        N = query.shape[0]

        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        # Split the embedding into self.heads different pieces
        values = values.view(N, value_len, self.heads, self.head_dim)
        keys = keys.view(N, key_len, self.heads, self.head_dim)
        queries = query.view(N, query_len, self.heads, self.head_dim)

        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)

        # Calculate attention scores
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        # Apply softmax to get attention weights
        attention = torch.nn.functional.softmax(energy / (self.embed_size ** (1/2)), dim=3)

        # Calculate the weighted sum of the values
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
            N, query_len, self.heads * self.head_dim
        )

        out = self.fc_out(out)
        return out

# Example usage
if __name__ == "__main__":
    embed_size = 256
    heads = 8
    seq_len = 10
    batch_size = 32

    # Create random input tensors
    values = torch.rand((batch_size, seq_len, embed_size))
    keys = torch.rand((batch_size, seq_len, embed_size))
    query = torch.rand((batch_size, seq_len, embed_size))
    mask = None  # You can define a mask if needed

    # Initialize the self-attention layer
    self_attention = SelfAttention(embed_size, heads)

    # Forward pass
    output = self_attention(values, keys, query, mask)
    print(output.shape)  # Should be (batch_size, seq_len, embed_size)

### Explanation:
1. **Initialization**:
   - `embed_size`: The size of the embeddings.
   - `heads`: The number of attention heads.
   - `head_dim`: The dimension of each head, which is `embed_size // heads`.

2. **Forward Pass**:
   - The input tensors (`values`, `keys`, `query`) are reshaped to split them into multiple heads.
   - Linear transformations are applied to `values`, `keys`, and `queries`.
   - The attention scores are computed using the dot product of `queries` and `keys`.
   - If a mask is provided, it is applied to the attention scores.
   - The softmax function is applied to the attention scores to get the attention weights.
   - The weighted sum of the `values` is computed using the attention weights.
   - The output is reshaped and passed through a final linear layer to get the final output.