**План семинара** "Создание собственных операторов Pytorch-Python. Часть 2 - Attention Mechanisms

1) Vanilla Attention [paper 2015](https://arxiv.org/abs/1409.0473), [paper 2015](https://arxiv.org/abs/1508.04025)

2) Self-Attention, Simplified Self-Attention [Attention Is All You Need 2017](https://arxiv.org/abs/1706.03762)

3) External Attention [paper 2021](https://arxiv.org/abs/2105.02358)

4) Attention Mechanisms in Computer Vision:
A Survey [paper 2021](https://arxiv.org/pdf/2111.07624.pdf)

#Vanilla Attention

Полезные ссылки:

1) [Общее описание + представление данных + лосс функция. Классический RNN Encoder-Decoder](https://medium.com/analytics-vidhya/encoder-decoder-seq2seq-models-clearly-explained-c34186fbf49b)

2) [Описание статьи с изображения снизу](https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/)

**Прочитать статью:** пункт 2.1, 3.1, 3.2

![alt text](https://drive.google.com/uc?export=view&id=1fWp4x3LFZbWXq77VQoBUBNWbvmd7qH_3)

Изображение взято с [paper](https://arxiv.org/pdf/1703.03906.pdf), [свободное описание](https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/)


Вопросы:

1) Что из себя представляеТ классический RNN Encoder-Decoder?

2) Чему равен вектор с в классическом RNN Encoder-Decoder?


![alt text](https://drive.google.com/uc?export=view&id=1V3QsG5nCVZI8tVVyqi4B8aEFG5qwQ6tw)


attention query - Si (decoder_state)

attention key - Hj (encode state)

From [this paper, 4.5](https://arxiv.org/pdf/1703.03906.pdf)

In [None]:
from typing import Optional
import torch
from torch import nn
import numpy as np

In [None]:
class VanillaAttention(nn.Module):
    """
    Implementation of the attention network proposed in [1] and [2].
    Parameters
    ----------
    dim : int
        Size of the input tensor
    References
    ----------
    1. "`Neural Machine Translation by Jointly Learning to Align and Translate. \
            <https://arxiv.org/abs/1409.0473>`_" Dzmitry Bahdanau, et al. ICLR 2015.
    2. "`Effective Approaches to Attention-based Neural Machine Translation. \
            <https://arxiv.org/abs/1508.04025>`_" Minh-Thang Luong, et al. EMNLP 2015.
    """
    def __init__(
        self,
        dim: int,
    ) -> None:
        super(VanillaAttention, self).__init__()

        self.fc_align = nn.Linear(..)

        self.fc_query = nn.Linear(...)
        self.fc_value = nn.Linear(...)

        self.softmax = nn.Softmax(...)
        self.tanh = nn.Tanh()


    def forward(
        self, query: torch.Tensor, key: torch.Tensor) -> torch.Tensor:
        """
        Parameters
        ----------
        query : torch.Tensor (batch_size, dim)
            Query
        key : torch.Tensor (batch_size, length, dim)
            Key
        Returns
        -------
        out : torch.Tensor (batch_size, dim)
            Output tensor
        att: torch.Tensor (batch_size, length)
            Attention weights
        """

        # alignment scores
        score = self.fc_align(...)
        score = ...  

        # attention weights
        att = ...

        # context vector (weighted key)
        context = ...

        # attention result
        out = self.tanh(...)

        return out, att

In [None]:
attention = VanillaAttention(2)
query = torch.rand(20, 2)
key = torch.rand(20, 8, 2)

In [None]:
res = attention(query, key)
print(res[0].shape)
print(res[1].shape)

torch.Size([20, 2])
torch.Size([20, 8])


# Attention is all you need

**Прочитать**:

**Вопросы**:

1) Отличие RNN от Transformer. Как подаются данные и как это связано с backprop?

2) Зачем нужен positional encoding?

3) Практическое значение skip-connections? Какую именно полезную ифнформацию таким образом мы передаем сохраняем?

4) Понятны практические значения V, K, Q в Multi-Head Attention при соединении Encoder-Decoder?

5) Формула attention - геометрическое объяснение формулы.

In [None]:
from typing import Tuple, Optional
import torch
from torch import nn
import numpy as np

In [None]:
def split_heads(x: torch.Tensor, n_heads: int) -> torch.Tensor:
    """
    Parameters
    ----------
    x : torch.Tensor (batch_size, length, dim)
        Input tensor.
    n_heads : int
        Number of attention heads.
    """
    batch_size, dim = x.size(0), x.size(-1)
    x = x.view(batch_size, -1, n_heads, dim // n_heads)  # (batch_size, length, n_heads, d_head)
    x = x.transpose(1, 2)  # (batch_size, n_heads, length, d_head)
    return x

def combine_heads(x: torch.Tensor) -> torch.Tensor:
    """
    Parameters
    ----------
    x : torch.Tensor (batch_size, n_heads, length, d_head)
        Input tensor.
    """
    #https://stackoverflow.com/questions/48915810/what-does-contiguous-do-in-pytorch
    batch_size, n_heads, d_head = x.size(0), x.size(1), x.size(3)
    x = x.transpose(1, 2).contiguous().view(batch_size, -1, d_head * n_heads)  # (batch_size, length, n_heads * d_head)
    return x

def add_mask(x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
    """
    Mask away by setting such weights to a large negative number, so that they evaluate to 0
    under the softmax.
    Parameters
    ----------
    x : torch.Tensor (batch_size, n_heads, *, length) or (batch_size, length)
        Input tensor.
    mask : torch.Tensor, optional (batch_size, length)
        Mask metrix, ``None`` if it is not needed.
    """
    if mask is not None:
        if len(x.size()) == 4:
            expanded_mask = mask.unsqueeze(1).unsqueeze(1)  # (batch_size, 1, 1, length)
        x = x.masked_fill(expanded_mask.bool(), -np.inf)
    return 

In [None]:
class ScaledDotProductAttention(nn.Module):
    """
    Scaled Dot-Product Attention
    Parameters
    ----------
    scale : float
        Scale factor (``sqrt(d_head)``).
    dropout : float, optional
        Dropout, ``None`` if no dropout layer.
    """
    def __init__(self, scale: float, dropout: float = 0.5) -> None:
        super(ScaledDotProductAttention, self).__init__()

        self.scale = scale
        self.softmax = nn.Softmax(dim=-1)
        self.dropout = None if dropout is None else nn.Dropout(dropout)

    def forward(
        self,
        Q: torch.Tensor,
        K: torch.Tensor,
        V: torch.Tensor,
        mask: Optional[torch.Tensor] = None
    ) -> Tuple[torch.Tensor]:
        """
        Parameters
        ----------
        Q : torch.Tensor (batch_size, n_heads, length, d_head)
            Query
        K : torch.Tensor (batch_size, n_heads, length, d_head)
            Key 
        V : torch.Tensor (batch_size, n_heads, length, d_head)
            Value
        mask : torch.Tensor (batch_size, 1, 1, length)
            Mask metrix, None if it is not needed
        Returns
        -------
        context : torch.Tensor (batch_size, n_heads, length, d_head)
            Context vector.
        att : torch.Tensor (batch_size, n_heads, length, length)
            Attention weights.
        """
        # Q·K^T / sqrt(d_head)
        score = ...
        score = ... # add_mask

        # eq.1: Attention(Q, K, V) = softmax(Q·K^T / sqrt(d_head))·V
        att = ...
        att = ... # dropout
        context = ...

        return context, att

In [None]:
class SelfAttention(nn.Module):
    """
    Implementation of Multi-Head Self-Attention proposed in [1].
    Parameters
    ----------
    dim : int
        Dimension of the input features.
    n_heads : int
        Number of attention heads.
    dropout : float, optional
        Dropout, ``None`` if no dropout layer.
    References
    ----------
    1. "`Attention Is All You Need. <https://arxiv.org/abs/1706.03762>`_" Ashish Vaswani, et al. NIPS 2017.
    """
    def __init__(
        self,
        dim: int,
        n_heads: int = 8,
        dropout: Optional[float] = None
    ) -> None:
        super(SelfAttention, self).__init__()

        assert dim % n_heads == 0

        self.n_heads = n_heads
        self.d_head = dim // n_heads
        
        # linear projections
        self.W_Q = nn.Linear(...)
        self.W_K = nn.Linear(...)
        self.W_V = nn.Linear(...)

        # scaled dot-product attention
        scale = self.d_head ** 0.5  # scale factor
        self.attention = ScaledDotProductAttention(scale=scale, dropout=dropout)

        self.layer_norm = nn.LayerNorm(dim)
        self.fc = nn.Linear(...)

        self.dropout = None if dropout is None else nn.Dropout(dropout)

    def forward(
        self, x: torch.Tensor, mask: Optional[torch.Tensor] = None
    ) -> Tuple[torch.Tensor]:
        """
        Parameters
        ----------
        x : torch.Tensor (batch_size, length, dim)
            Input data, where ``length`` is the length (number of features) of the input and
            ``dim`` is the dimension of the features.
        mask : torch.Tensor, optional (batch_size, length)
            Mask metrix, ``None`` if it is not needed.
        Returns
        -------
        out : torch.Tensor (batch_size, length, dim)
            Output of multi-head self-attention network.
        """
        Q = ...
        K = ...
        V = ...

        Q, K, V = split_heads(Q, self.n_heads), split_heads(K, self.n_heads), split_heads(V, self.n_heads)  # (batch_size, n_heads, length, d_head)

        context, _ = ...
        context = combine_heads(context)  # (batch_size, length, n_heads * d_head)

        out = ...
        out = out if self.dropout is None else self.dropout(out)

        out = ... # residual connection
        out = self.layer_norm(out)  # LayerNorm

        return out

In [None]:
class SimplifiedSelfAttention(SelfAttention):
    """
    Implementation of a common simplified version of Multi-Head Self-Attention, which drops the
    linear projection layers and directly calculates an attention map from the input feature to
    reduce the computational complexity.
    ----------
    dim : int
        Dimension of the input features.
    n_heads : int
        Number of attention heads.
    dropout : float, optional
        Dropout, ``None`` if no dropout layer.
    """
    def __init__(
        self,
        dim: int,
        n_heads: int = 8,
        dropout: Optional[float] = None
    ) -> None:
        super(SelfAttention, self).__init__()

        ....

    def forward(
        self, x: torch.Tensor, mask: Optional[torch.Tensor] = None
    ) -> Tuple[torch.Tensor]:
        """
        Parameters
        ----------
        x : torch.Tensor (batch_size, length, dim)
            Input data, where ``length`` is the length (number of features) of the input and
            ``dim`` is the dimension of the features.
        mask : torch.Tensor, optional (batch_size, length)
            Mask metrix, ``None`` if it is not needed.
        Returns
        -------
        out : torch.Tensor (batch_size, length, dim)
            Output of multi-head self-attention network.
        """
        ...

        return out
