欢迎来到上海交通大学 CS7353（2025年春季学期）《[设计和理解深度神经网络](https://cs7353.netlify.app/)》！

这里是第三次课程作业，具体时间信息见[课程网站](https://cs7353.netlify.app/)。作业在 Canvas 上提交，注意时间节点。只需要上传一份 ipynb 文件，请务必保留每个单元格的运行结果。

如有任何问题，请联系[助教](https://cs7353.netlify.app/staff/)。

# 1 简介

在本次作业中，您将练习编写 Transformer 代码。本次作业不需要用到GPU。

此任务的目标如下：

- 理解Transformer模型的基本概念和结构。
- 实现Transformer的自注意力机制（self-attention mechanism）。
- 实现Transformer的多头注意力机制（Multi-head attention）。
- 编写代码实现Transformer的编码器（encoder）和解码器（decoder）结构。

注意，请严格遵守以下注意事项，如有违背，本次作业零分处理：
- 请仅在标明的 TODO 位置完成代码，请勿更改其他代码；
- 请勿 import 其他 python package（补充：请手搓，不要为了省事直接调用打包好的函数）；
- 请务必保留每个单元格的运行结果。

# 2 准备

## 2.1 装载 Google Drive

装载 Google Drive，建议登录与 Colab 相同的 Google 账号。

In [1]:
import os
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


注意：将root_folder更改为此笔记本在你的 Google Drive中的文件夹

In [2]:
root_folder = "/content/drive/MyDrive/cs7353_hw3_colab/"
os.makedirs(root_folder, exist_ok=True)
os.chdir(root_folder)

## 2.2 下载数据

In [3]:
!pip install gdown
!gdown --fuzzy -O '/content/transformer_encoder_block' 'https://drive.google.com/file/d/1bz5Z2JEhoi68e2QUeJNgRqnAJ0GBAY__/view?usp=sharing'
!gdown --fuzzy -O '/content/transformer_decoder_block' 'https://drive.google.com/file/d/1V3KUEaH6AcGpG9Iafv-0d6s8Rw6iEmSX/view?usp=sharing'
!gdown --fuzzy -O '/content/transformer' 'https://drive.google.com/file/d/1Dws1at4W6jJpVHW0W-RtUmgZVqmfUg0z/view?usp=sharing'

Downloading...
From: https://drive.google.com/uc?id=1bz5Z2JEhoi68e2QUeJNgRqnAJ0GBAY__
To: /content/transformer_encoder_block
100% 7.10k/7.10k [00:00<00:00, 30.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1V3KUEaH6AcGpG9Iafv-0d6s8Rw6iEmSX
To: /content/transformer_decoder_block
100% 11.8k/11.8k [00:00<00:00, 43.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Dws1at4W6jJpVHW0W-RtUmgZVqmfUg0z
To: /content/transformer
100% 71.3k/71.3k [00:00<00:00, 113MB/s]


## 2.3 辅助函数

这里是辅助函数，**您不必进行任何代码层面的操作**，仅需运行对应单元格。

In [4]:
import numpy as np
import pickle
import collections
from typing import Optional, Sequence, Any, Union, Callable

import torch as th
from torch import nn
from torch.nn import functional as F
from torch.nn.utils import weight_norm

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2


device = th.device('cpu')
def get_device():
    return device
def set_device(new_device):
    global device
    device = new_device

class Stack(nn.Module):
    def __init__(self, layers, *args, **kwargs) -> None:
        super().__init__()
        self._layers = []
        if layers is not None:
            for layer in layers:
                self.add(layer)

    def add(self, layer):
        self._layers.append(layer)

    def forward(self, inputs, **kwargs):
        output = inputs
        for layer in self._layers:
            output = layer(output, **kwargs)
        return output

class DenseStack(Stack):
    """
    A stack of fully connected layers. Can do batch norm and specify an alternate output activation.
    """
    def __init__(self,
                 layers: Sequence[Union[tuple, int]],
                 **kwargs) -> None:
        super(DenseStack, self).__init__()
        if layers is None:
            layers = []
        self.add(nn.Linear(*layers[0:2], **kwargs))
        self.add(nn.ReLU())
        for i in range(1,len(layers)-1):
            layer = layers[i:i+2]
            self.add(nn.Linear(*layer, **kwargs))
            self.add(nn.ReLU())

        out_layer = layers[-2:]
        self.add(nn.Linear(*out_layer, **kwargs))

class WeightNormDense(nn.Linear):

    def __init__(self, in_features, out_features, bias=True):
        super().__init__(in_features,out_features,bias=bias)
        self.scale = th.ones(1, out_features, requires_grad=True, device=device)

    def forward(self, inputs):
        outputs = inputs.matmul(self.weight.t())
        scale = self.scale / (th.norm(self.weight, dim=0) + 1e-8)
        outputs = outputs * scale
        if self.bias is not None:
            outputs += self.bias

        return outputs

class EmbeddingTranspose(nn.Module):
    """Multiply by the transpose of an embedding layer
    """
    def __init__(self, embedding_layer, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.embedding = embedding_layer

    def forward(self, inputs):
        embed_mat = self.embedding.weight.detach()
        return th.matmul(inputs, embed_mat.T)

class ApplyAttentionMask(nn.Module):
    """
    Applies a mask to the attention similarities.
    """
    def __init__(self):
        super().__init__()

    def forward(self, similarity, mask=None):
        """
            Args:
                  similarity: a Tensor with shape [batch_size, heads (optional), q/k_length, q/k_length]
                  mask: a Tensor with shape [batch_size, q/k_length, q/k_length]

            Returns:
                masked_similarity: a Tensor with shape [batch_size, heads (optional), q/k_length, q/k_length]
        """
        if mask is None:
            return similarity

        # There are so many different reasons a mask might be constructed a particular manner.
        # Because of this we don't want to infer a particular construction.
        assert len(similarity.shape) in (3, 4)
        assert len(mask.shape) == 3

        # If shapes don't match, then similarity has been split for multi-headed attention
        if len(mask.shape) != len(similarity.shape):
            assert similarity[:, 0].shape == mask.shape
            mask = mask.unsqueeze(dim=1)
        else:
            assert similarity.shape == mask.shape

        # We know that we're passing this through a softmax later, thus just add a relatively large negative
        # value to mask the output avoids a hadamard product (though I think that technically it's not
        # any more efficient to do it this way operations wise)
        bias = -1e9 * th.logical_not(mask).float()
        masked_similarity = similarity + bias
        return masked_similarity

# Utility padding functions

def convert_padding_mask_to_attention_mask(sequence, padding_mask):
    """Given a padded input tensor of sequences and a boolean mask for each position
    in the sequence, returns a 3D boolean mask for use in attention.

    Args:
        sequence (th.Tensor): Tensor of shape [batch_size, sequence_length_1, ndim]
        padding_mask (th.Tensor[bool]): Tensor of shape [batch_size, sequence_length_2]

    Returns:
        th.Tensor[bool]: Tensor of shape [batch_size, sequence_length_1, sequence_length_2]
    """
    assert padding_mask.shape[0] == sequence.shape[0] and \
                                            'batch size mismatch between input sequence and  padding_mask'
    assert len(padding_mask.shape) == 2 and \
                                            'Can only convert 2D position mask to 3D attention mask'

    attention_mask = padding_mask[:, None, :].repeat(*(1, sequence.shape[1], 1))
    return attention_mask


def convert_sequence_length_to_sequence_mask(sequence, sequence_lengths):
    """Given a padded input tensor of sequences and a tensor of lengths, returns
    a boolean mask for each position in the sequence indicating whether or not
    that position is padding.

    Args:
        sequence (th.Tensor): Tensor of shape [batch_size, sequence_length, ndim]
        sequence_lengths (th.Tensor[int]): Tensor of shape [batch_size]

    Returns:
        th.Tensor[bool]: Tensor of shape [batch_size, sequence_length]
    """
    assert sequence_lengths.shape[0] == sequence.shape[0] and \
                                        'batch size mismatch between input sequence and sequence_lengths'
    assert len(sequence_lengths.shape) == 1 and \
                                        'Can only convert 1D sequence_lengths to 2D mask'

    indices = th.range(sequence.shape[1])[None, :].repeat(*(sequence_lengths.shape[0], 1))
    mask = indices < sequence_lengths[:, None]
    return mask


def convert_to_attention_mask(sequence, mask):
    """Automatically convert from None/1D/2D/3D mask to a boolean 3D attention mask.
    Note this does NOT allow for varying the input mask during training. We could replace
    the python if statements with tensorflow conditionals to allow this, but for the
    moment this is really a helper function and assumes that the type of mask
    passed in is fixed.

    Args:
        sequence (th.Tensor): Tensor of shape [batch_size, sequence_length, ndim]
        mask: Optional[Tensor] of shape [batch_size]
                                     or [batch_size, sequence_length]
                                     or [batch_size, sequence_length, sequence_length]

    Returns:
        Optional[th.Tensor[bool]]: Tensor of shape [batch_size, sequence_length, sequence_length]
    """
    if mask is None:
        return None
    if len(mask.shape) == 1:
        mask = convert_sequence_length_to_sequence_mask(
            sequence, mask)
    if len(mask.shape) == 2:
        mask = convert_padding_mask_to_attention_mask(
            sequence, mask)
    if mask.dtype != th.bool:
        mask = mask.bool()
    return


class MultiHeadAttention(nn.Module):
    """
    Fast multi-head attention. Based on the Attention is All You Need paper.

    https://arxiv.org/pdf/1706.03762.pdf
    """

    def __init__(self, n_heads, input_shapes):
        super().__init__()

        self.qa_channels, self.ma_channels = input_shapes

        self.n_heads = n_heads
        self.attention_layer = MultiHeadProjection(n_heads, (self.qa_channels,self.ma_channels))

        assert self.qa_channels % self.n_heads == 0 and self.ma_channels % self.n_heads == 0 and \
                                                        'Feature size must be divisible by n_heads'
        assert self.qa_channels == self.ma_channels and 'Cannot combine tensors with different shapes'

        self.query_layer = weight_norm(nn.Linear(self.qa_channels, self.qa_channels, bias=False))
        self.key_layer = weight_norm(nn.Linear(self.qa_channels, self.qa_channels, bias=False))
        self.value_layer = weight_norm(nn.Linear(self.ma_channels, self.ma_channels, bias=False))

        self.output_layer = weight_norm(nn.Linear(self.qa_channels, self.qa_channels, bias=False))

        def weights_init(m):
            # if isinstance(m, nn.Linear):
            nn.init.xavier_uniform_(m.weight.data)
        self.query_layer.apply(weights_init)
        self.key_layer.apply(weights_init)
        self.value_layer.apply(weights_init)
        self.output_layer.apply(weights_init)


    def forward(self, inputs, mask=None):
        """Fast multi-head self attention.

            :param inputs: tuple of (query_antecedent, memory_antecedent)
                query_antecedent -> tensor w/ shape [batch_size, n_queries, channels]
                memory_antecedent -> tensor w/ shape [batch_size, n_keyval, channels]
        """

        assert (isinstance(inputs, tuple) or isinstance(inputs, list)) and len(inputs) == 2 and \
                                                        'Must pass query and memory'
        query_antecedent, memory_antecedent = inputs
        q = self.query_layer(query_antecedent)
        k = self.key_layer(memory_antecedent)
        v = self.value_layer(memory_antecedent)

        attention_output = self.attention_layer((q, k, v), mask=mask)
        output = self.output_layer(attention_output)
        return output


class TransformerEncoder(nn.Module):
    """
    Stack of TransformerEncoderBlocks. Performs repeated self-attention.
    """

    def __init__(self,
                 embedding_layer, n_layers, n_heads, d_model, d_filter, dropout=None):
        super().__init__()

        self.embedding_layer = embedding_layer
        embed_size = self.embedding_layer.embed_size
        # The encoding stack is a stack of transformer encoder blocks
        self.encoding_stack = []
        for i in range(n_layers):
            encoder = TransformerEncoderBlock(embed_size, n_heads, d_filter, d_model, dropout)
            setattr(self,f"encoder{i}",encoder)
            self.encoding_stack.append(encoder)

    def forward(self, inputs, encoder_mask=None):
        """
            Args:
                inputs: Either a float32 or int32 Tensor with shape [batch_size, sequence_length, ndim]
                encoder_mask: a boolean Tensor with shape [batch_size, sequence_length, sequence_length]
            Returns:
                output: a Tensor with shape [batch_size, sequence_length, d_model]
        """

        inputs = self.embedding_layer(inputs)
        output = inputs
        for encoder in self.encoding_stack:
            output = encoder(output, self_attention_mask=encoder_mask)

        return output


class TransformerDecoder(nn.Module):
    """
        Stack of TransformerDecoderBlocks. Performs initial embedding to d_model dimensions, then repeated self-attention
        followed by attention on source sequence. Defaults to 6 layers of self-attention.
    """

    def __init__(self,
                 embedding_layer,
                 output_layer,
                 n_layers,
                 n_heads,
                 d_model,
                 d_filter,
                 dropout = None) -> None:
        super().__init__()
        self.embedding_layer = embedding_layer
        embed_size = self.embedding_layer.embed_size
        self.decoding_stack = []
        for i in range(n_layers):
            decoder = TransformerDecoderBlock(embed_size, n_heads, d_filter, d_model, dropout)
            setattr(self,f"decoder{i}",decoder)
            self.decoding_stack.append(decoder)
        self.output_layer = output_layer

    # Self attention mask is a upper triangular mask to prevent attending to future targets + a padding mask
    # attention mask is just the padding mask
    def forward(self, target_input, encoder_output, encoder_mask=None, decoder_mask=None, mask_future=False,
        shift_target_sequence_right=False):
        """
            Args:
                inputs: a tuple of (encoder_output, target_embedding)
                    encoder_output: a float32 Tensor with shape [batch_size, sequence_length, d_model]
                    target_input: either a int32 or float32 Tensor with shape [batch_size, target_length, ndims]
                    cache: Used for fast decoding, a dictionary of tf.TensorArray. None during training.
                mask_future: a boolean for whether to mask future states in target self attention

            Returns:
                a tuple of (encoder_output, output)
                    output: a Tensor with shape [batch_size, sequence_length, d_model]
        """
        if shift_target_sequence_right:
            target_input = self.shift_target_sequence_right(target_input)

        target_embedding = self.embedding_layer(target_input)

        # Build the future-mask if necessary. This is an upper-triangular mask
        # which is used to prevent the network from attending to later timesteps
        # in the target embedding
        batch_size = target_embedding.shape[0]
        sequence_length = target_embedding.shape[1]
        self_attention_mask = self.get_self_attention_mask(batch_size, sequence_length, decoder_mask, mask_future)
        # Build the cross-attention mask. This is an upper-left block matrix which takes care of the masking
        # of the output shapes
        cross_attention_mask = self.get_cross_attention_mask(
            encoder_output, target_input, encoder_mask, decoder_mask)

        # Now actually do the decoding which should take us to the right dimension
        decoder_output = target_embedding
        for decoder in self.decoding_stack:
            decoder_output = decoder(decoder_output, encoder_outputs=encoder_output, self_attention_mask=self_attention_mask, cross_attention_mask=cross_attention_mask)

        # Use the output layer for the final output. For example, this will map to the vocabulary
        output = self.output_layer(decoder_output)
        return output

    def shift_target_sequence_right(self, target_sequence):
        constant_values = 0 if target_sequence.dtype in [th.int32, th.int64] else 1e-10
        pad_array = [1,0,0,0]
        target_sequence = F.pad(target_sequence, pad_array, value=constant_values)[:, :-1]
        return target_sequence

    def get_future_mask(self, batch_size, sequence_length):
        """Mask future targets and padding

            :param batch_size: a Tensor dimension
            :param sequence_length: a Tensor dimension
            :param padding_mask: None or bool Tensor with shape [batch_size, sequence_length]

            :return mask Tensor with shape [batch_size, sequence_length, sequence_length]
        """

        xind = th.arange(sequence_length)[None,:].repeat(*(sequence_length, 1))
        yind = th.arange(sequence_length)[:,None].repeat(*(1, sequence_length))
        mask = yind >= xind
        mask = mask[None,...].repeat(*(batch_size, 1, 1))

        return mask.to(get_device())

    def get_self_attention_mask(self, batch_size, sequence_length, decoder_mask, mask_future):
        if not mask_future:
            return decoder_mask
        elif decoder_mask is None:
            return self.get_future_mask(batch_size, sequence_length)
        else:
            return decoder_mask & self.get_future_mask(batch_size, sequence_length)

    # This is an upper left block matrix which masks the attention for things that don't
    # exist within the internals.
    def get_cross_attention_mask(self, encoder_output, decoder_input, encoder_mask, decoder_mask):
        if encoder_mask is None and decoder_mask is None:
            cross_attention_mask = None
        elif encoder_mask is None:
            # We need to not mask the encoding, but mask the decoding
            # The decoding mask should have shape [batch_size x target_len x target_len]
            # meaning all we have to do is pad the mask out properly
            cross_attention_mask = decoder_mask[:, 1, :][:, None, :].repeat(
                                    *(1, encoder_output.shape[1], 1)).permute((0, 2, 1))
        elif decoder_mask is None:
            cross_attention_mask = encoder_mask[:, 1, :][:, :, None].repeat(
                                    *(1, 1, decoder_input.shape[1])).permute((0, 2, 1))
        else:
            dec_attention_mask = decoder_mask[:, 1, :][:, None, :].repeat(
                                    *(1, encoder_output.shape[1], 1)).permute((0, 2, 1))
            enc_attention_mask = encoder_mask[:, 1, :][:, :, None].repeat(
                                    *(1, 1, decoder_input.shape[1])).permute((0, 2, 1))
            cross_attention_mask = th.logical_and(enc_attention_mask, dec_attention_mask)

        return cross_attention_mask


class TransformerInputEmbedding(nn.Module):

    def __init__(self,
                 embed_size,
                 vocab_size = None,
                 dropout = None,
                 batch_norm = False,
                 embedding_initializer=None) -> None:
        super().__init__()
        self.embed_size = embed_size
        self.embedding = nn.Embedding(vocab_size, embed_size) # , weights=[embedding_initializer]

        self.position_encoding = PositionEmbedding(embed_size)
        self.dropout = nn.Dropout(0 if dropout is None else dropout)
        self.batch_norm = None if batch_norm is False else nn.BatchNorm1d(embed_size)

    def forward(self, inputs, start=1):

        # Compute the actual embedding of the inputs by using the embedding layer
        embedding = self.embedding(inputs)
        embedding = self.dropout(embedding)

        if self.batch_norm:
            embedding = self.batch_norm(embedding.permute((0,2,1))).permute((0,2,1))

        embedding = self.position_encoding(embedding, start=start)
        return embedding


attention_qkv_io = {
    "queries": [[[0.7568145140318818, -0.9406817732635092], [1.769451374038473, -0.9116550623898421], [1.8661835681128536, 0.6025814140562978]], [[0.5866493841835547, -0.36339359142428757], [0.23569717049541247, -0.5597841174551036], [0.5483185961131775, -0.3892917419686401]]],
    "keys": [[[-2.2538105167525955, -1.0645904286661103], [-0.08064574158219763, 0.4023086428338018], [-0.892587829736262, -1.9999977918470002], [0.30474216446101005, 0.4019244601536174], [-0.614541368331392, -1.2751128338790234]], [[-1.4598134161928722, -0.14640484084367686], [-1.0005408238685973, 0.2842668872897253], [-0.7026407760372666, -0.8726019165702644], [-0.5294123009127832, 0.9647810947668103], [-0.9869794793673864, 0.4493669724667603]]],
    "values": [[[-1.7519703208940531, -1.603842155680452], [0.5942462537801816, 1.3134188227132357], [0.6391814987323269, 1.0032888068286077], [0.37281050565872187, -0.94684340748494], [-0.2595529386750853, -0.35218536234632414]], [[-0.6545685051249167, -0.242370667235822], [0.6454307054616428, 1.2828652307246584], [0.27862681928250793, -1.0398590010570803], [0.359522015078788, 1.6662751034593328], [-1.793122278766435, -0.48623766032391236]]],
    "output": [[[0.1228136271238327, 0.14300334453582764], [0.26598671078681946, 0.11389785259962082], [0.38478583097457886, -0.08872542530298233]], [[-0.16494245827198029, 0.13024377822875977], [-0.18048660457134247, 0.037906914949417114], [-0.16648535430431366, 0.11781253665685654]]],
    "weights": [[[0.09695513546466827, 0.11691981554031372, 0.37424197793006897, 0.14373677968978882, 0.268146276473999], [0.02826736867427826, 0.16652311384677887, 0.283682644367218, 0.26977160573005676, 0.2517552673816681], [0.00994707178324461, 0.32701191306114197, 0.04024427756667137, 0.5436944365501404, 0.07910224795341492]], [[0.16999198496341705, 0.1841229647397995, 0.28046438097953796, 0.18794991075992584, 0.17747071385383606], [0.19882552325725555, 0.1810005158185959, 0.30068671703338623, 0.14955341815948486, 0.16993381083011627], [0.17315343022346497, 0.1837719976902008, 0.2836241126060486, 0.18291765451431274, 0.1765327900648117]]]
}

multihead_io = {
    "output": [[[-0.5246317386627197, -0.10775024443864822, 0.033127959817647934, -0.5490248799324036, -0.16862450540065765, -0.16868071258068085, -0.2916557192802429, -1.3725045919418335], [-0.03218190371990204, 0.022915983572602272, -1.2075181007385254, 0.6732829213142395, -0.45378199219703674, -0.08632758259773254, -0.41393741965293884, -0.919076144695282], [0.27988940477371216, 0.058166682720184326, -0.5138406157493591, 0.03870147094130516, 0.5612972974777222, -0.0705941841006279, -0.2779857814311981, -1.162642240524292]], [[0.8770015239715576, 0.5850421786308289, -0.1495259553194046, 0.4185110926628113, 0.1175808310508728, 0.3416708707809448, -0.34781304001808167, 0.10453856736421585], [0.2036600559949875, -0.15048746764659882, -0.01727881468832493, 0.1312207728624344, -0.509326159954071, -0.07360519468784332, -0.148659810423851, -0.25658491253852844], [-0.0785459578037262, 0.19411581754684448, 0.20153000950813293, 0.1901312619447708, -0.3952476382255554, -0.09269294142723083, 0.008046015165746212, -0.6706938743591309]]],
    "queries": [[[0.3194274270052471, -1.9223442515664235, 0.13904186090425594, 0.8371305264898589, 0.26401779963192845, -0.4117918851319261, 1.6012595404844467, -2.2479409650447604], [0.2840072595068634, 0.8109092983736957, 2.1954746340254587, -1.2257152698177909, -0.08742599098496921, 0.512182590036912, 0.11973257523499023, -0.3678221283548413], [-1.9344069149491436, 0.4214911545153706, 0.9556532166835882, 0.16317201388513125, -1.4336452792247703, -1.4018675181281455, 0.11520895752868934, -0.998397163741954]], [[2.2777223384038074, 0.3603290221954585, -0.5029542532879165, 1.4180490199685696, -1.6580942009574349, 0.3386894843553686, -0.33204546321853057, 0.22268187807211853], [-1.0949294712578133, -0.5619910633705743, 0.22451875262476312, 1.0043500081435404, 2.2710065381131903, 0.9530366629108166, -0.07781000131900821, -0.6529063960363948], [0.25387920609539383, -0.5611232519239844, -0.15575252729029782, 0.26182678004096377, 0.16017957265543506, 0.6840119713808136, 1.3642227040612827, -0.18800920955890274]]],
    "keys": [[[-0.5898589630874801, 0.7889801053785874, 2.0923819353056836, -0.6393760556642023, -0.053192114864584264, -1.184443238827274, 2.1921885399202, -0.04766041705469036], [1.4517480875342899, 0.17683159748682584, -0.97347685232994, 0.20814248974948632, 0.8215674621794792, 0.13599259903879354, 1.2159406479425043, 2.0299302024758865], [-0.5803237678688972, 0.7740658049440353, -0.7209397448236692, 0.5308770147654643, 0.9324823726381326, -0.6360869416484872, 0.5349322735714923, 0.772719502162348], [-0.39818941036740974, 0.5645370536805914, 0.32033435946303923, 1.91869729051839, 0.5622533436776598, 0.24485565393796552, 0.7060886580479199, -0.33563854885852923], [-0.10184558541295038, 0.9044345182512911, 0.31160489640858013, 1.3990705593902717, 0.512349799696173, 0.3397200986189287, 0.36882081287123303, 0.24859297359863602]], [[-1.5646836897735286, 1.8646979987032584, 1.0226512373664294, -1.2337124747993822, -1.3097235723365321, -0.14175656254391852, -0.7247675488122722, -0.24858938084779458], [-0.08419956264039889, 1.0189703681623181, 0.21904284641756963, -0.565632402544569, 1.068328160114735, -0.22030633310160888, -2.1334492515879337, -2.0770974848695536], [-0.24991223623681327, 0.5209072032347402, -0.03634149412932788, -1.4945061435913691, -1.764390877394693, 0.052975662394942585, -0.03857393725293331, -0.6041617183500434], [0.42013554549113974, -0.4426627926466797, 1.056090755674784, -0.2153023211021123, -0.7976045280757305, -1.1625230635953778, -0.09417464971719552, 0.485285199110894], [1.0549656044534956, 1.3993288979265377, -0.26163411986883045, -0.17477661263182775, 0.43052961696990283, 2.529900705666042, -1.1059117536108698, 2.3844123852304406]]],
    "values": [[[1.2958050537781218, -0.5397587938464922, -1.2411593972411352, 0.6960576609402678, 1.2094702511994904, 0.06738603627771655, -0.425421714414427, -1.3406127949861915], [-1.6261160458796935, -0.47666751254662454, -0.23505388760804094, 2.192872391541756, -0.4013741407993602, 0.30211744189388134, -1.7144404146136416, 1.4473381116148234], [-1.1548467053912241, 0.040684608634628194, -0.5290236503345951, -0.4066565457136003, 0.10154709826225888, -0.8989960885800464, -0.14063859612110702, -1.2025679448505773], [0.6679410047465706, 0.6794994568033029, 0.6674072570030148, -2.6360821373574788, -1.2434530045653551, -1.2844944062635058, 0.2839416433795298, -1.8124305900471425], [0.6791872848266586, 0.4260006937454436, 0.04447698707660998, 0.5773129121458931, -1.11266142994651, 1.1782645414652884, -0.598128895593226, -0.6664804261109919]], [[1.188090664178718, -1.3402074582418317, 1.0954029706889596, 0.07752124708908698, 1.4085165092121372, 0.5448914399838257, 0.1408543478228272, 1.2361222579263214], [-0.6854780489257932, -0.5531703757377959, -0.2924416780090112, 0.9435528803795352, 1.2110011700805712, 1.5346556149205293, -0.32955101490283495, 0.03278081418991892], [0.027543951483365036, 2.155672721277574, 1.7634360768396038, -0.6050090409746256, -0.3465915169875052, 0.5733717614716713, -0.09158177035801474, -2.7347970904266337], [-1.1082059043051353, -0.5526554855544211, -0.46628172358862824, -1.245169957953069, -0.3085131775759757, -0.6191207731760496, 0.49009184615471635, 0.12301677150380383], [1.62791171323798, 0.8843042678528971, -0.6205652735038195, 1.4454372781101044, -1.2827056981047396, -0.7790133281118232, -1.2690439365615782, 0.8358518061648079]]]
}

position_embedding_io = {
    "inputs": [[[0.2841879809955566, -0.8139630319930933, -2.3236730386353375, 0.011535520549498945], [0.12866399877504436, -0.9089874605551281, -0.4294283117668086, 0.12445062548996579], [-0.6892968930700646, -0.22740660528720644, 0.23810752791612652, 1.437984695452255]], [[-0.6974291363980137, 0.39867415935994155, -0.998269271379269, -0.2923702105681359], [0.9980797251527788, -0.18676984608682437, -0.505404494723029, 0.9143668591728924], [-0.24278077989487257, 0.1159503089961709, 0.11467259507054212, 0.8258232459785467]]],
    "output": [[[1.1256589889526367, -0.2736607789993286, -2.313673257827759, 1.0114854574203491], [1.0379613637924194, -1.32513427734375, -0.4094296395778656, 1.1242506504058838], [-0.5481768846511841, -1.2173991203308105, 0.2681030333042145, 2.437534809112549]], [[0.14404189586639404, 0.9389764070510864, -0.9882694482803345, 0.7075797915458679], [1.9073771238327026, -0.6029166579246521, -0.4854058027267456, 1.9141669273376465], [-0.10166077315807343, -0.8740422129631042, 0.1446680873632431, 1.8253732919692993]]]
}

transformer_encoder_block_io = {
    "inputs": [[[-1.3702332115082376, -2.4429925540314095, 0.45380062106110974, -0.41528812653811825, -1.6342437104559, 0.2645424788221865], [0.570188492961689, -1.0934129003976676, -1.0483373257041018, 1.8927986240716832, 1.167973115687023, 1.2565275197983876], [0.22407564895486154, 0.9668081694416605, -0.1674696917487785, -1.1751763300224924, 1.5125989836190825, -1.628791337566757], [-2.3279301755804447, -0.6743820252643327, 0.30823283535961854, 0.5148064462273634, -0.08880789161422287, -0.47917197527327005], [-1.512583026929165, -1.4572451360293381, 1.3505836300461604, -0.7582760993446085, -0.5892077785724741, -0.08871983345871762]], [[-0.0097422411659953, -2.2501728740430886, -0.37250463043420246, 0.1437519662586601, 1.1775478246523967, 0.12540659967678774], [1.1445095267264966, 0.6097394436734916, -0.09180907865747658, 0.2058717827432977, 2.56622270010743, 0.19243672358584651], [-0.8328783363376664, 1.0410539224662043, -1.113083741008717, 0.16627878235958518, -1.413526575834892, 0.03735555411425047], [-1.0499874057121528, 0.10438609112258646, -0.2647144876534182, 0.5877625270325011, -1.9440115572208956, 0.2535885170856571], [-1.0570019691414627, 0.5391254224542747, 0.8863348531267699, 0.37317736897351567, 0.9305629566158702, 1.4900248728211396]]],
    "output": [[[-0.6797174215316772, -2.93060040473938, 1.275446891784668, 0.06107048690319061, -1.2274653911590576, -0.355381041765213], [0.30895307660102844, -1.9238835573196411, -0.40434974431991577, 2.592137575149536, 0.483573853969574, 1.7405859231948853], [0.3060923218727112, 0.8169943690299988, -0.11065448820590973, -0.39384526014328003, 0.5795166492462158, -1.2048671245574951], [-2.200212240219116, -0.9227977991104126, 0.8830193281173706, 1.643011212348938, 0.3457893431186676, -0.34716910123825073], [-0.7731752395629883, -1.623112440109253, 2.2154321670532227, -0.07486292719841003, -0.047842323780059814, -0.43003156781196594]], [[-0.2522561550140381, -2.100017547607422, 0.5644172430038452, 1.195959448814392, 1.1103918552398682, 0.13462112843990326], [1.1970289945602417, 1.140316367149353, 0.6136670708656311, 1.010125994682312, 1.5822440385818481, 0.8279627561569214], [-1.1332038640975952, -0.5114755034446716, 0.014642000198364258, 0.13465869426727295, -3.0932869911193848, 0.9682630300521851], [-1.0926356315612793, -1.0310351848602295, 1.104590892791748, 0.7447429299354553, -2.8887722492218018, 0.4102741479873657], [-0.41485530138015747, 0.15573455393314362, 1.2887852191925049, 0.456840455532074, 0.6938028335571289, 1.8357515335083008]]]
}

transformer_decoder_block_io = {
     "decoder_inputs": [[[-0.43241077184304977, 0.1586999565358803, -0.8761111598215587, 0.763131907749561, -0.8459321772302711, -0.3794276487805159], [-0.9330533050119365, 0.38698499191100905, 0.7174819897383943, 1.3829225874482074, 0.05521407231914396, 1.10968259933925], [0.8669818683948396, -0.3917775325584208, -0.49527488169475026, -0.5626042135185644, -0.4941303097334601, 0.053999525244551504]], [[-0.9346873016466931, -0.04555302873924536, -0.605005037117585, -0.9124328351446505, -0.6354425766891972, 0.4406684188180485], [-0.27633403523351824, 0.013679253879663327, 1.6589034180556606, 0.9049777231404725, -1.2138127259675409, 1.641737935003157], [1.4504650258456235, 0.7554108575731721, -2.4129144061371957, -2.6110537453193454, -0.5274040753253559, 0.3101229955543841]]],
     "encoder_output": [[[0.030647175401534853, 0.025248663724636605, 0.6218956343060061, -0.013216818982989494, 1.8478120235403617, 1.2624123720006315], [0.35698679562398533, 1.5824263628849822, 0.2958278569394237, 0.007192849084785858, 1.1386555321231835, -0.012475029828279426], [0.4530808477543726, -0.03830202239162352, -0.7519525136409332, 0.3517732166332095, 0.9448588116683903, 0.020725787841954196], [-0.5520574571566881, -0.9762824806876904, -1.049578575052563, -0.9125523338185944, -0.07615178022929953, 0.7773840619324337], [0.7056367029529699, 0.24318868046715852, 1.3528945925803582, -0.007392694737610054, -0.6087532366376494, -0.703553125859638]], [[0.17640865358859784, -0.4072012591146658, -0.9668398658167662, -0.33216027319105096, 0.14625919874305857, 0.28202119124427194], [-0.616769533200768, 0.4143287641883316, 0.07969769166455234, -0.4537257102740741, 0.05312382545423559, 1.3200198520831765], [-0.8593366717092451, 0.9087822456648903, 0.027672761133063106, 0.7491907848652445, 1.0673209535292174, 0.7876506278369828], [-0.6831496984851552, 1.1619692218740107, -1.014904381581491, -0.17896784408594288, 0.7461339374088843, -0.45459812432551977], [-0.4508837358548398, -0.5090466192595416, 1.9325287587989863, 0.847295962654667, -0.19723034794094177, -0.5719689510572019]]],
     "expected_output": [[[-0.927000105381012, 0.97175133228302, -2.19303297996521, 0.5960643887519836, -0.49442535638809204, -1.42160964012146], [0.20568913221359253, 1.2681128978729248, 2.456645965576172, 0.8511170744895935, -0.22437334060668945, 1.2418251037597656], [0.7246302962303162, 0.8084559440612793, -0.18514907360076904, -1.4579432010650635, 0.17570900917053223, 0.6669087409973145]], [[-2.1856942176818848, 0.4272521734237671, 0.24339498579502106, -1.8680362701416016, -0.20493602752685547, 0.2234593629837036], [-0.6958361268043518, 0.5447189807891846, 2.227477550506592, -0.3304630219936371, -0.7139310836791992, 1.6215535402297974], [-0.002445518970489502, 1.4482625722885132, -0.49207383394241333, -3.812607765197754, 0.23146450519561768, 1.2503995895385742]]],
     "output": [[[0.6384041905403137, -1.5404860973358154, -0.23241078853607178, 0.452423632144928, -2.4411089420318604, 0.44798165559768677], [-0.365153431892395, -1.3721791505813599, 0.921729564666748, 2.1242012977600098, -1.670034646987915, 1.5885952711105347], [1.1369218826293945, -1.2890963554382324, -0.4290195405483246, -0.5922385454177856, -0.33912041783332825, -0.144917830824852]], [[-0.807811975479126, -1.7920645475387573, -0.9247331619262695, 0.7748958468437195, -1.4148739576339722, 0.44907641410827637], [-0.0914144366979599, -0.9467222690582275, 2.04579496383667, 1.8824766874313354, -2.235422134399414, 1.7229859828948975], [1.5000035762786865, -0.050038933753967285, -2.8955862522125244, -2.2007715702056885, -0.4797360599040985, 0.5970586538314819]]]
}

transformer_io_new = {
    "enc_input": [[2, 3, 0, 5, 7], [0, 10, 8, 10, 1]],
    "dec_input": [[2, 3, 10], [1, 10, 8]],
    "enc_mask": [[2, 3, 0, 5, 7], [0, 10, 8, 10, 1]],
    "dec_mask": [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]],
    "output": [[[10.396791458129883, 0.4293022155761719, 2.6353344917297363, 5.403274059295654, 6.131463527679443, -2.1769134998321533, 1.9645226001739502, 3.287153720855713, 1.088327169418335, 11.174448013305664, 4.823846817016602], [3.3507421016693115, -3.4542150497436523, 6.391385078430176, 9.999515533447266, 0.9765976667404175, -0.5912466049194336, 4.690258026123047, 5.5837883949279785, 5.312085151672363, 11.330268859863281, 6.580809593200684], [4.593360900878906, -4.211770057678223, 8.075637817382812, 25.52280616760254, -2.3522841930389404, 4.335842132568359, 3.3377373218536377, 3.1725640296936035, 8.276178359985352, 18.223094940185547, 13.579854011535645]], [[10.42356014251709, -0.5592164993286133, 2.6844584941864014, 4.758581161499023, 5.531741619110107, -2.9369068145751953, 1.835939645767212, 3.612109661102295, 1.7104812860488892, 11.429039001464844, 3.5416059494018555], [2.321578025817871, 7.467000484466553, -0.6572381258010864, -3.329051971435547, 8.874455451965332, 0.3954874277114868, 2.824288845062256, 3.426391363143921, -6.490586757659912, -1.7989182472229004, 3.9746737480163574], [6.66019868850708, 1.523187279701233, 8.950340270996094, 11.724065780639648, 3.9338464736938477, 2.0031378269195557, 3.379256248474121, 5.81256628036499, 3.9903314113616943, 10.994983673095703, 16.228452682495117]]]
}

# 3 Query-Key-Value Attention (AttentionQKV)

注意机制描述了神经网络中一组新的层，在过去几年中引起了很多关注，特别是在序列任务中。文献中有很多关于“注意力”的不同可能定义，但我们在这里使用的定义如下：注意机制描述了根据输入Query和元素Key动态计算的权重对（序列）元素的加权平均值。那么这到底意味着什么？目标是对多个元素的特征取平均值。但是，我们不想对每个元素进行相等的加权，而是想根据它们的实际值进行加权。换句话说，我们想动态决定哪些输入比其他输入更值得“注意”。特别是，一个注意机制通常有四个部分需要指定：

- Query：Query是一个特征向量，描述了我们在序列中正在寻找的内容，即我们可能想要关注的内容。
- Key：对于每个输入元素，我们有一个Key，它同样是一个特征向量。这个特征向量大致描述了元素“提供”的内容，或者何时它可能很重要。Key应该设计成我们可以根据Query识别我们想要关注的元素。
- Value：对于每个输入元素，我们还有一个Value向量。这个特征向量是我们想要对其进行平均的向量。
- Score function：为了评估我们想要关注的元素，我们需要指定一个Score function数。Score function接受Query和Key作为输入，并输出Query-Key对的得分/注意力权重。通常，它由简单的相似度度量（如点积）或一个小型MLP实现。


## 3.1 代码

这部分AttentionQKV函数，请实现该类的call函数。

请参考[《Attention Is All You Need》](https://arxiv.org/pdf/1706.03762.pdf)描述的AttentionQKV的数学过程。

In [5]:
class AttentionQKV(nn.Module):
    """
    Computes attention based on provided similarity metric.
    """

    def __init__(self):
        super().__init__()
        self.apply_mask = ApplyAttentionMask()

    def forward(self, queries, keys, values, mask=None):
        """Fast scaled dot product attention.

            :param queries: Tensor with shape [batch_size, heads (optional), n_queries, depth_k]
            :param keys:    Tensor with shape [batch_size, heads (optional), n_keyval, depth_k]
            :param values:  Tensor with shape [batch_size, heads (optional), n_keyval, depth_v]
            :param mask:    Tensor with shape [batch_size, n_queries, n_queries]

            :return: output: Tensor with shape [batch_size, heads (optional), n_queries, depth_v]
        """
        ####################################  YOUR CODE HERE  ####################################
        # n_queries corresponds to the sequence length on the query side
        # n_keyval corresponds to the sequence length on the key side (and value, as they are one and the same)
        # depth_k is the size of the projection that the key / query comparison is performed on.
        # depth_v is the size of the projection of the value projection. In a setting with one head, it is usually the dimension (dim) of the Transformer.
        # heads corresponds to the number of heads the attention is performed on.
        # If you are unfamiliar with attention heads, read section 3.2.2 of the Attention is all you need paper

        # PART 1: Implement Attention QKV
        # Use queries, keys and values to compute the output of the QKV attention

        # As defined is the Attention is all you need paper: https://arxiv.org/pdf/1706.03762.pdf
        key_dim = th.tensor(keys.shape[-1],dtype=th.float32)

        similarity =  th.matmul(queries, keys.transpose(-2, -1))/th.sqrt(key_dim)

        masked_similarity = self.apply_mask(similarity, mask=mask) # We give you the mask to apply so that it is correct, you do not need to modify this.
        weights =  F.softmax(masked_similarity, dim = -1) # 在列方向做归一化
        output =  th.matmul(weights, values)
        ####################################  END OF YOUR CODE  ##################################

        return output, weights

## 3.2 测试

完成后，运行以下代码以检查您的实现。您应该看到误差小于 ``1e-6``。

In [6]:
batch_size = 2
n_queries = 3
n_keyval = 5
depth_k = 2
depth_v = 2

io = attention_qkv_io
queries = th.tensor(io['queries'])
keys = th.tensor(io['keys'])
values = th.tensor(io['values'])
expected_output  = th.tensor(io['output'])
expected_weights = th.tensor(io['weights'])

attn_qkv = AttentionQKV()
output, weights = attn_qkv(queries, keys, values)
print("Total output error: ",th.sum(th.abs(expected_output-output)).item())
print("Total weights error: ",th.sum(th.abs(expected_weights-weights)).item())

Total output error:  2.8312206268310547e-07
Total weights error:  2.849847078323364e-07


# 4 Multi-head attention


多头注意力（Multi-head Attention）是深度学习中一种用于处理序列数据的机制，特别是在自然语言处理任务中表现出色。它是注意力机制的一种扩展，最初由Vaswani等人在[《Attention Is All You Need》](https://arxiv.org/pdf/1706.03762.pdf)论文中提出，用于Transformer模型。在多头注意力中，输入序列通过多个注意力头进行处理，每个头都学习到不同的关注点，并且独立地进行自我注意力计算。然后，每个头的输出被合并起来，通过线性变换进一步处理以获得最终的注意力输出。

多头注意力的优点在于它允许模型同时关注输入序列的不同部分，从而捕捉更丰富的信息。这对于解决涉及长距离依赖或复杂关系的任务特别有用。此外，多头注意力还提高了模型的并行性，因为每个头可以并行计算，从而加快了模型的训练速度。

总的来说，多头注意力作为注意力机制的扩展，为深度学习模型在序列处理任务中提供了更强大和灵活的建模能力，已经成为自然语言处理等领域的重要组成部分。


## 4.1 代码

这部分MultiHeadProjection类中，请实现call、\_split_heads和\_combine_heads函数。


- 目标是利用你已经编写的AttentionQKV类。
- 你的输入是Query、Key、Value，它们是3维张量（batch_size，sequence_length，feature_size）。
- 将它们分割成4维张量（batch_size，n_heads，sequence_length，new_feature_size）。其中：
$$feature\_size = n\_heads * new_{feature\_size}.$$

- 然后，你可以将分割后的qkv输入到你实现的AttentionQKV中，它将把每个头部视为独立的注意力函数。
- 你的输出必须合并回一个3维张量。

In [8]:
class MultiHeadProjection(nn.Module):

    def __init__(self, n_heads, feature_sizes):
        """Map the multi-headed attention across the map

        Arguments:
            n_heads {int} -- The number of heads in the attention map
            feature_sizes {int} -- The size of the feature dimensions for key, query, and value

        """

        super().__init__()
        self.attention_map = AttentionQKV()
        self.n_heads = n_heads

        for size in feature_sizes:
            assert size % self.n_heads == 0, 'Shape of feature input must be divisible by n_heads'

    def forward(self, inputs, mask=None):
        """Fast multi-head attention.

        :param queries: Tensor with shape [batch_size, n_queries, depth_k]
        :param keys:    Tensor with shape [batch_size, n_keyval, depth_k]
        :param values:  Tensor with shape [batch_size, n_keyval, depth_v]

        :return: output: Tensor with shape [batch_size, n_queries, depth_v]
        """
        queries, keys, values = inputs

        # Split each of the projection into its heads, by adding a new dimension
        # You must implement _split_heads, and _combine_heads
        queries_split = self._split_heads(queries)
        keys_split = self._split_heads(keys)
        values_split = self._split_heads(values)

        # Apply the attention map
        attention_output_split, _ = self.attention_map(queries_split, keys_split, values_split, mask=mask)

        # Re-combine the heads together, and return the output.
        output = self._combine_heads(attention_output_split)
        return output

    def _split_heads(self, tensor):
        assert len(tensor.shape) == 3
        ####################################  YOUR CODE HERE  ####################################
        # PART 2: Implement the Multi-head attention.
        # You are given a Tensor which is one of the projections (K, Q or V)
        # and you must "split it" in self.n_heads. This splitting should add a dimension to the tensor,
        # so that each head acts independently

        batch_size, tensorlen = tensor.shape[0], tensor.shape[1]
        feature_size = tensor.shape[2]

        new_feature_size =  feature_size // self.n_heads # Compute what the feature size per head is.
        # Reshape this projection tensor so that it has n_heads, each of new_feature_size
        tensor = tensor.view(batch_size, tensorlen, self.n_heads, new_feature_size)
        # Transpose the matrix so the outer-dimensions are the batch-size and the number of heads
        tensor = tensor.transpose(1,2)
        return tensor
        ##########################################################################################

    def _combine_heads(self, tensor):
        assert len(tensor.shape) == 4
        ####################################  YOUR CODE HERE  ####################################
        # PART 2: Implement the Multi-head attention.
        # You are given the output from all the heads, and you must combine them back into 1 rank-3 matrix

        # Transpose back compared to the split, so that the outer dimensions are batch_size and sequence_length again
        tensor = tensor.transpose(1,2)
        batch_size, tensorlen = tensor.shape[0], tensor.shape[1]
        feature_size = tensor.shape[-1]

        new_feature_size =  self.n_heads * feature_size # What is the new feature size, if we combine all the heads
        tensor = tensor.reshape(batch_size, tensorlen, new_feature_size) # Reshape the Tensor to remove the heads dimension and come back to a Rank-3 tensor
        return tensor
        ##########################################################################################

## 4.2 测试

完成后，运行以下代码检查您的实现。您应该看到误差小于 ``1e-5``。

In [9]:
batch_size = 2
n_queries = 3
n_heads = 4
n_keyval = 5
depth_k = 8
depth_v = 8

io = multihead_io
queries = th.tensor(io['queries'])
keys = th.tensor(io['keys'])
values = th.tensor(io['values'])
expected_output  = th.tensor(io['output'])

mhp = MultiHeadProjection(n_heads, (depth_k,depth_v))
multihead_output = mhp((queries, keys, values))
print("Total output error: ",th.sum(th.abs(expected_output-multihead_output)).item())

Total output error:  1.5934929251670837e-06


# 5 Position Embedding

位置嵌入（Position Embedding）是在处理序列数据时常用的技术之一，特别是在自然语言处理领域中。在自然语言处理任务中，词语的顺序往往对理解文本的意义至关重要。然而，传统的神经网络模型（如卷积神经网络和循环神经网络）通常缺乏处理序列数据中位置信息的能力。为了解决这个问题，位置嵌入被引入到模型中，以提供关于每个词或标记在序列中位置的信息。

位置嵌入的常见方式包括使用正弦函数或余弦函数编码位置信息，也可以使用**可学习的嵌入矩阵**。以Transformer模型为例，它使用了一种称为位置编码（Positional Encoding）的技术，通过将位置信息与词向量相加来为序列中的每个位置添加位置信息。这样，模型就可以通过词向量和位置编码共同表示每个词在序列中的含义和位置。

位置嵌入的作用在于为模型提供关于序列中各个位置的位置信息，使得模型能够更好地理解序列数据的顺序和结构。这对于许多任务，特别是自然语言处理任务如机器翻译、文本生成和语言建模等非常重要。通过引入位置嵌入，模型可以更有效地捕捉序列中的长距离依赖关系，并且提高了模型在处理序列数据时的性能和泛化能力。

## 5.1 代码

请参考[《Attention Is All You Need》](https://arxiv.org/pdf/1706.03762.pdf)，实现PositionEmbedding类。

In [22]:
class PositionEmbedding(nn.Module):
    """
    Adds positional embedding to an input embedding.

    Based on https://arxiv.org/pdf/1706.03762.pdf.
    """
    def __init__(self, hidden_size):
        super(PositionEmbedding, self).__init__()

        assert hidden_size % 2 == 0 and 'Model vector size must be even for sinusoidal encoding'
        power = th.arange(0, hidden_size, step=2, dtype=th.float32)[:] / hidden_size
        divisor = 10000 ** power
        self.divisor = divisor
        self.hidden_size = hidden_size

    def forward(self, inputs, start=1):
        """
            Args:
                inputs: a float32 Tensor with shape [batch_size, sequence_length, hidden_size]

            Returns:
                embedding: a float32 Tensor with shape [batch_size, sequence_length, hidden_size]
        """
        ####################################  YOUR CODE HERE  ####################################
        # PART 3: Implement the Position Embedding.
        # As stated in section 3.5 of the paper, attention does not naturally embed position information
        # To incorporate that, the authors use a variable frequency sin embedding.
        # Note that we use zero-indexing here while the authors use one-indexing

        assert inputs.shape[-1] == self.hidden_size and 'Input final dim must match model hidden size'

        batch_size = inputs.shape[0]
        sequence_length = inputs.shape[1]

        # obtain a sequence that starts at `start` and increments for `sequence_length `
        seq_pos = th.arange(start, sequence_length + start, dtype=th.float32)
        seq_pos_expanded = seq_pos[None,:,None]
        index = seq_pos_expanded.repeat(*[1,1,self.hidden_size//2])

        # create the position embedding as described in the paper
        # use the `divisor` attribute instantiated in __init__
        sin_embedding = th.sin(index/self.divisor) # [1, sequence_length, self.hidden_size//2]
        cos_embedding = th.cos(index/self.divisor)

        # interleave the sin and cos. For more info see:
        # https://discuss.pytorch.org/t/how-to-interleave-two-tensors-along-certain-dimension/11332/3
        position_shape = (1, sequence_length, self.hidden_size) # fill in the other two dimensions
        position_embedding = th.stack((sin_embedding,cos_embedding), dim=3).view(position_shape)

        pos_embed_deviced = position_embedding.to(get_device())
        return inputs + pos_embed_deviced # add the embedding to the input
        ####################################  END OF YOUR CODE  ##################################

## 5.2 测试

运行以下代码检查您的实现。您应该看到误差小于 `1e-6`。

In [23]:
batch_size = 2
sequence_length = 3
dim = 4

io = position_embedding_io
inputs = th.tensor(io['inputs'])
expected_output  = th.tensor(io['output'])

pos_emb = PositionEmbedding(dim)
output_t = pos_emb(inputs)
print("Total output error: ",th.sum(th.abs(expected_output-output_t)).item())

Total output error:  2.980232238769531e-07


# 6 Transformer Encoder / Transformer Decoder

Transformer由编码器（Encoder）和解码器（Decoder）两部分组成，每部分都由多个堆叠的层构成。

1. **Transformer Encoder**:
   编码器负责将输入序列转换为一系列上下文感知的编码表示。它由多个相同的层堆叠而成。

2. **Transformer Decoder**:
   解码器负责根据编码器的输出生成目标序列。

通过编码器-解码器结构的设计，Transformer模型能够在序列到序列（Seq2Seq）任务中表现出色，如机器翻译、文本摘要、对话生成等。它克服了传统循环神经网络（RNN）在处理长序列时的困难，并且在训练效率和推理速度上具有显著的优势，成为了自然语言处理领域的重要里程碑之一。

## 6.1 代码

你现在已经拥有实现Transformer所需的所有模块。
在这一部分中，你需要填写：TransformerFeedForward，TransformerEncoderBlock和TransformerDecoderBlock。

In [40]:
class TransformerFeedForward(nn.Module):
    def __init__(self, input_size,
                 filter_size,
                 hidden_size,
                 dropout):
        super(TransformerFeedForward, self).__init__()
        self.norm = nn.LayerNorm(input_size)
        self.feed_forward = nn.Sequential(
                                nn.Linear(input_size,filter_size),
                                nn.ReLU(),
                                nn.Linear(filter_size,hidden_size)
                            )
        def weights_init(m):
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight.data)
        self.feed_forward.apply(weights_init)
        self.dropout = nn.Dropout(0 if dropout is None else dropout)

    def forward(self, inputs):
        ####################################  YOUR CODE HERE  ####################################
        # PART 4.1: Implement the FeedForward Layer.
        # As seen in fig1, the feedforward layer includes a normalization and residual
        norm_input = self.norm(inputs)
        dense_out = self.feed_forward(norm_input)
        dense_drop =  self.dropout(dense_out)# Add the dropout here
        return inputs + dense_drop # Add the residual here
        ####################################  END OF YOUR CODE  ##################################


class TransformerEncoderBlock(nn.Module):
    """An encoding block from the paper Attention Is All You Need (https://arxiv.org/pdf/1706.03762.pdf).

    :param inputs: Tensor with shape [batch_size, sequence_length, channels]

    :return: output: Tensor with same shape as input
    """

    def __init__(self,
                 input_size,
                 n_heads,
                 filter_size,
                 hidden_size,
                 dropout = None) -> None:
        super().__init__()
        self.norm = nn.LayerNorm(input_size)
        self.self_attention = MultiHeadAttention(n_heads,[input_size,input_size])
        self.feed_forward = TransformerFeedForward(input_size, filter_size, hidden_size, dropout)

    def forward(self, inputs, self_attention_mask=None):

        ####################################  YOUR CODE HERE  ####################################
        # PART 4.2: Implement the Transformer Encoder according to section 3.1 of the paper.
        # Perform a multi-headed self-attention across the inputs.

        # First normalize the input with the LayerNorm initialized in the __init__ function (self.norm)
        norm_inputs = self.norm(inputs)

        # Apply the self-attention with the normalized input, use the self_attention mask as the optional mask parameter.
        attn = self.self_attention((norm_inputs, norm_inputs), self_attention_mask)

        # Apply the residual connection. res_attn should sum the attention output and the original, non-normalized inputs
        res_attn = inputs + attn # Residual connection of the attention block

        # output passes through a feed_forward network
        output = self.feed_forward(res_attn)
        return output


class TransformerDecoderBlock(nn.Module):
    """A decoding block from the paper Attention Is All You Need (https://arxiv.org/pdf/1706.03762.pdf).

    :param inputs: two Tensors encoder_outputs, decoder_inputs
                    encoder_outputs -> a Tensor with shape [batch_size, sequence_length, channels]
                    decoder_inputs -> a Tensor with shape [batch_size, decoding_sequence_length, channels]

    :return: output: Tensor with same shape as decoder_inputs
    """

    def __init__(self,
                 input_size,
                 n_heads,
                 filter_size,
                 hidden_size,
                 dropout = None) -> None:
        super().__init__()
        self.self_norm = nn.LayerNorm(input_size)
        self.self_attention = MultiHeadAttention(n_heads,[input_size,input_size])

        self.cross_attention = MultiHeadAttention(n_heads,[input_size,input_size])
        self.cross_norm_source = nn.LayerNorm(input_size)
        self.cross_norm_target = nn.LayerNorm(input_size)
        self.feed_forward = TransformerFeedForward(input_size, filter_size, hidden_size, dropout)

    def forward(self, decoder_inputs, encoder_outputs, self_attention_mask=None, cross_attention_mask=None):
        # The cross-attention mask should have shape [batch_size x target_len x input_len]

        ####################################  YOUR CODE HERE  ####################################
        # PART 4.2: Implement the Transformer Decoder according to section 3.1 of the paper.
        # The cross-attention mask should have shape [batch_size x target_len x input_len]

        # Compute the selt-attention over the decoder inputs. This uses the self-attention
        # mask to control for the future outputs.
        # This generates a tensor of size [batch_size x target_len x d_model]

        norm_decoder_inputs = self.self_norm(decoder_inputs)

        target_selfattn = self.self_attention((norm_decoder_inputs, norm_decoder_inputs), self_attention_mask)

        # Take the residual between the output and the unnormalized input of the self-attention
        res_target_self_attn = decoder_inputs + target_selfattn

        # Compute the attention using the keys/values from the encoder, and the query from the
        # decoder. This takes the encoder output of size [batch_size x source_len x d_model] and the
        # target self-attention layer of size [batch_size x target_len x d_model] and then computes
        # a multi-headed attention across them, giving an output of [batch_size x target_len x d_model]
        # using the encoder as the keys and values and the target as the queries

        norm_target_selfattn = self.cross_norm_target(res_target_self_attn)
        norm_encoder_outputs = self.cross_norm_source(encoder_outputs)
        encdec_attention = self.cross_attention((norm_target_selfattn, norm_encoder_outputs), cross_attention_mask)
        # Take the residual between the output and the unnormalized target input of the cross-attention
        res_encdec_attention = res_target_self_attn + encdec_attention

        output = self.feed_forward(res_encdec_attention)

        return output

## 6.2 测试

运行以下代码进行测试。您应该看到误差小于 `1e-5`。

In [41]:
# Encoder
batch_size = 2
sequence_length = 5
hidden_size = 6
filter_size = 12
n_heads = 2

io = transformer_encoder_block_io
inputs = th.tensor(io['inputs'])
expected_output = th.tensor(io['output'])
enc_block = TransformerEncoderBlock(input_size=6, n_heads=n_heads, filter_size=filter_size, hidden_size=hidden_size)
enc_block.load_state_dict(th.load("/content/transformer_encoder_block"))
output_t = enc_block(inputs)
print("Total output error: ",th.sum(th.abs(expected_output-output_t)).item())


# Decoder
batch_size = 2
encoder_length = 5
decoder_length = 3
hidden_size = 6
filter_size = 12
n_heads = 2

io = transformer_decoder_block_io
decoder_inputs = th.tensor(io['decoder_inputs'])
encoder_output = th.tensor(io['encoder_output'])
expected_output = th.tensor(io['output'])

dec_block = TransformerDecoderBlock(input_size=6, n_heads=n_heads, filter_size=filter_size, hidden_size=hidden_size)
dec_block.load_state_dict(th.load("/content/transformer_decoder_block"))
output_t = dec_block(decoder_inputs, encoder_output)
print("Total output error: ",th.sum(th.abs(expected_output-output_t)).item())

Total output error:  5.02169132232666e-06
Total output error:  2.950429916381836e-06


# 7 结语

恭喜你！你已经完成了第三次作业。尽管这一路历经艰辛，但是你对于Transformer有了更加深刻的理解！



>本次作业负责人：郜今（助教），gaojin@sjtu.edu.cn。
最后请允许我再次强调，作业在 Canvas 上提交，只需要上传一份 ipynb 文件，请保留每个单元格的运行结果，注意时间节点。 如有任何问题，请联系[助教](https://cs7353.netlify.app/staff/)。