# Channel Attention实现方式

- SENet中的实现
- Attention map shape 为torch.Size([16, 64, 1, 1])，相当于给每个通道指定了一个权重
- https://github.com/moskomule/senet.pytorch/blob/master/senet/se_module.py
- 其作用是去挖掘特征图通道之间的关系，去对通道之间的相互依赖关系，相关性进行显式的建模，增强模型的表达能力。
- 从而学习利用全局信息去增强有用的特征，抑制作用少的特征。
- In this paper, we investigate a different aspect of network
design—the relationship between channels. We introduce a
new architectural unit, which we term the Squeeze-and-Excitation
(SE) block, with the goal of improving the quality of
representations produced by a network by explicitly modelling
the interdependencies between the channels of its convolutional
features. To this end, we propose a mechanism
that allows the network to perform feature recalibration,
through which it can learn to use global information to
selectively emphasise informative features and suppress
less useful ones.

In this paper we proposed the SE block, an architectural unit
designed to improve the representational power of a network
by enabling it to perform dynamic channel-wise feature recalibration.
A wide range of experiments show the effectiveness
of SENets, which achieve state-of-the-art performance across
multiple datasets and tasks. In addition, SE blocks shed some
light on the inability of previous architectures to adequately
model channel-wise feature dependencies. We hope this
insight may prove useful for other tasks requiring strong discriminative
features.

在这篇文章中我们提出了SE block，一个可以提高模型表达能力的结构单元，它是通过执行动态的通道特征对齐的方式来提高表达能力的。
SE block为以前的模型提供充分建模通道间特征依赖指明了道路。

In [8]:
from torch import nn


class SELayer(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channel, channel // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        print(b, c)
        y = self.avg_pool(x).view(b, c) #squeeze produces a channel descriptor by aggregating feature maps across their spatial dimensions
        y = self.fc(y).view(b, c, 1, 1)
        print('attention map shape:', y.shape) #torch.Size([16, 64, 1, 1])
        return x * y.expand_as(x) # *号表示哈达玛积，既element-wise乘积, 用输入x乘以attention map


- 测试SENet

In [9]:
import torch
features = torch.randn(16, 64, 64,48)

selayer = SELayer(64)

feature_out = selayer(features) #通过SElayer之后，features的shape还是保持不变，torch.Size([16, 64, 64, 48])
print(feature_out.shape)

16 64
attention map shape: torch.Size([16, 64, 1, 1])
torch.Size([16, 64, 64, 48])


- 上面SELyaer的如何嵌入在resdial block中
- 基本是按照论文中standard SE block的设计来的


In [2]:
import torch.nn as nn

class SEBasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None,
                 *, reduction=16):
        super(SEBasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes, 1)
        self.bn2 = nn.BatchNorm2d(planes)
        self.se = SELayer(planes, reduction)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.se(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out

- DANet中的实现
- attention map shape 为torch.Size([16, 64, 64])，相当于两两通道之间的关系使用一个矩阵表示了出来

In [21]:
from mmcv.cnn import ConvModule, Scale
import torch.nn.functional as F

class CAM(nn.Module):
    """Channel Attention Module (CAM)"""

    def __init__(self):
        super(CAM, self).__init__()
        self.gamma = Scale(0) #A learnable scale parameter. 论文中的beta

    def forward(self, x):
        """Forward function."""
        batch_size, channels, height, width = x.size()
        proj_query = x.view(batch_size, channels, -1)
        proj_key = x.view(batch_size, channels, -1).permute(0, 2, 1)
        energy = torch.bmm(proj_query, proj_key) #matrix multiplication
        print('energy shape: ', energy.shape)
        energy_new = torch.max(
            energy, -1, keepdim=True)[0].expand_as(energy) - energy
        print('energy_new shape: ', energy_new.shape)
        
        attention = F.softmax(energy_new, dim=-1)
        print('attention map shape: ', attention.shape) # torch.Size([16, 64, 64])
        proj_value = x.view(batch_size, channels, -1)
        print('proj_value shape:', proj_value.shape)
        out = torch.bmm(attention, proj_value)
        print('out shape: ',out.shape)
        out = out.view(batch_size, channels, height, width)

        out = self.gamma(out) + x
        return out


In [22]:
import torch
features = torch.randn(16, 64, 64,48)

camlayer = CAM()

feature_out = camlayer(features)
print(feature_out.shape)

energy shape:  torch.Size([16, 64, 64])
energy_new shape:  torch.Size([16, 64, 64])
attention map shape:  torch.Size([16, 64, 64])
proj_value shape: torch.Size([16, 64, 3072])
out shape:  torch.Size([16, 64, 3072])
torch.Size([16, 64, 64, 48])


- ENCAM中的实现
- 其与SENet的不同之处在于：SENet中的squeeze操作使用global average pooling来实现，pytorch函数为AdaptiveAvgPool2d，而ENCAM中使用global average pooling 和 global max pooling的和来实现，pytorch的函数为AdaptiveAvgPool2d + AdaptiveMaxPool2d
- Attention map shape 为torch.Size([16, 64, 1, 1])，相当于给每个通道指定了一个权重

- ENCAM中的映射使用的是2d卷积，而SENet中使用的是Linear层。

In [23]:
class ChannelAttention(nn.Module):
    def __init__(self, in_planes, ):
        super(ChannelAttention, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)

        self.fc1 = nn.Conv2d(in_planes, in_planes // 8, 1, bias=False)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Conv2d(in_planes // 8, in_planes, 1, bias=False)

        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avg_out = self.fc2(self.relu1(self.fc1(self.avg_pool(x))))
        max_out = self.fc2(self.relu1(self.fc1(self.max_pool(x))))
        out = avg_out + max_out
        return self.sigmoid(out)

In [25]:
import torch
features = torch.randn(16, 64, 64,48)

channel_attention = ChannelAttention(64)
feature_out = channel_attention(features)
print('attention map shape: ', feature_out.shape)

#得到feature map之后使用原来的features*feature_out，相当于对原来的特征进行了通道维的增强。
可以参考原论文中的Fig2. The architecture of the channel attention model.

attention map shape:  torch.Size([16, 64, 1, 1])


# 3D squeeze and exitation 
https://github.com/ai-med/squeeze_and_excitation

In [6]:
from torch import nn


class SELayer3D(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SELayer3D, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool3d(1) #这个操作会将最后三个轴的维度变为1维
        self.fc = nn.Sequential(
            nn.Linear(channel, channel // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _,_ = x.size()
        print(b, c)
        y = self.avg_pool(x).view(b, c) #squeeze produces a channel descriptor by aggregating feature maps across their spatial dimensions
        y = self.fc(y).view(b, c, 1, 1, 1)
        print('attention map shape:', y.shape) #torch.Size([16, 64, 1, 1, 1])
        return x * y.expand_as(x) # *号表示哈达玛积，既element-wise乘积, 用输入x乘以attention map

- 测试SELayer3D

In [7]:
import torch
features = torch.randn(16, 64, 64,48, 32)

selayer = SELayer3D(64)

feature_out = selayer(features) #通过SElayer之后，features的shape还是保持不变torch.Size([16, 64, 64, 48, 32])
print(feature_out.shape)

16 64
attention map shape: torch.Size([16, 64, 1, 1, 1])
torch.Size([16, 64, 64, 48, 32])


- 定义新的3d attention，使得其只关注原来通道维，而不影响band维

In [30]:
class SELayer3DNew(nn.Module):
    def __init__(self, channel, reduction=2):
        super(SELayer3DNew, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool3d((None, 1, 1))
        self.fc = nn.Sequential(
            nn.Linear(channel, channel//4 , bias=False),
            nn.ReLU(),
            nn.Linear(channel//4, channel, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, band_num, _,_ = x.size()
        #print(b, c)
        y = self.avg_pool(x).view(b, c, band_num) #squeeze produces a channel descriptor by aggregating feature maps across their spatial dimensions
        #print(y.shape)
        y = self.fc(y).view(b, c, band_num, 1, 1)
        #print('attention map shape:', y.shape) #torch.Size([16, 64, 1, 1, 1])
        return x * y.expand_as(x) # *号表示哈达玛积，既element-wise乘积, 用输入x乘以attention map

- 测试新的3D注意力

In [32]:
import torch
features = torch.randn(16, 64, 64,48, 32)

selayer = SELayer3DNew(64)

feature_out = selayer(features) #通过SElayer之后，features的shape还是保持不变torch.Size([16, 64, 64, 48, 32])
print(feature_out.shape)

torch.Size([16, 64, 64, 48, 32])


## ECA-Net
- 相对于se-net改进的地方
- Our ECA module aims at capturing local cross-channel
interaction, which shares some similarities with channel local
convolutions [35] and channel-wise convolutions [8];
different from them, our method investigates a 1D convolution
with adaptive kernel size to replace FC layers in channel
attention module.
- While dimensionality reduction in SE-Net can reduce
model complexity, it destroys the direct correspondence between
channel and its weight.For example, one single FC
layer predicts weight of each channel using a linear combination
of all channels，But Eq. (2) first projects channel
features into a low-dimensional space and then maps
them back, making correspondence between channel and
its weight be indirect.(虽然降维可以在SE-Net中降低模型的复杂度，但是他也同时破坏了通道和其权重之间的直接相关性。比如只使用一个全连接层就可以通过所有通道的线性组合去预测某一个通道的权重，但是SE-Net中首先将通道特征映射到低维空间中，再将他们映射回来，反而使得通道和其权重之间变成了间接相关。)

In [34]:
import torch
import torch.nn as nn
class eca_layer(nn.Module):
    """Constructs a ECA module.
    Args:
        channel: Number of channels of the input feature map
        k_size: Adaptive selection of kernel size
    """
    def __init__(self, channel, k_size=3):
        super(eca_layer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.conv = nn.Conv1d(1, 1, kernel_size=k_size, padding=(k_size - 1) // 2, bias=False) 
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # x: input features with shape [b, c, h, w]
        b, c, h, w = x.size()
        print(b, c, h, w)
        # feature descriptor on the global spatial information
        y = self.avg_pool(x) #将最后两维压缩掉了

        print('y.shape =', y.shape) #(b, c, 1, 1)
        print('y.squeeze = ',y.squeeze(-1).shape)#y.squeeze(-1)表示压缩掉最后一维
        print('transpose =', y.squeeze(-1).transpose(-1, -2).shape) #transpose(-1, -2)表示交换倒数第一维和倒数第二维
        print('atter conv1d shape =', self.conv(y.squeeze(-1).transpose(-1, -2)).shape)
        print('after transpose shape =', self.conv(y.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).shape)
        # Two different branches of ECA module
        y = self.conv(y.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).unsqueeze(-1) #unsqueeze(-1) 表示在最后一维增加一个维度
        print('after y shape =', y.shape)
        # Multi-scale information fusion
        y = self.sigmoid(y)

        return x * y.expand_as(x) #最后使用获得的注意力对原始输入进行了加权

- ECA 测试代码

In [35]:
input = torch.randn(1, 32, 64, 64)

efficient_channel_attention = eca_layer(32)

output = efficient_channel_attention(input)

print(output.shape) #torch.Size([1, 32, 64, 64]),也就是输出的shape与输入的shape是相同的

1 32 64 64
y.shape = torch.Size([1, 32, 1, 1])
y.squeeze =  torch.Size([1, 32, 1])
transpose = torch.Size([1, 1, 32])
atter conv1d shape = torch.Size([1, 1, 32])
after transpose shape = torch.Size([1, 32, 1])
after y shape = torch.Size([1, 32, 1, 1])
torch.Size([1, 32, 64, 64])


- ECA模块在residual Net中的应用

In [14]:
def conv3x3(in_planes, out_planes, stride=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=1, bias=False)


class ECABasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None, k_size=3):
        super(ECABasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes, 1)
        self.bn2 = nn.BatchNorm2d(planes)
        self.eca = eca_layer(planes, k_size)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.eca(out)

        if self.downsample is not None:
            residual = self.downsample(x)
        
        #print('out.shape: ', out.shape)
        #print('residual.shape: ', residual.shape)
        out += residual #将注意力加权之后的结果，加上了原来的residual
        out = self.relu(out)

        return out

- 测试ECABasicBlock

In [16]:
input = torch.randn(1, 32, 64, 64)

eca_block = ECABasicBlock(32,32)

output = eca_block(input)
print(output.shape) #torch.Size([1, 32, 64, 64])


down_sample = conv3x3(32, 16)
eca_block_with_down_smaple = ECABasicBlock(32,16, 1, down_sample)

output = eca_block_with_down_smaple(input)
print(output.shape) #torch.Size([1, 16, 64, 64])

torch.Size([1, 32, 64, 64])
torch.Size([1, 16, 64, 64])


## 定义基于ECA的3D卷积

In [48]:
import torch
import torch.nn as nn
class eca_layer3d(nn.Module):
    """Constructs a ECA module.
    Args:
        channel: Number of channels of the input feature map
        k_size: Adaptive selection of kernel size
    """
    def __init__(self, channel, k_size=3):
        super(eca_layer3d, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool3d((None, 1, 1))
        self.conv = nn.Conv1d(channel, channel, kernel_size=k_size, padding=(k_size - 1) // 2, bias=False) 
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # x: input features with shape [b, c, h, w]
        b, c, depth, h, w = x.size()
        print(b, c, depth, h, w)
        # feature descriptor on the global spatial information
        y = self.avg_pool(x) #将最后两维压缩掉了

        print('y.shape =', y.shape) #(b, c, 1, 1)
        print('y.squeeze = ',y.squeeze(-1).shape)#y.squeeze(-1)表示压缩掉最后一维
        #print('transpose =', y.squeeze(-1).transpose(-1, -2).shape) #transpose(-1, -2)表示交换倒数第一维和倒数第二维
        #print('atter conv1d shape =', self.conv(y.squeeze(-1).transpose(-1, -2)).shape)
        #print('after transpose shape =', self.conv(y.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).shape)
        print('y.squeeze.squeeze shape =', y.squeeze(-1).squeeze(-1).shape)
        # Two different branches of ECA module
        y = self.conv(y.squeeze(-1).squeeze(-1)).unsqueeze(-1).unsqueeze(-1) #unsqueeze(-1) 表示在最后一维增加一个维度
        print('after y shape =', y.shape)
        # Multi-scale information fusion
        y = self.sigmoid(y)

        return x * y.expand_as(x) #最后使用获得的注意力对原始输入进行了加权

- 测试基于eca的3d注意力

In [49]:
import torch
features = torch.randn(16, 64, 64,48, 32)

selayer = eca_layer3d(64)

feature_out = selayer(features) #通过SElayer之后，features的shape还是保持不变torch.Size([16, 64, 64, 48, 32])
print(feature_out.shape)

16 64 64 48 32
y.shape = torch.Size([16, 64, 64, 1, 1])
y.squeeze =  torch.Size([16, 64, 64, 1])
y.squeeze.squeeze shape = torch.Size([16, 64, 64])
after y shape = torch.Size([16, 64, 64, 1, 1])
torch.Size([16, 64, 64, 48, 32])


## 兼顾channel维和depth维的3d注意力

- 先计算最后两维的注意力，再计算最后三维的注意力

In [58]:
import torch
import torch.nn as nn
class eca_depth_spatial_layer3d(nn.Module):
    """Constructs a ECA module.
    Args:
        channel: Number of channels of the input feature map
        k_size: Adaptive selection of kernel size
    """
    def __init__(self, channel, k_size=3):
        super(eca_depth_spatial_layer3d, self).__init__()
        self.avg_pool_spatial = nn.AdaptiveAvgPool3d((None, 1, 1))
        self.avg_pool_depth = nn.AdaptiveAvgPool3d(1)
        self.conv_spatial = nn.Conv1d(channel, channel, kernel_size=k_size, padding=(k_size - 1) // 2, bias=False) 
        self.conv_depth = nn.Conv1d(1, 1, kernel_size=k_size, padding=(k_size - 1) // 2, bias=False) 
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # x: input features with shape [b, c, h, w]
        b, c, depth, h, w = x.size()
        print(b, c, depth, h, w)
        # feature descriptor on the global spatial information
        y = self.avg_pool_spatial(x) #将最后两维压缩掉了

        print('y.shape =', y.shape) #(b, c, 1, 1)
        print('y.squeeze = ',y.squeeze(-1).shape)#y.squeeze(-1)表示压缩掉最后一维
        #print('transpose =', y.squeeze(-1).transpose(-1, -2).shape) #transpose(-1, -2)表示交换倒数第一维和倒数第二维
        #print('atter conv1d shape =', self.conv(y.squeeze(-1).transpose(-1, -2)).shape)
        #print('after transpose shape =', self.conv(y.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).shape)
        print('y.squeeze.squeeze shape =', y.squeeze(-1).squeeze(-1).shape)
        # Two different branches of ECA module
        spatial_y = self.conv_spatial(y.squeeze(-1).squeeze(-1)).unsqueeze(-1).unsqueeze(-1) #unsqueeze(-1) 表示在最后一维增加一个维度
        print('after y shape =', spatial_y.shape)
        # Multi-scale information fusion
        spatial_atten = self.sigmoid(spatial_y)

        depth_y = self.avg_pool_depth(x)
        
        depth_y = self.conv_depth(depth_y.squeeze(-1).squeeze(-1).transpose(-1, -2)).transpose(-1, -2).unsqueeze(-1).unsqueeze(-1)
        depth_atten = self.sigmoid(depth_y)
        
        return (x * spatial_atten.expand_as(x)) * depth_atten.expand_as(x) #最后使用获得的注意力对原始输入进行了加权

- eca_depth_spatial_layer3d

In [59]:
import torch
features = torch.randn(16, 64, 64,48, 32)

selayer = eca_depth_spatial_layer3d(64)

feature_out = selayer(features) #通过SElayer之后，features的shape还是保持不变torch.Size([16, 64, 64, 48, 32])
print(feature_out.shape)

16 64 64 48 32
y.shape = torch.Size([16, 64, 64, 1, 1])
y.squeeze =  torch.Size([16, 64, 64, 1])
y.squeeze.squeeze shape = torch.Size([16, 64, 64])
after y shape = torch.Size([16, 64, 64, 1, 1])
torch.Size([16, 64, 64, 48, 32])
