Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about backbone truncation #1

Closed
Senwang98 opened this issue Nov 1, 2021 · 13 comments
Closed

question about backbone truncation #1

Senwang98 opened this issue Nov 1, 2021 · 13 comments

Comments

@Senwang98
Copy link

Hi, @prakharg24
I feel confused about backbone truncation your paper. For mobilenetv2 arch:
image, you say the last two blocks are not used. Can you tell me which two blocks?
You mean the last two bottleneck??(I think you don't mean FC layer, nobody will keep FC layer for detection task)
Wishing for your reply!

@Senwang98
Copy link
Author

So, truncate here means random weight init instead of imagenet pre-trained weights?

@prakharg24
Copy link
Owner

Hi @Senwang98

Yes the truncation of last two blocks does not not include the fully connected layers. It represents the two MBConv blocks (or 'bottlenecks' as you mentiond) with 1280 and 320 channels.

In the paper, we try to motivate backbone truncation in two steps. First, we show that the last CNN blocks have no transfer learning importance. In this step, there is no truncation, but only changing the weight initialization from imagenet to random (Figure 2 in our paper). Next, once we have shown that these weights have no transfer learning importance, we claim that removing them (or truncating) will be a better way to make the model lightweight as compared to reducing width (i.e. the scaling factor).

So, truncate does not mean random weight init, it means truncation (or removing) the blocks. Its just that truncation was motivated by this other experiments where random initialization of last layers instead of transfer learning weights helped improve performance.

I hope that clarifies things.

@Senwang98
Copy link
Author

Okay, thanks for your quick reply!
So, I just remove the last MBConv(320 and 1280 channels)?
By the way, do manually designed classification networks generally have more flops in the last few layers, because I notice that the flops of last few layers which are obtained by NAS are small.

@prakharg24
Copy link
Owner

That was definitely a common behavior for a long time. Since ImageNet has 1000 classes, most pre-trained models attempt to increase the number of channels to similar order of channels for the last layers. Some of the recent models don't follow this to the heart, but the overall behavior of significantly increase channel number for the last 2-3 layers is still there. However, as I mentioned, the fact that these layers are very heavy is only one part of the issue, the other being that last layers seem to not contain any relevant transfer learning features.

Btw, when working on say object detection, using transfer learning NAS backbone might not be the optimal choice, since NAS architectures are usually not easy to generalize. There is also a lot of work on creating object detection specific backbones using NAS which would be a better fit if someone wants to explore in that direction.

@Senwang98
Copy link
Author

@prakharg24
Thanks for your detailed explanation, I understand your paper better and I will use your idea to help imporve light-weight detection.
Thanks again, good luck!

@Senwang98
Copy link
Author

Senwang98 commented Nov 1, 2021

@prakharg24
This is my pytorch style of RFCR module:

from .conv import DepthwiseConvModule, ConvModule
import torch
import torch.nn as nn
import torch.nn.functional as F


class MobilenetSeparableConv2D(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(MobilenetSeparableConv2D, self).__init__()
        self.depthwiseconv = DepthwiseConvModule(in_channels=in_channels,
                                                 out_channels=in_channels,
                                                 kernel_size=5,
                                                 stride=1,
                                                 padding=2,
                                                 dilation=1,
                                                 bias="auto",
                                                 norm_cfg=dict(type="BN"),
                                                 activation="ReLU6")
        self.conv = ConvModule(in_channels=in_channels,
                               out_channels=out_channels,
                               kernel_size=1,
                               stride=1,
                               padding=0,
                               dilation=1,
                               bias="auto",
                               norm_cfg=dict(type="BN"),
                               activation="ReLU6")

    def forward(self, x):
        x = self.depthwiseconv(x)
        x = self.conv(x)
        return x


class RFCR_module(nn.Module):
    def __init__(self, in_channel, mid_channel=48, out_channel=96):
        super(RFCR_module, self).__init__()
        self.scale = nn.ParameterList(nn.Parameter(torch.tensor(
            [1.]), requires_grad=True) for _ in range(len(in_channel)))
        self.pwconv = nn.ModuleList(nn.Conv2d(
            in_channel[i], mid_channel, kernel_size=1, stride=1, padding=0, bias=False) for i in range(len(in_channel)))
        self.MB_conv = MobilenetSeparableConv2D(mid_channel, out_channel)

    def forward(self, model_outputs):
        fuse_out = []
        # fuse_out.append(F.max_pool2d(F.max_pool2d(
        #     model_outputs[0], 1, stride=2), 1, stride=2))
        fuse_out.append(F.max_pool2d(model_outputs[0], 2))
        fuse_out.append(model_outputs[1])
        fuse_out.append(F.interpolate(
            model_outputs[2], scale_factor=2, mode="bilinear"))
        for i in range(len(fuse_out)):
            fuse_out[i] = self.pwconv[i](fuse_out[i])

        sum_feat = self.scale[0] * fuse_out[0] + self.scale[1] * \
            fuse_out[1] + self.scale[2] * fuse_out[2] \
                # + self.scale[3] * fuse_out[3]

        mb_feat = self.MB_conv(sum_feat)
        redist_feat = []
        redist_feat.append(torch.cat(
            [F.interpolate(mb_feat, scale_factor=2, mode="bilinear"), model_outputs[0]], dim=1))
        redist_feat.append(torch.cat([mb_feat, model_outputs[1]], dim=1))
        redist_feat.append(
            torch.cat([F.max_pool2d(mb_feat, 1, stride=2), model_outputs[2]], dim=1))
        return redist_feat

I ignore the feature map with 4x stride, so the model_output length is 3 instead.
I am not sure if I have reproducted your module collectly?
I add RFCR into NanoDet repo to see if RFCR can improve detection performance!

Maybe pointwise conv and depthwise conv is fast enough, but I found that RFCR output will add 96 channel numbers which may increase GFlops dramaticly. So, I set pointwise conv output channel=16, and MBconv output channel=32.
In this way, NanoDet GFlops increase from 0.3GFlops to 0.32GFlops!

@Senwang98 Senwang98 reopened this Nov 2, 2021
@prakharg24
Copy link
Owner

Hi @Senwang98

Before anything else, while we are still working on adding more trained models in the repo and make our code easily adaptable, you can already find the model definition here.

As for your implementation, I think that also looks correct to me. You should also cross-check with our TensorFlow implementation, as two heads are better than one :)

For the channel number conundrum, there are a few things I would like to point out. Firstly, its true that the RFCR module can add some computations. While it also improves accuracy, overall there is a trade-off. We were able to overcome this by combining RFCR with backbone truncation. Second, one of the things that we focus on in our work is that indirect metrics of comparisons, like flops or size, are usually not the best measure of a model's execution requirements and thus even though it might seem RFCR causes significant damage to the flops count, network fragmentation in RFCR are limited and can operate well under proper parallelization. Finally, yes the additional channels we used might not be the right choice for you. It depends on the backbone and the detection head being used. I would suggest though that even though you reduce the MBConv output channel to 32, you can still keep the pointwise conv output channel high and not reduce it severely to 16 as that might be hurting to the performance.

I hope this helps

@Senwang98
Copy link
Author

@prakharg24
Ok, I got it! Thanks for your interesting work, I will modidy the channel number to see performance and inference speed.
Thanks again and I will close this issue.

@vaerdu
Copy link

vaerdu commented Nov 23, 2021

@prakharg24 This is my pytorch style of RFCR module:

from .conv import DepthwiseConvModule, ConvModule
import torch
import torch.nn as nn
import torch.nn.functional as F


class MobilenetSeparableConv2D(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(MobilenetSeparableConv2D, self).__init__()
        self.depthwiseconv = DepthwiseConvModule(in_channels=in_channels,
                                                 out_channels=in_channels,
                                                 kernel_size=5,
                                                 stride=1,
                                                 padding=2,
                                                 dilation=1,
                                                 bias="auto",
                                                 norm_cfg=dict(type="BN"),
                                                 activation="ReLU6")
        self.conv = ConvModule(in_channels=in_channels,
                               out_channels=out_channels,
                               kernel_size=1,
                               stride=1,
                               padding=0,
                               dilation=1,
                               bias="auto",
                               norm_cfg=dict(type="BN"),
                               activation="ReLU6")

    def forward(self, x):
        x = self.depthwiseconv(x)
        x = self.conv(x)
        return x


class RFCR_module(nn.Module):
    def __init__(self, in_channel, mid_channel=48, out_channel=96):
        super(RFCR_module, self).__init__()
        self.scale = nn.ParameterList(nn.Parameter(torch.tensor(
            [1.]), requires_grad=True) for _ in range(len(in_channel)))
        self.pwconv = nn.ModuleList(nn.Conv2d(
            in_channel[i], mid_channel, kernel_size=1, stride=1, padding=0, bias=False) for i in range(len(in_channel)))
        self.MB_conv = MobilenetSeparableConv2D(mid_channel, out_channel)

    def forward(self, model_outputs):
        fuse_out = []
        # fuse_out.append(F.max_pool2d(F.max_pool2d(
        #     model_outputs[0], 1, stride=2), 1, stride=2))
        fuse_out.append(F.max_pool2d(model_outputs[0], 1, stride=2))
        fuse_out.append(model_outputs[1])
        fuse_out.append(F.interpolate(
            model_outputs[2], scale_factor=2, mode="bilinear"))
        for i in range(len(fuse_out)):
            fuse_out[i] = self.pwconv[i](fuse_out[i])

        sum_feat = self.scale[0] * fuse_out[0] + self.scale[1] * \
            fuse_out[1] + self.scale[2] * fuse_out[2] \
                # + self.scale[3] * fuse_out[3]

        mb_feat = self.MB_conv(sum_feat)
        redist_feat = []
        redist_feat.append(torch.cat(
            [F.interpolate(mb_feat, scale_factor=2, mode="bilinear"), model_outputs[0]], dim=1))
        redist_feat.append(torch.cat([mb_feat, model_outputs[1]], dim=1))
        redist_feat.append(
            torch.cat([F.max_pool2d(mb_feat, 1, stride=2), model_outputs[2]], dim=1))
        return redist_feat

I ignore the feature map with 4x stride, so the model_output length is 3 instead. I am not sure if I have reproducted your module collectly? I add RFCR into NanoDet repo to see if RFCR can improve detection performance!

Maybe pointwise conv and depthwise conv is fast enough, but I found that RFCR output will add 96 channel numbers which may increase GFlops dramaticly. So, I set pointwise conv output channel=16, and MBconv output channel=32. In this way, NanoDet GFlops increase from 0.3GFlops to 0.32GFlops!

您好,大佬,看了你写的pytorch代码,向您请教几个问题
1、您代码中输入的是使用了backbone的3种尺度的特征吗?
2、在对featuremap下采样的时候最大的池化,F.max_pool2d(model_outputs[0], 1, stride=2)池化核的大小是1吗?
3、使用dw卷积的时候group参数是?
期待大佬的回复

@Senwang98
Copy link
Author

@vaerdu
我在nandoet的检测框架上进行测试的,因为模型小训练够快

  1. 是的,3种。
  2. kernel size = 1
  3. 默认使用的nanodet的dwconv,https://github.com/RangiLyu/nanodet/blob/c931de553e0ded55e8811e51cf0b74ac3aa5e9de/nanodet/model/module/conv.py#L191

@vaerdu
Copy link

vaerdu commented Nov 23, 2021

@vaerdu 我在nandoet的检测框架上进行测试的,因为模型小训练够快

  1. 是的,3种。
  2. kernel size = 1
  3. 默认使用的nanodet的dwconv,https://github.com/RangiLyu/nanodet/blob/c931de553e0ded55e8811e51cf0b74ac3aa5e9de/nanodet/model/module/conv.py#L191

抱歉大佬,还是要向您请教
1、1x1的池化核不太懂,可以实现下采样吗?池化核通常不都是2、5、7尺度的吗?
2、RFCR模块先对backbone的3种尺度特征进行pw卷积,通常backbone的3种尺度通道数128、256、512、1024等,但经过pw卷积后通道数都降到了48,这样会不会丢失较多的特征信息?
希望大佬不吝赐教

@Senwang98
Copy link
Author

Senwang98 commented Nov 23, 2021

@vaerdu

  1. 这里我应该写得不对(晕),直接max_pool2d(feat, 2)
  2. RFCR我理解的主要适用于小模型吧,一般到不了512,而且这个参数是自定义的

@vaerdu
Copy link

vaerdu commented Nov 24, 2021

@vaerdu

  1. 这里我应该写得不对(晕),直接max_pool2d(feat, 2)
  2. RFCR我理解的主要适用于小模型吧,一般到不了512,而且这个参数是自定义的

嗯嗯,明白了
self.scale = nn.ParameterList(nn.Parameter(torch.tensor(
[1.]), requires_grad=True) for _ in range(len(in_channel)))
在weighted sum中,是用以上代码获得的权重参数与pw卷积后的featuremaps相乘然后求和吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants