-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question about backbone truncation #1
Comments
So, |
Hi @Senwang98 Yes the truncation of last two blocks does not not include the fully connected layers. It represents the two MBConv blocks (or 'bottlenecks' as you mentiond) with 1280 and 320 channels. In the paper, we try to motivate backbone truncation in two steps. First, we show that the last CNN blocks have no transfer learning importance. In this step, there is no truncation, but only changing the weight initialization from imagenet to random (Figure 2 in our paper). Next, once we have shown that these weights have no transfer learning importance, we claim that removing them (or truncating) will be a better way to make the model lightweight as compared to reducing width (i.e. the scaling factor). So, I hope that clarifies things. |
Okay, thanks for your quick reply! |
That was definitely a common behavior for a long time. Since ImageNet has 1000 classes, most pre-trained models attempt to increase the number of channels to similar order of channels for the last layers. Some of the recent models don't follow this to the heart, but the overall behavior of significantly increase channel number for the last 2-3 layers is still there. However, as I mentioned, the fact that these layers are very heavy is only one part of the issue, the other being that last layers seem to not contain any relevant transfer learning features. Btw, when working on say object detection, using transfer learning NAS backbone might not be the optimal choice, since NAS architectures are usually not easy to generalize. There is also a lot of work on creating object detection specific backbones using NAS which would be a better fit if someone wants to explore in that direction. |
@prakharg24 |
@prakharg24
I ignore the feature map with 4x stride, so the model_output length is 3 instead. Maybe pointwise conv and depthwise conv is fast enough, but I found that RFCR output will add 96 channel numbers which may increase GFlops dramaticly. So, I set pointwise conv output channel=16, and MBconv output channel=32. |
Hi @Senwang98 Before anything else, while we are still working on adding more trained models in the repo and make our code easily adaptable, you can already find the model definition here. As for your implementation, I think that also looks correct to me. You should also cross-check with our TensorFlow implementation, as two heads are better than one :) For the channel number conundrum, there are a few things I would like to point out. Firstly, its true that the RFCR module can add some computations. While it also improves accuracy, overall there is a trade-off. We were able to overcome this by combining RFCR with backbone truncation. Second, one of the things that we focus on in our work is that indirect metrics of comparisons, like flops or size, are usually not the best measure of a model's execution requirements and thus even though it might seem RFCR causes significant damage to the flops count, network fragmentation in RFCR are limited and can operate well under proper parallelization. Finally, yes the additional channels we used might not be the right choice for you. It depends on the backbone and the detection head being used. I would suggest though that even though you reduce the MBConv output channel to 32, you can still keep the pointwise conv output channel high and not reduce it severely to 16 as that might be hurting to the performance. I hope this helps |
@prakharg24 |
您好,大佬,看了你写的pytorch代码,向您请教几个问题 |
@vaerdu
|
抱歉大佬,还是要向您请教 |
|
嗯嗯,明白了 |
Hi, @prakharg24
I feel confused about backbone truncation your paper. For mobilenetv2 arch:
, you say the last two blocks are not used. Can you tell me which two blocks?
You mean the last two bottleneck??(I think you don't mean FC layer, nobody will keep FC layer for detection task)
Wishing for your reply!
The text was updated successfully, but these errors were encountered: