Tags: ConvNets
, Decreased GPU Throughput
, Increased Accuracy
, Method
, Capacity
Adds a channel-wise attention operator in CNNs. Attention coefficients are produced by a small, trainable MLP that uses the channels' globally pooled activations as input.
After an activation tensor
Squeeze-and-Excitation Networks by Jie Hu, Li Shen, and Gang Sun (2018).
latent_channels
- Number of channels to use in the hidden layer of MLP that computes channel attention coefficients.min_channels
- The minimum number of output channels in a Conv2d for an SE module to be added afterward.
Applicable to convolutional neural networks. Currently only implemented for CNNs with 2d inputs (e.g., images).
0.5-1.5% accuracy gain, roughly 25% slowdown of the model. E.g., we've seen an accuracy change from 76.1 to 77.2% on ImageNet with ResNet-50, in exchange for a training throughput decrease from 4500 samples/sec to 3500 samples/sec on eight RTX 3080 GPUs.
Squeeze-Excitation blocks apply channel-wise attention to an activation tensor
In order to be architecture-agnostic, our implementation applies the SE attention mechanism after individual conv2d modules, rather than at particular points in particular networks. This results in more SE modules being present than in the original paper.
Our implementation also allows applying the SE module after only certain conv2d modules, based on their channel count (see hyperparameter discussion).
-
latent_channels
- 64 yielded the best speed-accuracy tradeoffs in our ResNet experiments. The original paper expressed this as a "reduction ratio"$r$ that makes the MLP latent channel count a fraction of the SE block's input channel count. We also support specifyinglatent_channels
as a fraction of the input channel count, although we've found that it tends to yield a worse speed vs accuracy tradeoff. -
min_channels
- For typical CNNs that have lower channel count at higher resolution, this can be used to control where in the network to start applying SE blocks. Ops with higher channel counts take longer to compute relative to the time taken by the SE block. An appropriate value is architecture-dependent, but we weakly suggest setting this to 128 if the architecture in question has modules with at least this many channels.
This method tends to consistently improve the accuracy of CNNs both in absolute terms and when controlling for training and inference time. This may come at the cost of a roughly 20% increase in inference latency, depending on the architecture and inference hardware.
Because SE modules slow down the model, they compose well with methods that make the data loader slower (e.g., RandAugment) or that speed up each training step (e.g., Selective Backprop). In the former case, the slower model allows more time for the data loader to run. In the latter case, the initial slowdown allows techniques that accelerate the forward and backward passes to have a greater effect before they become limited by the data loader's speed.
.. autoclass:: composer.algorithms.squeeze_excite.SqueezeExcite
:members: match, apply
:noindex:
.. autoclass:: composer.algorithms.squeeze_excite.SqueezeExciteHparams
:noindex:
.. autoclass:: composer.algorithms.squeeze_excite.SqueezeExcite2d
:noindex:
.. autoclass:: composer.algorithms.squeeze_excite.SqueezeExciteConv2d
:noindex:
.. autofunction:: composer.algorithms.squeeze_excite.apply_se
:noindex: