# Modern Convolutional Neural Networks
:label:`chap_modern_cnn`

Now that we understand the basics of wiring together CNNs, let's take
a tour of modern CNN architectures. This tour is, by
necessity, incomplete, thanks to the plethora of exciting new designs
being added. Their importance derives from the fact that not only can
they be used directly for vision tasks, but they also serve as basic
feature generators for more advanced tasks such as tracking
:cite:`Zhang.Sun.Jiang.ea.2021`, segmentation :cite:`Long.Shelhamer.Darrell.2015`, object
detection :cite:`Redmon.Farhadi.2018`, or style transformation
:cite:`Gatys.Ecker.Bethge.2016`.  In this chapter, most sections
correspond to a significant CNN architecture that was at some point
(or currently) the base model upon which many research projects and
deployed systems were built.  Each of these networks was briefly a
dominant architecture and many were winners or runners-up in the
[ImageNet competition](https://www.image-net.org/challenges/LSVRC/)
which has served as a barometer of progress on supervised learning in
computer vision since 2010. It is only recently that Transformers have begun
to displace CNNs, starting with :citet:`Dosovitskiy.Beyer.Kolesnikov.ea.2021` and 
followed by the Swin Transformer :cite:`liu2021swin`. We will cover this development later 
in :numref:`chap_attention-and-transformers`. 

While the idea of *deep* neural networks is quite simple (stack
together a bunch of layers), performance can vary wildly across
architectures and hyperparameter choices.  The neural networks
described in this chapter are the product of intuition, a few
mathematical insights, and a lot of trial and error.  We present these
models in chronological order, partly to convey a sense of the history
so that you can form your own intuitions about where the field is
heading and perhaps develop your own architectures.  For instance,
batch normalization and residual connections described in this chapter
have offered two popular ideas for training and designing deep models,
both of which have since also been applied to architectures beyond computer
vision.

We begin our tour of modern CNNs with AlexNet :cite:`Krizhevsky.Sutskever.Hinton.2012`,
the first large-scale network deployed to beat conventional computer
vision methods on a large-scale vision challenge; the VGG network
:cite:`Simonyan.Zisserman.2014`, which makes use of a number of
repeating blocks of elements; the network in network (NiN) that
convolves whole neural networks patch-wise over inputs
:cite:`Lin.Chen.Yan.2013`; GoogLeNet that uses networks with
multi-branch convolutions :cite:`Szegedy.Liu.Jia.ea.2015`; the residual
network (ResNet) :cite:`He.Zhang.Ren.ea.2016`, which remains one of
the most popular off-the-shelf architectures in computer vision;
ResNeXt blocks :cite:`Xie.Girshick.Dollar.ea.2017`
for sparser connections;
and DenseNet
:cite:`Huang.Liu.Van-Der-Maaten.ea.2017` for a generalization of the
residual architecture. Over time many special optimizations for efficient 
networks have been developed, such as coordinate shifts (ShiftNet) :cite:`wu2018shift`. This 
culminated in the automatic search for efficient architectures such as 
MobileNet v3 :cite:`Howard.Sandler.Chu.ea.2019`. It also includes the 
semi-automatic design exploration of :citet:`Radosavovic.Kosaraju.Girshick.ea.2020`
that led to the RegNetX/Y which we will discuss later in this chapter. 
The work is instructive insofar as it offers a path for marrying brute force computation with 
the ingenuity of an experimenter in the search for efficient design spaces. Of note is
also the work of :citet:`liu2022convnet` as it shows that training techniques (e.g., optimizers, data augmentation, and regularization)
play a pivotal role in improving accuracy. It also shows that long-held assumptions, such as 
the size of a convolution window, may need to be revisited, given the increase in 
computation and data. We will cover this and many more questions in due course throughout this chapter.

:begin_tab:toc
 - [alexnet](alexnet.ipynb)
 - [vgg](vgg.ipynb)
 - [nin](nin.ipynb)
 - [googlenet](googlenet.ipynb)
 - [batch-norm](batch-norm.ipynb)
 - [resnet](resnet.ipynb)
 - [densenet](densenet.ipynb)
 - [cnn-design](cnn-design.ipynb)
:end_tab:


# 现代卷积神经网络
:label:`chap_modern_cnn`

现在我们已经理解了如何构建CNN的基础知识，让我们来了解现代CNN架构。由于层出不穷的新设计，这次介绍必然是不完整的。它们的重要性在于，不仅可以直接用于视觉任务，还可以作为更高级任务（如跟踪:cite:`Zhang.Sun.Jiang.ea.2021`、分割:cite:`Long.Shelhamer.Darrell.2015`、目标检测:cite:`Redmon.Farhadi.2018`或风格转换:cite:`Gatys.Ecker.Bethge.2016`）的基本特征生成器。本章中，大多数小节对应一个重要的CNN架构，这些架构在某个时期（或当前）是许多研究项目和部署系统的基础模型。每个网络都曾短暂地成为主导架构，其中许多是[ImageNet竞赛](https://www.image-net.org/challenges/LSVRC/)的冠军或亚军，该竞赛自2010年以来一直是计算机视觉监督学习进展的晴雨表。直到最近，Transformer才开始取代CNN，始于:citet:`Dosovitskiy.Beyer.Kolesnikov.ea.2021`，随后是Swin Transformer:cite:`liu2021swin`。我们将在:numref:`chap_attention-and-transformers`中介绍这一发展。

虽然*深度*神经网络的概念相当简单（堆叠一堆层），但不同架构和超参数选择的性能可能差异巨大。本章描述的神经网络是直觉、一些数学见解和大量试错的产物。我们按时间顺序呈现这些模型，部分是为了传达历史感，以便你能形成自己对领域发展方向的直觉，或许还能开发自己的架构。例如，本章描述的批量归一化和残差连接为训练和设计深度模型提供了两种流行思路，这两种思路后来也被应用于计算机视觉之外的架构。

我们从AlexNet:cite:`Krizhevsky.Sutskever.Hinton.2012`开始介绍现代CNN，这是第一个在大规模视觉挑战中击败传统计算机视觉方法的大规模网络；VGG网络:cite:`Simonyan.Zisserman.2014`，它使用了多个重复的元素块；网络中的网络(NiN)对整个神经网络进行逐块卷积:cite:`Lin.Chen.Yan.2013`；GoogLeNet使用具有多分支卷积的网络:cite:`Szegedy.Liu.Jia.ea.2015`；残差网络(ResNet):cite:`He.Zhang.Ren.ea.2016`，它仍然是计算机视觉中最流行的现成架构之一；ResNeXt块:cite:`Xie.Girshick.Dollar.ea.2017`用于稀疏连接；以及DenseNet:cite:`Huang.Liu.Van-Der-Maaten.ea.2017`作为残差架构的泛化。随着时间的推移，开发了许多针对高效网络的特殊优化，如坐标移位(ShiftNet):cite:`wu2018shift`。这最终导致了自动搜索高效架构，如MobileNet v3:cite:`Howard.Sandler.Chu.ea.2019`。还包括:citet:`Radosavovic.Kosaraju.Girshick.ea.2020`的半自动设计探索，这导致了RegNetX/Y，我们将在本章后面讨论。这项工作很有启发性，因为它提供了一条将蛮力计算与实验者的独创性相结合以寻找高效设计空间的路径。值得注意的是:citet:`liu2022convnet`的工作表明，训练技术（如优化器、数据增强和正则化）在提高准确性方面起着关键作用。它还表明，鉴于计算和数据的增加，可能需要重新审视长期持有的假设，如卷积窗口的大小。我们将在本章中适时讨论这个问题和更多问题。

:begin_tab:toc
 - [alexnet](alexnet.ipynb)
 - [vgg](vgg.ipynb)
 - [nin](nin.ipynb)
 - [googlenet](googlenet.ipynb)
 - [batch-norm](batch-norm.ipynb)
 - [resnet](resnet.ipynb)
 - [densenet](densenet.ipynb)
 - [cnn-design](cnn-design.ipynb)
:end_tab: