# Generalization in Deep Learning


In :numref:`chap_regression` and :numref:`chap_classification`,
we tackled regression and classification problems
by fitting linear models to training data.
In both cases, we provided practical algorithms
for finding the parameters that maximized
the likelihood of the observed training labels.
And then, towards the end of each chapter,
we recalled that fitting the training data
was only an intermediate goal.
Our real quest all along was to discover *general patterns*
on the basis of which we can make accurate predictions
even on new examples drawn from the same underlying population.
Machine learning researchers are *consumers* of optimization algorithms.
Sometimes, we must even develop new optimization algorithms.
But at the end of the day, optimization is merely a means to an end.
At its core, machine learning is a statistical discipline
and we wish to optimize training loss only insofar
as some statistical principle (known or unknown)
leads the resulting models to generalize beyond the training set.


On the bright side, it turns out that deep neural networks
trained by stochastic gradient descent generalize remarkably well
across myriad prediction problems, spanning computer vision;
natural language processing; time series data; recommender systems;
electronic health records; protein folding;
value function approximation in video games
and board games; and numerous other domains.
On the downside, if you were looking
for a straightforward account
of either the optimization story
(why we can fit them to training data)
or the generalization story
(why the resulting models generalize to unseen examples),
then you might want to pour yourself a drink.
While our procedures for optimizing linear models
and the statistical properties of the solutions
are both described well by a comprehensive body of theory,
our understanding of deep learning
still resembles the wild west on both fronts.

Both the theory and practice of deep learning
are rapidly evolving,
with theorists adopting new strategies
to explain what's going on,
even as practitioners continue
to innovate at a blistering pace,
building arsenals of heuristics for training deep networks
and a body of intuitions and folk knowledge
that provide guidance for deciding
which techniques to apply in which situations.

The summary of the present moment is that the theory of deep learning
has produced promising lines of attack and scattered fascinating results,
but still appears far from a comprehensive account
of both (i) why we are able to optimize neural networks
and (ii) how models learned by gradient descent
manage to generalize so well, even on high-dimensional tasks.
However, in practice, (i) is seldom a problem
(we can always find parameters that will fit all of our training data)
and thus understanding generalization is far the bigger problem.
On the other hand, even absent the comfort of a coherent scientific theory,
practitioners have developed a large collection of techniques
that may help you to produce models that generalize well in practice.
While no pithy summary can possibly do justice
to the vast topic of generalization in deep learning,
and while the overall state of research is far from resolved,
we hope, in this section, to present a broad overview
of the state of research and practice.


## Revisiting Overfitting and Regularization

According to the "no free lunch" theorem of :citet:`wolpert1995no`,
any learning algorithm generalizes better on data with certain distributions, and worse with other distributions.
Thus, given a finite training set,
a model relies on certain assumptions: 
to achieve human-level performance
it may be useful to identify *inductive biases* 
that reflect how humans think about the world.
Such inductive biases show preferences 
for solutions with certain properties.
For example,
a deep MLP has an inductive bias
towards building up a complicated function by the composition of simpler functions.

With machine learning models encoding inductive biases,
our approach to training them
typically consists of two phases: (i) fit the training data;
and (ii) estimate the *generalization error*
(the true error on the underlying population)
by evaluating the model on holdout data.
The difference between our fit on the training data
and our fit on the test data is called the *generalization gap* and when this is large,
we say that our models *overfit* to the training data.
In extreme cases of overfitting,
we might exactly fit the training data,
even when the test error remains significant.
And in the classical view,
the interpretation is that our models are too complex,
requiring that we either shrink the number of features,
the number of nonzero parameters learned,
or the size of the parameters as quantified.
Recall the plot of model complexity compared with loss
(:numref:`fig_capacity_vs_error`)
from :numref:`sec_generalization_basics`.


However deep learning complicates this picture in counterintuitive ways.
First, for classification problems,
our models are typically expressive enough
to perfectly fit every training example,
even in datasets consisting of millions
:cite:`zhang2021understanding`.
In the classical picture, we might think
that this setting lies on the far right extreme
of the model complexity axis,
and that any improvements in generalization error
must come by way of regularization,
either by reducing the complexity of the model class,
or by applying a penalty, severely constraining
the set of values that our parameters might take.
But that is where things start to get weird.

Strangely, for many deep learning tasks
(e.g., image recognition and text classification)
we are typically choosing among model architectures,
all of which can achieve arbitrarily low training loss
(and zero training error).
Because all models under consideration achieve zero training error,
*the only avenue for further gains is to reduce overfitting*.
Even stranger, it is often the case that
despite fitting the training data perfectly,
we can actually *reduce the generalization error*
further by making the model *even more expressive*,
e.g., adding layers, nodes, or training
for a larger number of epochs.
Stranger yet, the pattern relating the generalization gap
to the *complexity* of the model (as captured, for example, in the depth or width of the networks)
can be non-monotonic,
with greater complexity hurting at first
but subsequently helping in a so-called "double-descent" pattern
:cite:`nakkiran2021deep`.
Thus the deep learning practitioner possesses a bag of tricks,
some of which seemingly restrict the model in some fashion
and others that seemingly make it even more expressive,
and all of which, in some sense, are applied to mitigate overfitting.

Complicating things even further,
while the guarantees provided by classical learning theory
can be conservative even for classical models,
they appear powerless to explain why it is
that deep neural networks generalize in the first place.
Because deep neural networks are capable of fitting
arbitrary labels even for large datasets,
and despite the use of familiar methods such as $\ell_2$ regularization,
traditional complexity-based generalization bounds,
e.g., those based on the VC dimension
or Rademacher complexity of a hypothesis class
cannot explain why neural networks generalize.

## Inspiration from Nonparametrics

Approaching deep learning for the first time,
it is tempting to think of them as parametric models.
After all, the models *do* have millions of parameters.
When we update the models, we update their parameters.
When we save the models, we write their parameters to disk.
However, mathematics and computer science are riddled
with counterintuitive changes of perspective,
and surprising isomorphisms between seemingly different problems.
While neural networks clearly *have* parameters,
in some ways it can be more fruitful
to think of them as behaving like nonparametric models.
So what precisely makes a model nonparametric?
While the name covers a diverse set of approaches,
one common theme is that nonparametric methods
tend to have a level of complexity that grows
as the amount of available data grows.

Perhaps the simplest example of a nonparametric model
is the $k$-nearest neighbor algorithm (we will cover more nonparametric models later, for example in :numref:`sec_attention-pooling`).
Here, at training time,
the learner simply memorizes the dataset.
Then, at prediction time,
when confronted with a new point $\mathbf{x}$,
the learner looks up the $k$ nearest neighbors
(the $k$ points $\mathbf{x}_i'$ that minimize
some distance $d(\mathbf{x}, \mathbf{x}_i')$).
When $k=1$, this algorithm is called $1$-nearest neighbors,
and the algorithm will always achieve a training error of zero.
That however, does not mean that the algorithm will not generalize.
In fact, it turns out that under some mild conditions,
the 1-nearest neighbor algorithm is consistent
(eventually converging to the optimal predictor).


Note that $1$-nearest neighbor requires that we specify
some distance function $d$, or equivalently,
that we specify some vector-valued basis function $\phi(\mathbf{x})$
for featurizing our data.
For any choice of the distance metric,
we will achieve zero training error
and eventually reach an optimal predictor,
but different distance metrics $d$
encode different inductive biases
and with a finite amount of available data
will yield different predictors.
Different choices of the distance metric $d$
represent different assumptions about the underlying patterns
and the performance of the different predictors
will depend on how compatible the assumptions
are with the observed data.

In a sense, because neural networks are over-parametrized,
possessing many more parameters than are needed to fit the training data,
they tend to *interpolate* the training data (fitting it perfectly)
and thus behave, in some ways, more like nonparametric models.
More recent theoretical research has established
deep connection between large neural networks
and nonparametric methods, notably kernel methods.
In particular, :citet:`Jacot.Grabriel.Hongler.2018`
demonstrated that in the limit, as multilayer perceptrons
with randomly initialized weights grow infinitely wide,
they become equivalent to (nonparametric) kernel methods
for a specific choice of the kernel function
(essentially, a distance function),
which they call the neural tangent kernel.
While current neural tangent kernel models may not fully explain
the behavior of modern deep networks,
their success as an analytical tool
underscores the usefulness of nonparametric modeling
for understanding the behavior of over-parametrized deep networks.


## Early Stopping

While deep neural networks are capable of fitting arbitrary labels,
even when labels are assigned incorrectly or randomly
:cite:`zhang2021understanding`,
this capability only emerges over many iterations of training.
A new line of work :cite:`Rolnick.Veit.Belongie.Shavit.2017`
has revealed that in the setting of label noise,
neural networks tend to fit cleanly labeled data first
and only subsequently to interpolate the mislabeled data.
Moreover, it has been established that this phenomenon
translates directly into a guarantee on generalization:
whenever a model has fitted the cleanly labeled data
but not randomly labeled examples included in the training set,
it has in fact generalized :cite:`Garg.Balakrishnan.Kolter.Lipton.2021`.

Together these findings help to motivate *early stopping*,
a classic technique for regularizing deep neural networks.
Here, rather than directly constraining the values of the weights,
one constrains the number of epochs of training.
The most common way to determine the stopping criterion
is to monitor validation error throughout training
(typically by checking once after each epoch)
and to cut off training when the validation error
has not decreased by more than some small amount $\epsilon$
for some number of epochs.
This is sometimes called a *patience criterion*.
As well as the potential to lead to better generalization
in the setting of noisy labels,
another benefit of early stopping is the time saved.
Once the patience criterion is met, one can terminate training.
For large models that might require days of training
simultaneously across eight or more GPUs,
well-tuned early stopping can save researchers days of time
and can save their employers many thousands of dollars.

Notably, when there is no label noise and datasets are *realizable*
(the classes are truly separable, e.g., distinguishing cats from dogs),
early stopping tends not to lead to significant improvements in generalization.
On the other hand, when there is label noise,
or intrinsic variability in the label
(e.g., predicting mortality among patients),
early stopping is crucial.
Training models until they interpolate noisy data is typically a bad idea.


## Classical Regularization Methods for Deep Networks

In :numref:`chap_regression`, we described
several  classical regularization techniques
for constraining the complexity of our models.
In particular, :numref:`sec_weight_decay`
introduced a method called weight decay,
which consists of adding a regularization term to the loss function
in order to penalize large values of the weights.
Depending on which weight norm is penalized
this technique is known either as ridge regularization (for $\ell_2$ penalty)
or lasso regularization (for an $\ell_1$ penalty).
In the classical analysis of these regularizers,
they are considered as sufficiently restrictive on the values
that the weights can take to prevent the model from fitting arbitrary labels.

In deep learning implementations,
weight decay remains a popular tool.
However, researchers have noted
that typical strengths of $\ell_2$ regularization
are insufficient to prevent the networks
from interpolating the data :cite:`zhang2021understanding` and thus the benefits if interpreted
as regularization might only make sense
in combination with the early stopping criterion.
Absent early stopping, it is possible
that just like the number of layers
or number of nodes (in deep learning)
or the distance metric (in 1-nearest neighbor),
these methods may lead to better generalization
not because they meaningfully constrain
the power of the neural network
but rather because they somehow encode inductive biases
that are better compatible with the patterns
found in datasets of interests.
Thus, classical regularizers remain popular
in deep learning implementations,
even if the theoretical rationale
for their efficacy may be radically different.

Notably, deep learning researchers have also built
on techniques first popularized
in classical regularization contexts,
such as adding noise to model inputs.
In the next section we will introduce
the famous dropout technique
(invented by :citet:`Srivastava.Hinton.Krizhevsky.ea.2014`),
which has become a mainstay of deep learning,
even as the theoretical basis for its efficacy
remains similarly mysterious.


## Summary

Unlike classical linear models,
which tend to have fewer parameters than examples,
deep networks tend to be over-parametrized,
and for most tasks are capable
of perfectly fitting the training set.
This *interpolation regime* challenges
many hard fast-held intuitions.
Functionally, neural networks look like parametric models.
But thinking of them as nonparametric models
can sometimes be a more reliable source of intuition.
Because it is often the case that all deep networks under consideration
are capable of fitting all of the training labels,
nearly all gains must come by mitigating overfitting
(closing the *generalization gap*).
Paradoxically, the interventions
that reduce the generalization gap
sometimes appear to increase model complexity
and at other times appear to decrease complexity.
However, these methods seldom decrease complexity
sufficiently for classical theory
to explain the generalization of deep networks,
and *why certain choices lead to improved generalization*
remains for the most part a massive open question
despite the concerted efforts of many brilliant researchers.


## Exercises

1. In what sense do traditional complexity-based measures fail to account for generalization of deep neural networks?
1. Why might *early stopping* be considered a regularization technique?
1. How do researchers typically determine the stopping criterion?
1. What important factor seems to differentiate cases when early stopping leads to big improvements in generalization?
1. Beyond generalization, describe another benefit of early stopping.

[Discussions](https://discuss.d2l.ai/t/7473)


<think>我需要将这整篇文章翻译成中文，同时保持原格式不变。我将根据段落逐一翻译。

# 深度学习中的泛化

首先翻译标题和开头部分：</think># 深度学习中的泛化

在 :numref:`chap_regression` 和 :numref:`chap_classification` 中，我们通过将线性模型拟合到训练数据来解决回归和分类问题。在这两种情况下，我们提供了实用算法来寻找能够最大化观测训练标签似然的参数。然后，在每章结尾，我们回顾了拟合训练数据只是一个中间目标。我们真正的追求一直是发现*一般模式*，基于这些模式，我们可以对从同一底层总体中抽取的新样本做出准确预测。机器学习研究者是优化算法的*使用者*。有时，我们甚至必须开发新的优化算法。但最终，优化只是达到目的的手段。从本质上讲，机器学习是一门统计学科，我们希望优化训练损失，但仅限于某种统计原则（已知或未知）能够使所得模型泛化到训练集之外的情况。

从好的方面看，事实证明，通过随机梯度下降训练的深度神经网络在无数预测问题上有着卓越的泛化能力，这些问题跨越了计算机视觉、自然语言处理、时间序列数据、推荐系统、电子健康记录、蛋白质折叠、视频游戏和棋盘游戏中的价值函数近似以及许多其他领域。从不利的一面来看，如果你想要一个关于优化故事（为什么我们能够将它们拟合到训练数据）或泛化故事（为什么得到的模型能够泛化到未见的例子）的直接解释，那么你可能需要倒杯酒。虽然我们优化线性模型的程序和解决方案的统计属性都由全面的理论体系很好地描述，但我们对深度学习的理解在这两个方面仍然像是蛮荒西部。

深度学习的理论和实践都在迅速发展，理论家采用新策略来解释正在发生的事情，而实践者以惊人的速度继续创新，建立训练深度网络的启发式方法库和直觉与民间知识体系，为决定在何种情况下应用何种技术提供指导。

当前状况的总结是，深度学习理论已经产生了有希望的攻击路线和零散的迷人结果，但似乎仍然远离全面解释(i)为什么我们能够优化神经网络和(ii)通过梯度下降学习的模型如何能够如此好地泛化，即使在高维任务上。然而，在实践中，(i)很少是个问题（我们总能找到适合所有训练数据的参数），因此理解泛化是更大的问题。另一方面，即使没有连贯科学理论的安慰，实践者已经开发了大量技术，这些技术可能帮助你在实践中生产出能够很好泛化的模型。虽然没有简明扼要的总结能够公正地对待深度学习中泛化这一广泛主题，而且整体研究状态远未解决，但我们希望在本节中对研究和实践状态提供广泛概述。

## 重新审视过拟合和正则化

根据 :citet:`wolpert1995no` 的"没有免费午餐"定理，任何学习算法在某些分布的数据上泛化得更好，在其他分布上则更差。因此，给定有限的训练集，模型依赖于某些假设：要达到人类水平的性能，识别反映人类如何思考世界的*归纳偏置*可能是有用的。这种归纳偏置表现为对具有某些属性的解决方案的偏好。例如，深度MLP有一种归纳偏置，倾向于通过简单函数的组合构建复杂函数。

随着机器学习模型编码归纳偏置，我们训练它们的方法通常包括两个阶段：(i)拟合训练数据；(ii)通过在保留数据上评估模型来估计*泛化误差*（底层总体上的真实误差）。我们在训练数据上的拟合与在测试数据上的拟合之间的差异被称为*泛化差距*，当这个差距很大时，我们说我们的模型对训练数据*过拟合*。在过拟合的极端情况下，我们可能会完全拟合训练数据，即使测试误差仍然很大。在经典观点中，解释是我们的模型太复杂，需要我们减少特征数量、学习的非零参数数量或量化的参数大小。回想一下 :numref:`sec_generalization_basics` 中模型复杂性与损失的对比图 (:numref:`fig_capacity_vs_error`)。

然而，深度学习以反直觉的方式使这幅图复杂化。首先，对于分类问题，我们的模型通常有足够的表达能力来完美拟合每个训练样本，即使在包含数百万个样本的数据集中 :cite:`zhang2021understanding`。在传统观点中，我们可能认为这种设置位于模型复杂性轴的最右端，任何泛化误差的改进必须通过正则化来实现，要么减少模型类的复杂性，要么应用惩罚，严格限制我们的参数可能取的值的集合。但这就是事情开始变得奇怪的地方。

奇怪的是，对于许多深度学习任务（例如，图像识别和文本分类），我们通常是在选择模型架构，所有这些架构都可以实现任意低的训练损失（和零训练误差）。因为所有考虑的模型都达到零训练误差，*进一步获益的唯一途径是减少过拟合*。更奇怪的是，尽管完美拟合了训练数据，我们实际上可以通过使模型*更具表达性*来*进一步减少泛化误差*，例如，添加层、节点或训练更多的轮次。更奇怪的是，泛化差距与模型*复杂性*（例如，通过网络的深度或宽度捕获）的关系可能是非单调的，复杂性开始时会造成伤害，但随后在所谓的"双重下降"模式中有所帮助 :cite:`nakkiran2021deep`。因此，深度学习实践者拥有一袋技巧，其中一些看似以某种方式限制模型，而其他一些看似使其更具表达性，而且所有这些技巧在某种意义上都是用来减轻过拟合的。

使事情更加复杂的是，虽然古典学习理论提供的保证即使对于古典模型也可能是保守的，但它们似乎无力解释为什么深度神经网络首先能够泛化。因为深度神经网络能够拟合任意标签，即使对于大型数据集，并且尽管使用了熟悉的方法，如 $\ell_2$ 正则化，传统的基于复杂性的泛化界限，例如那些基于假设类的VC维或Rademacher复杂性的界限，无法解释为什么神经网络能够泛化。

## 从非参数方法获得灵感

第一次接触深度学习时，很容易将它们视为参数模型。毕竟，这些模型*确实*有数百万个参数。当我们更新模型时，我们更新它们的参数。当我们保存模型时，我们将它们的参数写入磁盘。然而，数学和计算机科学充满了反直觉的视角变化，以及看似不同问题之间的惊人同构。虽然神经网络明显*有*参数，但在某些方面，将它们视为表现得像非参数模型更为有益。那么，究竟是什么使模型成为非参数的呢？虽然这个名称涵盖了一组多样的方法，但一个共同的主题是，非参数方法的复杂性水平往往随着可用数据量的增加而增长。

非参数模型的最简单例子可能是 $k$ 近邻算法（我们将在后面介绍更多非参数模型，例如在 :numref:`sec_attention-pooling` 中）。在这里，在训练时，学习者简单地记忆数据集。然后，在预测时，当面对一个新点 $\mathbf{x}$ 时，学习者查找 $k$ 个最近邻（$k$ 个点 $\mathbf{x}_i'$ 使得某个距离 $d(\mathbf{x}, \mathbf{x}_i')$ 最小化）。当 $k=1$ 时，这个算法被称为 $1$ 近邻，该算法将始终达到零训练误差。然而，这并不意味着算法不会泛化。事实上，在一些温和条件下，1近邻算法是一致的（最终收敛到最优预测器）。

注意，$1$近邻要求我们指定一些距离函数 $d$，或等价地，要求我们指定一些向量值基函数 $\phi(\mathbf{x})$ 来特征化我们的数据。对于距离度量的任何选择，我们都将实现零训练误差并最终达到最优预测器，但不同的距离度量 $d$ 编码不同的归纳偏置，并且在有限的可用数据量下会产生不同的预测器。距离度量 $d$ 的不同选择代表对底层模式的不同假设，不同预测器的性能将取决于假设与观测数据的兼容程度。

在某种意义上，因为神经网络是过参数化的，拥有比拟合训练数据所需更多的参数，它们倾向于*插值*训练数据（完美拟合），因此在某些方面更像非参数模型。最近的理论研究建立了大型神经网络与非参数方法（特别是核方法）之间的深度联系。特别是，:citet:`Jacot.Grabriel.Hongler.2018` 证明，在极限情况下，当随机初始化权重的多层感知机无限宽时，它们变得等同于（非参数）核方法，对于特定选择的核函数（本质上是一个距离函数），他们称之为神经切线核。虽然当前的神经切线核模型可能无法完全解释现代深度网络的行为，但它们作为分析工具的成功强调了非参数建模对于理解过参数化深度网络的行为的有用性。

## 早停

虽然深度神经网络能够拟合任意标签，即使标签被错误或随机分配 :cite:`zhang2021understanding`，但这种能力只有在经过多次训练迭代后才会出现。一项新研究 :cite:`Rolnick.Veit.Belongie.Shavit.2017` 揭示，在标签噪声的情况下，神经网络倾向于首先拟合干净标记的数据，然后才插值错误标记的数据。此外，已经确立这种现象直接转化为泛化保证：只要模型已经拟合了干净标记的数据但没有拟合包含在训练集中的随机标记的例子，它实际上已经泛化 :cite:`Garg.Balakrishnan.Kolter.Lipton.2021`。

这些发现共同帮助激发了*早停*，这是一种正则化深度神经网络的经典技术。在这里，与其直接约束权重的值，不如约束训练的轮数。确定停止标准的最常见方法是在整个训练过程中监控验证误差（通常在每个轮次后检查一次），并在验证误差在某些轮次中没有减少超过某个小量 $\epsilon$ 时停止训练。这有时被称为*耐心标准*。除了在嘈杂标签设置中可能导致更好的泛化外，早停的另一个好处是节省时间。一旦满足耐心标准，就可以终止训练。对于可能需要几天时间同时在八个或更多GPU上训练的大型模型，精心调整的早停可以为研究人员节省数天时间，并为他们的雇主节省数千美元。

值得注意的是，当没有标签噪声且数据集是*可实现的*（类真正可分离，例如，区分猫和狗）时，早停往往不会导致泛化的显著改善。另一方面，当存在标签噪声，或标签中有内在变异性（例如，预测患者死亡率）时，早停至关重要。训练模型直到它们插值嘈杂数据通常是个坏主意。

## 深度网络的经典正则化方法

在 :numref:`chap_regression` 中，我们描述了几种经典的正则化技术，用于约束我们模型的复杂性。特别是，:numref:`sec_weight_decay` 介绍了一种称为权重衰减的方法，它包括向损失函数添加正则化项以惩罚大的权重值。根据惩罚的权重范数，这种技术被称为岭正则化（对于 $\ell_2$ 惩罚）或lasso正则化（对于 $\ell_1$ 惩罚）。在这些正则化器的经典分析中，它们被认为对权重可以取的值有足够的限制，以防止模型拟合任意标签。

在深度学习实现中，权重衰减仍然是一种流行的工具。然而，研究者已经注意到，典型强度的 $\ell_2$ 正则化不足以防止网络插值数据 :cite:`zhang2021understanding`，因此如果被解释为正则化，其好处可能只有与早停标准结合才有意义。没有早停，可能就像层数或节点数（在深度学习中）或距离度量（在1近邻中）一样，这些方法可能导致更好的泛化，不是因为它们有意义地约束了神经网络的能力，而是因为它们以某种方式编码了归纳偏置，这些偏置与在感兴趣数据集中发现的模式更加兼容。因此，经典正则化器在深度学习实现中仍然很受欢迎，即使其有效性的理论基础可能完全不同。

值得注意的是，深度学习研究者也基于在经典正则化环境中首次流行的技术构建，如向模型输入添加噪声。在下一节中，我们将介绍著名的dropout技术（由 :citet:`Srivastava.Hinton.Krizhevsky.ea.2014` 发明），它已成为深度学习的主力，即使其有效性的理论基础仍然同样神秘。

## 总结

与倾向于拥有少于样本数量的参数的经典线性模型不同，深度网络倾向于过参数化，且对于大多数任务都能够完美拟合训练集。这种*插值机制*挑战了许多坚定不移的直觉。从功能上看，神经网络看起来像参数模型。但将它们视为非参数模型有时可能是更可靠的直觉来源。因为通常所有考虑的深度网络都能够拟合所有训练标签，几乎所有的收益都必须来自减轻过拟合（缩小*泛化差距*）。矛盾的是，减少泛化差距的干预有时似乎增加了模型复杂性，而在其他时候则似乎减少了复杂性。然而，这些方法很少减少复杂性足以让经典理论解释深度网络的泛化，而*为什么某些选择会导致改进的泛化*在很大程度上仍然是一个巨大的开放问题，尽管许多杰出研究者做出了协调努力。

## 练习

1. 在什么意义上，传统的基于复杂性的度量无法解释深度神经网络的泛化？
2. 为什么*早停*可能被视为一种正则化技术？
3. 研究人员通常如何确定停止标准？
4. 什么重要因素似乎区分了早停何时导致泛化大幅改善的情况？
5. 除了泛化之外，描述早停的另一个好处。

[讨论](https://discuss.d2l.ai/t/7473)
