# Recurrent Neural Networks
:label:`chap_rnn`

Up until now, we have focused primarily on fixed-length data.
When introducing linear and logistic regression
in :numref:`chap_regression` and :numref:`chap_classification`
and multilayer perceptrons in :numref:`chap_perceptrons`,
we were happy to assume that each feature vector $\mathbf{x}_i$
consisted of a fixed number of components $x_1, \dots, x_d$,
where each numerical feature $x_j$
corresponded to a particular attribute.
These datasets are sometimes called *tabular*,
because they can be arranged in tables,
where each example $i$ gets its own row,
and each attribute gets its own column.
Crucially, with tabular data, we seldom
assume any particular structure over the columns.

Subsequently, in :numref:`chap_cnn`,
we moved on to image data, where inputs consist
of the raw pixel values at each coordinate in an image.
Image data hardly fitted the bill
of a protypical tabular dataset.
There, we needed to call upon convolutional neural networks (CNNs)
to handle the hierarchical structure and invariances.
However, our data were still of fixed length.
Every Fashion-MNIST image is represented
as a $28 \times 28$ grid of pixel values.
Moreover, our goal was to develop a model
that looked at just one image and then
outputted a single prediction.
But what should we do when faced with a
sequence of images, as in a video,
or when tasked with producing
a sequentially structured prediction,
as in the case of image captioning?

A great many learning tasks require dealing with sequential data.
Image captioning, speech synthesis, and music generation
all require that models produce outputs consisting of sequences.
In other domains, such as time series prediction,
video analysis, and musical information retrieval,
a model must learn from inputs that are sequences.
These demands often arise simultaneously:
tasks such as translating passages of text
from one natural language to another,
engaging in dialogue, or controlling a robot,
demand that models both ingest and output
sequentially structured data.


Recurrent neural networks (RNNs) are deep learning models
that capture the dynamics of sequences via
*recurrent* connections, which can be thought of
as cycles in the network of nodes.
This might seem counterintuitive at first.
After all, it is the feedforward nature of neural networks
that makes the order of computation unambiguous.
However, recurrent edges are defined in a precise way
that ensures that no such ambiguity can arise.
Recurrent neural networks are *unrolled* across time steps (or sequence steps),
with the *same* underlying parameters applied at each step.
While the standard connections are applied *synchronously*
to propagate each layer's activations
to the subsequent layer *at the same time step*,
the recurrent connections are *dynamic*,
passing information across adjacent time steps.
As the unfolded view in :numref:`fig_unfolded-rnn` reveals,
RNNs can be thought of as feedforward neural networks
where each layer's parameters (both conventional and recurrent)
are shared across time steps.


![On the left recurrent connections are depicted via cyclic edges. On the right, we unfold the RNN over time steps. Here, recurrent edges span adjacent time steps, while conventional connections are computed synchronously.](../img/unfolded-rnn.svg)
:label:`fig_unfolded-rnn`


Like neural networks more broadly,
RNNs have a long discipline-spanning history,
originating as models of the brain popularized
by cognitive scientists and subsequently adopted
as practical modeling tools employed
by the machine learning community.
As we do for deep learning more broadly,
in this book we adopt the machine learning perspective,
focusing on RNNs as practical tools that rose
to popularity in the 2010s owing to
breakthrough results on such diverse tasks
as handwriting recognition :cite:`graves2008novel`,
machine translation :cite:`Sutskever.Vinyals.Le.2014`,
and recognizing medical diagnoses :cite:`Lipton.Kale.2016`.
We point the reader interested in more
background material to a publicly available
comprehensive review :cite:`Lipton.Berkowitz.Elkan.2015`.
We also note that sequentiality is not unique to RNNs.
For example, the CNNs that we already introduced
can be adapted to handle data of varying length,
e.g., images of varying resolution.
Moreover, RNNs have recently ceded considerable
market share to Transformer models,
which will be covered in :numref:`chap_attention-and-transformers`.
However, RNNs rose to prominence as the default models
for handling complex sequential structure in deep learning,
and remain staple models for sequential modeling to this day.
The stories of RNNs and of sequence modeling
are inextricably linked, and this is as much
a chapter about the ABCs of sequence modeling problems
as it is a chapter about RNNs.


One key insight paved the way for a revolution in sequence modeling.
While the inputs and targets for many fundamental tasks in machine learning
cannot easily be represented as fixed-length vectors,
they can often nevertheless be represented as
varying-length sequences of fixed-length vectors.
For example, documents can be represented as sequences of words;
medical records can often be represented as sequences of events
(encounters, medications, procedures, lab tests, diagnoses);
videos can be represented as varying-length sequences of still images.


While sequence models have popped up in numerous application areas,
basic research in the area has been driven predominantly
by advances on core tasks in natural language processing.
Thus, throughout this chapter, we will focus
our exposition and examples on text data.
If you get the hang of these examples,
then applying the models to other data modalities
should be relatively straightforward.
In the next few sections, we introduce basic
notation for sequences and some evaluation measures
for assessing the quality of sequentially structured model outputs.
After that, we discuss basic concepts of a language model
and use this discussion to motivate our first RNN models.
Finally, we describe the method for calculating gradients
when backpropagating through RNNs and explore some challenges
that are often encountered when training such networks,
motivating the modern RNN architectures that will follow
in :numref:`chap_modern_rnn`.

:begin_tab:toc
 - [sequence](sequence.ipynb)
 - [text-sequence](text-sequence.ipynb)
 - [language-model](language-model.ipynb)
 - [rnn](rnn.ipynb)
 - [rnn-scratch](rnn-scratch.ipynb)
 - [rnn-concise](rnn-concise.ipynb)
 - [bptt](bptt.ipynb)
:end_tab:


# 循环神经网络
:label:`chap_rnn`

到目前为止，我们主要关注固定长度的数据。在 :numref:`chap_regression` 和 :numref:`chap_classification` 中介绍线性和逻辑回归，以及在 :numref:`chap_perceptrons` 中介绍多层感知机时，我们假设每个特征向量 $\mathbf{x}_i$ 由固定数量的分量 $x_1, \dots, x_d$ 组成，其中每个数值特征 $x_j$ 对应特定属性。这类数据集有时被称为*表格数据*，因为它们可以排列成表格，每行对应一个样本 $i$，每列对应一个属性。关键在于，对于表格数据，我们很少假设列之间存在特定结构。

随后在 :numref:`chap_cnn` 中，我们转向图像数据，其输入由图像每个坐标的原始像素值组成。图像数据显然不符合典型表格数据集的特征。为此，我们需要借助卷积神经网络（CNN）来处理层次化结构和不变性。然而，我们的数据仍然是固定长度的。每个 Fashion-MNIST 图像都表示为 $28 \times 28$ 的像素值网格。此外，我们的目标是开发一个模型：观察单张图像后输出单个预测。但当面对图像序列（如视频）或需要生成序列化预测（如图像描述）时，我们该怎么做？

许多学习任务需要处理序列数据。图像描述生成、语音合成和音乐生成都需要模型输出由序列组成的结果。在其他领域，如时间序列预测、视频分析和音乐信息检索，模型必须从序列输入中学习。这些需求常常同时出现：诸如将文本段落从一种自然语言翻译成另一种、进行对话或控制机器人等任务，要求模型既能处理又能输出序列化结构数据。

循环神经网络（RNN）是通过*循环连接*捕捉序列动态特性的深度学习模型，这些连接可视为网络节点间的循环结构。乍看这可能违反直觉，毕竟神经网络的前馈特性确保了计算顺序的明确性。然而，循环边界的精确定义消除了这种歧义。循环神经网络在时间步（或序列步）上展开，每个步骤应用相同的底层参数。标准连接*同步*应用于同一时间步中将各层激活传播至后续层，而循环连接则是*动态*的，在相邻时间步间传递信息。如 :numref:`fig_unfolded-rnn` 展开视图所示，RNN 可视为参数（常规和循环）跨时间步共享的前馈神经网络。

![左侧通过循环边表示递归连接，右侧展示RNN在时间步上的展开。循环边跨越相邻时间步，常规连接同步计算。](../img/unfolded-rnn.svg)
:label:`fig_unfolded-rnn`

与广义神经网络类似，RNN 具有跨学科发展历史，最初作为认知科学家推广的脑模型，后被机器学习界采纳为实用建模工具。如同深度学习整体发展，本书采用机器学习视角，聚焦于 RNN 作为 2010 年代因多样任务突破性成果而流行的实用工具，例如手写识别 :cite:`graves2008novel`、机器翻译 :cite:`Sutskever.Vinyals.Le.2014` 和医疗诊断识别 :cite:`Lipton.Kale.2016`。感兴趣的读者可参阅公开的全面综述 :cite:`Lipton.Berkowitz.Elkan.2015`。需注意序列性并非 RNN 独有，例如已介绍的 CNN 可适配处理变长数据（如不同分辨率图像）。此外，RNN 近年已让出相当市场份额给 Transformer 模型（详见 :numref:`chap_attention-and-transformers`），但 RNN 仍是深度学习处理复杂序列结构的经典模型。RNN 与序列建模的故事密不可分，本章既是 RNN 专题，也是序列建模基础问题的全面阐述。

关键洞见推动了序列建模的革命：虽然许多机器学习基础任务的输入和输出难以表示为定长向量，但它们常可表示为定长向量的变长序列。例如，文档可表示为词序列；医疗记录可表示为事件序列（就诊、用药、手术、检验、诊断）；视频可表示为静态图像的变长序列。

虽然序列模型已应用于众多领域，但该领域的基础研究主要由自然语言处理核心任务进展驱动。因此，本章将重点阐述文本数据案例。掌握这些案例后，将模型应用于其他数据模态应相对直接。后续章节将介绍序列基本表示法、序列化模型输出的质量评估指标，讨论语言模型基础概念并由此引出首个 RNN 模型，最后阐述 RNN 反向传播梯度计算方法及训练挑战，为 :numref:`chap_modern_rnn` 现代 RNN 架构奠定基础。

:begin_tab:toc
 - [序列](sequence.ipynb)
 - [文本序列](text-sequence.ipynb)
 - [语言模型](language-model.ipynb)
 - [循环神经网络](rnn.ipynb)
 - [从零实现RNN](rnn-scratch.ipynb)
 - [简洁实现RNN](rnn-concise.ipynb)
 - [通过时间反向传播](bptt.ipynb)
:end_tab: