# Enriching Word Vectors with Subword Information
# 用Subword信息丰富词向量

### Abstract
### 摘要

* Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

* 在大型未标记语料库上训练的连续词表示对于许多自然语言处理任务都是有用的。学习这种表示法的流行模型忽略了单词的词形，它们为每个单词指定一个不同的向量。这是一个限制，特别是对于那些词汇量大、词汇量少的语言。本文在skipgram模型的基础上，提出了一种新的方法，将每个单词表示为一包字符n-grams。向量表示与每个字符n-gram相关联；单词表示为这些表示的总和。我们的方法是快速的，允许在大型语料库上快速训练模型，并允许我们计算未出现在训练数据中的单词的单词表示。我们在九种不同的语言中评估我们的词汇表达，包括词汇相似度和类比任务。通过与最近提出的形态词表示方法的比较，我们证明我们的向量在这些任务上达到了最先进的性能。

### 1 Introduction
### 1简介

* Learning continuous representations of words has a long history in natural language processing (Rumelhart et al., 1988). These representations are typically derived from large unlabeled corpora using co-occurrence statistics (Deerwester et al., 1990; Schütze, 1992; Lund and Burgess, 1996). A large body of work, known as distributional semantics, has studied the properties of these methods (Turney et al., 2010; Baroni and Lenci, 2010). In the neural network community, Collobert and Weston (2008) proposed to learn word embeddings using a feed-forward neural network, by predicting a word based on the two words on the left and two words on the right. More recently, Mikolov et al. (2013b) proposed simple log-bilinear models to learn continuous representations of words on very large corpora efficiently.

* 学习单词的连续表示在自然语言处理中有着悠久的历史（Rumelhart等人，1988）。这些表示通常来自使用共现统计的大型未标记语料库（Deerwester等人，1990年；Schütze，1992年；Lund和Burgess，1996年）。大量的工作，被称为分布语义学，已经研究了这些方法的性质（Turney等人，2010；Baroni和Lenci，2010）。在神经网络社区中，Collobert和Weston（2008）提出使用前馈神经网络来学习单词嵌入，方法是根据左边的两个单词和右边的两个单词来预测单词。最近，Mikolov等人。（2013b）提出了简单的对数双线性模型来有效地学习非常大语料库上单词的连续表示。

* Most of these techniques represent each word of the vocabulary by a distinct vector, without parameter sharing. In particular, they ignore the internal structure of words, which is an important limitation for morphologically rich languages, such as Turkish or Finnish. For example, in French or Spanish, most verbs have more than forty different inflected forms, while the Finnish language has fifteen cases for nouns. These languages contain many word forms that occur rarely (or not at all) in the training corpus, making it difficult to learn good word representations. Because many word formations follow rules, it is possible to improve vector representations for morphologically rich languages by using character level information.

* 这些技术中的大多数都是通过一个不同的向量来表示词汇表中的每个单词，而没有参数共享。特别是，他们忽略了单词的内部结构，这是对形态丰富的语言，如土耳其语或芬兰语的一个重要限制。例如，在法语或西班牙语中，大多数动词有超过40种不同的屈折变化形式，而芬兰语有15种名词情况。这些语言包含许多在训练语料库中很少出现（或根本没有出现）的单词形式，这使得学习好的单词表示变得困难。由于许多构词法遵循一定的规则，因此利用字符级信息可以改进形态丰富语言的矢量表示。

* In this paper, we propose to learn representations for character n-grams, and to represent words as the sum of the n-gram vectors. Our main contribution is to introduce an extension of the continuous skip-gram model (Mikolov et al., 2013b), which takes into account subword information. We evaluate this model on nine languages exhibiting different morphologies, showing the benefit of our approach.

* 在本文中，我们建议学习字符n-grams的表示，并将单词表示为n-gram向量的和。我们的主要贡献是引入了考虑subword信息的continuous skip-gram模型（Mikolov等人，2013b）的扩展。我们在九种呈现不同形态的语言上评估了这个模型，显示了我们方法的优点。

### 2 Related work
### 2 相关工作

* Morphological word representations. In recent years, many methods have been proposed to incorporate morphological information into word representations. To model rare words better, Alexandrescu and Kirchhoff (2006) introduced factored neural language models, where words are represented as sets of features. These features might include morphological information, and this technique was succesfully applied to morphologically rich languages, such as Turkish (Sak et al., 2010). Recently, several works have proposed different composition functions to derive representations of words from morphemes (Lazaridou et al., 2013; Luong et al., 2013; Botha and Blunsom, 2014; Qiu et al., 2014). These different approaches rely on a morphological decomposition of words, while ours does not. Similarly, Chen et al. (2015) introduced a method to jointly learn embeddings for Chinese words and characters. Cui et al. (2015) proposed to constrain morphologically similar words to have similar representations. Soricut and Och (2015) described a method to learn vector representations of morphological transformations, allowing to obtain representations for unseen words by applying these rules. Word representations trained on morphologically annotated data were introduced by Cotterell and Schütze (2015). Closest to our approach, Schütze (1993) learned representations of character four-grams through singular value decomposition, and derived representations for words by summing the four-grams representations. Very recently, Wieting et al. (2016) also proposed to represent words using character n-gram count vectors. However, the objective function used to learn these representations is based on paraphrase pairs, while our model can be trained on any text corpus.


* 形态词表征。近年来，人们提出了许多方法来将形态学信息融入到词的表示中。为了更好地对稀有单词进行建模，Alexandrescu和Kirchhoff（2006）引入了因子神经语言模型，其中单词被表示为一组特征。这些特征可能包括形态信息，并且这种技术成功地应用于形态丰富的语言，例如土耳其语（Sak等人，2010）。最近，有几部著作提出了不同的构词功能，以从语素中提取单词的表示（Lazaridou等人，2013年；Luong等人，2013年；Botha和Blunsom，2014年；邱等人，2014年）。这些不同的方法依赖于单词的形态分解，而我们的方法则不依赖。同样，Chen等人。（2015）介绍了一种联合学习汉字嵌入的方法。Cui等人。（2015）提议限制形态相似的词具有相似的表示。Soricut和Och（2015）描述了一种学习形态转换的矢量表示的方法，允许通过应用这些规则来获得不可见单词的表示。Cotterell和Schütze（2015）引入了基于形态学注释数据的单词表示。与我们的方法最接近的是，Schütze（1993）通过奇异值分解学习了字符四个g的表示，并通过对四个g的表示求和得到了单词的表示。最近，Wieting等人。（2016）还建议使用字符n-gram计数向量表示单词。然而，用于学习这些表示的目标函数是基于释义对的，而我们的模型可以在任何文本语料库上进行训练。

* Character level features for NLP. Another area of research closely related to our work are character-level models for natural language processing. These models discard the segmentation into words and aim at learning language representations directly from characters. A first class of such models are recurrent neural networks, applied to language modeling (Mikolov et al., 2012; Sutskever et al., 2011; Graves, 2013; Bojanowski et al., 2015), text normalization (Chrupała, 2014), part-of-speech tagging (Ling et al., 2015) and parsing (Ballesteros et al., 2015). Another family of models are convolutional neural networks trained on characters, which were applied to part-of-speech tagging (dos Santos and Zadrozny, 2014), sentiment analysis (dos Santos and Gatti, 2014), text classification (Zhang et al., 2015) and language modeling (Kim et al., 2016). Sperr et al. (2013) introduced a language model based on restricted Boltzmann machines, in which words are encoded as a set of character n-grams. Finally, recent works in machine translation have proposed using subword units to obtain representations of rare words (Sennrich et al., 2016; Luong and Manning, 2016).

* NLP的字符级功能。与我们工作密切相关的另一个研究领域是自然语言处理的字符级模型。这些模型摒弃了分词，旨在直接从字符中学习语言表示。第一类模型是递归神经网络，应用于语言建模（Mikolov等人，2012年；Sutskever等人，2011年；Graves，2013年；Bojanowski等人，2015年）、文本规范化（ChrupałA，2014年）、词性标注（Ling等人，2015年）和句法分析（Ballesteros等人，2015年）。另一类模型是基于字符训练的卷积神经网络，用于词性标注（dos Santos和Zadrozny，2014）、情感分析（dos Santos和Gatti，2014）、文本分类（Zhang等人，2015）和语言建模（Kim等人，2016）。Sperr等人。（2013）引入了一种基于受限Boltzmann机器的语言模型，其中单词被编码为一组字符n-grams。最后，最近在机器翻译方面的工作提出使用子词单位来获得稀有词的表示（Sennrich等人，2016；Luong和Manning，2016）。

### 3 Model
### 3 模型

* In this section, we propose our model to learn word representations while taking into account morphology. We model morphology by considering subword units, and representing words by a sum of its character n-grams. We will begin by presenting the general framework that we use to train word vectors, then present our subword model and eventually describe how we handle the dictionary of character n-grams.


* 在这一部分中，我们提出了我们的模型来学习单词表示，同时考虑到形态学。我们通过考虑subword单位来建模形态学，并通过其字符n-gram的总和来表示词。我们将首先介绍用于训练单词向量的通用框架，然后介绍子单词模型，并最终描述如何处理字符n-grams字典。

#### 3.1 General model
#### 3.1 通用模型

* We start by briefly reviewing the continuous skip-gram model introduced by Mikolov et al. (2013b), from which our model is derived. Given a word vocabulary of size W, where a word is identified by its index w ∈ {1,...,W}, the goal is to learn a vectorial representation for each word w. Inspired by the distributional hypothesis (Harris, 1954), word representations are trained to predict well words that appear in its context. More formally, given a large training corpus represented as a sequence of words w1, ..., wT , the objective of the skipgram model is to maximize the following log-likelihood:


* 我们首先简要回顾了Mikolov等人引入的continuous skip-gram模型。（2013b），我们的模型由此导出。给定一个大小为W的单词词汇表，其中一个单词由其索引W∈{1，…，W}识别，其目标是学习每个单词W的向量表示法。受分布假设（Harris，1954）的启发，单词表示法被训练来预测在其上下文中出现的单词。更正式地说，给定一个以单词w1，…，wT序列表示的大型训练语料库，skipgram模型的目标是最大化以下对数似然

![avatar](图片/1.png)

* where the context $C_t$ is the set of indices of words surrounding word $w_t$. The probability of observing a context word $w_c$ given $w_t$ will be parameterized using the aforementioned word vectors. For now, let us consider that we are given a scoring function $s$ which maps pairs of (word, context) to scores in $R$.

* 其中上下文$C_t$是围绕单词$w_t$的一组单词索引。在给定$w_t$的情况下，观察上下文单词$w_c$的概率将使用上述单词向量参数化。现在，让我们考虑一个评分函数$s$，它将成对的（单词、上下文）映射到$R$中的分数。

* One possible choice to define the probability of a context word is the softmax:

* 定义上下文单词概率的一种可能选择是softmax：

![avatar](图片/2.png)

* However, such a model is not adapted to our case as it implies that, given a word $w_t$, we only predict one context word $w_c$.

* 然而，这样的模型并不适合我们的情况，因为它意味着，给定一个单词$w_t$，我们只预测一个上下文单词$w_c$。

* The problem of predicting context words can in- stead be framed as a set of independent binary clas- sification tasks. Then the goal is to independently predict the presence (or absence) of context words. For the word at position t we consider all context words as positive examples and sample negatives at random from the dictionary. For a chosen context position c, using the binary logistic loss, we obtain the following negative log-likelihood:

* 上下文词的预测问题可以被构造成一组独立的二元分类任务。然后目标是独立预测上下文单词的存在（或不存在）。对于位置t处的单词，我们认为所有上下文单词都是积极的例子，并从字典中随机抽取否定词。对于选定的上下文位置c，使用二进制logistic损失，我们得到以下负对数似然：

![avatar](图片/3.png)

* where $N_{t,c}$ is a set of negative examples sampled from the vocabulary. By denoting the logistic loss function $l : x → log(1 + e^{−x})$, we can re-write the objective as:

* 其中$N{t,c}$是从词汇表中抽取的一组否定示例。通过表示逻辑损失函数$l:x→log（1+e^{-x}）$，我们可以将目标重写为：

![avatar](图片/4.png)

* A natural parameterization for the scoring function s between a word $w_t$ and a context word $w_c$ is to use word vectors. Let us define for each word $w$ in the vocabulary two vectors $u_w$ and $v_w$ in $R^d$. These two vectors are sometimes referred to as input and output vectors in the literature. In particular, we have vectors $u_{wt}$ and $v_{wc}$ , corresponding, respectively, to words $w_t$ and $w_c$. Then the score can be computed as the scalar product between word and context vectors as $s(w_t , w_c ) = u^⊤_{wt} v_{wc}$ . The model described in this section is the skipgram model with negative sampling, introduced by Mikolov et al. (2013b).

* 评分函数s在单词$w_t$和上下文单词$w_c$之间的自然参数化是使用单词向量。让我们为词汇表中的每个单词$w$定义两个向量$u_w$和$R^d$中的$u_w$。在文献中，这两个向量有时被称为输入和输出向量。特别是，向量$u_{wt}$和$v_{wc}$，分别对应于单词$w_{t}$和$w_{c}$。然后，分数可以作为单词和上下文向量之间的标量积来计算，即$s（w_t，w_c）=u^⊤_{wt} v_{wc}$。本节描述的模型是Mikolov等人引入的负采样skipgram模型。（2013年b）。

#### 3.2 Subword model
#### 3.2 子词模型

* By using a distinct vector representation for each word, the skipgram model ignores the internal struc- ture of words. In this section, we propose a different scoring function s, in order to take into account this information.

* 通过对每个单词使用不同的向量表示，skipgram模型忽略了单词的内部结构。在这一节中，我们提出了一个不同的评分函数s，以考虑这一信息。

* Each word w is represented as a bag of character n-gram. We add special boundary symbols < and > at the beginning and end of words, allowing to dis- tinguish prefixes and suffixes from other character sequences. We also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to character n-grams). Taking the word where and n = 3 as an example, it will be represented by the character n-grams:

* 每个单词w都表示为一包字符n-gram。我们在单词的开头和结尾添加特殊的边界符号<和>，允许从其他字符序列中分离前缀和后缀。我们还将单词w本身包含在它的n-grams集合中，以学习每个单词的表示（除了字符n-grams）。以where和n=3为例，用n-grams表示：

![avatar](图片/5.png)

* and the special sequence

* 还有特殊的顺序

![avatar](图片/6.png)

* Note that the sequence her, corresponding to the word her is different from the tri-gram her from the word where. In practice, we extract all the n-grams for n greater or equal to 3 and smaller or equal to 6. This is a very simple approach, and different sets of n-grams could be considered, for example taking all prefixes and suffixes.

* 注意，与单词her相对应的序列 her，与单词where的三元语法her不同。在实际应用中，我们提取了n大于等于3，小于等于6的所有n-gram。这是一种非常简单的方法，并且可以考虑不同的n-grams集，例如取所有前缀和后缀。

* Suppose that you are given a dictionary of n-grams of size $G$. Given a word $w$, let us denote by $Gw ⊂ \{1,...,G\}$ the set of n-grams appearing in $w$. We associate a vector representation $z_g$ to each n-gram $g$. We represent a word by the sum of the vector representations of its n-grams. We thus obtain the scoring function:

* 假设给你一本大小为$G$的词典。给定一个单词$w$，让我们用$Gw⊂\{1，…，G\}$表示出现在$w$中的n个G的集合。我们将向量表示$z_g$与每个n-gram$g$相关联。我们用一个词的n-gram表示的和来表示它。从而得到得分函数：

![avatar](图片/7.png)

* This simple model allows sharing the representations across words, thus allowing to learn reliable representation for rare words.

* 这个简单的模型允许跨单词共享表示，从而允许学习稀有单词的可靠表示。

* In order to bound the memory requirements of our model, we use a hashing function that maps n-grams to integers in 1 to K. We hash character sequences using the Fowler-Noll-Vo hashing function (specifically the FNV-1a variant).1 We set $K = 2.10^6$ below. Ultimately, a word is represented by its index in the word dictionary and the set of hashed n-grams it contains.

* 为了限制我们模型的内存需求，我们使用了一个哈希函数，将n-grams映射为1到K的整数。我们使用Fowler Noll Vo哈希函数（特别是FNV-1a变量）对字符序列进行哈希。我们在下面设置了$K=2.10^6$。最终，一个单词由它在单词词典中的索引和它所包含的一组散列n-grams来表示。

### 4 实验设置

#### 4.1 baseline

* In most experiments (except in Sec. 5.3), we compare our model to the C implementation of the skipgram and cbow models from the word2vec2 package.

* 在大多数实验中（除了Sec. 5.3），我们将我们的模型与word2vec包中skipgram和cbow模型的C实现进行了比较。

#### 4.2 优化

* We solve our optimization problem by performing stochastic gradient descent on the negative log-likelihood presented before. As in the baseline skipgram model, we use a linear decay of the step size. Given a training set containing T words and a number of passes over the data equal to P, the step size at time t is equal to $γ0(1 − \frac{t}{TP})$, where $γ0$ is a fixed parameter. We carry out the optimization in parallel, by resorting to Hogwild (Recht et al., 2011). All threads share parameters and update vectors in an asynchronous manner.

* 我们通过对前面提出的负对数似然进行随机梯度下降来解决我们的优化问题。在基线skipgram模型中，我们使用步长的线性衰减。给定一个包含T个单词的训练集，并且在数据上传递的次数等于P，则时间T的步长等于$γ0（1-\frac{T}{TP}）$，其中$γ0$是一个固定参数。我们通过使用Hogwild并行执行优化（Recht等人，2011）。所有线程以异步方式共享参数和更新向量。

#### 4.3 实施细节

* For both our model and the baseline experiments, we use the following parameters: the word vectors have dimension 300. For each positive example, we sample 5 negatives at random, with probability proportional to the square root of the uni-gram frequency. We use a context window of size c, and uniformly sample the size c between 1 and 5. In order to subsample the most frequent words, we use a rejection threshold of $10^{−4}$ (for more details, see (Mikolov et al., 2013b)). When building the word dictionary, we keep the words that appear at least 5 times in the training set. The step size $γ_0$ is set to 0.025 for the skipgram baseline and to 0.05 for both our model and the cbow baseline. These are the default values in the word2vec package and work well for our model too.

* 对于我们的模型和基线实验，我们使用以下参数：单词向量的维数为300。对于每个正的例子，我们随机抽取5个负的样本，其概率与单位频率的平方根成正比。我们使用大小为c的上下文窗口，并在1到5之间对大小c进行统一采样。为了对最频繁的单词进行子抽样，我们使用$10^{-4}$的拒绝阈值（有关更多详细信息，请参见（Mikolov等人，2013b））。在构建单词词典时，我们会保留在训练集中出现至少5次的单词。skipgram基线的步长$γ_0$设置为0.025，模型和cbow基线的步长均设置为0.05。这些是word2vec包中的默认值，也适用于我们的模型。

* Using this setting on English data, our model with character n-grams is approximately 1.5× slower to train than the skipgram baseline. Indeed, we process 105k words/second/thread versus 145k words/second/thread for the baseline. Our model is implemented in C++, and is publicly available.

* 在英语数据上使用此设置，我们的字符n-grams模型的训练速度大约比skipgram基线慢1.5倍。实际上，我们处理105k字/秒/线程，而基线处理145k字/秒/线程。我们的模型是在C++中实现的，并且是公开的。

#### 4.4 数据集

* Except for the comparison to previous work (Sec. 5.3), we train our models on Wikipedia data.We downloaded Wikipedia dumps in nine languages: Arabic, Czech, German, English, Spanish, French, Italian, Romanian and Russian. We normalize the raw Wikipedia data using Matt Mahoney’s pre-processing perl script.All the datasets are shuffled, and we train our models by doing five passes over them.

* 除了与先前工作的比较（Sec. 5.3），我们在维基百科数据上训练我们的模型。我们下载了九种语言的维基百科转储文件：阿拉伯语、捷克语、德语、英语、西班牙语、法语、意大利语、罗马尼亚语和俄语。我们使用Matt Mahoney的预处理perl脚本规范化原始Wikipedia数据。所有数据集都被洗牌，我们通过对它们进行五次传递来训练模型。

### 5 Results
### 5 结果

* We evaluate our model in five experiments: an evaluation of word similarity and word analogies, a comparison to state-of-the-art methods, an analysis of the effect of the size of training data and of the size of character n-grams that we consider. We will describe these experiments in detail in the following sections.


* 我们在五个实验中评估我们的模型：单词相似性和单词类比的评价，与最先进的方法的比较，分析训练数据的大小和我们所考虑的字符n-gram的大小的影响。我们将在以下部分详细描述这些实验。

#### 5.1 Human similarity judgement
#### 5.1 人类相似性判断

* We first evaluate the quality of our representations on the task of word similarity / relatedness. We do so by computing Spearman’s rank correlation coefficient (Spearman, 1904) between human judgement and the cosine similarity between the vector representations. For German, we compare the different models on three datasets: GUR65, GUR350 and ZG222 (Gurevych, 2005; Zesch and Gurevych, 2006). For English, we use the WS353 dataset introduced by Finkelstein et al. (2001) and the rare word dataset (RW), introduced by Luong et al. (2013). We evaluate the French word vectors on the translated dataset RG65 (Joubarne and Inkpen, 2011). Spanish, Arabic and Romanian word vectors are evaluated using the datasets described in (Hassan and Mihalcea, 2009). Russian word vectors are evaluated using the HJ dataset introduced by Panchenko et al. (2016).

* 我们首先评估我们在单词相似性/关联性任务中的表示质量。我们通过计算人类判断和向量表示之间的余弦相似性之间的Spearman秩相关系数（Spearman，1904）来实现。对于德语，我们比较了三个数据集上的不同模型：GUR65、GUR350和ZG222（Gurevych，2005；Zesch和Gurevych，2006）。对于英语，我们使用Finkelstein等人引入的WS353数据集。（2001）和Luong等人介绍的稀有词数据集（RW）。（2013年）。我们在翻译后的数据集RG65上评估法语单词向量（Joubarne and Inkpen，2011）。西班牙语、阿拉伯语和罗马尼亚语单词向量使用中描述的数据集进行评估（Hassan和Mihalcea，2009）。使用Panchenko等人引入的HJ数据集对俄语单词向量进行评估。（2016年）。

* We report results for our method and baselines for all datasets in Table 1. Some words from these datasets do not appear in our training data, and thus, we cannot obtain word representation for these words using the cbow and skipgram baselines. In order to provide comparable results, we propose by default to use null vectors for these words. Since our model exploits subword information, we can also compute valid representations for out-of-vocabulary words. We do so by taking the sum of its n-gram vectors. When OOV words are represented using null vectors we refer to our method as sisg- and sisg otherwise (Subword Information Skip Gram).

* 我们报告方法的结果和表1中所有数据集的基线。这些数据集中的一些单词没有出现在我们的训练数据中，因此，我们无法使用cbow和skipgram基线获取这些单词的单词表示。为了提供可比较的结果，我们建议在默认情况下对这些词使用空向量。由于我们的模型利用了子词信息，我们还可以计算出词汇表外单词的有效表示。我们通过求其n-gram向量的和来实现。当OOV单词使用空向量表示时，我们将我们的方法称为sisg-否则称为sisg（子单词信息skipGram）。

![avatar](图片/8.png)

<center>表1：人类判断与词相似度数据集相似度得分的相关性。我们在规范化的维基百科转储上训练我们的模型和word2vec基线。评估数据集包含不属于训练集的单词，因此我们使用空向量（sisg-）表示它们。在我们的模型中，我们还通过对n-gram向量（sisg）求和来计算不可见单词的向量。</center>

* First, by looking at Table 1, we notice that the proposed model (sisg), which uses subword information, outperforms the baselines on all datasets except the English WS353 dataset. Moreover, computing vectors for out-of-vocabulary words (sisg) is always at least as good as not doing so (sisg-). This proves the advantage of using subword information in the form of character n-grams.

* 首先，通过查看表1，我们注意到所提出的模型（sisg）使用子词信息，在除英语WS353数据集之外的所有数据集上都优于基线。此外，计算词汇表外单词的向量（sisg）至少和不计算词汇表外单词的向量（sisg-）一样好。这证明了以字符n-grams的形式使用子词信息的优势。

* Second, we observe that the effect of using character n-grams is more important for Arabic, German and Russian than for English, French or Spanish. German and Russian exhibit grammatical declensions with four cases for German and six for Russian. Also, many German words are compound words; for instance the nominal phrase “table tennis” is written in a single word as “Tischtennis”. By exploiting the character-level similarities between “Tischtennis” and “Tennis”, our model does not represent the two words as completely different words.

* 其次，我们观察到，对于阿拉伯语、德语和俄语来说，使用字符n-grams的效果比英语、法语或西班牙语更为重要。德语和俄语有语法变化，德语有四个格，俄语有六个格。此外，许多德语单词是复合词；例如，名词短语“乒乓球”是用一个单词写成的“Tischtennis”。通过挖掘“Tischtennis”和“tensins”之间的字符级相似性，我们的模型不能将这两个词表示为完全不同的词。

* Finally, we observe that on the English Rare Words dataset (RW), our approach outperforms the baselines while it does not on the English WS353 dataset. This is due to the fact that words in the English WS353 dataset are common words for which good vectors can be obtained without exploiting subword information. When evaluating on less frequent words, we see that using similarities at the character level between words can help learning good word vectors.

* 最后，我们观察到，在英语稀有词数据集（RW）上，我们的方法优于基线，而在英语WS353数据集上则没有。这是因为英语WS353数据集中的词是常用词，不需要利用子词信息就可以获得好的向量。在对不太频繁的单词进行评估时，我们发现在单词之间的字符级使用相似度有助于学习好的单词向量。

![avatar](图片/9.png)

<center>表2：捷克语、德语、英语和意大利语单词类比任务模型和基线的准确性。我们分别报告语义类比和句法类比的结果。</center>

#### 5.2 词汇类比任务

* We now evaluate our approach on word analogy questions, of the form A is to B as C is to D, where D must be predicted by the models. We use the datasets introduced by Mikolov et al. (2013a) for English, by Svoboda and Brychcin (2016) for Czech, by Köper et al. (2015) for German and by Berardi et al. (2015) for Italian. Some questions contain words that do not appear in our training corpus, and we thus excluded these questions from the evaluation.

* 我们现在评估我们在单词类比问题上的方法，A是B，C是D，其中D必须由模型预测。我们使用Mikolov等人介绍的数据集。（2013a）英语，斯沃博达和布莱希金（2016）捷克语，Kóper等人。（2015）德语版，Berardi等人。（2015）意大利语。有些问题包含了我们的训练语料库中没有出现的单词，因此我们将这些问题排除在评估之外。

* We report accuracy for the different models in Table 2. We observe that morphological information significantly improves the syntactic tasks; our approach outperforms the baselines. In contrast, it does not help for semantic questions, and even degrades the performance for German and Italian. Note that this is tightly related to the choice of the length of character n-grams that we consider. We show in Sec. 5.5 that when the size of the n-grams is chosen optimally, the semantic analogies degrade less. Another interesting observation is that, as expected, the improvement over the baselines is more important for morphologically rich languages, such as Czech and German.

* 我们在表2中报告了不同模型的准确性。我们观察到形态信息显著改善了句法任务；我们的方法优于基线。相反，它对语义问题没有帮助，甚至降低了德语和意大利语的性能。请注意，这与我们所考虑的字符n-gram长度的选择密切相关。我们马上就来。5.5当最佳选择n-gram的大小时，语义类比的退化程度较小。另一个有趣的观察是，正如预期的那样，基线的改进对于形态丰富的语言（如捷克语和德语）更为重要。

#### 5.3 与形态表征的比较

* We also compare our approach to previous work on word vectors incorporating subword information on word similarity tasks. The methods used are: the recursive neural network of Luong et al. (2013), the morpheme cbow of Qiu et al. (2014) and the morphological transformations of Soricut and Och (2015). In order to make the results comparable, we trained our model on the same datasets as the methods we are comparing to: the English Wikipedia data released by Shaoul and Westbury (2010), and the news crawl data from the 2013 WMT shared task for German, Spanish and French. We also compare our approach to the log-bilinear language model introduced by Botha and Blunsom (2014), which was trained on the Europarl and news commentary corpora. Again, we trained our model on the same data to make the results comparable. Using our model, we obtain representations of out-of-vocabulary words by summing the representations of character n-grams. We report results in Table 3. We observe that our simple approach performs well relative to techniques based on subword information obtained from morphological segmentors. We also observe that our approach outperforms the Soricut and Och (2015) method, which is based on prefix and suffix analysis. The large improvement for German is due to the fact that their approach does not model noun compounding, contrary to ours.

* 我们还将我们的方法与以前的关于词向量的工作进行了比较，这些工作包含了关于词相似性任务的子词信息。所采用的方法有：Luong等人的递归神经网络。（2013），邱等的语素cbow。（2014）和Soricut和Och的形态转变（2015）。为了使结果具有可比性，我们在与我们正在比较的方法相同的数据集上训练我们的模型：Shaoul和Westbury（2010）发布的英语维基百科数据，以及2013年WMT德语、西班牙语和法语共享任务的新闻爬网数据。我们还将我们的方法与Botha和Blunsom（2014）引入的对数双线性语言模型进行了比较，后者是在Europarl和新闻评论语料库上进行训练的。再次，我们在相同的数据上训练我们的模型，使结果具有可比性。利用我们的模型，我们通过对字符n-grams的表示进行求和来获得词汇表外单词的表示。结果见表3。我们观察到，相对于基于从形态学分词器获得的子词信息的技术，我们的简单方法表现良好。我们还观察到，我们的方法优于基于前缀和后缀分析的Soricut和Och（2015）方法。德语的巨大进步是因为他们的方法与我们的方法相反，并没有模拟名词复合。

![avatar](图片/10.png)

<center>表3：使用形态学学习单词表示的不同方法的人类判断和模型分数之间的Spearman秩相关系数。我们保留了评估集的所有词对，并通过对字符n-grams的向量求和，用我们的模型获得了词汇表外词的表示。我们的模型是在与我们比较的方法相同的数据集上训练的（因此我们的方法有两行结果）。</center>

#### 5.4 训练数据大小的影响

* Since we exploit character-level similarities between words, we are able to better model infrequent words. Therefore, we should also be more robust to the size of the training data that we use. In order to assess that, we propose to evaluate the performance of our word vectors on the similarity task as a function of the training data size. To this end, we train our model and the cbow baseline on portions of Wikipedia of increasing size. We use the Wikipedia corpus described above and isolate the first 1, 2, 5, 10, 20, and 50 percent of the data. Since we don’t reshuffle the dataset, they are all subsets of each other. We report results in Fig. 1.

* 因为我们利用了单词之间的字符级相似性，所以我们能够更好地为不常用的单词建模。因此，我们还应该对所使用的训练数据的大小更加稳健。为了评估这一点，我们建议评估我们的词向量在相似性任务上的表现，作为训练数据大小的函数。为此，我们在不断增长的维基百科上训练我们的模型和cbow基线。我们使用上述维基百科语料库，分离出前1、2、5、10、20和50%的数据。因为我们不重新整理数据集，所以它们都是彼此的子集。我们在图1中报告结果。

* As in the experiment presented in Sec. 5.1, not all words from the evaluation set are present in the Wikipedia data. Again, by default, we use a null vector for these words (sisg-) or compute a vector by summing the n-gram representations (sisg). The out-of-vocabulary rate is growing as the dataset shrinks, and therefore the performance of sisg- and cbow necessarily degrades. However, the proposed model (sisg) assigns non-trivial vectors to previously unseen words.

* 在第二节的实验中。5.1，维基百科数据中并不是所有来自评估集的单词都存在。同样，在默认情况下，我们对这些单词使用一个空向量（sisg-），或者通过对n-gram表示（sisg）求和来计算向量。随着数据集的缩小，词汇量不足的比率也在增加，因此sisg和cbow的性能必然会下降。然而，所提出的模型（sisg）将非平凡向量分配给先前未看到的词。

* First, we notice that for all datasets, and all sizes, the proposed approach (sisg) performs better than the baseline. However, the performance of the baseline cbow model gets better as more and more data is available. Our model, on the other hand, seems to quickly saturate and adding more data does not always lead to improved results.

* 首先，我们注意到，对于所有数据集和所有大小的数据集，所提出的方法（sisg）的性能都优于基线。然而，随着越来越多的数据可用，基线cbow模型的性能变得更好。另一方面，我们的模型似乎很快就饱和了，添加更多的数据并不总能带来更好的结果。

* Second, and most importantly, we notice that the proposed approach provides very good word vectors even when using very small training datasets. For instance, on the German GUR350 dataset, our model (sisg) trained on 5% of the data achieves better performance (66) than the cbow baseline trained on the full dataset (62). On the other hand, on the English RW dataset, using 1% of the Wikipedia corpus we achieve a correlation coefficient of 45 which is better than the performance of cbow trained on the full dataset (43). This has a very important practical implication: well performing word vectors can be computed on datasets of a restricted size and still work well on previously unseen words. In general, when using vectorial word representations in specific applications, it is recommended to retrain the model on textual data relevant for the application. However, this kind of relevant task-specific data is often very scarce and learning from a reduced amount of training data is a great advantage.

* 其次，也是最重要的一点，我们注意到，即使使用非常小的训练数据集，所提出的方法也能提供非常好的词向量。例如，在德国GUR350数据集上，我们在5%的数据上训练的模型（sisg）比在完整数据集上训练的cbow基线（62）获得更好的性能（66）。另一方面，在英文RW数据集上，使用1%的Wikipedia语料库，我们得到了45的相关系数，这比在完整数据集上训练的cbow的性能要好（43）。这有一个非常重要的实际意义：性能良好的词向量可以在大小受限的数据集上计算，并且仍然可以在以前看不见的词上很好地工作。通常，在特定应用程序中使用矢量词表示时，建议在与应用程序相关的文本数据上重新训练模型。然而，这种相关的特定任务数据往往非常稀缺，从减少的训练数据中学习是一个很大的优势。

#### 5.5 Effect of the size of n-grams
#### 5.5 n-gram大小的影响

* The proposed model relies on the use of character n-grams to represent words as vectors. As mentioned in Sec. 3.2, we decided to use n-grams ranging from 3 to 6 characters. This choice was arbitrary, motivated by the fact that n-grams of these lengths will cover a wide range of information. They would include short suffixes (corresponding to conjugations and declensions for instance) as well as longer roots. In this experiment, we empirically check for the influence of the range of n-grams that we use on performance. We report our results in Table 4 for English and German on word similarity and analogy datasets.

* 该模型依赖于字符n-grams作为向量来表示单词。如第Sec. 3.2所述，我们决定使用3到6个字符的n-grams。这种选择是任意的，其动机是这些长度的n-grams将覆盖广泛的信息。它们包括短后缀（例如对应于共轭和衰减）和长根。在这个实验中，我们用经验的方法来检验我们使用的n-grams的范围对性能的影响。我们在表4中报告了英语和德语词汇相似度和类比数据集的结果。

* We observe that for both English and German, our arbitrary choice of 3-6 was a reasonable decision, as it provides satisfactory performance across languages. The optimal choice of length ranges depends on the considered task and language and should be tuned appropriately. However, due to the scarcity of test data, we did not implement any proper validation procedure to automatically select the best parameters. Nonetheless, taking a large range such as 3 − 6 provides a reasonable amount of subword information.


* 我们观察到，对于英语和德语，我们任意选择3-6是一个合理的决定，因为它提供了跨语言的令人满意的性能。长度范围的最佳选择取决于所考虑的任务和语言，应适当调整。但是，由于测试数据的缺乏，我们没有实现任何正确的验证过程来自动选择最佳参数。尽管如此，取3-6这样的大范围可以提供合理数量的子字信息。

* This experiment also shows that it is important to include long n-grams, as columns corresponding to n ≤ 5 and n ≤ 6 work best. This is especially true for German, as many nouns are compounds made up from several units that can only be captured by longer character sequences. On analogy tasks, we observe that using larger n-grams helps for semantic analogies. However, results are always improved by taking n ≥ 3 rather than n ≥ 2, which shows that character 2-grams are not informative for that task. As described in Sec. 3.2, before computing character n-grams, we prepend and append special positional characters to represent the beginning and end of word. Therefore, 2-grams will not be enough to properly capture suffixes that correspond to conjugations or declensions, since they are composed of a single proper character and a positional one.

* 实验还表明，由于n≤5和n≤6所对应的列工作得最好，因此包含长n-gram也很重要。德语尤其如此，因为许多名词是由几个单位组成的复合词，只能由较长的字符序列捕获。在类比任务中，我们观察到使用更大的n-grams有助于语义类比。然而，采用n≥3而不是n≥2的方法总是能改善结果，这表明字符2-grams对该任务没有信息。如第。3.2在计算字符n-grams之前，我们会预先添加和附加特殊的位置字符来表示单词的开头和结尾。因此，2-grams不足以正确地捕获对应于共轭或变位的后缀，因为它们由单个正确字符和位置字符组成。

![avatar](图片/11.png)

<center>表4：研究所考虑的n-gram大小对性能的影响。我们用n在{i，…j}中的n个字符来计算字向量。，j}和报告i和j的各种值的性能。我们评估了这种影响对德语和英语的影响，并使用子词信息表示词汇外的单词。</center>

#### 5.6 Language modeling
#### 5.6 语言建模

* In this section, we describe an evaluation of the word vectors obtained with our method on a language modeling task. We evaluate our language model on five languages (CS, DE, ES, FR, RU) using the datasets introduced by Botha and Blunsom (2014). Each dataset contains roughly one million training tokens, and we use the same preprocessing and data splits as Botha and Blunsom (2014).

* 在这一部分中，我们描述了在语言建模任务中使用我们的方法获得的词向量的评估。我们使用Botha和Blunsom（2014）引入的数据集，在五种语言（CS、DE、ES、FR、RU）上评估我们的语言模型。每个数据集包含大约一百万个训练令牌，我们使用与Botha和Blunsom（2014）相同的预处理和数据分割。

* Our model is a recurrent neural network with 650 LSTM units, regularized with dropout (with probability of 0.5) and weight decay (regularization parameter of 10−5). We learn the parameters using the Adagrad algorithm with a learning rate of 0.1, clipping the gradients which have a norm larger than 1.0. We initialize the weight of the network in the range [−0.05, 0.05], and use a batch size of 20. Two baselines are considered: we compare our approach to the log-bilinear language model of Botha and Blunsom (2014) and the character aware language model of Kim et al. (2016). We trained word vectors with character n-grams on the training set of the language modeling task and use them to initialize the lookup table of our language model. We report the test perplexity of our model without using pre-trained word vectors (LSTM), with word vectors pre-trained without subword information (sg) and with our vectors (sisg). The results are presented in Table 5.

* 我们的模型是一个具有650个LSTM单位的递归神经网络，通过dropout（概率为0.5）和权值衰减（正则化参数为10-5）进行正则化。我们使用学习率为0.1的Adagrad算法来学习参数，剪裁范数大于1.0的梯度。我们初始化网络的权重，范围是–-0.05，0.05，使用的批量大小是20。考虑了两个基线：我们将我们的方法与Botha和Blunsom（2014）的对数双线性语言模型和Kim等人的字符感知语言模型进行了比较。（2016年）。在语言建模任务的训练集上训练具有n个字符的词向量，并用它们初始化语言模型的查找表。我们报告了不使用预训练词向量（LSTM）、不使用子词信息预训练词向量（sg）和使用向量（sisg）的模型的测试困惑。结果见表5。

* We observe that initializing the lookup table of the language model with pre-trained word representations improves the test perplexity over the baseline LSTM. The most important observation is that using word representations trained with subword information outperforms the plain skipgram model. We observe that this improvement is most significant for morphologically rich Slavic languages such as Czech (8% reduction of perplexity over sg) and Russian (13% reduction). The improvement is less significant for Roman languages such as Spanish (3% reduction) or French (2% reduction). This shows the importance of subword information on the language modeling task and exhibits the usefulness of the vectors that we propose for morphologically rich languages.

* 我们观察到，使用预先训练的单词表示初始化语言模型的查找表，比基线LSTM提高了测试困惑度。最重要的观察是，使用子词信息训练的词表示优于普通的skipgram模型。我们观察到，这一改进对于形态丰富的斯拉夫语如捷克语（比sg减少8%的困惑）和俄语（减少13%）最为显著。对于西班牙语（减少了3%）或法语（减少了2%）等罗马语言来说，这一改进没有那么显著。这表明了子词信息在语言建模任务中的重要性，并展示了我们为形态丰富的语言提出的向量的有用性。

![avatar](图片/12.png)

<center>表5：测试5种不同语言的语言建模任务的困惑度。我们比较了两种最先进的方法：CLBL指的是Botha和Blunsom（2014）的作品，CANLM指的是Kim等人的作品。（2016年）。</center>

### 6 定性分析

#### 6.1 最近的邻居

* We report sample qualitative results in Table 7. For selected words, we show nearest neighbors according to cosine similarity for vectors trained using the proposed approach and for the skipgram baseline. As expected, the nearest neighbors for complex, technical and infrequent words using our approach are better than the ones obtained using the baseline model.


* 我们在表7中报告了样品定性结果。对于选定的词，我们根据余弦相似度来显示最近邻，对于使用该方法训练的向量和skipgram基线。如预期的那样，我们的方法对于复杂的、技术性的和不经常使用的单词的最近邻比使用基线模型得到的要好。

#### 6.2 n-gram和语素

* We want to qualitatively evaluate whether or not the most important n-grams in a word correspond to morphemes. To this end, we take a word vector that we construct as the sum of n-grams. As described in Sec. 3.2, each word w is represented as the sum of its n-grams: $u_w =\sum_{g∈Gw} z_g$. For each n-gram g, we propose to compute the restricted representation $u_{w/g}$ obtained by omitting g:

* 我们想定性地评估一个单词中最重要的n个词是否对应于语素。为此，我们取一个我们构造的词向量作为n个g的和。如第Sec. 3.2，每个单词w都表示为其n个g的和：$u_w =\sum_{g∈Gw} z_g$。对于每个n-gram g，我们建议计算通过省略g而获得的限制表示$u_{w/g}$：

![avatar](图片/13.png)

* We then rank n-grams by increasing value of cosine between $u_w$ and $u_{w/g}$. We show ranked n-grams for selected words in three languages in Table 6.

* 然后，我们通过将余弦值增加到$u_w$和$u_{w/g}$之间来对n-grams进行排序。我们在表6中显示了三种语言中所选单词的排名n-grams。

![avatar](图片/14.png)

<center>表6：三种语言中所选单词的最重要字符n-grams图解。对于每一个单词，我们都会显示n个字母，当去掉这些字母时，它们的表现形式会有很大的不同。</center>

* For German, which has a lot of compound nouns, we observe that the most important n-grams correspond to valid morphemes. Good examples include Autofahrer (car driver) whose most important n-grams are Auto (car) and Fahrer (driver). We also observe the separation of compound nouns into morphemes in English, with words such as lifetime or starfish. However, for English, we also observe that n-grams can correspond to affixes in words such as kindness or unlucky. Interestingly, for French we observe the inflections of verbs with endings such as ais>, ent> or ions>.


* 对于含有大量复合名词的德语来说，我们观察到最重要的n-grams对应于有效语素。很好的例子包括Auto Fahrer（汽车驾驶员），其最重要的n-grams是Auto（汽车）和Fahrer（驾驶员）。我们还观察了英语中复合名词与生命线或海星等词的词素分离。然而，对于英语来说，我们也注意到n-grams可以对应于单词中的词缀，比如善良或不幸。有趣的是，对于法语，我们可以观察到以ais>、ent>或ions>结尾的动词的屈折变化。

#### 6.3 面向对象的词汇相似度

* As described in Sec. 3.2, our model is capable of building word vectors for words that do not appear in the training set. For such words, we simply average the vector representation of its n-grams. In order to assess the quality of these representations, we analyze which of the n-grams match best for OOV words by selecting a few word pairs from the English RW similarity dataset. We select pairs such that one of the two words is not in the training vocabulary and is hence only represented by its n-grams. For each pair of words, we display the cosine similarity between each pair of n-grams that appear in the words. In order to simulate a setup with a larger number of OOV words, we use models trained on 1% of the Wikipedia data as in Sec. 5.4. The results are presented in Fig. 2.

* 如第。3.2，我们的模型能够为训练集中没有出现的词构建词向量。对于这类词，我们简单地求其n-图的向量表示的平均值。为了评估这些表示的质量，我们从英语RW相似性数据集中选择几个词对，分析哪些n-grams最适合OOV词。我们选择成对词，这样两个词中的一个不在训练词汇表中，因此只由它的n个字母表示。对于每一对单词，我们显示单词中出现的每一对n-grams之间的余弦相似性。为了模拟一个有大量OOV单词的设置，我们使用了在1%的Wikipedia数据上训练的模型，如Sec。5.4条。结果如图2所示。

* We observe interesting patterns, showing that subwords match correctly. Indeed, for the word chip, we clearly see that there are two groups of n-grams in microcircuit that match well. These roughly correspond to micro and circuit, and n-grams in between don’t match well. Another interesting example is the pair rarity and scarceness. Indeed, scarce roughly matches rarity while the suffix -ness matches -ity very well. Finally, the word preadolescent matches young well thanks to the -adolesc- subword. This shows that we build robust word representations where prefixes and suffixes can be ignored if the grammatical form is not found in the dictionary.


* 我们观察有趣的模式，显示子词匹配正确。实际上，对于单词chip，我们清楚地看到微电路中有两组n-grams匹配得很好。这些基本上对应于微电路和电路，而两者之间的n-grams不太匹配。另一个有趣的例子是这对稀有和稀有。事实上，稀有性与稀有性大致匹配，而后缀性与稀有性匹配得非常好。最后，由于-adolesc-子词，青春期前这个词与young匹配得很好。这表明我们构建了健壮的单词表示，如果在字典中找不到语法形式，前缀和后缀可以忽略。

### 7 结论

* In this paper, we investigate a simple method to learn word representations by taking into account subword information. Our approach, which incorporates character n-grams into the skipgram model, is related to an idea that was introduced by Schütze (1993). Because of its simplicity, our model trains fast and does not require any preprocessing or supervision. We show that our model outperforms baselines that do not take into account subword information, as well as methods relying on morphological analysis. We will open source the implementation of our model, in order to facilitate comparison of future work on learning subword representations.

* 在本文中，我们研究了一个简单的方法来学习单词表示法，考虑到子单词信息。我们的方法将字符n-grams合并到skipgram模型中，这与Schütze（1993）提出的一个想法有关。由于它的简单性，我们的模型训练迅速，不需要任何预处理或监督。我们的模型比不考虑子词信息的基线以及依赖于形态学分析的方法有更好的性能。我们将开放源代码实现我们的模型，以便于比较今后学习子词表示的工作。

# 创新点

* 1. subword单位来建模形态学，并通过其字符n-gram的总和来表示词。
* 2. subword和skip-gram模型结合。



