## 摘要

* We present the Word Mover’s Distance (WMD), a novel distance function between text documents. Our work is based on recent results in word embeddings that learn semantically meaningful representations for words from local cooccurrences in sentences. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document. We show that this distance metric can be cast as an instance of the Earth Mover’s Distance, a well studied transportation problem for which several highly efficient solvers have been developed. Our metric has no hyperparameters and is straight-forward to implement. Further, we demonstrate on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates.

* 我们提出了Word Mover 's Distance (WMD)，一个新的文本文档之间的距离函数。我们的工作是基于单词嵌入的最新结果，从句子中的局部共现中学习词汇的语义意义表示。WMD距离度量两个文本文档之间的差异，它是一个文档中嵌入的单词到达另一个文档中嵌入的单词需要“移动”的最小距离。我们表明，这个距离度量可以作为一个例子的挖土机的距离，这是一个已经被研究的运输问题，有几个高效的解决方法。我们的度量没有超参数，并且可以直接实现。此外，我们通过八个真实世界的文档分类数据集，与七个最先进的基线进行比较，证明WMD度量导致前所未有的低k近邻文档分类错误率。

## 1. 介绍

* Accurately representing the distance between two documents has far-reaching applications in document retrieval (Salton & Buckley, 1988), news categorization and clustering (Ontrup & Ritter, 2001; Greene & Cunningham, 2006), song identification (Brochu & Freitas, 2002), and multilingual document matching (Quadrianto et al., 2009).

* 准确地表示两个文档之间的距离在文档检索(Salton & Buckley, 1988)、新闻分类与聚类(Ontrup & Ritter, 2001)等方面有着深远的应用。Greene和Cunningham, 2006)、歌曲识别(Brochu和Freitas, 2002)和多语言文档匹配(Quadrianto等人，2009)。

* The two most common ways documents are represented is via a bag of words (BOW) or by their term frequency-inverse document frequency (TF-IDF). However, these features are often not suitable for document distances due to their frequent near-orthogonality (Scho ̈lkopf et al., 2002; Greene & Cunningham, 2006). Another significant drawback of these representations are that they do not capture the distance between individual words. Take for example the two sentences in different documents: Obama speaks to the media in Illinois and: The President greets the press in Chicago. While these sentences have no words in common, they convey nearly the same information, a fact that cannot be represented by the BOW model. In this case, the closeness of the word pairs: (Obama, President); (speaks, greets); (media, press); and (Illinois, Chicago) is not factored into the BOW-based distance.


* 文档最常用的两种表示方式是通过词袋模型（BOW）或它们的术语频率逆文档频率（TF-IDF）。然而，这些特征通常不适合于文档距离，因为它们经常是近似正交的（Schoĕlkopf et al.，2002；Greene&Cunningham，2006）。这些表示法的另一个显著缺点是它们无法捕捉单个单词之间的距离。以不同文件中的两句话为例：Obama speaks to the media in Illinois，The President greets the press in Chicago。虽然这些句子没有共同的词，但它们传达的信息几乎相同，这一事实不能用BOW模型来表示。在这种情况下，两个词对的亲密度：（Obama，President）；（speaks，greets）；（media，press）；和（illinois，Chicago）不计入BOW的距离。

<center>图1。对“移动者的距离”一词的说明。两个文档的所有不间断单词(粗体)都嵌入到一个word2vec空间中。两个文档之间的距离是文档1中所有单词需要移动的最小累积距离，以精确匹配文档2。(彩色观看效果最佳。)</center>

* There have been numerous methods that attempt to circumvent this problem by learning a latent low-dimensional representation of documents. Latent Semantic Indexing (LSI) (Deerwester et al., 1990) eigendecomposes the BOW feature space, and Latent Dirichlet Allocation (LDA) (Blei et al., 2003) probabilistically groups similar words into topics and represents documents as distribution over these topics. At the same time, there are many competing variants of BOW/TF-IDF (Salton & Buckley, 1988; Robertson & Walker, 1994). While these approaches produce a more coherent document representation than BOW, they often do not improve the empirical performance of BOW on distance-based tasks (e.g., nearest-neighbor classifiers) (Petterson et al., 2010; Mikolov et al., 2013c).

* 有许多方法试图通过学习文档的潜在低维表示来规避这个问题。潜在语义索引(LSI) (Deerwester et al.， 1990)特征特征分解BOW特征空间，潜在Dirichlet分配(LDA) (Blei et al.， 2003)将相似的词概率地分组为主题，并将文档表示为这些主题上的分布。同时，有许多竞争的BOW/TF-IDF (Salton & Buckley, 1988;(罗伯逊&沃克，1994)。虽然这些方法比BOW产生了更连贯的文档表示，但它们通常不会提高BOW在基于距离的任务上的经验表现(例如，最近邻分类器)(Petterson et al.， 2010;Mikolov等，2013c)。

* In this paper we introduce a new metric for the distance between text documents. Our approach leverages recent results by Mikolov et al. (2013b) whose celebrated word2vec model generates word embeddings of unprecedented quality and scales naturally to very large data sets (e.g., we use a freely-available model trained on approximately 100 billion words). The authors demonstrate that semantic relationships are often preserved in vector operations on word vectors, e.g., vec(Berlin) - vec(Germany) + vec(France) is close to vec(Paris). This suggests that distances and between embedded word vectors are to some degree semantically meaningful. Our metric, which we call the Word Mover’s Distance (WMD), utilizes this property of word2vec embeddings. We represent text documents as a weighted point cloud of embedded words. The distance between two text documents A and B is the minimum cumulative distance that words from document A need to travel to match exactly the point cloud of document B. Figure 1 shows a schematic illustration of our new metric.


* 本文提出了一种新的文本距离度量方法。我们的方法利用了Mikolov等人(2013b)最近的研究结果，他们著名的word2vec模型生成了质量空前的单词嵌入，并自然地扩展到非常大的数据集(例如，我们使用一个自由可用的模型，训练了大约1000亿个单词)。作者论证了语义关系经常在词向量的向量操作中保留，例如，vec(柏林)- vec(德国)+ vec(法国)接近于vec(巴黎)。这表明嵌入的词向量之间的距离和距离在某种程度上具有语义意义。我们的度量，我们称之为单词Mover的距离(WMD)，利用了word2vec嵌入的这个属性。我们将文本文档表示为嵌入单词的加权点云。两个文本文档A和B之间的距离是来自文档A的单词为了精确匹配文档B的点云需要移动的最小累积距离。图1显示了我们的新度量的示意图。

![avatar](图片2/1.png)

* The optimization problem underlying WMD reduces to a special case of the well-studied Earth Mover’s Distance (Rubner et al., 1998) transportation problem and we can leverage existing literature on fast specialized solvers (Pele & Werman, 2009). We also compare several lower bounds and show that these can be used as approximations or to prune away documents that are provably not amongst the k-nearest neighbors of a query.


* WMD的优化问题简化为一种特殊情况，即已经得到充分研究的挖土机的距离(Rubner et al.， 1998)运输问题，我们可以利用现有的关于快速专业求解器的文献(Pele & Werman, 2009)。我们还比较了几个下界，并说明这些下界可用于近似，或删除那些可证明不在查询的k近邻之间的文档。

* The WMD distance has several intriguing properties: 1. it is hyper-parameter free and straight-forward to understand and use; 2. it is highly interpretable as the distance between two documents can be broken down and explained as the sparse distances between few individual words; 3. it naturally incorporates the knowledge encoded in the word2vec space and leads to high retrieval accuracy—it outperforms all 7 state-of-the-art alternative document distances in 6 of 8 real world classification tasks.

* WMD距离有几个有趣的特性:超参数自由，易于理解和使用;2. 由于两个文档之间的距离可以分解为几个单独单词之间的稀疏距离，因此具有较高的可解释性;3.它自然地结合了在word2vec空间中编码的知识，并带来了很高的检索精度——在8个真实世界分类任务中的6个任务中，它的性能超过了所有7个最先进的文档距离。

## 2. 相关工作

* Constructing a distance between documents is closely tied with learning new document representations. One of the first works to systematically study different combinations of term frequency-based weightings, normalization terms, and corpus-based statistics is Salton & Buckley (1988). Another variation is the Okapi BM25 function (Robertson & Walker, 1994) which describes a score for each (word, document) pair and is designed for ranking applications. Aslam & Frost (2003) derive an information-theoretic similarity score between two documents, based on probability of word occurrence in a document corpus. Croft & Lafferty (2003) use a language model to describe the probability of generating a word from a document, similar to LDA (Blei et al., 2003). Most similar to our method is that of Wan(2007) which first decomposes each document into a set of subtopic units via TextTiling (Hearst, 1994), and then measures the effort required to transform a subtopic set into another via the EMD (Monge, 1781; Rubner et al., 1998).

* 构造文档之间的距离与学习新的文档表示密切相关。Salton & Buckley(1988)最先系统地研究了基于词频的权重、归一化项和基于语料库的统计的不同组合。另一个变体是Okapi BM25函数(Robertson & Walker, 1994)，它描述每个(单词、文档)对的分数，并为应用程序排名而设计。Aslam & Frost(2003)基于单词在文档语料库中的出现概率，推导出两个文档之间的信息理论相似性得分。Croft和Lafferty(2003)使用一种语言模型来描述从文档中生成一个单词的概率，类似于LDA (Blei等人，2003)。与我们的方法最相似的是Wan(2007)，它首先通过TextTiling将每个文档分解为一组子主题单元(Hearst, 1994)，然后度量通过EMD将一个子主题集转换为另一个子主题集所需的工作量(Monge, 1781;Rubner等，1998)。

* New approaches for learning document representations include Stacked Denoising Autoencoders (SDA) (Glorot et al., 2011), and the faster mSDA (Chen et al., 2012), which learn word correlations via dropout noise in stacked neural networks. Recently, the Componential Counting Grid (Perina et al., 2013) merges LDA (Blei et al., 2003) and Counting Grid (Jojic & Perina, 2011) models, allowing ‘topics’ to be mixtures of word distributions. As well, Le & Mikolov (2014) learn a dense representation for documents using a simplified neural language model, inspired by the word2vec model (Mikolov et al., 2013a).

* 学习文档表示的新方法包括堆叠去噪自动编码器(SDA) (Glorot et al.， 2011)和更快的mSDA (Chen et al.， 2012)，后者通过堆叠神经网络中的dropout噪声学习单词相关性。最近，成分计数网格(Perina et al.， 2013)合并了LDA (Blei et al.， 2003)和计数网格(Jojic & Perina, 2011)模型，使得“主题”成为单词分布的混合体。Le和Mikolov(2014)受到word2vec模型的启发，学习了使用简化的神经语言模型对文档进行密集表示的方法(Mikolov等，2013a)。

* The use of the EMD has been pioneered in the computer vision literature (Rubner et al., 1998; Ren et al., 2011). Sev- eral publications investigate approximations of the EMD for image retrieval applications (Grauman & Darrell, 2004; Shirdhonkar & Jacobs, 2008; Levina & Bickel, 2001). As word embeddings improve in quality, document retrieval enters an analogous setup, where each word is associated with a highly informative feature vector. To our knowledge, our work is the first to make the connection between high quality word embeddings and EMD retrieval algorithms.

* EMD在计算机视觉文献(Rubner等，1998年;Ren等，2011)。有几篇论文研究了EMD在图像检索应用中的近似(Grauman & Darrell, 2004;Shirdhonkar & Jacobs, 2008年;Levina & Bickel, 2001)。随着单词嵌入质量的提高，文档检索进入一个类似的设置，其中每个单词都与一个信息丰富的特征向量相关联。据我们所知，我们的工作是第一次将高质量的单词嵌入和EMD检索算法联系起来。

* Cuturi (2013) introduces an entropy penalty to the EMD objective, which allows the resulting approximation to be solved with very efficient iterative matrix updates. Further, the vectorization enables parallel computation via GPGPUs However, their approach assumes that the number of dimensions per document is not too high, which in our setting is extremely large (all possible words). This removes the main benefit (parallelization on GPGPUs) of their approach and so we develop a new EMD approximation that appears to be very effective for our problem domain.

* Cuturi(2013)在EMD目标中引入了熵罚，使得通过非常高效的迭代矩阵更新来求解得到的近似。此外，向量化支持通过GPGPUs进行并行计算。然而，他们的方法假设每个文档的维数不是太高，在我们的设置中是非常大的(所有可能的单词)。这消除了他们方法的主要好处(gpgpu上的并行化)，因此我们开发了一种新的EMD近似，对于我们的问题领域似乎非常有效。

## 3. word2Vec嵌入

* Recently Mikolov et al. (2013a;b) introduced word2vec, a novel word-embedding procedure. Their model learns a vector representation for each word using a (shallow) neural network language model. Specifically, they propose a neural network architecture (the skip-gram model) that consists of an input layer, a projection layer, and an output layer to predict nearby words. Each word vector is trained to maximize the log probability of neighboring words in a corpus, i.e., given a sequence of words w1 , . . . , wT ,

* 最近，Mikolov等人(2013a;b)提出了一种新的字嵌入程序word2vec。他们的模型使用(浅)神经网络语言模型学习每个单词的向量表示。具体地说，他们提出了一种神经网络架构(跃图模型)，它由一个输入层、一个投影层和一个输出层组成，用来预测附近的单词。对每个词向量进行训练，使语料库中相邻词的对数概率最大化，即，给定单词序列w1，…wT,

![avatar](图片2/2.png)

* where nb(t) is the set of neighboring words of word wt and p(wj |wt ) is the hierarchical softmax of the associated word vectors vwj and vwt (see Mikolov et al. (2013a) for more details). Due to its surprisingly simple architecture and the use of the hierarchical softmax, the skip-gram model can be trained on a single machine on billions of words per hour using a conventional desktop computer. The ability to train on very large data sets allows the model to learn complex word relationships such as vec(Japan) - vec(sushi) + vec(Germany) ≈ vec(bratwurst) and vec(Einstein) - vec(scientist) + vec(Picasso) ≈ vec(painter) (Mikolov et al., 2013a;b). Learning the word embedding is entirely unsupervised and it can be computed on the text corpus of interest or be pre-computed in advance. Although we use word2vec as our preferred embedding throughout, other embeddings are also plausible (Collobert & Weston, 2008; Mnih & Hinton, 2009; Turian et al., 2010).

* 其中nb(t)是单词小波变换相邻词的集合，p(wj |小波变换)是关联词向量vwj和vwt的分级softmax(详见Mikolov et al. (2013a))。由于其惊人的简单架构和使用的分层softmax，跃格模型可以训练在单机上的数十亿字每小时使用传统的桌面计算机。对非常大的数据集进行训练的能力允许模型学习复杂的单词关系，比如vec(日本)- vec(sushi) + vec(德国)≈vec(bratwurst)和vec(爱因斯坦)- vec(科学家)+ vec(毕加索)≈vec(画家)(Mikolov et al.， 2013a;b)。单词嵌入的学习是完全无监督的，可以在感兴趣的文本语料库上进行计算或预先计算。虽然我们使用word2vec作为我们的首选嵌入，其他嵌入也是可行的(Collobert & Weston, 2008;Mnih & Hinton, 2009;Turian等，2010)。

## 4. 词移距离

* Assume we are provided with a word2vec embedding matrix $X\in{R^{d×n}}$ for a finite size vocabulary of n words. The $i^{th}$ column, $x_i\in{R^d}$, represents the embedding of the $i^{th}$ word in d-dimensional space. We assume text documents are represented as normalized bag-of-words (nBOW) vectors, $d\in{R^n}$ .To be precise，if word i appears $c_i$ times in the document, we denote $d_i = \frac{c_i}{\sum^{n}_{j=1} c_j}$，An nBOW vector，d is naturally very sparse as most words will not appear in any given document. (We remove stop words, which are generally category independent.)

* 假设我们在$X\in{R^{d×n}}$中提供了一个word2vec嵌入矩阵，用于有限大小的n个单词的词汇表。$i^{th}$列，$x_i\in{R^d}$中，表示$i^{th}$单词在d维空间中嵌入。我们假设文本文档被表示为标准化的词袋(nBOW)向量，$d\in{R^n}$ .中。更精确地说，如果word i在文档中出现$c_i$次，我们表示$d_i = \frac{c_i}{\sum^{n}_{j=1} c_j}$，一个nBOW向量，d自然是非常稀疏的，因为大多数单词不会出现在任何给定的文档中。(我们删除了停止词，这些词通常是独立于类别的。)

* nBOW representation. We can think of the vector d as a point on the n−1 dimensional simplex of word distributions. Two documents with different unique words will lie in different regions of this simplex. However, these documents may still be semantically close. Recall the earlier example of two similar, but word-different sentences in one document: “Obama speaks to the media in Illinois” and in another: “The President greets the press in Chicago”. After stop-word removal, the two corresponding nBOW vectors d and d′ have no common non-zero dimensions and therefore have close to maximum simplex distance, although their true distance is small.

* nBOW表示。我们可以把向量d看作是单词分布的n - 1维单复数上的一个点。具有不同独特单词的两个文档将位于这个单复数的不同区域。但是，这些文档在语义上可能仍然很接近。让我们回想一下之前在一份文件中出现的两个相似但单词不同的句子:“奥巴马在伊利诺伊州对媒体讲话”和“总统在芝加哥问候媒体”。去除停用词后，对应的两个nBOW向量d和d’没有共同的非零维数，因此，虽然它们的真距离很小，但它们的单纯形距离接近最大。

* Word travel cost. Our goal is to incorporate the semantic similarity between individual word pairs (e.g. President and Obama) into the document distance metric. One such measure of word dissimilarity is naturally provided by their Euclidean distance in the word2vec embedding space. More precisely, the distance between word i and word j becomes $c(i, j ) = ||x_i − x_j||_2$ . To avoid confusion between word and document distances, we will refer to $c(i, j)$ as the cost associated with “traveling” from one word to another.

* 词旅行成本。我们的目标是将单个词对(例如总统和奥巴马)之间的语义相似性合并到文档距离度量中。单词不同程度的一种度量自然是由它们在word2vec嵌入空间中的欧氏距离提供的。更精确地说，单词i和单词j之间的距离变成$c(i, j) = ||x_i−x_j||_2$。为了避免单词和文档距离之间的混淆，我们将使用$c(i, j)$作为从一个单词到另一个单词的“旅行”相关的成本。

* Document distance. The “travel cost” between two words is a natural building block to create a distance between two documents. Let d and d′ be the nBOW representation of two text documents in the (n − 1)-simplex. First, we al- low each word i in d to be transformed into any word in d′ in total or in parts. Let $T\in{R^{n×n}}$ be a (sparse) flow matrix where $T_{ij} ≥ 0$ denotes how much of word i in d travels to word j in d′. To transform d entirely into d′ we ensure that the entire outgoing flow from word i equals $d_i$,$i.e. \sum_i T_{ij} = d_j ′$, Further, the amount of incoming flow to word j must match dj′，$i.e. \sum_j T_{ij} = d_i$，Finally, we can define the distance between the two documents as the minimum (weighted) cumulative cost required to move all words from d to d′，$i.e. \sum_{i,j} T_{ij} c(i,j)$

* 文档的距离.两个单词之间的“旅行成本”是在两个文档之间产生距离的自然构建块。 设d和d '为(n−1)-单形中两个文本文档的nBOW表示。 首先，我们允许d中的每个单词i可以全部或部分地转换成d '中的任何单词。让 $T\in{R^{n×n}}$ 变成一个稀疏的流矩阵，其中 $T_{ij} ≥ 0$ 表示d中的单词i有多少移动到d中的单词j '. 为了将d完全转换为d '，我们确保从word i开始的整个流出流等于$d_i$,$i.e. \sum_i T_{ij} = d_j ′$, 另外，传入词j的流量必须匹配dj’，$i.e. \sum_j T_{ij} = d_i$，最后，我们可以将两个文档之间的距离定义为将所有单词从d移动到d '所需的最小(加权)累计成本。$i.e. \sum_{i,j} T_{ij} c(i,j)$

* Transportation problem. Formally, the minimum cumulative cost of moving d to d′ given the constraints is provided by the solution to the following linear program,

* 运输问题。形式上，在给定约束条件下，将d移动到d '的最小累积成本由以下线性规划的解提供，

![avatar](图片2/3.png)

* The above optimization is a special case of the earth mover’s distance metric (EMD) (Monge, 1781; Rubner et al., 1998; Nemhauser & Wolsey, 1988), a well studied transportation problem for which specialized solvers have been developed (Ling & Okada, 2007; Pele & Werman, 2009). To highlight this connection we refer to the resulting metric as the word mover’s distance (WMD). As the cost c(i, j) is a metric, it can readily be shown that the WMD is also a metric (Rubner et al., 1998).

* 上述优化是推土机距离度量(EMD)的一个特例(Monge, 1781;Rubner等人，1998年;Nemhauser & Wolsey, 1988)，一个被充分研究的运输问题，专门的解决方案已经被开发(Ling & Okada, 2007;贝利和沃曼，2009)。为了突出这一联系，我们将得到的度量称为单词mover 's distance (WMD)。由于代价c(i, j)是一个度量，可以很容易地表明WMD也是一个度量(Rubner et al.， 1998)。

* Visualization. Figure 2 illustrates the WMD metric on two sentences D1 and D2 which we would like to compare to the query sentence D0. First, stop-words are removed, leaving President, greets, press, Chicago in D0 each with di = 0.25. The arrows from each word i in sentences D1 and D2 to word j in D0 are labeled with their contribution to the distance Tij c(i, j ). We note that the WMD agrees with our intuition, and “moves” words to semantically similar words. Transforming Illinois into Chicago is much cheaper than is Japan into Chicago. This is because the word2vec embedding places the vector vec(Illinois) closer to vec(Chicago) than vec(Japan). Consequently, the distance from D0 to D1 (1.07) is significantly smaller than to D2 (1.63). Importantly however, both sentences D1 and D2 have the same bag-of-words/TF-IDF distance from D0, as neither shares any words in common with D0. An additional example D3 highlights the flow when the number of words does not match. D3 has term weights dj = 0.33 and excess flow is sent to other similar words. This increases the distance, although the effect might be artificially magnified due to the short document lengths as longer documents may contain several similar words.

* 可视化。图2展示了对两个句子D1和D2的WMD度量，我们希望将这两个句子与查询句子D0进行比较。首先，停用词被删除，留下总统、问候、出版社、芝加哥在D0中di = 0.25。将从句子D1和D2中的每个单词i到D0中的单词j的箭头标记为它们对距离Tij c(i, j)的贡献。我们注意到，WMD符合我们的直觉，并将单词“移动”成语义上相似的单词。把伊利诺伊州改造成芝加哥比日本改造芝加哥要便宜得多。这是因为word2vec嵌入将向量vec(伊利诺斯州)置于比vec(日本)更接近vec(芝加哥)的位置。因此，D0到D1(1.07)的距离明显小于D2(1.63)。但重要的是，句子D1和D2与D0的词袋/TF-IDF距离是相同的，因为两者都没有与D0相同的词。另一个示例D3突出显示了单词数量不匹配时的流。D3的项权值dj = 0.33，过量流被发送到其他类似的词。这增加了距离，尽管由于文档长度较短，效果可能被人为放大，因为较长的文档可能包含几个类似的单词。

![avatar](图片2/15.png)

<center>图2。在查询D0和两个句子D1, D2之间的WMD度量的组成部分。箭头表示两个单词之间的流，并标记为它们的距离贡献。(下:)两个句子D3和D0之间的流动，单词数量不同。这种不匹配导致WMD将单词移动到多个相似的单词。</center>

## 4.1 速度距离计算

* The best average time complexity of solving the WMD optimization problem scales O(p3 log p), where p denotes the number of unique words in the documents (Pele & Werman, 2009). For datasets with many unique words (i.e., high-dimensional) and/or a large number of documents, solving the WMD optimal transport problem can become prohibitive. We can however introduce several cheap lower bounds of the WMD transportation problem that allows us to prune away the majority of the documents without ever computing the exact WMD distance.

* 解决WMD优化问题的最佳平均时间复杂度为O(p3 log p)，其中p表示文档中唯一单词的数量(Pele & Werman, 2009)。对于有许多唯一词的数据集(例如。例如，高维的)和/或大量的文件，解决WMD的最优运输问题可能变得令人望而却步。然而，我们可以引入几个WMD问题的廉价下界，使我们可以在不计算大规模杀伤性武器精确距离的情况下删除大部分文档。

* Word centroid distance. Following the work of Rubner et al. (1998) it is straight-forward to show (via the triangle inequality) that the ‘centroid’ distance $∥Xd − Xd′ ∥_2$ must lower bound the WMD between documents d, d′,

* 词质心的距离，根据Rubner等人(1998)的工作，它是直接显示(通过三角形不等式)的“质心”距离$∥Xd − Xd′ ∥_2$，必须在文档d, d '之间设置WMD的下限，

![avatar](图片2/4.png)

* We refer to this distance as the Word Centroid Distance (WCD) as each document is represented by its weighted average word vector. It is very fast to compute via a few matrix operations and scales O(dp). For nearest-neighbor applications we can use this centroid-distance to inform our nearest neighbor search about promising candidates, which allows us to speed up the exact WMD search significantly. We can also use WCD to limit our k-nearest neighbor search to a small subset of most promising candidates, resulting in an even faster approximate solution.

* 我们将此距离称为单词质心距离(WCD)，因为每个文档都是由其加权平均单词向量表示的。通过少量的矩阵运算和尺度O(dp)计算速度非常快。对于最近邻的应用程序，我们可以使用这个中心距离来通知最近邻搜索有希望的候选对象，这使得我们可以显著加快WMD搜索的速度。我们还可以使用WCD将k近邻搜索限制到最有希望的候选对象的一个小子集，从而得到更快的近似解。

* Relaxed word moving distance. Although the WCD is fast to compute, it is not very tight (see section 5). We can obtain much tighter bounds by relaxing the WMD optimization problem and removing one of the two constraints respectively (removing both constraints results in the trivial lower bound T = 0.) If just the second constraint is removed, the optimization becomes,

* 轻松字移动距离。尽管WCD的计算速度很快，但它并不是非常紧凑(见第5节)。我们可以通过放松WMD优化问题并分别去除两个约束中的一个来获得更紧凑的边界(去除两个约束将导致T = 0的小下界)。如果去掉第二个约束，优化就变成，

![avatar](图片2/5.png)

* This relaxed problem must yield a lower-bound to the WMD distance, which is evident from the fact that every WMD solution (satisfying both constraints) must remain a feasible solution if one constraint is removed.
* The optimal solution is for each word in d to move all its probability mass to the most similar word in d′. Precisely, an optimal T∗ matrix is defined as

* 这个放宽限制的问题必须为WMD距离提供一个下界，这一点很明显，每个WMD解决方案(满足两个约束)都必须是一个可行的解决方案，如果一个约束被删除。
* 最优解决方案是将d中的每个单词的所有概率质量移动到d '中最相似的单词。准确地说，最优T∗矩阵定义为

![avatar](图片2/6.png)

* The optimality of this solution is straight-forward to show. Let T be any feasible matrix for the relaxed problem, the contribution to the objective value for any word i, with closest word $j^{∗} = argmin_j c(i, j )$, cannot be smaller:

![avatar](图片2/7.png)

* Therefore, T∗ must yield a minimum objective value. Computing this solution requires only the identification of $j^{∗} = argmin_i c(i, j )$ which is a nearest neighbor search in the Euclidean word2vec space. For each word vector xi in document D we need to find the most similar word vector x in document D′. The second setting,when the first constraint is removed, is almost identical except that the nearest neighbor search is reversed. Both lower bounds ultimately rely on pairwise distance computations between word vectors. These computations can be combined and reused to obtain both bounds jointly at little additional overhead. Let the two relaxed solutions be l1 (d, d′ ) and l2 (d, d′ ) respectively. We can obtain an even tighter bound by taking the maximum of the two, $l_r (d,d′) = max(l_1 (d,d′),l_2 (d,d′))$ which we refer to as the Relaxed WMD (RWMD). This bound is significantly tighter than WCD. The nearest neighbor search has a time complexity of $O(p^2)$, and it can be sped up further by leveraging out-of-the-box tools for fast (approximate or exact) nearest neighbor retrieval (Garcia et al., 2008; Yianilos, 1993; Andoni & Indyk, 2006).

* 因此，T∗必须产生一个最小的客观值。计算这个解决方案只需要识别$j^{∗}= argmin_i c(i, j)$，它是欧氏word2vec空间中的最近近邻搜索。对于文档D中的每个单词向量xi，我们需要找到文档D '中最相似的单词向量x。当第一个约束被删除时，第二个设置几乎是相同的，除了最近邻搜索是反向的。这两个下界最终都依赖于字向量之间成对距离的计算。这些计算可以组合和重用，以很少的额外开销联合获得两个边界。设两个松弛解分别为l1 (d, d’)和l2 (d, d’)。取两者的最大值，我们可以得到一个更紧密的约束，即$l_r (d,d ') = max(l_1 (d,d ')，l_2 (d,d '))$，即松弛WMD (RWMD)。这个界限比WCD要紧密得多。最近邻居搜索的时间复杂度为$O(p^2)$，利用开箱可用的工具进行快速(近似或精确)最近邻居检索可以进一步加快速度(Garcia et al.， 2008;Yianilos, 1993;Andoni & Indyk, 2006)。

* Prefetch and prune. We can use the two lower bounds to drastically reduce the amount of WMD distance computations we need to make in order to find the k nearest neighbors of a query document. We first sort all documents in increasing order of their (extremely cheap) WCD distance to the query document and compute the exact WMD distance to the first k of these documents. Subsequently, we traverse the remaining documents. For each we first check if the RWMD lower bound exceeds the distance of the current kth closest document, if so we can prune it. If not, we compute the WMD distance and update the k nearest neighbors if necessary. As the RWMD approximation is very tight, it allows us to prune up to 95% of all documents on some data sets. If the exact k nearest neighbors are not required, an additional speedup can be obtained if this traversal is limited to m < n documents. We refer to this algorithm as prefetch and prune. If m = k, this is equivalent to returning the k nearest neighbors of the WCD distance. If m = n it is exact as only provably non-neighbors are pruned.

* 预取和修剪。我们可以使用这两个下界来极大地减少WMD距离的计算量，我们需要做的是找到一个查询文档的k个最近的邻居。我们首先按照文档到查询文档的WCD距离的递增顺序对所有文档进行排序，并计算到这些文档的前k个WMD距离。随后，我们遍历剩余的文档。对于每个文档，我们首先检查RWMD下界是否超过当前第k个最近文档的距离，如果是这样，我们可以对其进行修剪。如果没有，我们计算WMD距离，并在必要时更新k个最近的邻居。由于RWMD近似非常严格，它允许我们删除某些数据集上高达95%的文档。如果不需要确切的k个最近邻，那么如果遍历仅限于m < n个文档，则可以获得额外的加速。我们把这个算法称为预取和修剪。如果m = k，这相当于返回WCD距离的k个最近邻。如果m = n，它是精确的，因为只有可证明的非邻居被修剪。

## 5. 结果

* We evaluate the word mover’s distance in the context of kNN classification on eight benchmark document categorization tasks. We first describe each dataset and a set of classic and state-of-the-art document representations and distances. We then compare the nearest-neighbor performance of WMD and the competing methods on these datasets. Finally, we examine how the fast lower bound distances can speedup nearest neighbor computation by prefetching and pruning neighbors. Matlab code for the WMD metric is available at http:// matthewkusner.com

* 我们在八个基准文档分类任务上评估了kNN分类上下文中的单词移动距离。我们首先描述每个数据集和一组经典和最先进的文档表示和距离。然后在这些数据集上比较了大规模杀伤性武器和竞争方法的最近邻性能。最后，我们研究了快速下界距离如何通过预取和修剪近邻来加速最近邻计算。WMD度量的Matlab代码可以在http://matthewkusner.com 上获得。

### 5.1. 数据集描述和设置

* We evaluate all approaches on 8 supervised document datasets: BBCSPORT: BBC sports articles between 2004-2005, TWITTER: a set of tweets labeled with sentiments ‘positive’, ‘negative’, or ‘neutral’ (Sanders, 2011) (the set is reduced due to the unavailability of some tweets), RECIPE: a set of recipe procedure descriptions labeled by their region of origin, OHSUMED: a collection of medical abstracts categorized by different cardiovascular disease groups (for computational efficiency we subsample the dataset, using the first 10 classes), CLASSIC: sets of sentences from academic papers, labeled by publisher name, REUTERS: a classic news dataset labeled by news topics (we use the 8-class version with train/test split as described in Cardoso-Cachopo (2007)), AMAZON: a set of Amazon reviews which are labeled by category product in {books, dvd, electronics, kitchen} (as opposed to by sentiment), and 20NEWS: news articles classified into 20 different categories (we use the “bydate” train/test split1 Cardoso-Cachopo (2007)). We preprocess all datasets by removing all words in the SMART stop word list (Salton & Buckley, 1971). For 20NEWS, we additionally remove all words that appear less than 5 times across all documents. Finally, to speed up the computation of WMD (and its lower bounds) we limit all 20NEWS documents to the most common 500 words (in each document) for WMD-based methods.

* 我们评估方法监督8日文档数据集:BBCSPORT: BBC体育文章在2004 - 2005之间,TWITTER:一组微博贴上"正面",“负面”,或“中性”(桑德斯,2011)(一组是减少由于一些tweet的不可用),食谱:一组配方过程描述原产地标记的区域,OHSUMED:医疗抽象分类的集合不同的心血管疾病组(抽样计算效率我们的数据集,使用前10类),经典:从学术论文几组句子,由出版商的名字标签,路透社:一个典型的新闻数据集新闻话题标记(我们使用8级版本与火车/测试分割中描述Cardoso-Cachopo(2007)),亚马逊:一组亚马逊评论，在{书籍、dvd、电子产品、厨房}中按类别产品进行标记(与情感相反)，以及20NEWS:新闻文章分为20个不同类别(我们使用“bydate”火车/测试split1 Cardoso-Cachopo(2007))。我们通过删除智能停止单词列表中的所有单词来预处理所有数据集(Salton & Buckley, 1971)。对于20NEWS，我们还删除了所有文档中出现次数少于5次的单词。最后，为了加速WMD(及其下界)的计算，我们将基于WMD方法的所有20个news文档的字数限制在500个单词以内。

![avatar](图片2/8.png)

* We split each dataset into training and testing subsets (if not already done so). Table 1 shows relevant statistics for each of these training datasets including the number of inputs n, the bag-of-words dimensionality, the average number of unique words per document, and the number of classes |Y|. The word embedding used in our WMD implementation is the freely-available word2vec word embedding which has an embedding for 3 million words/phrases (from Google News), trained using the approach in Mikolov et al. (2013b). Words that are not present in the pre-computed word2vec model are dropped for the WMD metric (and its lower bounds), but kept for all baselines (thus giving the baselines a slight competitive advantage).


* 我们将每个数据集分成训练和测试子集(如果还没有这样做的话)。表1显示了每个训练数据集的相关统计信息，包括输入的数量n、单词包维数、每个文档的平均唯一单词数和类的数量|Y|。在我们的WMD实现中使用的单词嵌入是可自由使用的word2vec单词嵌入，它可以嵌入300万个单词/短语(来自谷歌News)，使用Mikolov等人(2013b)的方法进行训练。在WMD度量(及其下界)中没有出现在预先计算的word2vec模型中的单词被删除，但保留在所有基线中(从而使基线具有轻微的竞争优势)。

* We compare against 7 document representation baselines: bag-of-words (BOW). A vector of word counts of dimensionality d, the size of the dictionary.
* TFIDF term frequency-inverse document frequency (Salton & Buckley, 1988): the bag-of-words representation divided by each word’s document frequency.
* BM25 Okapi: (Robertson et al., 1995) a ranking function that extends TF-IDF for each word w in a document D:

* 我们比较了7个文档表示基线:单词包(BOW)。维数d的单词计数向量，即字典的大小。
* TFIDF术语频率逆文档频率(Salton & Buckley, 1988):单词包表示除以每个单词的文档频率。
* BM25 Okapi:(Robertson等，1995)对文档D中的每个单词w扩展TF-IDF的排序函数:

![avatar](图片2/9.png)

* where IDF(w) is the inverse document frequency of word w, T F (w, D) is its term frequency in document D, |D| is the number of words in the document, Davg is the average size of a document, and k1 and b are free parameters.

* 其中IDF(w)为单词w的反文档频率，T F(w, D)为其在文档D中的项频率，|D|为文档中的单词数，Davg为文档的平均大小，k1和b为自由参数。

![avatar](图片2/10.png)

<center>图3。与标准的和最先进的基线方法相比，kNN测试在8个文档分类数据集上的错误结果。</center>

![avatar](图片2/11.png)

<center>图4。与使用BOW的kNN相比，各种文档度量的kNN测试误差平均于所有8个数据集。
</center>

* LSI Latent Semantic Indexing (Deerwester et al., 1990): uses singular value decomposition on the BOW representation to arrive at a semantic feature space.
* LDA Latent Dirichlet Allocation (Blei et al., 2003): a celebrated generative model for text documents that learns representations for documents as distributions over word topics. We use the Matlab Topic Modeling Toolbox Steyvers & Griffiths (2007) and allow 100 iterations for burn-in and run the chain for 1000 iterations afterwards. Importantly, for each dataset we train LDA transductively, i.e. we train on the union of the training and holdout sets.
* mSDA Marginalized Stacked Denoising Autoencoder (Chen et al., 2012): a representation learned from stacked denoting autoencoders (SDAs), marginalized for fast training. In general, SDAs have been shown to have state-of-the-art performance for document sentiment analysis tasks (Glorot et al., 2011). For high-dimensional datasets (i.e., all except BBCSPORT, TWITTER, and RECIPE) we use either the high-dimensional version of mSDA (Chen et al., 2012) or limit the features to the top 20% of the words (ordered by occurence), whichever performs better.
* CCG Componential Counting Grid (Perina et al.,2013): a generative model that directly generalizes the Counting Grid (Jojic & Perina, 2011), which models documents as a mixture of word distributions, and LDA (Blei et al., 2003). We use the grid location admixture probability of each document as the new representation.
* For each baseline we use the Euclidean distance for kNN classification. All free hyperparameters were set with Bayesian optimization for all algorithms (Snoek et al., 2012). We use the open source MATLAB implementation “bayesopt.m” from Gardner et al. (2014).

* LSI潜语义索引(Deerwester et al.， 1990):对BOW表示进行奇异值分解，得到语义特征空间。
* LDA潜在Dirichlet分配(Blei等人，2003):一种著名的文本文档生成模型，它将文档的表示学习为单词主题上的分布。我们使用Matlab主题建模工具箱Steyvers & Griffiths(2007)，允许100次迭代用于老化，并在之后运行1000次迭代链。重要的是，对于每个数据集，我们都对LDA进行换向训练，也就是说，我们对训练集和坚持集的联合进行训练。
* mSDA边缘化堆叠去噪自动编码器(Chen et al.， 2012):从堆叠表示autoencoders (SDAs)中学习到的一种表示，边缘化用于快速训练。一般而言，SDAs已经被证明在文档情绪分析任务方面具有最先进的性能(Glorot et al.， 2011)。对于高维数据集(即我们要么使用mSDA的高维版本(Chen et al.， 2012)，要么将功能限制在前20%的单词(按出现次数排序)，以性能更好的为准。
* CCG组份计数网格(Perina et al.，2013):直接概括计数网格(Jojic & Perina, 2011)的生成模型，将文档建模为单词分布的混合，以及LDA (Blei et al.， 2003)。我们使用每个文档的网格位置混合概率作为新的表示。
* 对于每个基线，我们使用欧几里得距离进行kNN分类。所有自由超参数对所有算法都使用贝叶斯优化设置(Snoek et al.， 2012)。我们使用开源的MATLAB实现bayesopt。m”，加德纳等人(2014)。

![avatar](图片2/12.png)

<center>表2。不同文本嵌入的测试错误百分比和标准偏差。NIPS, AMZ, News是在不同数据集上训练的word2vec (w2v)模型，而HLBL和Collo也是通过其他嵌入算法获得的。</center>

### 5.2. 文档分类

* Document similarities are particularly useful for classification given a supervised training dataset, via the kNN decision rule (Cover & Hart, 1967). Different from other classification techniques, kNN provides an interpretable certificate (i.e., in the form of nearest neighbors) that allow practitioners the ability to verify the prediction result. Moreover, such similarities can be used for ranking and recommendation. To assess the performance of our metric on classification, we compare the kNN results of the WMD with each of the aforementioned document representations/distances. For all algorithms we split the training set into a 80/20 train/validation for hyper-parameter tuning. It is worth emphasizing that BOW, TF-IDF, BM25 and WMD have no hyperparameters and thus we only optimize the neighborhood size (k ∈ {1, . . . , 19}) of kNN.

* 通过kNN决策规则(Cover & Hart, 1967)，文档相似性对于给定一个监督训练数据集的分类特别有用。与其他分类技术不同，kNN提供了一个可解释的证书。允许从业者验证预测结果的能力。此外，这种相似性可以用来进行排序和推荐。为了评估我们的指标在分类上的性能，我们将WMD的kNN结果与前面提到的每个文档表示/距离进行比较。对于所有的算法，我们将训练集分成80/20的训练/验证来进行超参数调整。值得强调的是，BOW, TF-IDF, BM25和WMD没有超参数，因此我们只优化邻域大小(k∈{1，…， 19})的kNN。

* Figure 3 shows the kNN test error of the 8 aforementioned algorithms on the 8 document classification datasets. For datasets without predefined train/test splits (BBCSPORT, TWITTER, RECIPE, CLASSIC, AMAZON) we averaged over 5 train/test splits and we report means and standard errors. We order the methods by their average performance. Perhaps surprisingly, LSI and LDA outperform the more recent approaches CCG and mSDA. For LDA this is likely because it is trained transductively. One explanation for why LSI performs so well may be the power of Bayesian Optimization to tune the single LSI hyperparameter: the number of basis vectors to use in the representation. Fine-tuning the number of latent vectors may allow LSI to create a very accurate representation. On all datasets except two (BBCSPORT, OHSUMED), WMD achieves the lowest test error. Notably, WMD achieves almost a 10% reduction in (relative) error over the second best method on TWITTER (LSI). It even reaches error levels as low as 2.8% error on classic and 3.5% error on REUTERS, even outperforming transductive LDA, which has direct access to the features of the test set. One possible explanation for the WMD performance on OHSUMED is that many of these documents contain technical medical terms which may not have a word embedding in our model. These words must be discarded, possibly harming the accuracy of the metric.

* 图3显示了上述8种算法对8个文档分类数据集的kNN测试误差。对于没有预定义训练/测试分块的数据集(BBCSPORT, TWITTER, RECIPE, CLASSIC, AMAZON)，我们平均超过5个训练/测试分块，并报告均值和标准误差。我们根据平均性能对方法进行排序。也许令人惊讶的是，LSI和LDA的表现超过了最近的方法CCG和mSDA。对于LDA来说，这可能是因为它是经过转导性训练的。LSI性能如此优异的一个解释可能是贝叶斯优化对LSI单个超参数(用于表示的基向量的数量)的调优能力。微调潜在向量的数量可以允许LSI创建一个非常精确的表示。在除了两个数据集(BBCSPORT, OHSUMED)之外的所有数据集上，WMD的测试误差最低。值得注意的是，WMD比TWITTER上第二好的方法(LSI)几乎减少了10%的(相对)误差。甚至达到误差水平低至2.8%误差对经典和路透误差3.5%,甚至优于转换LDA,直接访问测试集的特性。在OHSUMED大规模杀伤性武器性能一个可能的解释是,这些文档包含技术医学术语可能没有一个字嵌入在我们的模型中。这些字必须丢弃，可能会损害度量的准确性。

* Figure 4 shows the average improvement of each method, relative to BOW, across all datasets. On average, WMD results in only 0.42 of the BOW test-error and outperforms all other metrics that we compared against.

* 图4显示了在所有数据集中，每种方法相对于BOW的平均改进。平均而言，WMD只导致BOW测试错误的0.42，并且优于我们所比较的所有其他指标。

![avatar](图片2/13.png)

## 5.3. Word embeddings.

* As our technique is naturally dependent on a word embedding, we examine how different word embeddings affect the quality of k-nearest neighbor classification with the WMD. Apart from the aforementioned freely-available Google News word2vec model, we trained two other word2vec models on a papers corpus (NIPS) and a product review corpus (AMZ). Specifically, we extracted text from all NIPS conference papers within the years 2004-2013 and trained a Skip-gram model on the dataset as per Mikolov et al. (2013b). We also used the 340,000 Amazon review dataset of Blitzer et al. (2006) (a superset of the amazon classification dataset used above) to train the review-based word2vec model. Prior to training we removed stop words for both models, resulting in 36,425,259 words for the NIPS dataset (50-dimensional) and 90,005,609 words for the Reviews dataset (100-dimensional) (compared to the 100 billion word dataset of Google News (NEWS)).

* 由于我们的技术自然依赖于单词嵌入，所以我们用WMD检查不同的单词嵌入如何影响k近邻分类的质量。除了上述可自由使用的谷歌新闻word2vec模型外，我们还在一个论文语料库(NIPS)和一个产品评论语料库(AMZ)上训练了另外两个word2vec模型。具体来说，我们从2004-2013年的NIPS会议论文中提取文本，并按照Mikolov等人(2013b)的方法对数据集进行跃格模型训练。我们还使用了Blitzer等人(2006)的34万Amazon review数据集(上面使用的Amazon分类数据集的超集)来训练基于review的word2vec模型。在训练之前，我们删除了两个模型的停止词，结果NIPS数据集(50维)有36,425,259个词，Reviews数据集(100维)有90,005,609个词(与谷歌News的1000亿词数据集相比)。

* Additionally, we experimented with the pre-trained embeddings of the hierarchical log-bilinear model (HLBL) (Mnih & Hinton, 2009) and the model of Collobert & Weston (2008) (CW)4 . The HLBL model contains 246, 122 unique 50-dimensional word embeddings and the Collobert model has 268,810 unique word embeddings (also 50dimensional). Table 2 shows classification results on all data sets except 20NEWS (which we dropped due to running time constraints). On the five larger data sets, the 3 million word Google NEWS model performs superior to the smaller models. This result is in line with those of Mikolov et al. (2013a), that in general more data (as opposed to simply relevant data) creates better embeddings. Additionally, the three word2vec (w2v) models outperform the HLBL and Collobert models on all datasets. The classification error deteriorates when the underlying model is trained on very different vocabulary (e.g. NIPS papers vs cooking recipes), although the performance of the Google NEWS corpus is surprisingly competitive throughout.

* 此外，我们实验了预先训练好的嵌入层次对数-双线性模型(HLBL) (Mnih & Hinton, 2009)和Collobert & Weston (2008) (CW)的模型4。HLBL模型包含246,122个独特的50维单词嵌入，Collobert模型有268,810个独特的单词嵌入(也是50维的)。表2显示了除20NEWS(由于运行时间限制，我们省略了20NEWS)以外的所有数据集的分类结果。在五个较大的数据集上，300万字谷歌新闻模型优于较小的模型。这一结果与Mikolov等人(2013a)的研究结果一致，即一般而言，更多的数据(相对于简单的相关数据)可以创建更好的嵌入。此外，三个word2vec (w2v)模型在所有数据集上都优于HLBL和Collobert模型。尽管谷歌新闻语料库的表现自始至终都令人惊讶地竞争激烈，但当底层模型在非常不同的词汇(例如NIPS论文和烹饪食谱)上训练时，分类错误会恶化。

### 5.4. 下界和修剪

* Although WMD yields by far the most accurate classification results, it is fair to say that it is also the slowest metric to compute. We can therefore use the lower bounds from section 4 to speed up the distance computations. Figure 6 shows the WMD distance of all training inputs to two randomly chosen test queries from TWITTER and AMAZON in increasing order. The graph also depicts the WCD and RWMD lower bounds. The RWMD is typically very close to the exact WMD distance, whereas the cheaper WCD approximation is rather loose. The tightness of RWMD makes it valuable to prune documents that are provably not amongst the k nearest neighbors. Although the WCD is too loose for pruning, its distances increase with the exact WMD distance, which makes it a useful heuristic to identify promising nearest neighbor candidates.

* 尽管WMD得到的分类结果是迄今为止最精确的，但公平地说，它也是计算速度最慢的指标。因此，我们可以使用第4节中的下界来加速距离的计算。图6按递增顺序显示了从TWITTER和AMAZON随机选择的两个测试查询的所有训练输入的WMD距离。该图还描绘了WCD和RWMD的下界。RWMD通常非常接近大规模毁灭性武器的精确距离，而WCD的更便宜的近似是相当宽松的。RWMD的紧密性使得它有价值的修剪文件，被证明不是在k近邻之间。虽然WCD过于松散，无法进行修剪，但它的距离随着WMD距离的增加而增加，这使得它成为一个有用的启发式来识别有希望的最近邻候选对象。

![avatar](图片2/14.png)

* The tightness of each lower bound can be seen in the left image in Figure 5 (averaged across all test points). RWMDC1, RWMDC2 correspond to WMD with only constraints #1 and #2 respectively, which result in comparable tightness. WCD is by far the loosest and RWMD the tightest bound. Interestingly, this tightness does not directly translate into retrieval accuracy. The right image shows the average kNN errors (relative to the BOW kNN error) if the lower bounds are used directly for nearest neighbor retrieval. The two most left columns represent the two individual lower bounds of the RWMD approximation. Both perform poorly (worse than WCD), however their maximum (RWMD) is surprisingly accurate and yields kNN errors that are only a little bit less accurate than the exact WMD. In fact, we would like to emphasize that the average kNN error with RWMD (0.45 relative to BOW) still outperforms all other baselines (see Figure 4).

* 每个下界的紧密程度可以在图5的左边图像中看到(所有测试点的平均值)。RWMDC1和RWMDC2分别与WMD对应，仅限制#1和#2，这导致了相当的紧致性。WCD是最松散的，RWMD是最紧密的。有趣的是，这种紧密性并不能直接转化为检索的准确性。右图显示了平均kNN错误(相对于BOW kNN错误)，如果下界被直接用于最近邻居检索。最左边的两列表示RWMD近似的两个单独的下界。两者的性能都很差(比WCD差)，但是它们的最大值(RWMD)惊人的精确，产生的kNN误差仅比WMD精确一点点。事实上，我们想强调的是RWMD的平均kNN误差(相对于BOW为0.45)仍然优于所有其他基线(参见图4)。

* Finally, we evaluate the speedup and accuracy of the exact and approximate versions of the prefetch and prune algorithm from Section 4 under various values of m (Figure 7). When m = k we use the WCD metric for classification (and drop all WMD computations which are unnecessary). For all other results we prefetch m instances via WCD, use RWMD to check if a document can be pruned and only if not compute the exact WMD distance. The last bar for each dataset shows the test error obtained with the RWMD metric (omitting all WMD computations). All speedups are reported relative to the time required for the exhaustive WMD metric (very top of the figure) and were run in paralell on 4 cores (8 cores for 20NEWS) of an Intel L5520 CPU with 2.27Ghz clock frequency.

* 最后，我们评估了第4节中的预取和修剪算法在不同m值下的精确版本和近似版本的速度和准确性(图7)。当m = k时，我们使用WCD度量进行分类(并删除所有不必要的WMD计算)。对于所有其他结果，我们通过WCD预取m个实例，使用RWMD来检查一个文档是否可以被修剪，只有当不计算精确的WMD距离。每个数据集的最后一栏显示使用RWMD度量获得的测试错误(忽略所有WMD计算)。所有的加速都是相对于详尽的WMD度量所需的时间进行报告的(图的最上方)，并在时钟频率为2.27Ghz的Intel L5520 CPU的4核(20NEWS为8核)上并行运行。

* First, we notice in all cases the increase in error through prefetching is relatively minor whereas the speedup can be substantial. The exact method (m = n) typically results in a speedup between 2× and 5× which appears pronounced with increasing document lengths (e.g. 20NEWS). It is interesting to observe, that the error drops most between m=k and m=2k,which might yield a sweet spot between accuracy and retrieval time for time-sensitive applications. As noted before, using RWMD directly leads to impressively low error rates and average retrieval times below 1s on all data sets. We believe the actual timing could be improved substantially with more sophisticated implementations (our code is in MATLAB) and parallelization.


* 首先，我们注意到，在所有情况下，预取带来的错误增加都相对较小，而加速却相当可观。精确的方法(m = n)通常会导致2 *和5 *之间的加速，随着文档长度的增加(例如20NEWS)会出现明显的加速。有趣的是，可以观察到，在m=k和m=2k之间的误差下降最多，对于时间敏感的应用程序，这可能在准确度和检索时间之间产生一个最佳点。如前所述，直接使用RWMD会导致非常低的错误率和所有数据集上的平均检索时间低于1。我们相信，通过更复杂的实现(我们的代码是用MATLAB编写的)和并行化，实际的计时可以得到极大的改进。

## 6. 讨论和结论

* It is worthwhile considering why the WMD metric leads to such low error rates across all data sets. We attribute this to its ability to utilize the high quality of the word2vec embedding. Trained on billions of words, the word2vec embedding captures knowledge about text documents in the English language that may not be obtainable from the training set alone. As pointed out by Mikolov et al. (2013a), other algorithms (such as LDA or LSI) do not scale naturally to data sets of such scale without special approximations which often counteract the benefit of large-scale data (although it is a worthy area of future work). Surprisingly, this “latent” supervision benefits tasks that are different from the data used to learn the word embedding.

* 值得考虑的是，为什么WMD指标在所有数据集上导致如此低的错误率。我们将此归因于它能够利用高质量的word2vec嵌入。经过数十亿个单词的训练，word2vec嵌入捕获关于英语文本文档的知识，这些知识可能无法单独从训练集获得。Mikolov等人(2013a)指出，其他算法(如LDA或LSI)在没有特殊近似的情况下无法自然地伸缩到这种规模的数据集，这种近似通常会抵消大规模数据的好处(尽管这是一个值得未来研究的领域)。令人惊讶的是，这种“潜在的”监督有利于不同于用于学习“嵌入”这个词的数据的任务。

* One attractive feature of the WMD, that we would like to explore in the future, is its interpretability. Document distances can be dissected into sparse distances between words, which can be visualized and explained to humans. Another interesting direction will be to incorporate document structure into the distances between words by adding penalty terms if two words occur in different sections of similarly structured documents. If for example the WMD metric is used to measure the distance between academic papers, it might make sense to penalize word movements between the introduction and method section more than word movements from one introduction to another.

* WMD的一个吸引人的特点是它的可解释性，这也是我们将来想要探讨的。文档距离可以分解为单词之间的稀疏距离，可以可视化并向人类解释。另一个有趣的方向是，如果两个单词出现在结构类似的文档的不同部分中，通过添加惩罚词，将文档结构合并到单词之间的距离中。例如，如果使用WMD度量标准来度量学术论文之间的距离，那么处罚介绍和方法部分之间的单词移动可能比处罚从一个介绍到另一个介绍之间的单词移动更有意义。