## Document similarity calculation and nearest neighbor search

计算两个文档或者两个句子的相似度相当管用。任何可以用语言描述的事物（文章、商品、人都是这样）都可以在语言层面对其向量化，向量化之后就可以做分类、聚类，做相关推荐。但是，要完成这样的任务，向量化和计算相似度是基础。

[如何计算两个文档的相似度全文文档](http://www.52nlp.cn/%E5%A6%82%E4%BD%95%E8%AE%A1%E7%AE%97%E4%B8%A4%E4%B8%AA%E6%96%87%E6%A1%A3%E7%9A%84%E7%9B%B8%E4%BC%BC%E5%BA%A6%E5%85%A8%E6%96%87%E6%96%87%E6%A1%A3)

[如何计算两个文档的相似度（一）](http://www.52nlp.cn/%E5%A6%82%E4%BD%95%E8%AE%A1%E7%AE%97%E4%B8%A4%E4%B8%AA%E6%96%87%E6%A1%A3%E7%9A%84%E7%9B%B8%E4%BC%BC%E5%BA%A6%E4%B8%80)

[如何计算两个文档的相似度（二）](http://www.52nlp.cn/%E5%A6%82%E4%BD%95%E8%AE%A1%E7%AE%97%E4%B8%A4%E4%B8%AA%E6%96%87%E6%A1%A3%E7%9A%84%E7%9B%B8%E4%BC%BC%E5%BA%A6%E4%BA%8C)

[如何计算两个文档的相似度（三）](http://www.52nlp.cn/%E5%A6%82%E4%BD%95%E8%AE%A1%E7%AE%97%E4%B8%A4%E4%B8%AA%E6%96%87%E6%A1%A3%E7%9A%84%E7%9B%B8%E4%BC%BC%E5%BA%A6%E4%B8%89)


计算两个向量的相似度是经常用到的操作，可以通过求向量点积或者向量夹角大小来衡量。从一堆向量中寻找和一个向量相似度最大的向量又是另一回事，涉及到算法的效率和复杂度。Brute force的算法当然是可以的，就是把每一个向量和该向量的相似度都算一遍，然后从中找出相似度最大的那个。除了这种笨办法，还有一些巧妙的算法，如果向量是一维的，相当简单，可以用binary search来进行寻找。但如果向量是高维的，事情就变得复杂，没法简单用bianry search来做（搜索引擎中常用的index过程就是针对高维向量的，不像一维向量简单的排序就行，但本质是相同的；另外，不管是排序还是index，都可以保存在内存中，也可以保存在硬盘或者网络数据库中，然后导入）。Founder at RARE Technologies（gensim的出品方） Radiam写了三篇很好的文章来介绍相关问题和可能的解决办法：

[Performance Shootout of Nearest Neighbours: Intro](https://rare-technologies.com/performance-shootout-of-nearest-neighbours-intro/)

[Performance Shootout of Nearest Neighbours: Contestants](https://rare-technologies.com/performance-shootout-of-nearest-neighbours-contestants/)

[Performance Shootout of Nearest Neighbours: Querying](https://rare-technologies.com/performance-shootout-of-nearest-neighbours-querying/#wikisim)

另外，在gensim中，有相关的函数来解决文档相似度的问题，文档如下：

[similarities.docsim – Document similarity queries](https://radimrehurek.com/gensim/similarities/docsim.html)

    >>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
    >>>
    >>> index_tmpfile = get_tmpfile("index")
    >>> batch_of_documents = common_corpus[:]  # only as example
    >>> index = Similarity(index_tmpfile, common_corpus, num_features=len(common_dictionary)) # build the index
    >>>
    >>> for similarities in index[batch_of_documents]: # the batch is simply an iterable of documents, aka gensim corpus.
    ...     pass
    
    query = [(1, 2), (6, 1), (7, 2)]
    similarities = index[query] # get similarities between the query and all index documents
    
There is also a special syntax for when you need similarity of documents in the index to the index itself (i.e. queries=indexed documents themselves). This special syntax uses the faster, batch queries internally and is ideal for all-vs-all pairwise similarities:

    >>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
    >>>
    >>> index_tmpfile = get_tmpfile("index")
    >>> index = Similarity(index_tmpfile, common_corpus, num_features=len(common_dictionary)) # build the index
    >>>
    >>> for similarities in index: # yield similarities of the 1st indexed document, then 2nd...
    ...     pass
    

Compute cosine similarity of a dynamic query against a static corpus of documents (‘the index’).

Notes

Scalability is achieved by sharding the index into smaller pieces, each of which fits into core memory The shards themselves are simply stored as files to disk and mmap’ed back as needed.

Examples

    >>> from gensim.corpora.textcorpus import TextCorpus
    >>> from gensim.test.utils import datapath, get_tmpfile
    >>> from gensim.similarities import Similarity
    >>>
    >>> corpus = TextCorpus(datapath('testcorpus.mm'))
    >>> index_temp = get_tmpfile("index")
    >>> index = Similarity(index_temp, corpus, num_features=400)  # create index
    >>>
    >>> query = next(iter(corpus))
    >>> result = index[query]  # search similar to `query` in index
    >>>
    >>> for sims in index[corpus]: # if you have more query documents, you can submit them all at once, in a batch
    ...     pass
    >>>
    >>> # There is also a special syntax for when you need similarity of documents in the index
    >>> # to the index itself (i.e. queries=indexed documents themselves). This special syntax
    >>> # uses the faster, batch queries internally and **is ideal for all-vs-all pairwise similarities**:
    >>> for similarities in index: # yield similarities of the 1st indexed document, then 2nd...
    ...     pass
    
