## 使用Annoy和Word2Vec进行快速相似度查询

目前使用的相似度计算方法是通过暴力搜索向量空间中最接近的k个邻居，复杂度是线性的，搜索结果是精确的，对于大多数任务来说没有必要，以下将使用annoy进行相似度的估计，时间更快。

### 使用Text8语料库

下载语料库：
```
set https_proxy=IP:PORT
python -m gensim.downloader --download text8
```
使用语料库：

In [1]:
import gensim.downloader as api
text8_path = api.load('text8', return_path=True)
print(text8_path)

C:\Users\Administrator/gensim-data\text8\text8.gz


### 训练Word2Vec模型


In [2]:
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.word2vec import Text8Corpus

params = {
    'alpha': 0.05,
    'vector_size': 100,
    'window': 5,
    'epochs': 5,
    'min_count': 3,
    'sample': 1e-4,
    'sg': 1,
    'hs': 0,
    'negative': 5,
}

model = Word2Vec(Text8Corpus(text8_path), **params)
wv = model.wv
print("Training Finished", wv)

Training Finished <gensim.models.keyedvectors.KeyedVectors object at 0x000001FC9978E940>


### 使用模型构建AnnoyIndex并且进行相似度查询

为了在gensim中使用Annoy，首先需要创建AnnoyIndex实例。
实例化需要两个参数：
- model：word2vec或者doc2vec模型
- num_trees: 正整数，会影响时间和索引的大小，值越大结果越精准。

进行相似度查找时，除了使用之前的方法，还需要带入**indexer**参数。

*除了annoy，gensim还支持NMSLIB索引器，与annoy一样都支持快速、估计的查找*

In [3]:
from gensim.similarities.annoy import AnnoyIndexer

# use num_trees = 100
annoy_index = AnnoyIndexer(model, 100)
vector = wv['science']

In [4]:
approximate_neighbors = wv.most_similar([vector], topn=11, indexer=annoy_index)
print("Approximate Neighbors:")
for neighbor in approximate_neighbors:
    print(neighbor)
    
normal_neighbors = wv.most_similar([vector], topn=11)
print("Exact Neighbors:")
for neighbor in normal_neighbors:
    print(neighbor)

Approximate Neighbors:
('science', 1.0)
('astronautics', 0.5984432399272919)
('sciences', 0.5957670509815216)
('astrobiology', 0.5933246612548828)
('geisteswissenschaften', 0.5929851830005646)
('integrative', 0.5911383032798767)
('castronova', 0.584848940372467)
('populariser', 0.5819923877716064)
('criminology', 0.5813790261745453)
('theorizing', 0.5798007845878601)
('psychometrics', 0.5797312259674072)
Exact Neighbors:
('science', 1.0000001192092896)
('fiction', 0.735144853591919)
('astronautics', 0.6775044202804565)
('sciences', 0.673191487789154)
('actuarial', 0.6727616786956787)
('multidisciplinary', 0.6696473956108093)
('astrobiology', 0.6692304611206055)
('geisteswissenschaften', 0.6686779260635376)
('integrative', 0.6656641960144043)
('castronova', 0.655299186706543)
('technology', 0.6517725586891174)


### 持久化

储存索引以便以后使用不需要重新构建，节省时间。

持久化需要在磁盘中储存两个文件fname,fname.d。

在每次导入前，需要创建一个空的AnnoyIndexer对象。

In [5]:
# 注意不要重名，否则会出现OSError
fname = 'annoy.indexer'
# save
annoy_index.save(fname)


In [6]:
# load
load_index = AnnoyIndexer()
load_index.load(fname)
load_index.model = model

vector = wv['science']
approximate_neighbors2 = wv.most_similar([vector], topn=11, indexer=load_index)
for neighbor in approximate_neighbors2:
    print(neighbor)
    
assert approximate_neighbors2 == approximate_neighbors

('science', 1.0)
('astronautics', 0.5984432399272919)
('sciences', 0.5957670509815216)
('astrobiology', 0.5933246612548828)
('geisteswissenschaften', 0.5929851830005646)
('integrative', 0.5911383032798767)
('castronova', 0.584848940372467)
('populariser', 0.5819923877716064)
('criminology', 0.5813790261745453)
('theorizing', 0.5798007845878601)
('psychometrics', 0.5797312259674072)
