## Doc2Vec 模型

Doc2Vec模型将每个文档表征为向量。

In [1]:
import os
import gensim

test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')
lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')
lee_test_file = os.path.join(test_data_dir, 'lee.cor')

定义读取train/test文件的函数：

In [2]:
import smart_open

def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding='iso-8859-1') as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])
                
train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

In [3]:
print(train_corpus[:2])

[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', '

In [4]:
print(test_corpus[:2])

[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to'

### 训练模型

使用50维的向量实例化doc2vec模型，在训练语料库上迭代40次，设置最小词语出现次数为2以丢弃出现次数过小的单词：

In [5]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(train_corpus)

实际上，字典是一个列表，包含了从训练语料库中抽取的唯一的单词。额外的属性可以通过model.wv.get_vecattr()方法提取，例如，提取单词"penalty"出现的次数：

In [6]:
print(f"Word 'penalty' appeared {model.wv.get_vecattr('penalty', 'count')} times")

Word 'penalty' appeared 4 times


接下来，在语料库上训练模型（使用BLAS库加快训练时间）。

In [7]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

通过model.infer_vector将单词列表传递给训练好的模型，返回的向量可以用于余弦相似度的计算。

In [11]:
vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)

[-0.08966017 -0.23857424 -0.13767982  0.24228011  0.01501597 -0.07111948
  0.13412331  0.04861502 -0.13826714 -0.15789643  0.11965988 -0.08167095
 -0.07613041  0.02066883 -0.16935885 -0.16157469  0.02196263  0.15724814
  0.19331518 -0.14272724 -0.04133454 -0.00185587  0.17325863  0.05465156
 -0.01126674 -0.05047667 -0.30797902 -0.02384892 -0.06248168  0.01862221
  0.40152788 -0.09138237  0.16819322  0.08238731  0.18077222  0.14003135
 -0.13970795 -0.28252092 -0.22613388 -0.04293396 -0.02291233 -0.07351386
 -0.02236126 -0.04248099  0.1363744   0.07939965 -0.16482353 -0.13134898
  0.1511766   0.13974746]


由于算法是估算，所以对相同文本的估算结果会有略微的不同。

### 评估模型

为了评估模型，首先对训练语料库中的每个文档推断新向量，随后根据相似度返回每个文档的rank。

In [12]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)
    
    second_ranks.append(sims[1])

In [13]:
import collections

counter = collections.Counter(ranks)
print(counter)

Counter({0: 293, 1: 7})


In [14]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not v

最相似的文档拥有接近1.0的相似度分数，但是，排行第二的文档具有显著低的分数。

重复运行以下代码，

In [17]:
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (259): «israeli forces have launched attacks on some of the key palestinian symbols of autonomy including gaza international airport the strikes come as israeli authorities announced they were stepping up military operations against yasser arafat palestinian authority the palestinian leadership meanwhile appealed for intervention from the united nations security council after israeli air strikes yesterday and accused israeli prime minister ariel sharon of declaring war on the palestinians mr sharon government also placed force the armed group in charge of mr arafat protection and the tanzim military groups of his fatah faction on its list of terrorist organisations senior israeli official said the decisions were taken in five hour marathon late night session of the national unity government said the official who asked not to be named in series of incursions and air strikes the israeli military targeted mr arafat symbols of power after holding him to account for spate of 

### 测试模型

将以上的方法用在随机挑选的测试文档中，将之与文档对比：

In [18]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Test Document (2): «the united states government has said it wants to see president robert mugabe removed from power and that it is working with the zimbabwean opposition to bring about change of administration as scores of white farmers went into hiding to escape round up by zimbabwean police senior bush administration official called mr mugabe rule illegitimate and irrational and said that his re election as president in march was won through fraud walter kansteiner the assistant secretary of state for african affairs went on to blame mr mugabe policies for contributing to the threat of famine in zimbabwe»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (94, 0.607802152633667): «foreign minister alexander downer says the commonwealth democracy watchdog should put zimbabwe formally on its agenda in the first step to possible suspension from the organisation mr downer says ministers from the commonwealth ministerial action group cmag should review whethe