### 训练模型

首先使用lee语料库进行模型的训练：

In [1]:
from gensim.test.utils import datapath
from gensim import utils 
import gensim.models

class MyCorpus:
    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            yield utils.simple_preprocess(line)
            
sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)
# 也可以分布进行：
# model = gensim.models.Word2Vec
# model.build_vocab(sentences)
# mode.train(sentences)

模型一旦训练好之后，就可以进行相似度计算等操作。

模型的主要部分是model.wv:

In [2]:
vec_king = model.wv['king']
print(vec_king)

[-0.01629661  0.04841239  0.0110352   0.01286649  0.01025715 -0.09067513
  0.03708515  0.09195175 -0.00444688 -0.01559921 -0.00516166 -0.05739029
  0.00757758  0.03019384  0.00534     0.01359623 -0.00281732 -0.00239061
 -0.01876056 -0.06646399  0.04076691  0.00967966  0.01266097 -0.00154255
 -0.02061245  0.02157506 -0.02023448 -0.01322702 -0.02953921  0.01450843
  0.03576827 -0.04518547  0.04006007 -0.0358248  -0.00813503  0.0512962
  0.0155316   0.00897516 -0.0186383  -0.03283753 -0.01695674  0.00471581
 -0.00954359  0.01626479  0.02876061 -0.02135929 -0.02859485 -0.00109092
  0.00788634  0.03376314  0.01782753 -0.02405947 -0.01787091  0.00067012
 -0.01479281  0.01859088 -0.00080463  0.00197593 -0.02462159  0.00408266
 -0.01587849  0.00064162  0.00855404 -0.00472066 -0.03219509  0.06520161
  0.01620102  0.03610145 -0.04194429  0.05237212 -0.01118278  0.00405617
  0.05388708 -0.00835613  0.03904381  0.02990142 -0.00118563 -0.02267906
 -0.04432076 -0.01639775 -0.03473365  0.00836176 -0.

获取字典：

In [3]:
for index, word in enumerate(model.wv.index_to_key):
    if index==10:
        break
    print("word #{} is {}".format(index, word))

word #0 is the
word #1 is to
word #2 is of
word #3 is in
word #4 is and
word #5 is he
word #6 is is
word #7 is for
word #8 is on
word #9 is said


### 模型的持久化

可以将训练好的模型保存，下次使用可以直接load：

In [4]:
filepath = 'my_word2vec'
model.save(filepath)

In [5]:
# load
new_model = gensim.models.Word2Vec.load(filepath)

### 训练用参数

- min_count: 忽略出现次数小于该值的单词
- vector_size: 将单词表征至N维空间
- workers: 线程数提升训练速度

*注：训练模型时报错
RuntimeError: you must first build vocabulary before training the model
这是因为min_count默认为5，

### 持续训练（在线训练）

对于已经训练好的模型，可以继续导入更多的数据进行持续训练。

In [6]:
more_sentences = [
    ['Advanced', 'users', 'can', 'load', 'continue', 'training', 'more', 'sentences']
]

new_model.build_vocab(more_sentences, update=True)
new_model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)

(20, 40)

In [7]:
print(len(new_model.wv.index_to_key))
print(new_model.wv.most_similar(positive=['high', 'good'], topn=5))

1750
[('for', 0.9996793866157532), ('at', 0.999624490737915), ('today', 0.9996228814125061), ('could', 0.9996103048324585), ('this', 0.9995994567871094)]
