# gensim word2vec example

全局定义，此处需要注意：注释掉的reload两行会影响print的输出，需要注释掉。

In [1]:
#!/usr/bin/env python
#coding=utf8                                                                                                                                                                                                      
# author: Andy Liu
#from __future__ import print_function
import argparse
import logging
import sys 

from gensim.models.word2vec import Word2Vec
from gensim.models.word2vec import Text8Corpus

#reload(sys)
#sys.setdefaultencoding("utf8")

# Configure logging.
logging.basicConfig(
    level=logging.INFO,
    format=('[%(levelname)s] %(asctime)s %(filename)s:%(lineno)d :'
            '%(message)s'), datefmt='%a, %d %b %Y %H:%M:%S')

定义从文件读取训练数据的类，重定义__iter__函数使该类为iterable，gensim的Word2Vec可以一次性读取所有训练语料，也可以从iterable读取训练语料，这样可以节省内存，注意__iter__函数中使用yield输出每个句子，也是为了节省内存。

In [2]:
class Sentences(object):
    def __init__(self, corpus_file):
        self.corpus_file = corpus_file

    def __iter__(self):
        for line in open(self.corpus_file):
            line = line.strip().lower()
            if not isinstance(line, unicode):
                try:
                    line = line.decode('utf8')
                except:
                    logging.error('Failed to decode line %s' % line, file=sys.stderr)
            yield line.split('\t') 

** 训练函数 **

In [3]:
def train(corpus_file, dimension):
    # 此处可以使用上面定义的类读取训练数据，也可以使用gensim提供的Text8Corpus类读取.
    sentences = Sentences(corpus_file)
    #sentences = Text8Corpus(args.corpus_file)
    model = Word2Vec(sentences, min_count=5, size=dimension)
    logging.info('Finished training!')

    return model

** restore函数 **

In [4]:
def restore(args):
    if not args.model:
        logging.error('Model path needed!')
        sys.exit(1)

    return Word2Vec.load(args.model)

** 测试函数 **

In [15]:
def evaluate(model, word):
    if not isinstance(word, unicode):
        word = word.decode('utf8')

    if word not in model.wv.vocab:
        print('Word not exist in vocabulary!')
        return

    #print('Embedding vector :')
    #print(model.wv[word])                                                                                                                                                                                    

    print('Top 10 most simiar words of word %s:' % word)
    l = model.most_similar(positive=[word], topn=10)
    for (word, v) in l:
        print('%s\t%f' % (word, v))

#### 训练
此处使用的语料是10万条已经分词的视频标题，对于训练词向量来说这个语料规模是很小的，此处只是为了编写和测试流程和功能，从后面选择的几个测试的词的结果来看，效果还可以。如果需要实际应用或追求更好的效果，需要使用更大的语料。

In [6]:
model = train('./data/corpus_10w', 200)

[INFO] Sat, 09 Sep 2017 11:48:32 word2vec.py:550 :collecting all words and their counts
[INFO] Sat, 09 Sep 2017 11:48:32 word2vec.py:564 :PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
[INFO] Sat, 09 Sep 2017 11:48:32 word2vec.py:564 :PROGRESS: at sentence #10000, processed 103613 words, keeping 25861 word types
[INFO] Sat, 09 Sep 2017 11:48:32 word2vec.py:564 :PROGRESS: at sentence #20000, processed 206818 words, keeping 41462 word types
[INFO] Sat, 09 Sep 2017 11:48:32 word2vec.py:564 :PROGRESS: at sentence #30000, processed 309989 words, keeping 53888 word types
[INFO] Sat, 09 Sep 2017 11:48:32 word2vec.py:564 :PROGRESS: at sentence #40000, processed 413984 words, keeping 64591 word types
[INFO] Sat, 09 Sep 2017 11:48:32 word2vec.py:564 :PROGRESS: at sentence #50000, processed 517395 words, keeping 74021 word types
[INFO] Sat, 09 Sep 2017 11:48:32 word2vec.py:564 :PROGRESS: at sentence #60000, processed 621091 words, keeping 82786 word types
[INFO] Sat, 09 Sep 201

#### 保存模型到文件

In [7]:
model.save('./model/corpus_10w_200.model')

[INFO] Sat, 09 Sep 2017 11:48:39 utils.py:362 :saving Word2Vec object under ./model/corpus_10w_200.model, separately None
[INFO] Sat, 09 Sep 2017 11:48:39 utils.py:450 :not storing attribute syn0norm
[INFO] Sat, 09 Sep 2017 11:48:39 utils.py:450 :not storing attribute cum_table
[INFO] Sat, 09 Sep 2017 11:48:39 utils.py:375 :saved ./model/corpus_10w_200.model


#### 测试

In [16]:
evaluate(model, u'电影')

Top 10 most simiar words of word 电影:
主题曲	0.938073
微电影	0.913408
歌曲	0.913322
插曲	0.905568
演唱	0.889877
经典	0.885923
港囧	0.881589
预告片	0.878016
电视剧	0.876737
花絮	0.876419


In [17]:
evaluate(model, u'花絮')

Top 10 most simiar words of word 花絮:
预告片	0.919176
港囧	0.907085
主题曲	0.903566
电视剧	0.902356
特辑	0.901953
预告	0.897735
拍摄	0.896118
幕后	0.894320
片花	0.893637
红高粱	0.884499


In [18]:
evaluate(model, u'片花')

Top 10 most simiar words of word 片花:
当你老了	0.966364
选段	0.965452
幕后	0.964890
霍元甲	0.962546
越剧	0.961426
湖南卫视	0.958544
甄嬛传	0.957648
音乐剧	0.955277
连续剧	0.954533
红高粱	0.953207
