In [1]:
from gensim.models import word2vec
import logging

### 对句子进行分词

In [9]:
raw_sentences = ["The research subjects are the 29 English teachers of First Office of College English Department of HIT and some of their students.","Microsoft Office 2013, free and safe download. Microsoft Office 2013 latest version"]

In [10]:
sentence = [s.split() for s in raw_sentences]
print(sentence)

[['The', 'research', 'subjects', 'are', 'the', '29', 'English', 'teachers', 'of', 'First', 'Office', 'of', 'College', 'English', 'Department', 'of', 'HIT', 'and', 'some', 'of', 'their', 'students.'], ['Microsoft', 'Office', '2013,', 'free', 'and', 'safe', 'download.', 'Microsoft', 'Office', '2013', 'latest', 'version']]


### gensim构建word2vec的参数
gensim中，word2vec 相关的API都在包gensim.models.word2vec中。和算法有关的参数都在类gensim.models.word2vec.Word2Vec中。算法需要注意的参数有：

　　　　1) sentences: 我们要分析的语料，可以是一个列表，或者从文件中遍历读出。后面我们会有从文件读出的例子。

　　　　2) size: 词向量的维度，默认值是100。这个维度的取值一般与我们的语料的大小相关，如果是不大的语料，比如小于100M的文本语料，则使用默认值一般就可以了。如果是超大的语料，建议增大维度。

　　　　3) window：即词向量上下文最大距离，这个参数在我们的算法原理篇中标记为c，window越大，则和某一词较远的词也会产生上下文关系。默认值为5。在实际使用中，可以根据实际的需求来动态调整这个window的大小。如果是小语料则这个值可以设的更小。对于一般的语料这个值推荐在[5,10]之间。

　　　　4) sg: 即我们的word2vec两个模型的选择了。如果是0， 则是CBOW模型，是1则是Skip-Gram模型，默认是0即CBOW模型。

　　　　5) hs: 即我们的word2vec两个解法的选择了，如果是0， 则是Negative Sampling，是1的话并且负采样个数negative大于0， 则是Hierarchical Softmax。默认是0即Negative Sampling。

　　　　6) negative:即使用Negative Sampling时负采样的个数，默认是5。推荐在[3,10]之间。这个参数在我们的算法原理篇中标记为neg。

　　　　7) cbow_mean: 仅用于CBOW在做投影的时候，为0，则算法中的xw为上下文的词向量之和，为1则为上下文的词向量的平均值。在我们的原理篇中，是按照词向量的平均值来描述的。个人比较喜欢用平均值来表示xw,默认值也是1,不推荐修改默认值。

　　　　8) min_count:需要计算词向量的最小词频。这个值可以去掉一些很生僻的低频词，默认是5。如果是小语料，可以调低这个值。

　　　　9) iter: 随机梯度下降法中迭代的最大次数，默认是5。对于大语料，可以增大这个值。

　　　　10) alpha: 在随机梯度下降法中迭代的初始步长。算法原理篇中标记为η，默认是0.025。

　　　　11) min_alpha: 由于算法支持在迭代的过程中逐渐减小步长，min_alpha给出了最小的迭代步长值。随机梯度下降中每轮的迭代步长可以由iter，alpha， min_alpha一起得出。这部分由于不是word2vec算法的核心内容，因此在原理篇我们没有提到。对于大语料，需要对alpha, min_alpha,iter一起调参，来选择合适的三个值。

In [11]:
# 构建word2vec 模型
model = word2vec.Word2Vec(sentence,min_count=1)

2019-04-22 23:09:49,886 : INFO : collecting all words and their counts
2019-04-22 23:09:49,887 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-22 23:09:49,888 : INFO : collected 26 word types from a corpus of 34 raw words and 2 sentences
2019-04-22 23:09:49,888 : INFO : Loading a fresh vocabulary
2019-04-22 23:09:49,889 : INFO : min_count=1 retains 26 unique words (100% of original 26, drops 0)
2019-04-22 23:09:49,889 : INFO : min_count=1 leaves 34 word corpus (100% of original 34, drops 0)
2019-04-22 23:09:49,890 : INFO : deleting the raw counts dictionary of 26 items
2019-04-22 23:09:49,891 : INFO : sample=0.001 downsamples 26 most-common words
2019-04-22 23:09:49,892 : INFO : downsampling leaves estimated 6 word corpus (18.3% of prior 34)
2019-04-22 23:09:49,893 : INFO : estimated required memory for 26 words and 100 dimensions: 33800 bytes
2019-04-22 23:09:49,894 : INFO : resetting layer weights
2019-04-22 23:09:49,897 : INFO : training model with

In [5]:
model.similarity("Microsoft","Office")

  """Entry point for launching an IPython kernel.


-0.06650513

### 相关性比较

In [4]:
# 引入 word2vec
from gensim.models import word2vec

# 引入日志配置
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 引入数据集
raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]

# 切分词汇
sentences= [s.split() for s in raw_sentences]

# 构建模型
model = word2vec.Word2Vec(sentences, min_count=1)

# 进行相关性比较
model.similarity('dogs','you')

2019-04-22 23:05:00,364 : INFO : collecting all words and their counts
2019-04-22 23:05:00,364 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-22 23:05:00,365 : INFO : collected 15 word types from a corpus of 16 raw words and 2 sentences
2019-04-22 23:05:00,366 : INFO : Loading a fresh vocabulary
2019-04-22 23:05:00,367 : INFO : min_count=1 retains 15 unique words (100% of original 15, drops 0)
2019-04-22 23:05:00,367 : INFO : min_count=1 leaves 16 word corpus (100% of original 16, drops 0)
2019-04-22 23:05:00,368 : INFO : deleting the raw counts dictionary of 15 items
2019-04-22 23:05:00,369 : INFO : sample=0.001 downsamples 15 most-common words
2019-04-22 23:05:00,370 : INFO : downsampling leaves estimated 2 word corpus (13.7% of prior 16)
2019-04-22 23:05:00,370 : INFO : estimated required memory for 15 words and 100 dimensions: 19500 bytes
2019-04-22 23:05:00,371 : INFO : resetting layer weights
2019-04-22 23:05:00,409 : INFO : training model with

-0.003932551024755548

### 写入文本，并进行分词

In [77]:
import jieba
with open("data.txt") as f:
    document = f.read()
    
    #document_decode = document.decode('GBK')
    
    document_cut = jieba.cut(document)
    #print  ' '.join(jieba_cut)  //如果打印结果，则分词效果消失，后面的result无法显示
    result = ' '.join(document_cut)
    result = result.encode('utf-8')
    with open('data_01.txt', 'wb') as f2:
        f2.write(result)
f.close()
f2.close()

#### 拿到了分词后的文件，在一般的NLP处理中，会需要去停用词。由于word2vec的算法依赖于上下文，而上下文有可能就是停词。因此对于word2vec，我们可以不用去停词。

　　　　现在我们可以直接读分词后的文件到内存。这里使用了word2vec提供的LineSentence类来读文件，然后套用word2vec的模型。这里只是一个示例，因此省去了调参的步骤，实际使用的时候，你可能需要对我们上面提到一些参数进行调参。

In [78]:
import logging
import os
from gensim.models import word2vec

# 引入日志配置
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#这里使用了word2vec提供的LineSentence类来读文件
sentences = word2vec.LineSentence('data_01.txt') 

# 创建模型
model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)

2019-04-23 00:43:23,381 : INFO : collecting all words and their counts
2019-04-23 00:43:23,383 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-23 00:43:23,384 : INFO : collected 247 word types from a corpus of 467 raw words and 4 sentences
2019-04-23 00:43:23,385 : INFO : Loading a fresh vocabulary
2019-04-23 00:43:23,386 : INFO : min_count=1 retains 247 unique words (100% of original 247, drops 0)
2019-04-23 00:43:23,387 : INFO : min_count=1 leaves 467 word corpus (100% of original 467, drops 0)
2019-04-23 00:43:23,388 : INFO : deleting the raw counts dictionary of 247 items
2019-04-23 00:43:23,389 : INFO : sample=0.001 downsamples 67 most-common words
2019-04-23 00:43:23,390 : INFO : downsampling leaves estimated 297 word corpus (63.8% of prior 467)
2019-04-23 00:43:23,391 : INFO : constructing a huffman tree from 247 words
2019-04-23 00:43:23,398 : INFO : built huffman tree with maximum node depth 9
2019-04-23 00:43:23,399 : INFO : estimated requir

#### 模型出来了，我们可以用来做什么呢？这里给出三个常用的应用

　　　**1.第一个是最常用的，找出某一个词向量最相近的词集合**

In [79]:
req_count = 5  # 设置显示多少个匹配词
for key in model.wv.similar_by_word('你好', topn =100):
    if len(key[0])>2:   # 设置匹配词长度
        req_count -= 1
        print (key[0], key[1])  #key
        if req_count == 0:
            break;

2019-04-23 00:43:24,680 : INFO : precomputing L2-norms of word weight vectors


KeyError: "word '你好' not in vocabulary"

model.most_similar(positive=['woman', 'king'], negative=['man']) 
#输出[('queen', 0.50882536), ...] 
* * *
model.doesnt_match("breakfast cereal dinner lunch".split()) 
#输出'cereal' 
* * *
model.similarity('woman', 'man') 
#输出0.73723527 
* * *
model['computer'] # raw numpy vector of a word 
#输出array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) 
* * *

### word2vec模型训练

这里我们调用Word2Vec创建模型实际上会对数据执行两次迭代操作，

第一轮操作会统计词频来构建内部的词典数结构，

第二轮操作会进行神经网络训练，而这两个步骤是可以分步进行的，这样对于某些不可重复的流（譬如 Kafka 等流式数据中）可以手动控制：

In [7]:
import gensimim
model = gensim.models.Word2Vec(iter=1)  # an empty model, no training yet
model.build_vocab(some_sentences)  # can be a non-repeatable, 1-pass generator
model.train(other_sentences)  # can be a non-repeatable, 1-pass generator

NameError: name 'some_sentences' is not defined

* * *

### 处理批量文本

* * *

在Gensim中实现word2vec模型非常简单。首先，我们需要将原始的**训练语料转化成一个sentence的迭代器；每一次迭代返回的sentence是一个word的列表**

#### 我们在上文中也提及，<font color=red>如果是对于大量的输入语料集或者需要整合磁盘上多个文件夹下的数据，我们可以以迭代器的方式而不是一次性将全部内容读取到内存中来节省 RAM 空间</font>：

In [26]:
for name in os.listdir('D:\\test\\Gensim_构造词向量模板\\corpus'):
    print(name)

data1.txt
data2.txt
data3.txt


In [58]:
# 定类，生成迭代器
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                word_cut = jieba.cut(line)
                yield word_cut

sentences = MySentences('D:\\test\\Gensim_构造词向量模板\\corpus') # a memory-friendly iterator

In [60]:
import gensim

#这里使用了word2vec提供的LineSentence类来读文件
# sentence = word2vec.LineSentence(sentences) 
model = gensim.models.Word2Vec(sentences,min_count=1)

2019-04-23 00:31:19,100 : INFO : collecting all words and their counts
2019-04-23 00:31:19,106 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-23 00:31:19,448 : INFO : collected 12 word types from a corpus of 854 raw words and 61 sentences
2019-04-23 00:31:19,449 : INFO : Loading a fresh vocabulary
2019-04-23 00:31:19,450 : INFO : min_count=1 retains 12 unique words (100% of original 12, drops 0)
2019-04-23 00:31:19,451 : INFO : min_count=1 leaves 854 word corpus (100% of original 854, drops 0)
2019-04-23 00:31:19,452 : INFO : deleting the raw counts dictionary of 12 items
2019-04-23 00:31:19,453 : INFO : sample=0.001 downsamples 12 most-common words
2019-04-23 00:31:19,454 : INFO : downsampling leaves estimated 97 word corpus (11.5% of prior 854)
2019-04-23 00:31:19,455 : INFO : estimated required memory for 12 words and 100 dimensions: 15600 bytes
2019-04-23 00:31:19,456 : INFO : resetting layer weights
2019-04-23 00:31:19,458 : INFO : training mode

In [63]:
# 查看对应的词向量
model[u'共商'] 

  


KeyError: "word '共商' not in vocabulary"

如此，便完成了一个word2vec模型的训练。

### 模型保存与读取

**文件保存**

In [64]:
model.save('text2.model')    # 保存在当前文件夹路径下

2019-04-23 00:33:43,714 : INFO : saving Word2Vec object under text2.model, separately None
2019-04-23 00:33:43,714 : INFO : not storing attribute vectors_norm
2019-04-23 00:33:43,715 : INFO : not storing attribute cum_table
2019-04-23 00:33:43,730 : INFO : saved text2.model


**文件读取**

In [34]:
model1 = gensim.models.Word2Vec.load('text1.model')

2019-04-23 00:09:54,803 : INFO : loading Word2Vec object from text1.model
2019-04-23 00:09:54,832 : INFO : loading wv recursively from text1.model.wv.* with mmap=None
2019-04-23 00:09:54,833 : INFO : setting ignored attribute vectors_norm to None
2019-04-23 00:09:54,833 : INFO : loading vocabulary recursively from text1.model.vocabulary.* with mmap=None
2019-04-23 00:09:54,833 : INFO : loading trainables recursively from text1.model.trainables.* with mmap=None
2019-04-23 00:09:54,834 : INFO : setting ignored attribute cum_table to None
2019-04-23 00:09:54,834 : INFO : loaded text1.model


此外，word2vec对象也支持原始bin文件格式的读写。

In [37]:
# 写入二进制文件
model.wv.save_word2vec_format('text.model.bin', binary=True)

2019-04-23 00:11:23,872 : INFO : storing 32x100 projection weights into text.model.bin


In [39]:
# 读取二进制文件
model2 = gensim.models.KeyedVectors.load_word2vec_format('text.model.bin', binary=True)

2019-04-23 00:12:05,113 : INFO : loading projection weights from text.model.bin
2019-04-23 00:12:05,216 : INFO : loaded (32, 100) matrix from text.model.bin


### online learning

**我们可以将更多的训练数据传递给一个已经训练好的word2vec对象，继续更新模型的参数：**

In [80]:
model3 = gensim.models.Word2Vec.load('text1.model')

# model3.train(more_sentences)

# 引入数据集
raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]
some_sentences= [s.split() for s in raw_sentences]

model3.train(some_sentences)


2019-04-23 00:47:22,801 : INFO : loading Word2Vec object from text1.model
2019-04-23 00:47:22,803 : INFO : loading wv recursively from text1.model.wv.* with mmap=None
2019-04-23 00:47:22,804 : INFO : setting ignored attribute vectors_norm to None
2019-04-23 00:47:22,804 : INFO : loading vocabulary recursively from text1.model.vocabulary.* with mmap=None
2019-04-23 00:47:22,805 : INFO : loading trainables recursively from text1.model.trainables.* with mmap=None
2019-04-23 00:47:22,805 : INFO : setting ignored attribute cum_table to None
2019-04-23 00:47:22,806 : INFO : loaded text1.model


ValueError: You must specify either total_examples or total_words, for proper job parameters updationand progress calculations. The usual value is total_examples=model.corpus_count.

### 外部语料集

#### 在真实的训练场景中我们往往会使用较大的语料集进行训练，譬如这里以 Word2Vec 官方的text8为例，只要改变模型中的语料集开源即可：

In [81]:
# 载入官方语料集，进行训练（需要提前下载text8语料）
sentences = word2vec.Text8Corpus('text8')
model = word2vec.Word2Vec(sentences, size=200)

2019-04-23 00:50:53,748 : INFO : collecting all words and their counts


FileNotFoundError: [Errno 2] No such file or directory: 'text8'

### 模型评估

Word2Vec 的训练属于无监督模型，并没有太多的类似于监督学习里面的客观评判方式，更多的依赖于端应用。Google 之前公开了20000条左右的语法与语义化训练样本，每一条遵循A is to B as C is to D这个格式。

In [None]:
model.accuracy('/tmp/questions-words.txt')