# Assignment-04 基于维基百科的词向量构建

在本章，你将使用Gensim和维基百科获得你的第一批词向量，并且感受词向量的基本过程。

## Step-01: Download Wikipedia Chinese Corpus

第一步：使用维基百科下载中文语料库

https://dumps.wikimedia.org/zhwiki/20190720/

## Step-02: Using wikiextractor to extract the wikipedia corpus

第二步：使用python wikipedia extractor抽取维基百科的内容

https://github.com/attardi/wikiextractor

执行：

```shell
> python WikiExtractor.py -o .\output D:\BaiduYunDownload\维基百科中文20190720\zhwiki-20190720-pages-articles-multistream.xml.bz2
```


## Step-03: Using gensim get word vectors:
Reference:

https://radimrehurek.com/gensim/models/word2vec.html

https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne

第三步：参考Gensim的文档和Kaggle的参考文档，获得词向量。 注意，你要使用Jieba分词把维基百科的内容切分成一个一个单词，然后存进新的文件中。然后，你需要用Gensim的LineSentence这个类进行文件的读取。

在训练成词向量Model.



### 3.1 Cut words

In [49]:
import os
import pandas as pd
import jieba.analyse
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from gensim.models.word2vec import PathLineSentences


corpus_path = 'E:\\GitHub\\wikiextractor\\output' #source corpus path
sentences_path = 'E:\\corpus' #words corpus path

In [35]:
def get_all_files(root_path):
    """
    return all file pathes as a list under one directory
    """
    pathes = []
    for root, dirs, files in os.walk(corpus_path):
        if not files:
            continue
        for file in files:
            pathes.append(root + '\\' + file)
    return pathes

In [42]:
def preprocess_text(text):
    """
    preprocess text, drop number, blank, stopwords
    return segments list
    """
    stopwords=pd.read_csv('.\\stopwords.txt',index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
    stopwords=stopwords['stopword'].values
    
    try:
        segs = list(jieba.cut(text))
        segs = [v for v in segs if not str(v).isdigit()]#去数字
        segs = list(filter(lambda x:x.strip(), segs)) #去左右空格
        segs = list(filter(lambda x:len(x)>1, segs))#长度为1的字符
        segs = list(filter(lambda x:x not in stopwords, segs)) #去掉停用词
    except Exception:
        print(Exception)
    return segs

  return f(*args, **kwds)
  return f(*args, **kwds)


In [50]:
"""
generate new files after cutting words
"""
file_pathes = get_all_files(corpus_path)

limit = 10000
i = 0

for file_path in file_pathes:
    with open(file_path, 'r', encoding='utf-8') as rf:
        with open(sentences_path+'\\'+str(i)+'.txt', 'w+', encoding='utf-8') as wf:
            for line in rf.readlines():
                if line == '\n':
                    continue
                if line[0] == '<':
                    continue
                i += 1
                segs = preprocess_text(line)
                wf.write(' '.join(segs))
    if i > limit:
        break

In [51]:
sentences = PathLineSentences(sentences_path, limit=1)

'''
LineSentence(inp)：格式简单：一句话=一行; 单词已经过预处理并被空格分隔。
size：是每个词的向量维度； 
window：是词向量训练时的上下文扫描窗口大小，窗口为5就是考虑前5个词和后5个词； 
min-count：设置最低频率，默认是5，如果一个词语在文档中出现的次数小于5，那么就会丢弃； 
workers：是训练的进程数（需要更精准的解释，请指正），默认是当前运行机器的处理器核数。这些参数先记住就可以了。
sg ({0, 1}, optional) – 模型的训练算法: 1: skip-gram; 0: CBOW
alpha (float, optional) – 初始学习率
iter (int, optional) – 迭代次数，默认为5
'''
model = Word2Vec(sentences=sentences, size=100, window=5, min_count=1, sg=1)

model.save(".\\word2vec.model")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## Step-04: Using some words to test your preformance.

第四步，测试同义词，找几个单词。

In [54]:
print(model['中国'])

[ 0.32242677  0.12971713  0.2020709   0.3632374   0.9222486  -0.55811554
 -0.28390476  0.28367355  0.1964107  -0.21977392 -0.06444634  0.16362631
  0.03197181  0.19361655 -0.16852972 -0.01497601  0.87039506 -0.28291765
 -0.06348117  0.3588286  -0.6118939  -0.2945417  -0.46971563 -0.5631982
  0.34107518 -0.08078215 -1.232944    0.1503928  -0.06569257  0.07252707
 -0.8946868   0.5967637   0.5617484   0.36433455 -0.05487395  0.6541698
 -0.13211599 -0.33732408  0.31179795  0.32845244  0.11792107 -0.3170439
 -0.19777887 -0.57849216 -0.15530036  0.445071   -0.23603877 -0.14478168
 -0.3815146   0.8960201   0.298135    0.09621727  0.4851041   0.03529807
 -0.92373455  0.14742832 -0.02569527 -0.21468323  0.3877639  -0.2643373
 -0.19145395 -0.49811006 -0.2772722   0.16501157 -0.17221622 -0.5216831
 -0.05388469  0.6187258  -0.41403374  0.14545973 -0.5051723   0.19265892
 -0.6928579   0.2558732  -0.41839534 -0.7749304   0.08568496 -0.5544341
 -0.5532263   0.22697765 -0.42497575  0.23071884  0.12475

  """Entry point for launching an IPython kernel.


In [55]:
print(model['北京'])

[ 0.5007452   0.10219055 -0.32905862  0.11369596  0.7433472  -0.04048169
 -0.40990755  0.22003876  0.02057978 -0.42945328 -0.16393243 -0.18261424
 -0.22520319  0.14518121 -0.0525115  -0.11810838  0.41022584 -0.14147696
  0.31358632  0.3409524  -0.4367431   0.4189133  -0.04084573 -0.6241634
  0.07543996 -0.06181612 -0.8216611   0.30633673 -0.07451538  0.2397082
 -0.32713944  0.15042588  0.37081626  0.11423954  0.15377311  0.35698316
  0.03006061  0.02773372  0.4609735   0.10990081  0.07908319 -0.16551732
  0.18416736 -0.23363319 -0.48199022  0.00179261  0.30570132 -0.28700334
 -0.14028281  0.47954333 -0.06488842 -0.11705579  0.15717424 -0.00782508
 -0.49156383  0.18056244  0.31305757  0.00255206  0.21876153 -0.18654749
  0.20955521 -0.41260055 -0.34611264  0.25265303 -0.00279441 -0.3569383
 -0.25085923  0.20668027 -0.17850289  0.04928784 -0.51274717 -0.04985979
 -0.3842528   0.20844556 -0.30001938 -0.5342532   0.1900018   0.08399962
 -0.5444809   0.5911435  -0.08231807  0.07196554  0.18

  """Entry point for launching an IPython kernel.


In [56]:
print(model['袁世凯'])

[ 0.18818612  0.05914191 -0.46481448 -0.09066724  0.39920747  0.38367796
 -0.722312    0.05082085  0.07305121 -0.41403073 -0.20223632 -0.13346133
 -0.3583985  -0.05723323  0.04368744  0.24109672  0.13027646 -0.11257724
  0.13885616  0.14313225 -0.23422834  0.18115498  0.28590676 -0.19023846
 -0.03333521 -0.010207   -0.82052076  0.46401957  0.10739937  0.26979136
 -0.1804987   0.5027365   0.46676564  0.13664642  0.28820288  0.23880954
  0.16560265 -0.12516342  0.585944   -0.14872144  0.11497646 -0.30897352
  0.04645425  0.05566444 -0.4432086  -0.14166217  0.33963412 -0.1946155
 -0.24318226  0.8164626  -0.16104457 -0.34987578  0.09355684 -0.20611012
 -0.19640651  0.18445201  0.18920024  0.08458568  0.1703575  -0.40404052
  0.2120697  -0.47506446 -0.46961927  0.44609952  0.35195532 -0.41860363
  0.10224506 -0.01112052  0.38819352  0.1411015  -0.41493973 -0.35139906
 -0.24966563 -0.00126147 -0.57345897 -0.57567614 -0.08771753  0.4757634
 -0.3937756   0.341965   -0.1304231  -0.17436333  0.0

  """Entry point for launching an IPython kernel.


## Step-05: Using visualization tools

https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne

第五步：使用Kaggle给出的T-SEN进行词向量的可视化。

In [59]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [57]:
def tsne_plot(model):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for word in model.wv.vocab:
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

In [None]:
tsne_plot(model)