# Assignment-04 基于维基百科的词向量构建

在本章，你将使用Gensim和维基百科获得你的第一批词向量，并且感受词向量的基本过程。

## Step-01: Download Wikipedia Chinese Corpus

第一步：使用维基百科下载中文语料库

https://dumps.wikimedia.org/zhwiki/20190720/

## Step-02: Using wikiextractor to extract the wikipedia corpus

第二步：使用python wikipedia extractor抽取维基百科的内容

https://github.com/attardi/wikiextractor

执行：

```shell
> python WikiExtractor.py -o .\output D:\BaiduYunDownload\维基百科中文20190720\zhwiki-20190720-pages-articles-multistream.xml.bz2
```


## Step-03: Using gensim get word vectors:
Reference:

https://radimrehurek.com/gensim/models/word2vec.html

https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne

第三步：参考Gensim的文档和Kaggle的参考文档，获得词向量。 注意，你要使用Jieba分词把维基百科的内容切分成一个一个单词，然后存进新的文件中。然后，你需要用Gensim的LineSentence这个类进行文件的读取。

在训练成词向量Model.



### 3.1 Cut words

In [36]:
import os
import pandas as pd
import jieba.analyse
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence


corpus_path = 'E:\\GitHub\\wikiextractor\\output'

In [35]:
def get_all_files(root_path):
    """
    return all file pathes as a list under one directory
    """
    pathes = []
    for root, dirs, files in os.walk(corpus_path):
        if not files:
            continue
        for file in files:
            pathes.append(root + '\\' + file)
    return pathes

In [42]:
def preprocess_text(text):
    """
    preprocess text, drop number, blank, stopwords
    return segments list
    """
    stopwords=pd.read_csv('.//stopwords.txt',index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
    stopwords=stopwords['stopword'].values
    
    try:
        segs = list(jieba.cut(text))
        segs = [v for v in segs if not str(v).isdigit()]#去数字
        segs = list(filter(lambda x:x.strip(), segs)) #去左右空格
        segs = list(filter(lambda x:len(x)>1, segs))#长度为1的字符
        segs = list(filter(lambda x:x not in stopwords, segs)) #去掉停用词
    except Exception:
        print(Exception)
    return segs

  return f(*args, **kwds)
  return f(*args, **kwds)


In [43]:
file_pathes = get_all_files(corpus_path)
sentences_path = 'E:\\corpus'

limit = 100
i = 0

for file_path in file_pathes:
    with open(file_path, 'r', encoding='utf-8') as rf:
        with open(sentences_path+'\\'+str(i)+'.txt', 'w+', encoding='utf-8') as wf:
            for line in rf.readlines():
                if line == '\n':
                    continue
                if line[0] == '<':
                    continue
                i += 1
                segs = preprocess_text(line)
                wf.write(' '.join(segs))
    if i > limit:
        break

In [48]:
words_files = os.listdir(sentences_path) 
for words_file in words_files:
    words_file_dir = sentences_path + '\\' + words_file
    sentences = LineSentence(words_file_dir)

    '''
    LineSentence(inp)：格式简单：一句话=一行; 单词已经过预处理并被空格分隔。
    size：是每个词的向量维度； 
    window：是词向量训练时的上下文扫描窗口大小，窗口为5就是考虑前5个词和后5个词； 
    min-count：设置最低频率，默认是5，如果一个词语在文档中出现的次数小于5，那么就会丢弃； 
    workers：是训练的进程数（需要更精准的解释，请指正），默认是当前运行机器的处理器核数。这些参数先记住就可以了。
    sg ({0, 1}, optional) – 模型的训练算法: 1: skip-gram; 0: CBOW
    alpha (float, optional) – 初始学习率
    iter (int, optional) – 迭代次数，默认为5
    '''
    model = Word2Vec(sentences=sentences, size=100, window=5, min_count=1, sg=1)

0
1
2
3
4
5
6
7
8
9


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus('/text8')
model = word2vec.Word2Vec(sentences, sg=1, size=100,  window=5,  min_count=5,  negative=3, sample=0.001, hs=1, workers=4)
model.save('/text82.model')
print(model['man'])

## Step-04: Using some words to test your preformance.

第四步，测试同义词，找几个单词。

## Step-05: Using visualization tools

https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne

第五步：使用Kaggle给出的T-SEN进行词向量的可视化。