word2vec zhwiki, first version

lzhenboy · Aug 14, 2018 · 3dacdd7 · 3dacdd7
commit 3dacdd7
Show file tree

Hide file tree

Showing 8 changed files with 328 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+.idea/
diff --git a/README.md b/README.md
@@ -0,0 +1,84 @@
+# word2vec-Chinese
+a tutorial for training Chinese-word2vec using Wiki corpus  
+
+word2vec词向量是NLP领域的基础，如何快速地训练出符合自己项目预期的词向量是必要的。
+
+【注】：本项目主要目的在于快速的构建通用中文word2vec词向量，关于word2vec原理后期有时间再补充（nlp新手，文中不足之处欢迎各位大神批评指正，亦可共同交流学习）。
+
+## 0. 环境要求
+* python 3.6<br>
+* 依赖：numpy，gensim，opencc，jieba
+
+## 1. 获取中文语料库
+想要训练好word2vec模型，一份高质量的中文语料库是必要的，目前常用质量较好的中文语料库为维基百科的中文语料库。
+* 维基百科的中文语料库质量高、领域广泛而且开放，其每月会将所有条目打包供大家下载使用，可以点击： https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 直接下载最新版（也可以访问：https://dumps.wikimedia.org/zhwiki/ 获取历史版本）。
+* 由于某些的原因，中文维基百科的条目到目前只有91万多条，而百度百科、互动百科都有千万条了（英文维基百科也有上千万了）。尽管中文维基百科语料条数较少，但仍不失为最高质量的中文语料库。（ps：百度百科、互动百科多用爬虫爬取内容，不少记录质量差。）
+
+## 2. 中文语料库预处理
+### 2.1 将xml的Wiki数据转换为text格式
+* python的gensim包中提供了WikiCorpus方法可以直接处理Wiki的语料库（xml的baz格式，无需解压），具体可参见脚本[parse_zhwiki_corpus.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/parse_zhwiki_corpus.py/)。<br>
+执行以下命令可以将xml的Wiki语料库转换为txt格式：<br>
+```python
+python parse_zhwiki.py -i zhwiki-latest-pages-articles.xml.bz2 -o corpus.zhwiki.txt
+```
+* 生成的```corpus.zhwiki.txt```有1.04G，共有32w+的documents（每行为1个doc）。
+### 2.2 中文简繁体转换
+* Wiki语料库中的文档含有繁体中文，可以利用工具包opencc将繁体转换为简体，具体可参见脚本[chinese_t2s.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/chinese_t2s.py)。<br>
+执行以下命令可以将语料库中的繁体中文转化为简体中文：<br>
+```python
+python chinese_t2s.py -i corpus.zhwiki.txt -o corpus.zhwiki.simplified.txt
+```
+* 得到简体中文的Wiki语料库```corpus.zhwiki.simplified.txt```。
+
+### 2.3 去除英文和空格
+* 现在得到的语料库中有许多英文（也有些许日文、德文等），为避免影响所训练的词向量效果，我们将其中的英文以及空格做了删除（其他日文、德文等后续有时间再进行处理），具体可参见脚本[remove_en_blank.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/remove_en_blank.py)。<br>
+执行以下命令可以将语料库中的英文以及空格删除：<br>
+```python
+python remove_en_blank.py -i corpus.zhwiki.simplified.txt -o corpus.zhwiki.simplified.done.txt
+```
+* 得到去除英文和空格的中文语料库```corpus.zhwiki.simplified.done.txt```。
+
+### 2.4 中文分词（jieba分词）
+* 想要完成word2vec的训练，语料库需要进行分词处理，这里采用python的jieba分词，具体可参见脚本[corpus_zhwiki_seg.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/corpus_zhwiki_seg.py)。<br>
+执行以下命令可以将语料库中的中文语料进行分词：<br>
+```python
+python corpus_zhwiki_seg.py -i corpus.zhwiki.simplified.done.txt -o corpus.zhwiki.segwithb.txt
+```
+* 得到分词之后的中文语料库```corpus.zhwiki.segwithb.txt```。 
+
+
+## 3. word2vec模型训练
+* python的gensim模块提供了word2vec训练的函数，极大地方便了模型训练的过程。具体可参考脚本[word2vec_train.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/word2vec_train.py)。<br>
+执行以下命令得到所训练的word2vec模型和词向量：<br>
+```python
+python word2vec_train.py -i corpus.zhwiki.segwithb.txt -m zhwiki.word2vec.model -v zhwiki.word2vec.vectors -s 400 -w 5 -n 5
+```
+* 得到基于Wiki中文语料库训练好的word2vec模型和词向量：<br>
+word2vec模型文件：<br>
+(1) ```zhwiki.word2vec.model```<br>
+(2) ```zhwiki.word2vec.model.trainables.syn1neg.npy```<br>
+(3) ```zhwiki.word2vec.model.wv.vectors.npy```<br>
+word2vec词向量文件：<br>
+```zhwiki.word2vec.vectors```
+
+
+## 4. word2vec模型测试
+* 模型训练好之后，对模型进行测试，具体可参见脚本[word2vec_test.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/word2vec_test.py)。<br>
+示例代码如下：<br>
+```python
+from gensim.models import Word2Vec
+word2vec_model = Word2Vec.load(zhwiki.word2vec.model)
+# 查看词向量
+print('北京：', word2vec_model['北京'])
+# 查看相似词
+sim_words = word2vec_model.most_similar('北京')
+for w in sim_words:
+print(w)
+```
+
+
+## 参考与致谢
+1. https://github.com/zishuaiz/ChineseWord2Vec
+2. https://www.jianshu.com/p/ec27062bd453
+3. https://blog.csdn.net/jdbc/article/details/59483767<br>
+ps:参考文献无法一一列举，如有问题请联系我添加！
diff --git a/chinese_t2s.py b/chinese_t2s.py
@@ -0,0 +1,57 @@
+import os, sys
+import logging
+from optparse import OptionParser
+from opencc import OpenCC
+
+def zh_t2s(infile, outfile):
+    '''convert the traditional Chinese of infile into the simplified Chinese of outfile'''
+
+    # read the traditional Chinese file
+    t_corpus = []
+    with open(infile,'r',encoding='utf-8') as f:
+        for line in f:
+            line = line.replace('\n', '').replace('\t','')
+            t_corpus.append(line)
+    logger.info('read traditional file finished!')
+
+    # convert the t_Chinese to s_Chinese
+    cc = OpenCC('t2s')
+    s_corpus = []
+    for i,line in zip(range(len(t_corpus)),t_corpus):
+        if i % 1000 == 0:
+            logger.info('convert t2s with the {}/{} line'.format(i,len(t_corpus)))
+        # s_corpus.append(OpenCC.convert(line))
+        s_corpus.append(cc.convert(line))
+    logger.info('convert t2s finished!')
+
+    # write the simplified Chinese into the outfile
+    with open(outfile, 'w', encoding='utf-8') as f:
+        for line in s_corpus:
+            f.writelines(line + '\n')
+    logger.info('write the simplified file finished!')
+
+
+if __name__ == '__main__':
+    program = os.path.basename(sys.argv[0])
+    logger = logging.getLogger(program)
+    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+
+    logger.info('running ' + program + ' : convert Traditional Chinese to Simplified Chinese')
+
+    parser = OptionParser()
+    parser.add_option('-i','--input',dest='input_file',default='corpus.zhwiki.txt',help='traditional file')
+    parser.add_option('-o','--output',dest='output_file',default='corpus.zhwiki.simplified.txt',help='simplified file')
+    (options,args) = parser.parse_args()
+
+    input_file = options.input_file
+    output_file = options.output_file
+
+    try:
+        zh_t2s(infile=input_file,outfile=output_file)
+        logger.info('Traditional Chinese to Simplified Chinese Finished')
+    except Exception as err:
+        logger.info(err)
+
+
+
+
diff --git a/corpus_zhwiki_seg.py b/corpus_zhwiki_seg.py
@@ -0,0 +1,38 @@
+import os, sys
+import logging
+from optparse import OptionParser
+import jieba
+
+def seg_with_jieba(infile, outfile):
+    '''segment the input file with jieba'''
+    with open(infile, 'r', encoding='utf-8') as fin, open(outfile, 'w', encoding='utf-8') as fout:
+        i = 0
+        for line in fin:
+            seg_list = jieba.cut(line)
+            seg_res = ' '.join(seg_list)
+            fout.write(seg_res)
+            i += 1
+            if i % 1000 ==0:
+                logger.info('handing with {} line'.format(i))
+
+
+if __name__ == '__main__':
+    program = os.path.basename(sys.argv[0])
+    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+    logger = logging.getLogger(program)
+    logger.info('running ' + program + ': segmentation of corpus by jieba')
+
+    # parse the parameters
+    parser = OptionParser()
+    parser.add_option('-i','--input',dest='infile',default='corpus.zhwiki.simplified.done.txt',help='input file to be segmented')
+    parser.add_option('-o','--output',dest='outfile',default='corpus.zhwiki.segwithb.txt',help='output file segmented')
+
+    (options, args) = parser.parse_args()
+    infile = options.infile
+    outfile = options.outfile
+
+    try:
+        seg_with_jieba(infile,outfile)
+        logger.info('segment the infile finished')
+    except Exception as err:
+        logger.info(err)
diff --git a/parse_zhwiki_corpus.py b/parse_zhwiki_corpus.py
@@ -0,0 +1,43 @@
+import logging
+import os.path
+import sys
+from optparse import OptionParser
+from gensim.corpora import WikiCorpus
+
+
+def parse_corpus(infile, outfile):
+    '''parse the corpus of the infile into the outfile'''
+    space = ' '
+    i = 0
+    with open(outfile, 'w', encoding='utf-8') as fout:
+        wiki = WikiCorpus(infile, lemmatize=False, dictionary={})  # gensim中的维基百科处理类WikiCorpus
+        for text in wiki.get_texts():
+            fout.write(space.join(text) + '\n')
+            i += 1
+            if i % 10000 == 0:
+                logger.info('Saved ' + str(i) + ' articles')
+
+
+if __name__ == '__main__':
+    program = os.path.basename(sys.argv[0])
+    logging.basicConfig(level = logging.INFO, format = '%(asctime)s - %(levelname)s - %(message)s')
+    logger = logging.getLogger(program)  # logging.getLogger(logger_name)
+    logger.info('running ' + program + ': parse the chinese corpus')
+
+    # parse the parameters
+    parser = OptionParser()
+    parser.add_option('-i','--input',dest='infile',default='zhwiki-latest-pages-articles.xml.bz2',help='input: Wiki corpus')
+    parser.add_option('-o','--output',dest='outfile',default='corpus.zhwiki.txt',help='output: Wiki corpus')
+
+    (options,args) = parser.parse_args()
+
+    infile = options.infile
+    outfile = options.outfile
+
+    try:
+        parse_corpus(infile, outfile)
+        logger.info('Finished Saved ' + str(i) + 'articles')
+    except Exception as err:
+        logger.info(err)
+
+
diff --git a/remove_en_blank.py b/remove_en_blank.py
@@ -0,0 +1,39 @@
+import os, sys
+import logging
+from optparse import OptionParser
+import re
+
+
+def remove_en_blank(infile,outfile):
+    '''remove the english word and blank from infile, and write into outfile'''
+    with open(infile,'r',encoding='utf-8') as fin, open(outfile,'w',encoding='utf-8') as fout:
+        relu = re.compile(r'[ a-zA-Z]')  # delete english char and blank
+        i = 0
+        for line in fin:
+            res = relu.sub('', line)
+            fout.write(res)
+            i += 1
+            if i % 1000 == 0:
+                logger.info('handing with the {} line'.format(i))
+
+
+if __name__ == '__main__':
+    program = os.path.basename(sys.argv[0])
+    logger = logging.getLogger(program)
+    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+    logger.info('running ' + program + ': remove english and blank')
+
+    # parse the parameters
+    parser = OptionParser()
+    parser.add_option('-i','--input',dest='infile',default='corpus.zhwiki.simplified.txt',help='input file to be preprocessed')
+    parser.add_option('-o','--output',dest='outfile',default='corpus.zhwiki.simplified.done.txt',help='output file removed english and blank')
+    (options,args) = parser.parse_args()
+
+    infile = options.infile
+    outfile = options.outfile
+
+    try:
+        remove_en_blank(infile, outfile)
+        logger.info('remove english and blank finished')
+    except Exception as err:
+        logger.info(err)
diff --git a/word2vec_test.py b/word2vec_test.py
@@ -0,0 +1,18 @@
+from gensim.models import Word2Vec
+
+
+if __name__ == '__main__':
+    WORD2VEC_MODEL_DIR = './zhwiki.word2vec.model'
+
+    word2vec_model = Word2Vec.load(WORD2VEC_MODEL_DIR)
+
+    # 查看词向量
+    vec = word2vec_model['北京']
+    print('北京：', vec)
+
+    # 查看相似词
+    sim_words = word2vec_model.most_similar('北京')
+    print('The most similar words: ')
+    for w in sim_words:
+        print(w)
+
diff --git a/word2vec_train.py b/word2vec_train.py
@@ -0,0 +1,48 @@
+import os, sys
+import logging
+import multiprocessing
+from optparse import OptionParser
+from gensim.corpora import WikiCorpus
+from gensim.models import Word2Vec
+from gensim.models.word2vec import LineSentence
+
+def word2vec_train(infile,outmodel,outvector,size,window,min_count):
+    '''train the word vectors by word2vec'''
+
+    # train model
+    model = Word2Vec(LineSentence(infile),size=size,window=window,min_count=min_count,workers=multiprocessing.cpu_count())
+
+    # save model
+    model.save(outmodel)
+    model.wv.save_word2vec_format(outvector,binary=False)
+
+
+if __name__ == '__main__':
+    program = os.path.basename(sys.argv[0])
+    logging.basicConfig(level=logging.INFO,format='%(asctime)s - %(levelname)s - %(message)s')
+    logger = logging.getLogger(program)
+
+    logger.info('running ' + program)
+
+    # parse the parameters
+    parser = OptionParser()
+    parser.add_option('-i','--input',dest='infile',default='corpus.zhwiki.segwithb.txt',help='zhwiki corpus')
+    parser.add_option('-m','--outmodel',dest='wv_model',default='zhwiki.word2vec.model',help='word2vec model')
+    parser.add_option('-v','--outvec',dest='wv_vectors',default='zhwiki.word2vec.vectors',help='word2vec vectors')
+    parser.add_option('-s',type='int',dest='size',default=400,help='word vector size')
+    parser.add_option('-w',type='int',dest='window',default=5,help='window size')
+    parser.add_option('-n',type='int',dest='min_count',default=5,help='min word frequency')
+
+    (options,argv) = parser.parse_args()
+    infile = options.infile
+    outmodel = options.wv_model
+    outvec = options.wv_vectors
+    vec_size = options.size
+    window = options.window
+    min_count = options.min_count
+
+    try:
+        word2vec_train(infile, outmodel, outvec, vec_size, window, min_count)
+        logger.info('word2vec model training finished')
+    except Exception as err:
+        logger.info(err)