-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 3dacdd7
Showing
8 changed files
with
328 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
.idea/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
# word2vec-Chinese | ||
a tutorial for training Chinese-word2vec using Wiki corpus | ||
|
||
word2vec词向量是NLP领域的基础,如何快速地训练出符合自己项目预期的词向量是必要的。 | ||
|
||
【注】:本项目主要目的在于快速的构建通用中文word2vec词向量,关于word2vec原理后期有时间再补充(nlp新手,文中不足之处欢迎各位大神批评指正,亦可共同交流学习)。 | ||
|
||
## 0. 环境要求 | ||
* python 3.6<br> | ||
* 依赖:numpy,gensim,opencc,jieba | ||
|
||
## 1. 获取中文语料库 | ||
想要训练好word2vec模型,一份高质量的中文语料库是必要的,目前常用质量较好的中文语料库为维基百科的中文语料库。 | ||
* 维基百科的中文语料库质量高、领域广泛而且开放,其每月会将所有条目打包供大家下载使用,可以点击: https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 直接下载最新版(也可以访问:https://dumps.wikimedia.org/zhwiki/ 获取历史版本)。 | ||
* 由于某些的原因,中文维基百科的条目到目前只有91万多条,而百度百科、互动百科都有千万条了(英文维基百科也有上千万了)。尽管中文维基百科语料条数较少,但仍不失为最高质量的中文语料库。(ps:百度百科、互动百科多用爬虫爬取内容,不少记录质量差。) | ||
|
||
## 2. 中文语料库预处理 | ||
### 2.1 将xml的Wiki数据转换为text格式 | ||
* python的gensim包中提供了WikiCorpus方法可以直接处理Wiki的语料库(xml的baz格式,无需解压),具体可参见脚本[parse_zhwiki_corpus.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/parse_zhwiki_corpus.py/)。<br> | ||
执行以下命令可以将xml的Wiki语料库转换为txt格式:<br> | ||
```python | ||
python parse_zhwiki.py -i zhwiki-latest-pages-articles.xml.bz2 -o corpus.zhwiki.txt | ||
``` | ||
* 生成的```corpus.zhwiki.txt```有1.04G,共有32w+的documents(每行为1个doc)。 | ||
### 2.2 中文简繁体转换 | ||
* Wiki语料库中的文档含有繁体中文,可以利用工具包opencc将繁体转换为简体,具体可参见脚本[chinese_t2s.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/chinese_t2s.py)。<br> | ||
执行以下命令可以将语料库中的繁体中文转化为简体中文:<br> | ||
```python | ||
python chinese_t2s.py -i corpus.zhwiki.txt -o corpus.zhwiki.simplified.txt | ||
``` | ||
* 得到简体中文的Wiki语料库```corpus.zhwiki.simplified.txt```。 | ||
|
||
### 2.3 去除英文和空格 | ||
* 现在得到的语料库中有许多英文(也有些许日文、德文等),为避免影响所训练的词向量效果,我们将其中的英文以及空格做了删除(其他日文、德文等后续有时间再进行处理),具体可参见脚本[remove_en_blank.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/remove_en_blank.py)。<br> | ||
执行以下命令可以将语料库中的英文以及空格删除:<br> | ||
```python | ||
python remove_en_blank.py -i corpus.zhwiki.simplified.txt -o corpus.zhwiki.simplified.done.txt | ||
``` | ||
* 得到去除英文和空格的中文语料库```corpus.zhwiki.simplified.done.txt```。 | ||
|
||
### 2.4 中文分词(jieba分词) | ||
* 想要完成word2vec的训练,语料库需要进行分词处理,这里采用python的jieba分词,具体可参见脚本[corpus_zhwiki_seg.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/corpus_zhwiki_seg.py)。<br> | ||
执行以下命令可以将语料库中的中文语料进行分词:<br> | ||
```python | ||
python corpus_zhwiki_seg.py -i corpus.zhwiki.simplified.done.txt -o corpus.zhwiki.segwithb.txt | ||
``` | ||
* 得到分词之后的中文语料库```corpus.zhwiki.segwithb.txt```。 | ||
|
||
|
||
## 3. word2vec模型训练 | ||
* python的gensim模块提供了word2vec训练的函数,极大地方便了模型训练的过程。具体可参考脚本[word2vec_train.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/word2vec_train.py)。<br> | ||
执行以下命令得到所训练的word2vec模型和词向量:<br> | ||
```python | ||
python word2vec_train.py -i corpus.zhwiki.segwithb.txt -m zhwiki.word2vec.model -v zhwiki.word2vec.vectors -s 400 -w 5 -n 5 | ||
``` | ||
* 得到基于Wiki中文语料库训练好的word2vec模型和词向量:<br> | ||
word2vec模型文件:<br> | ||
(1) ```zhwiki.word2vec.model```<br> | ||
(2) ```zhwiki.word2vec.model.trainables.syn1neg.npy```<br> | ||
(3) ```zhwiki.word2vec.model.wv.vectors.npy```<br> | ||
word2vec词向量文件:<br> | ||
```zhwiki.word2vec.vectors``` | ||
|
||
|
||
## 4. word2vec模型测试 | ||
* 模型训练好之后,对模型进行测试,具体可参见脚本[word2vec_test.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/word2vec_test.py)。<br> | ||
示例代码如下:<br> | ||
```python | ||
from gensim.models import Word2Vec | ||
word2vec_model = Word2Vec.load(zhwiki.word2vec.model) | ||
# 查看词向量 | ||
print('北京:', word2vec_model['北京']) | ||
# 查看相似词 | ||
sim_words = word2vec_model.most_similar('北京') | ||
for w in sim_words: | ||
print(w) | ||
``` | ||
|
||
|
||
## 参考与致谢 | ||
1. https://github.com/zishuaiz/ChineseWord2Vec | ||
2. https://www.jianshu.com/p/ec27062bd453 | ||
3. https://blog.csdn.net/jdbc/article/details/59483767<br> | ||
ps:参考文献无法一一列举,如有问题请联系我添加! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
import os, sys | ||
import logging | ||
from optparse import OptionParser | ||
from opencc import OpenCC | ||
|
||
def zh_t2s(infile, outfile): | ||
'''convert the traditional Chinese of infile into the simplified Chinese of outfile''' | ||
|
||
# read the traditional Chinese file | ||
t_corpus = [] | ||
with open(infile,'r',encoding='utf-8') as f: | ||
for line in f: | ||
line = line.replace('\n', '').replace('\t','') | ||
t_corpus.append(line) | ||
logger.info('read traditional file finished!') | ||
|
||
# convert the t_Chinese to s_Chinese | ||
cc = OpenCC('t2s') | ||
s_corpus = [] | ||
for i,line in zip(range(len(t_corpus)),t_corpus): | ||
if i % 1000 == 0: | ||
logger.info('convert t2s with the {}/{} line'.format(i,len(t_corpus))) | ||
# s_corpus.append(OpenCC.convert(line)) | ||
s_corpus.append(cc.convert(line)) | ||
logger.info('convert t2s finished!') | ||
|
||
# write the simplified Chinese into the outfile | ||
with open(outfile, 'w', encoding='utf-8') as f: | ||
for line in s_corpus: | ||
f.writelines(line + '\n') | ||
logger.info('write the simplified file finished!') | ||
|
||
|
||
if __name__ == '__main__': | ||
program = os.path.basename(sys.argv[0]) | ||
logger = logging.getLogger(program) | ||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') | ||
|
||
logger.info('running ' + program + ' : convert Traditional Chinese to Simplified Chinese') | ||
|
||
parser = OptionParser() | ||
parser.add_option('-i','--input',dest='input_file',default='corpus.zhwiki.txt',help='traditional file') | ||
parser.add_option('-o','--output',dest='output_file',default='corpus.zhwiki.simplified.txt',help='simplified file') | ||
(options,args) = parser.parse_args() | ||
|
||
input_file = options.input_file | ||
output_file = options.output_file | ||
|
||
try: | ||
zh_t2s(infile=input_file,outfile=output_file) | ||
logger.info('Traditional Chinese to Simplified Chinese Finished') | ||
except Exception as err: | ||
logger.info(err) | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
import os, sys | ||
import logging | ||
from optparse import OptionParser | ||
import jieba | ||
|
||
def seg_with_jieba(infile, outfile): | ||
'''segment the input file with jieba''' | ||
with open(infile, 'r', encoding='utf-8') as fin, open(outfile, 'w', encoding='utf-8') as fout: | ||
i = 0 | ||
for line in fin: | ||
seg_list = jieba.cut(line) | ||
seg_res = ' '.join(seg_list) | ||
fout.write(seg_res) | ||
i += 1 | ||
if i % 1000 ==0: | ||
logger.info('handing with {} line'.format(i)) | ||
|
||
|
||
if __name__ == '__main__': | ||
program = os.path.basename(sys.argv[0]) | ||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') | ||
logger = logging.getLogger(program) | ||
logger.info('running ' + program + ': segmentation of corpus by jieba') | ||
|
||
# parse the parameters | ||
parser = OptionParser() | ||
parser.add_option('-i','--input',dest='infile',default='corpus.zhwiki.simplified.done.txt',help='input file to be segmented') | ||
parser.add_option('-o','--output',dest='outfile',default='corpus.zhwiki.segwithb.txt',help='output file segmented') | ||
|
||
(options, args) = parser.parse_args() | ||
infile = options.infile | ||
outfile = options.outfile | ||
|
||
try: | ||
seg_with_jieba(infile,outfile) | ||
logger.info('segment the infile finished') | ||
except Exception as err: | ||
logger.info(err) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
import logging | ||
import os.path | ||
import sys | ||
from optparse import OptionParser | ||
from gensim.corpora import WikiCorpus | ||
|
||
|
||
def parse_corpus(infile, outfile): | ||
'''parse the corpus of the infile into the outfile''' | ||
space = ' ' | ||
i = 0 | ||
with open(outfile, 'w', encoding='utf-8') as fout: | ||
wiki = WikiCorpus(infile, lemmatize=False, dictionary={}) # gensim中的维基百科处理类WikiCorpus | ||
for text in wiki.get_texts(): | ||
fout.write(space.join(text) + '\n') | ||
i += 1 | ||
if i % 10000 == 0: | ||
logger.info('Saved ' + str(i) + ' articles') | ||
|
||
|
||
if __name__ == '__main__': | ||
program = os.path.basename(sys.argv[0]) | ||
logging.basicConfig(level = logging.INFO, format = '%(asctime)s - %(levelname)s - %(message)s') | ||
logger = logging.getLogger(program) # logging.getLogger(logger_name) | ||
logger.info('running ' + program + ': parse the chinese corpus') | ||
|
||
# parse the parameters | ||
parser = OptionParser() | ||
parser.add_option('-i','--input',dest='infile',default='zhwiki-latest-pages-articles.xml.bz2',help='input: Wiki corpus') | ||
parser.add_option('-o','--output',dest='outfile',default='corpus.zhwiki.txt',help='output: Wiki corpus') | ||
|
||
(options,args) = parser.parse_args() | ||
|
||
infile = options.infile | ||
outfile = options.outfile | ||
|
||
try: | ||
parse_corpus(infile, outfile) | ||
logger.info('Finished Saved ' + str(i) + 'articles') | ||
except Exception as err: | ||
logger.info(err) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
import os, sys | ||
import logging | ||
from optparse import OptionParser | ||
import re | ||
|
||
|
||
def remove_en_blank(infile,outfile): | ||
'''remove the english word and blank from infile, and write into outfile''' | ||
with open(infile,'r',encoding='utf-8') as fin, open(outfile,'w',encoding='utf-8') as fout: | ||
relu = re.compile(r'[ a-zA-Z]') # delete english char and blank | ||
i = 0 | ||
for line in fin: | ||
res = relu.sub('', line) | ||
fout.write(res) | ||
i += 1 | ||
if i % 1000 == 0: | ||
logger.info('handing with the {} line'.format(i)) | ||
|
||
|
||
if __name__ == '__main__': | ||
program = os.path.basename(sys.argv[0]) | ||
logger = logging.getLogger(program) | ||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') | ||
logger.info('running ' + program + ': remove english and blank') | ||
|
||
# parse the parameters | ||
parser = OptionParser() | ||
parser.add_option('-i','--input',dest='infile',default='corpus.zhwiki.simplified.txt',help='input file to be preprocessed') | ||
parser.add_option('-o','--output',dest='outfile',default='corpus.zhwiki.simplified.done.txt',help='output file removed english and blank') | ||
(options,args) = parser.parse_args() | ||
|
||
infile = options.infile | ||
outfile = options.outfile | ||
|
||
try: | ||
remove_en_blank(infile, outfile) | ||
logger.info('remove english and blank finished') | ||
except Exception as err: | ||
logger.info(err) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
from gensim.models import Word2Vec | ||
|
||
|
||
if __name__ == '__main__': | ||
WORD2VEC_MODEL_DIR = './zhwiki.word2vec.model' | ||
|
||
word2vec_model = Word2Vec.load(WORD2VEC_MODEL_DIR) | ||
|
||
# 查看词向量 | ||
vec = word2vec_model['北京'] | ||
print('北京:', vec) | ||
|
||
# 查看相似词 | ||
sim_words = word2vec_model.most_similar('北京') | ||
print('The most similar words: ') | ||
for w in sim_words: | ||
print(w) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
import os, sys | ||
import logging | ||
import multiprocessing | ||
from optparse import OptionParser | ||
from gensim.corpora import WikiCorpus | ||
from gensim.models import Word2Vec | ||
from gensim.models.word2vec import LineSentence | ||
|
||
def word2vec_train(infile,outmodel,outvector,size,window,min_count): | ||
'''train the word vectors by word2vec''' | ||
|
||
# train model | ||
model = Word2Vec(LineSentence(infile),size=size,window=window,min_count=min_count,workers=multiprocessing.cpu_count()) | ||
|
||
# save model | ||
model.save(outmodel) | ||
model.wv.save_word2vec_format(outvector,binary=False) | ||
|
||
|
||
if __name__ == '__main__': | ||
program = os.path.basename(sys.argv[0]) | ||
logging.basicConfig(level=logging.INFO,format='%(asctime)s - %(levelname)s - %(message)s') | ||
logger = logging.getLogger(program) | ||
|
||
logger.info('running ' + program) | ||
|
||
# parse the parameters | ||
parser = OptionParser() | ||
parser.add_option('-i','--input',dest='infile',default='corpus.zhwiki.segwithb.txt',help='zhwiki corpus') | ||
parser.add_option('-m','--outmodel',dest='wv_model',default='zhwiki.word2vec.model',help='word2vec model') | ||
parser.add_option('-v','--outvec',dest='wv_vectors',default='zhwiki.word2vec.vectors',help='word2vec vectors') | ||
parser.add_option('-s',type='int',dest='size',default=400,help='word vector size') | ||
parser.add_option('-w',type='int',dest='window',default=5,help='window size') | ||
parser.add_option('-n',type='int',dest='min_count',default=5,help='min word frequency') | ||
|
||
(options,argv) = parser.parse_args() | ||
infile = options.infile | ||
outmodel = options.wv_model | ||
outvec = options.wv_vectors | ||
vec_size = options.size | ||
window = options.window | ||
min_count = options.min_count | ||
|
||
try: | ||
word2vec_train(infile, outmodel, outvec, vec_size, window, min_count) | ||
logger.info('word2vec model training finished') | ||
except Exception as err: | ||
logger.info(err) |