Skip to content

Commit

Permalink
word2vec zhwiki, first version
Browse files Browse the repository at this point in the history
  • Loading branch information
lzhenboy committed Aug 14, 2018
0 parents commit 3dacdd7
Show file tree
Hide file tree
Showing 8 changed files with 328 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.idea/
84 changes: 84 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# word2vec-Chinese
a tutorial for training Chinese-word2vec using Wiki corpus

word2vec词向量是NLP领域的基础,如何快速地训练出符合自己项目预期的词向量是必要的。

【注】:本项目主要目的在于快速的构建通用中文word2vec词向量,关于word2vec原理后期有时间再补充(nlp新手,文中不足之处欢迎各位大神批评指正,亦可共同交流学习)。

## 0. 环境要求
* python 3.6<br>
* 依赖:numpy,gensim,opencc,jieba

## 1. 获取中文语料库
想要训练好word2vec模型,一份高质量的中文语料库是必要的,目前常用质量较好的中文语料库为维基百科的中文语料库。
* 维基百科的中文语料库质量高、领域广泛而且开放,其每月会将所有条目打包供大家下载使用,可以点击: https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 直接下载最新版(也可以访问:https://dumps.wikimedia.org/zhwiki/ 获取历史版本)。
* 由于某些的原因,中文维基百科的条目到目前只有91万多条,而百度百科、互动百科都有千万条了(英文维基百科也有上千万了)。尽管中文维基百科语料条数较少,但仍不失为最高质量的中文语料库。(ps:百度百科、互动百科多用爬虫爬取内容,不少记录质量差。)

## 2. 中文语料库预处理
### 2.1 将xml的Wiki数据转换为text格式
* python的gensim包中提供了WikiCorpus方法可以直接处理Wiki的语料库(xml的baz格式,无需解压),具体可参见脚本[parse_zhwiki_corpus.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/parse_zhwiki_corpus.py/)。<br>
执行以下命令可以将xml的Wiki语料库转换为txt格式:<br>
```python
python parse_zhwiki.py -i zhwiki-latest-pages-articles.xml.bz2 -o corpus.zhwiki.txt
```
* 生成的```corpus.zhwiki.txt```有1.04G,共有32w+的documents(每行为1个doc)。
### 2.2 中文简繁体转换
* Wiki语料库中的文档含有繁体中文,可以利用工具包opencc将繁体转换为简体,具体可参见脚本[chinese_t2s.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/chinese_t2s.py)。<br>
执行以下命令可以将语料库中的繁体中文转化为简体中文:<br>
```python
python chinese_t2s.py -i corpus.zhwiki.txt -o corpus.zhwiki.simplified.txt
```
* 得到简体中文的Wiki语料库```corpus.zhwiki.simplified.txt```

### 2.3 去除英文和空格
* 现在得到的语料库中有许多英文(也有些许日文、德文等),为避免影响所训练的词向量效果,我们将其中的英文以及空格做了删除(其他日文、德文等后续有时间再进行处理),具体可参见脚本[remove_en_blank.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/remove_en_blank.py)。<br>
执行以下命令可以将语料库中的英文以及空格删除:<br>
```python
python remove_en_blank.py -i corpus.zhwiki.simplified.txt -o corpus.zhwiki.simplified.done.txt
```
* 得到去除英文和空格的中文语料库```corpus.zhwiki.simplified.done.txt```

### 2.4 中文分词(jieba分词)
* 想要完成word2vec的训练,语料库需要进行分词处理,这里采用python的jieba分词,具体可参见脚本[corpus_zhwiki_seg.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/corpus_zhwiki_seg.py)。<br>
执行以下命令可以将语料库中的中文语料进行分词:<br>
```python
python corpus_zhwiki_seg.py -i corpus.zhwiki.simplified.done.txt -o corpus.zhwiki.segwithb.txt
```
* 得到分词之后的中文语料库```corpus.zhwiki.segwithb.txt```


## 3. word2vec模型训练
* python的gensim模块提供了word2vec训练的函数,极大地方便了模型训练的过程。具体可参考脚本[word2vec_train.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/word2vec_train.py)。<br>
执行以下命令得到所训练的word2vec模型和词向量:<br>
```python
python word2vec_train.py -i corpus.zhwiki.segwithb.txt -m zhwiki.word2vec.model -v zhwiki.word2vec.vectors -s 400 -w 5 -n 5
```
* 得到基于Wiki中文语料库训练好的word2vec模型和词向量:<br>
word2vec模型文件:<br>
(1) ```zhwiki.word2vec.model```<br>
(2) ```zhwiki.word2vec.model.trainables.syn1neg.npy```<br>
(3) ```zhwiki.word2vec.model.wv.vectors.npy```<br>
word2vec词向量文件:<br>
```zhwiki.word2vec.vectors```


## 4. word2vec模型测试
* 模型训练好之后,对模型进行测试,具体可参见脚本[word2vec_test.py](https://github.com/lzhenboy/word2vec-Chinese/blob/master/word2vec_test.py)。<br>
示例代码如下:<br>
```python
from gensim.models import Word2Vec
word2vec_model = Word2Vec.load(zhwiki.word2vec.model)
# 查看词向量
print('北京:', word2vec_model['北京'])
# 查看相似词
sim_words = word2vec_model.most_similar('北京')
for w in sim_words:
print(w)
```


## 参考与致谢
1. https://github.com/zishuaiz/ChineseWord2Vec
2. https://www.jianshu.com/p/ec27062bd453
3. https://blog.csdn.net/jdbc/article/details/59483767<br>
ps:参考文献无法一一列举,如有问题请联系我添加!
57 changes: 57 additions & 0 deletions chinese_t2s.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import os, sys
import logging
from optparse import OptionParser
from opencc import OpenCC

def zh_t2s(infile, outfile):
'''convert the traditional Chinese of infile into the simplified Chinese of outfile'''

# read the traditional Chinese file
t_corpus = []
with open(infile,'r',encoding='utf-8') as f:
for line in f:
line = line.replace('\n', '').replace('\t','')
t_corpus.append(line)
logger.info('read traditional file finished!')

# convert the t_Chinese to s_Chinese
cc = OpenCC('t2s')
s_corpus = []
for i,line in zip(range(len(t_corpus)),t_corpus):
if i % 1000 == 0:
logger.info('convert t2s with the {}/{} line'.format(i,len(t_corpus)))
# s_corpus.append(OpenCC.convert(line))
s_corpus.append(cc.convert(line))
logger.info('convert t2s finished!')

# write the simplified Chinese into the outfile
with open(outfile, 'w', encoding='utf-8') as f:
for line in s_corpus:
f.writelines(line + '\n')
logger.info('write the simplified file finished!')


if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

logger.info('running ' + program + ' : convert Traditional Chinese to Simplified Chinese')

parser = OptionParser()
parser.add_option('-i','--input',dest='input_file',default='corpus.zhwiki.txt',help='traditional file')
parser.add_option('-o','--output',dest='output_file',default='corpus.zhwiki.simplified.txt',help='simplified file')
(options,args) = parser.parse_args()

input_file = options.input_file
output_file = options.output_file

try:
zh_t2s(infile=input_file,outfile=output_file)
logger.info('Traditional Chinese to Simplified Chinese Finished')
except Exception as err:
logger.info(err)




38 changes: 38 additions & 0 deletions corpus_zhwiki_seg.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import os, sys
import logging
from optparse import OptionParser
import jieba

def seg_with_jieba(infile, outfile):
'''segment the input file with jieba'''
with open(infile, 'r', encoding='utf-8') as fin, open(outfile, 'w', encoding='utf-8') as fout:
i = 0
for line in fin:
seg_list = jieba.cut(line)
seg_res = ' '.join(seg_list)
fout.write(seg_res)
i += 1
if i % 1000 ==0:
logger.info('handing with {} line'.format(i))


if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(program)
logger.info('running ' + program + ': segmentation of corpus by jieba')

# parse the parameters
parser = OptionParser()
parser.add_option('-i','--input',dest='infile',default='corpus.zhwiki.simplified.done.txt',help='input file to be segmented')
parser.add_option('-o','--output',dest='outfile',default='corpus.zhwiki.segwithb.txt',help='output file segmented')

(options, args) = parser.parse_args()
infile = options.infile
outfile = options.outfile

try:
seg_with_jieba(infile,outfile)
logger.info('segment the infile finished')
except Exception as err:
logger.info(err)
43 changes: 43 additions & 0 deletions parse_zhwiki_corpus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import logging
import os.path
import sys
from optparse import OptionParser
from gensim.corpora import WikiCorpus


def parse_corpus(infile, outfile):
'''parse the corpus of the infile into the outfile'''
space = ' '
i = 0
with open(outfile, 'w', encoding='utf-8') as fout:
wiki = WikiCorpus(infile, lemmatize=False, dictionary={}) # gensim中的维基百科处理类WikiCorpus
for text in wiki.get_texts():
fout.write(space.join(text) + '\n')
i += 1
if i % 10000 == 0:
logger.info('Saved ' + str(i) + ' articles')


if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logging.basicConfig(level = logging.INFO, format = '%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(program) # logging.getLogger(logger_name)
logger.info('running ' + program + ': parse the chinese corpus')

# parse the parameters
parser = OptionParser()
parser.add_option('-i','--input',dest='infile',default='zhwiki-latest-pages-articles.xml.bz2',help='input: Wiki corpus')
parser.add_option('-o','--output',dest='outfile',default='corpus.zhwiki.txt',help='output: Wiki corpus')

(options,args) = parser.parse_args()

infile = options.infile
outfile = options.outfile

try:
parse_corpus(infile, outfile)
logger.info('Finished Saved ' + str(i) + 'articles')
except Exception as err:
logger.info(err)


39 changes: 39 additions & 0 deletions remove_en_blank.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import os, sys
import logging
from optparse import OptionParser
import re


def remove_en_blank(infile,outfile):
'''remove the english word and blank from infile, and write into outfile'''
with open(infile,'r',encoding='utf-8') as fin, open(outfile,'w',encoding='utf-8') as fout:
relu = re.compile(r'[ a-zA-Z]') # delete english char and blank
i = 0
for line in fin:
res = relu.sub('', line)
fout.write(res)
i += 1
if i % 1000 == 0:
logger.info('handing with the {} line'.format(i))


if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger.info('running ' + program + ': remove english and blank')

# parse the parameters
parser = OptionParser()
parser.add_option('-i','--input',dest='infile',default='corpus.zhwiki.simplified.txt',help='input file to be preprocessed')
parser.add_option('-o','--output',dest='outfile',default='corpus.zhwiki.simplified.done.txt',help='output file removed english and blank')
(options,args) = parser.parse_args()

infile = options.infile
outfile = options.outfile

try:
remove_en_blank(infile, outfile)
logger.info('remove english and blank finished')
except Exception as err:
logger.info(err)
18 changes: 18 additions & 0 deletions word2vec_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from gensim.models import Word2Vec


if __name__ == '__main__':
WORD2VEC_MODEL_DIR = './zhwiki.word2vec.model'

word2vec_model = Word2Vec.load(WORD2VEC_MODEL_DIR)

# 查看词向量
vec = word2vec_model['北京']
print('北京:', vec)

# 查看相似词
sim_words = word2vec_model.most_similar('北京')
print('The most similar words: ')
for w in sim_words:
print(w)

48 changes: 48 additions & 0 deletions word2vec_train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import os, sys
import logging
import multiprocessing
from optparse import OptionParser
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

def word2vec_train(infile,outmodel,outvector,size,window,min_count):
'''train the word vectors by word2vec'''

# train model
model = Word2Vec(LineSentence(infile),size=size,window=window,min_count=min_count,workers=multiprocessing.cpu_count())

# save model
model.save(outmodel)
model.wv.save_word2vec_format(outvector,binary=False)


if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logging.basicConfig(level=logging.INFO,format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(program)

logger.info('running ' + program)

# parse the parameters
parser = OptionParser()
parser.add_option('-i','--input',dest='infile',default='corpus.zhwiki.segwithb.txt',help='zhwiki corpus')
parser.add_option('-m','--outmodel',dest='wv_model',default='zhwiki.word2vec.model',help='word2vec model')
parser.add_option('-v','--outvec',dest='wv_vectors',default='zhwiki.word2vec.vectors',help='word2vec vectors')
parser.add_option('-s',type='int',dest='size',default=400,help='word vector size')
parser.add_option('-w',type='int',dest='window',default=5,help='window size')
parser.add_option('-n',type='int',dest='min_count',default=5,help='min word frequency')

(options,argv) = parser.parse_args()
infile = options.infile
outmodel = options.wv_model
outvec = options.wv_vectors
vec_size = options.size
window = options.window
min_count = options.min_count

try:
word2vec_train(infile, outmodel, outvec, vec_size, window, min_count)
logger.info('word2vec model training finished')
except Exception as err:
logger.info(err)

0 comments on commit 3dacdd7

Please sign in to comment.