StopWords for Chinese
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
LICENSE
README.md

README.md

stopwords_zh

StopWords for Chinese: collect Chinese stopwords, Just for removing common useless words.

Use

You can use for jieba and other Chinese text segmentation, just compare the word whether in the list or not.

Python code:

#! /usr/bin/env python
# encoding: utf-8
import codecs
import jieba

if __name__ == "__main__":
	str_in = "小明硕士毕业于中国科学院计算所,后在日本京都大学深造"
	stopwords = codecs.open('stopwords', 'r', 'utf-8').read().split(',')
	seg_list = jieba.cut_for_search(str_in)
    for seg in seg_list:
        if seg not in stopwords:
            print seg

Link