Skip to content

Latest commit

 

History

History
50 lines (29 loc) · 1.1 KB

README.en-US.md

File metadata and controls

50 lines (29 loc) · 1.1 KB

Chinese Wikipedia corpus creater

Workflow and scripts that help user create Chinese Wikepedia corpus easily form scratch.

Getting Started

Clone or download this repo to local filesystem.

Prerequisites

Python 3.4+ is well supported, python2 is not supported.

For ubuntu/debian user

Script install_dependencies_on_ubunut.bash will install everything for you.

For other operation system user

python packages

install requirements by:

pip install -r ./requirements.txt
non-python packages

OpenCCC is required. User should install it by self.

For Uubntu / debian user, opencc can be installed by command apt

sudo apt-get install opencc

Usage

All in one script

allinone_process.bash

Manual running

see workflow

TODO

Jieba has a poor model performance, replace it with LTP or THULAC, prefer using THULAC for it's an open source software.