Skip to content
Switch branches/tags


Failed to load latest commit information.
Latest commit message
Commit time


RichWordSegmentor is a package for Word Segmentation using transition based neural networks under LibN3L package. It is the state-of-the-art neural word segmentator which supports rich pretraining from external data. With the help of rich pretraining, our model achieves the best result on 5 out of 6 Chinese word segmentation benchmarks. Performance details and model structure can be seen in our ACL paper: Neural word segmentation with rich pretraining.

Demo system:

  • Download the LibN3L library and configure your system. Please refer to Here
  • Open CMakeLists.txt and change " ../LibN3L/" into the directory of your LibN3L package.
  • Run the file: sh (didn't load pretrained char/bichar embeddings in this demo script.)

The demo system includes Chinese word segmentation sample data "train.debug", "dev.debug" and "test.debug", Chinese word embeding sample file "ctb.50d.word.debug", Chinese char and char bigram pretrained embedding sample file "char.emb" "bichar.emb"and parameter setting file"option.STD". All of these files are gathered at folder RichWordSegmentor/example.


cmake .

Training model:
./STDSeg -l -train ${} -dev ${} -test ${} -option ${option.file} -model ${save_model_to_file} -word ${pretrain_word_emb, optional} -char ${pretrain_char_emb, optional} -bichar ${pretrain_bichar_emb, optional} -numlayer ${pretrain_parameters, optional}

Load model:
./STDSeg -test ${} -model ${load_model_file} -output ${output_file}


  1. For evaluate model performance, word seperated by a space, each sentence take one line. For example:

    就 做 了 一点 微小 的 工作 , 谢谢 大家 。
    一个人 的 命运 啊 , 当然 要 靠 自我 奋斗 , 但是 也要 考虑 到 历史 的 行程 。

    Result will calculate the P/R/F automatically.

  2. For raw text decoding, one sentence each line (without space).



The same format with training data. Word seperated by a space, each sentence take one line.

就 做 了 一点 微小 的 工作 , 谢谢 大家 。
一个人 的 命运 啊 , 当然 要 靠 自我 奋斗 , 但是 也要 考虑 到 历史 的 行程 。

Trained model/embeddings/parameters of rich pretraining and baseline:

We shared our trained model at BaiduPan( for visiters reproducing our results.

  1. File ctb.bilstm.joint4.model: the trained model on CTB6.0 corpus using multitask pretraining. You can simply load this file to decode raw text without training. Run:

    ./STDSeg -test ${input_raw_text} -model ctb.bilstm.joint4.model -output ${output_segmentated_text}

  2. File joint4.all.b10c1.2h.iter17.mchar, .mbichar, .pmodel are pretrained character, character bigram embeddings and representing parameters. If you want to train your own model, you can load these three files following above instruction.

  3. File: gigaword_chn.all.a2b.uni.ite50.vec, and ctb.50d.vec are the char, bichar and word embeddings of our baseline, respectively.

  4. If you want to do the rich pretraining experiments (for generating three files in last item), please refer to TrainEmbMultiTask.

Monitoring information

During the running of this NER system, it may print out the follow log information:

Iter 13 finished. Total time taken is: 1260.37s
Recall: P=57508/59929=0.959602, Accuracy: P=57508/59723=0.962912, Fmeasure: 0.961254
Decode dev finished. Total time taken is: 96.299s
Recall: P=77895/81579=0.954841, Accuracy: P=77895/81159=0.959783, Fmeasure: 0.957306
Decode test finished. Total time taken is: 128.9s
Exceeds best previous performance of 0.960922. Saving model file..

The first "Recall..." line shows the performance of the dev set and the second "Recall..." line shows you the performance of the test set.


  • Current version only compatible with LibN3L after Dec. 10th 2015 , which contains the model saving and loading module.
  • The example files are just to verify the running for the code. For copyright consideration, we take only hundreds of sentences as example. Hence the results on those example datasets does not represent the real performance on large dataset.


  author    = {Yang, Jie  and  Zhang, Yue  and  Dong, Fei},
  title     = {Neural Word Segmentation with Rich Pretraining},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {839--849},
  url       = {}


  • 2017-April-4: init version


Neural word segmentation with rich pretraining, code for ACL 2017 paper




No releases published


No packages published