Source code for an ACL2016 paper of Chinese word segmentation
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This code implements the word segmentation algorithm proposed in the following paper.

Deng Cai and Hai Zhao, Neural Word Segmentation Learing for Chinese. ACL 2016.

Lastest update! We improved the system, the corresponding paper was accepted to ACL2017, with source code at this repo.

Update! a faster implementation using dynet as backend is now available. python -d to use the new (dynet based) version.

Usage (theano, also helpful to dynet version):

- train

python -t. To train a model, first check the hyperparameter settings in The training procedure will result a config file at the very beginning in which your hyperparameter settings are preserved, and output the trained model parameters to *.npz per epoch.

- test

python params.npz input_file output_path config_file. To test a trained model, specify the file that stores the model parameters as params.npz as well as the corresponding configuration file config_file. The test procedure will read data from input_file and output result to output_path.

- evaluate

For example, To see the best result (F1-score 95.5) on PKU dataset reported in our paper, first generate the output file through the trained model ( python best_pku.npz ../data/pku_test somepath best_pku_config), then use the command ./score ../data/dic ../data/pku_test somepath.


Thanks to those excellent computing tools: Dynet, Theano, Numpy, Gensim.


Deng Cai. Any question, feel free to contact me through my email.


If you find this code useful, please cite our paper.

  author    = {Cai, Deng  and  Zhao, Hai},
  title     = {Neural Word Segmentation Learning for Chinese},
  booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {409--420},
  url       = {}