ginza-2.2.0

hiroshi-matsuda-rit released this 04 Oct 09:24

· 412 commits to develop since this release

fe6b9cc

ginza-2.2.0

2019-10-04, Ametrine
Important changes
- split_mode has been set incorrectly to sudachipy.tokenizer from v2.0.0 (#43)
  - This bug caused split_mode incompatibility between the training phase and the ginza command.
  - split_mode was set to 'B' for training phase and python APIs, but 'C' for ginza command.
  - We fixed this bug by setting the default split_mode to 'C' entirely.
  - This fix may cause the word segmentation incompatibilities during upgrading GiNZA from v2.0.0 to v2.2.0.
New features
- Add -f and --output-format option to ginza command:
  - -f 0 or -f conllu : CoNLL-U Syntactic Annotation format
  - -f 1 or -f cabocha: cabocha -f1 compatible format
- Add custom token fields:
  - bunsetu_index : bunsetu index starting from 0
  - reading: reading of token (not a pronunciation)
  - sudachi: SudachiPy's morpheme instance (or its list when then tokens are gathered by JapaneseCorrector)
Performance improvements
- Tokenizer
  - Use latest SudachiDict (SudachiDict_core-20190927.tar.gz)
  - Use Cythonized SudachiPy (v0.4.0)
- Dependency parser
  - Apply spacy pretrain command to capture the language model from UD-Japanese BCCWJ, UD_Japanese-PUD and KWDLC.
  - Apply multitask objectives by using -pt 'tag,dep' option of spacy train
- New model file
  - ja_ginza-2.2.0.tar.gz

Assets 5