ginza-2.2.0
·
412 commits
to develop
since this release
ginza-2.2.0
- 2019-10-04, Ametrine
- Important changes
split_modehas been set incorrectly to sudachipy.tokenizer from v2.0.0 (#43)- This bug caused
split_modeincompatibility between the training phase and theginzacommand. split_modewas set to 'B' for training phase and python APIs, but 'C' forginzacommand.- We fixed this bug by setting the default
split_modeto 'C' entirely. - This fix may cause the word segmentation incompatibilities during upgrading GiNZA from v2.0.0 to v2.2.0.
- This bug caused
- New features
- Add
-fand--output-formatoption toginzacommand:-f 0or-f conllu: CoNLL-U Syntactic Annotation format-f 1or-f cabocha: cabocha -f1 compatible format
- Add custom token fields:
bunsetu_index: bunsetu index starting from 0reading: reading of token (not a pronunciation)sudachi: SudachiPy's morpheme instance (or its list when then tokens are gathered by JapaneseCorrector)
- Add
- Performance improvements
- Tokenizer
- Use latest SudachiDict (SudachiDict_core-20190927.tar.gz)
- Use Cythonized SudachiPy (v0.4.0)
- Dependency parser
- Apply
spacy pretraincommand to capture the language model from UD-Japanese BCCWJ, UD_Japanese-PUD and KWDLC. - Apply multitask objectives by using
-pt 'tag,dep'option ofspacy train
- Apply
- New model file
- ja_ginza-2.2.0.tar.gz
- Tokenizer