Skip to content

ginza-2.2.0

Choose a tag to compare

@hiroshi-matsuda-rit hiroshi-matsuda-rit released this 04 Oct 09:24
· 412 commits to develop since this release
fe6b9cc

ginza-2.2.0

  • 2019-10-04, Ametrine
  • Important changes
    • split_mode has been set incorrectly to sudachipy.tokenizer from v2.0.0 (#43)
      • This bug caused split_mode incompatibility between the training phase and the ginza command.
      • split_mode was set to 'B' for training phase and python APIs, but 'C' for ginza command.
      • We fixed this bug by setting the default split_mode to 'C' entirely.
      • This fix may cause the word segmentation incompatibilities during upgrading GiNZA from v2.0.0 to v2.2.0.
  • New features
    • Add -f and --output-format option to ginza command:
    • Add custom token fields:
      • bunsetu_index : bunsetu index starting from 0
      • reading: reading of token (not a pronunciation)
      • sudachi: SudachiPy's morpheme instance (or its list when then tokens are gathered by JapaneseCorrector)
  • Performance improvements
    • Tokenizer
      • Use latest SudachiDict (SudachiDict_core-20190927.tar.gz)
      • Use Cythonized SudachiPy (v0.4.0)
    • Dependency parser
      • Apply spacy pretrain command to capture the language model from UD-Japanese BCCWJ, UD_Japanese-PUD and KWDLC.
      • Apply multitask objectives by using -pt 'tag,dep' option of spacy train
    • New model file
      • ja_ginza-2.2.0.tar.gz