0.1.0 update #6

ryszardtuora · 2020-01-13T10:34:07Z

0.1.0 UPDATE

Today we've released a new version which includes some important changes, mainly to the Morfeusz version.

The Morfeusz-based version gains two extensions in this release. The first one is the ability for full morphosyntactic analysis (i.e. tagging with features such as grammatical case, gender, number or tense) in the NKJP tagset. This is done by utilizing the winner of PolEval 2017, Toygger as our new tagger. This tagger requires TensorFlow to work. The second one is the introduction of a custom flexion component, the purpose of which is to inflect the words into the desired pattern. For details on both, please see the updated jupyter notebook for Morfeusz. The Morfeusz version works substantially slower now, because of the introduction of the new tagger. It may also be prone to errors on words which are OOV, if you expect your texts to contain many such words, we suggest trying out the basic model for tagging, and comparing the results. We aim to mitigate those issues in future releases, and further increase the capabilities of our integrated version of Toygger.

Additionally, the size of both models has been substantially reduced, we've found that we can use even less vectors without degradation in performance. For this reason, to fill in the slot of big models, we plan to release a bigger 300-dimension based model soon.

We work towards integrating our model as the officially recognized model for Polish. We've already sent the reconstruction scripts to the authors of spaCy, and hope that soon our model will gain their endorsement.

You can download the model here and the evaluation metrics are available here

Please, do experiment with it, and let us know about your results!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.1.0 update #6

0.1.0 update #6

ryszardtuora commented Jan 13, 2020

0.1.0 update #6

0.1.0 update #6

Comments

ryszardtuora commented Jan 13, 2020

0.1.0 UPDATE