Develop a tokenizer for Premodern Slavic #91

pirolen · 2023-07-17T13:55:49Z

Hi, I would be happy to contribute data and insights that would help develop a tokenizer for Medieval/Premodern Slavic.
Currently I am using tokconfig-rus on this data, and there'd be room for improvement; e.g. sentences are either very short or very long, please see below for some examples.

Some of the data characteristics:

the character set of this data is nonstandard, incl. punctuation
sentence delimiters are typically nonstandard or nonexistent (· or ∙ are often used between words but are typically not true sentence delimiters)

There are no real gold standards of orthography in this period, and I guess also no very strong gold labeled data.
I looked into the Stanza and the UDPipe sentence splitters but they worked suboptimally.

Would you be interested in creating a premodern slavic config? Or would you suggest another approach?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a tokenizer for Premodern Slavic #91

Develop a tokenizer for Premodern Slavic #91

pirolen commented Jul 17, 2023

Develop a tokenizer for Premodern Slavic #91

Develop a tokenizer for Premodern Slavic #91

Comments

pirolen commented Jul 17, 2023