Configure special treatment of prefixes and suffixes #26

ogallagher · 2024-02-12T05:12:45Z

I'm not sure what to do about them, but in cases where grammatical prefixes (ex. pre-, re-, un-) and suffixes (ex. -들, -에, -가, -는) occur frequently I've noticed some unusual results.

Since choices for a given word test are currently aiming for similarity, the choices end up being the same word with different conjugations/particles attached.
Different forms of the same word can disproportionately fill the number of tests generated.

ogallagher · 2024-02-15T12:40:47Z

If I modify the conversion of a token string to Word.key_string to exclude a configurable number of prefix and suffix characters, that will reduce the number of words that only differ terminally, as well as remove them from test choices.

In this case I'll have to be careful where I use key_string, because it's no longer an accurate representation of a word.

ogallagher · 2024-02-16T04:32:31Z

The smarter option would be to analyze the grammatical structure of the word (ex. from linguistics API) in order to find prefixes, suffixes, particles, etc and remove them from Word.key_string.

Another simpler approach would be to define static lists of custom prefix and suffix patterns, provided by the user, and or included within quizgen. But again, this would still be prone to many false positives, excluding characters from the ends of words that are not actually grammatical prefixes or suffixes.

ogallagher · 2024-02-19T13:35:52Z

English

Try Cloudmersive part-of-speech tagging natural language API, as recommended here.

Cloudmersive account portal.

Penn Treebank part-of-speech tags, as used by the POS tag API result object PosTaggedWord.

POS tagging works, but only for English. https://gist.github.com/ogallagher/5be9bfe5c1ef757cf4faccaac3dc7a55

ogallagher · 2024-02-19T16:13:55Z

~~Try spaCy POS tagging NLP library, as recommended here.~~

This is not an API, but rather self-hosted, optionally pre trained, NLP language models.

ogallagher · 2024-02-19T16:54:18Z

Korean

try konlpy Python package for Korean. I do not know how resource demanding the underlying (all Java?) models are, I think they are static algorithms and should not need any training.

Includes runtime performance analysis of different underlying models/engines.

I will probably use the 꼬꼬마/kkma engine, which seems most accurate according to konlpy docs if the input string tokens are correctly delimited with spaces.

ogallagher added the question Further information is requested label Feb 12, 2024

ogallagher mentioned this issue Feb 19, 2024

Use Korean NLP library for filtering testable words by part of speech #42

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure special treatment of prefixes and suffixes #26

Configure special treatment of prefixes and suffixes #26

ogallagher commented Feb 12, 2024

ogallagher commented Feb 15, 2024

ogallagher commented Feb 16, 2024

ogallagher commented Feb 19, 2024 •

edited

ogallagher commented Feb 19, 2024 •

edited

ogallagher commented Feb 19, 2024 •

edited

Configure special treatment of prefixes and suffixes #26

Configure special treatment of prefixes and suffixes #26

Comments

ogallagher commented Feb 12, 2024

ogallagher commented Feb 15, 2024

ogallagher commented Feb 16, 2024

ogallagher commented Feb 19, 2024 • edited

English

ogallagher commented Feb 19, 2024 • edited

ogallagher commented Feb 19, 2024 • edited

Korean

ogallagher commented Feb 19, 2024 •

edited

ogallagher commented Feb 19, 2024 •

edited

ogallagher commented Feb 19, 2024 •

edited