Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure special treatment of prefixes and suffixes #26

Open
ogallagher opened this issue Feb 12, 2024 · 5 comments
Open

Configure special treatment of prefixes and suffixes #26

ogallagher opened this issue Feb 12, 2024 · 5 comments
Labels
question Further information is requested

Comments

@ogallagher
Copy link
Owner

I'm not sure what to do about them, but in cases where grammatical prefixes (ex. pre-, re-, un-) and suffixes (ex. -들, -에, -가, -는) occur frequently I've noticed some unusual results.

  • Since choices for a given word test are currently aiming for similarity, the choices end up being the same word with different conjugations/particles attached.
  • Different forms of the same word can disproportionately fill the number of tests generated.
@ogallagher ogallagher added the question Further information is requested label Feb 12, 2024
@ogallagher
Copy link
Owner Author

If I modify the conversion of a token string to Word.key_string to exclude a configurable number of prefix and suffix characters, that will reduce the number of words that only differ terminally, as well as remove them from test choices.

In this case I'll have to be careful where I use key_string, because it's no longer an accurate representation of a word.

@ogallagher
Copy link
Owner Author

The smarter option would be to analyze the grammatical structure of the word (ex. from linguistics API) in order to find prefixes, suffixes, particles, etc and remove them from Word.key_string.

Another simpler approach would be to define static lists of custom prefix and suffix patterns, provided by the user, and or included within quizgen. But again, this would still be prone to many false positives, excluding characters from the ends of words that are not actually grammatical prefixes or suffixes.

@ogallagher
Copy link
Owner Author

ogallagher commented Feb 19, 2024

English

  • Try Cloudmersive part-of-speech tagging natural language API, as recommended here.

Cloudmersive account portal.

Penn Treebank part-of-speech tags, as used by the POS tag API result object PosTaggedWord.

POS tagging works, but only for English. https://gist.github.com/ogallagher/5be9bfe5c1ef757cf4faccaac3dc7a55

@ogallagher
Copy link
Owner Author

ogallagher commented Feb 19, 2024

  • Try spaCy POS tagging NLP library, as recommended here.

This is not an API, but rather self-hosted, optionally pre trained, NLP language models.

@ogallagher
Copy link
Owner Author

ogallagher commented Feb 19, 2024

Korean

  • try konlpy Python package for Korean. I do not know how resource demanding the underlying (all Java?) models are, I think they are static algorithms and should not need any training.

Includes runtime performance analysis of different underlying models/engines.

I will probably use the 꼬꼬마/kkma engine, which seems most accurate according to konlpy docs if the input string tokens are correctly delimited with spaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant