Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Form-normalization strategies #8

Closed
goodmami opened this issue Sep 6, 2020 · 2 comments
Closed

Form-normalization strategies #8

goodmami opened this issue Sep 6, 2020 · 2 comments
Labels
enhancement New feature or request
Milestone

Comments

@goodmami
Copy link
Owner

goodmami commented Sep 6, 2020

There needs to be a strategy for morphological normalization, for instance the use of Morphy in English-language lookup. Some ideas:

  • a form-lookup table separate from the actual database
    • irregular forms
    • transliterations
    • case-folded
    • diacritics removed
  • a normalization function that tries to get the form stored in the actual database
@goodmami goodmami added the enhancement New feature or request label Sep 6, 2020
@goodmami goodmami changed the title Morphy and normalization Form-normalization strategies Oct 20, 2020
@goodmami
Copy link
Owner Author

I pushed a commit for v0.7.0 that does normalization by default, where normalization is downcasing and NFKD-based diacritic removal. This cannot be configured by the convenience methods wn.words(), wn.synsets(), etc., but it can be configured when creating a wn.Wordnet object via the normalize parameter. I think morphological analysis, such as morphy, would be a separate process, perhaps configurable via a lemmatize parameter, since the purpose is to find a lemmatic form.

@goodmami
Copy link
Owner Author

My thinking now is that this hypothetical lemmatize parameter is a function with the following type signature:

Lemmatizer = Callable[[str, Optional[str]], Iterator[str]]

...where the two arguments are the word form and optional part of speech. For example:

def lemmatize(form: str, pos: str = None) -> Iterator[str]:
    ...

If the pos is None, it yields values for all parts of speech. Also this function does not need to yield valid lemmatic forms. The results will be used for queries against the database, and these queries act as the filter, so there's no point in doing it twice. Something that requires a lexicon, such as Morphy's exception lists, can be instantiated to make use of a wordnet and still follow the signature:

class Morphy:
    def __init__(self, wordnet: wn.Wordnet):
        ...
    def __call__(self, form: str, pos: str = None) -> Iterator[str]:
        ...

I'm not sure if lemmatize should take the output of any normalize function, or vice versa. I can see arguments both ways. Maybe it needs some preprocessing pipeline instead: preprocessor=[normalize, lemmatize].

@goodmami goodmami added this to the v0.7.0 milestone Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant