Form-normalization strategies #8

goodmami · 2020-09-06T03:40:41Z

There needs to be a strategy for morphological normalization, for instance the use of Morphy in English-language lookup. Some ideas:

a form-lookup table separate from the actual database
- irregular forms
- transliterations
- case-folded
- diacritics removed
a normalization function that tries to get the form stored in the actual database

goodmami · 2021-04-14T04:59:16Z

I pushed a commit for v0.7.0 that does normalization by default, where normalization is downcasing and NFKD-based diacritic removal. This cannot be configured by the convenience methods wn.words(), wn.synsets(), etc., but it can be configured when creating a wn.Wordnet object via the normalize parameter. I think morphological analysis, such as morphy, would be a separate process, perhaps configurable via a lemmatize parameter, since the purpose is to find a lemmatic form.

goodmami · 2021-04-14T09:30:15Z

My thinking now is that this hypothetical lemmatize parameter is a function with the following type signature:

Lemmatizer = Callable[[str, Optional[str]], Iterator[str]]

...where the two arguments are the word form and optional part of speech. For example:

def lemmatize(form: str, pos: str = None) -> Iterator[str]:
    ...

If the pos is None, it yields values for all parts of speech. Also this function does not need to yield valid lemmatic forms. The results will be used for queries against the database, and these queries act as the filter, so there's no point in doing it twice. Something that requires a lexicon, such as Morphy's exception lists, can be instantiated to make use of a wordnet and still follow the signature:

class Morphy:
    def __init__(self, wordnet: wn.Wordnet):
        ...
    def __call__(self, form: str, pos: str = None) -> Iterator[str]:
        ...

I'm not sure if lemmatize should take the output of any normalize function, or vice versa. I can see arguments both ways. Maybe it needs some preprocessing pipeline instead: preprocessor=[normalize, lemmatize].

goodmami added the enhancement New feature or request label Sep 6, 2020

goodmami changed the title ~~Morphy and normalization~~ Form-normalization strategies Oct 20, 2020

goodmami mentioned this issue Nov 20, 2020

Morphy implementation draft #50

Closed

goodmami mentioned this issue Feb 25, 2021

Approximate word search #105

Closed

goodmami added this to the v0.7.0 milestone Apr 27, 2021

goodmami closed this as completed in 4ce4459 Jun 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Form-normalization strategies #8

Form-normalization strategies #8

goodmami commented Sep 6, 2020

goodmami commented Apr 14, 2021

goodmami commented Apr 14, 2021

Form-normalization strategies #8

Form-normalization strategies #8

Comments

goodmami commented Sep 6, 2020

goodmami commented Apr 14, 2021

goodmami commented Apr 14, 2021