Running morphological analyzers in English and Japanese #2

notani · 2019-04-24T03:18:17Z

Normalizing allomorphs (?) so that BPE can find identical morphemes across words.

cats -> cat+s
boxes -> box+s

usefulness -> useful+ness
happiness -> happy+ness

食べる taberu = eat
食べた tabeta = eat+past -> 食べる+た taberu+ta
食べなかった tabenakatta = did not eat -> 食べる+ない+た taberu+nai+ta

I expect we get more similar results to UDPipe segmentation if we normalize Japanese morphemes.

The text was updated successfully, but these errors were encountered:

justhalf · 2019-04-24T11:39:15Z

Should we keep the plus sign?
In my Indonesian morphology normalizer script I didn't. Also, are you working on this?

notani · 2019-04-24T13:09:21Z

Should we keep the plus sign?

No, + is just for an illustration purpose.

I will do English and Japanese normalization. Were you also working on this?

justhalf · 2019-04-24T14:17:06Z

Yes, the scripts for Indonesian is at https://github.com/justhalf/bpe_analysis/blob/master/morphind/process_txt.py
It uses MorphInd for the morphology analyzer.

No, + is just for an illustration purpose.

I asked because in the Indonesian one I explicitly remove +. I think we should remove the plus sign, yeah.

notani · 2019-04-24T14:19:11Z

Did you start English and Japanese normalization, too?

justhalf · 2019-04-24T14:19:56Z

Did you start English and Japanese normalization, too?

No, I haven't started. I didn't know which morphology analyzer to use. But if we have them, we can simply replace the subprocess call with the corresponding call.

justhalf · 2019-04-24T14:35:52Z

In this paper it says the lexicon (19MB) is large:

Does anyone know a good English morphology analyzer? I was surprised to find none, only Morfessor, which was automatic.

justhalf · 2019-04-24T14:55:32Z

Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox

notani · 2019-04-24T15:03:23Z

Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox

Does this output surface forms of morphemes?

This statistical morphological segmenter can generate normalized surface forms like un+test+able+ly:
https://github.com/ryancotterell/treeseg

We can search for similar studies by "morphological segmentation" rather than "morphological analysis"

justhalf · 2019-04-24T15:48:25Z

Based on my cursory look, it seems so.

This statistical morphological segmenter can generate normalized surface forms like un+test+able+ly:
https://github.com/ryancotterell/treeseg

That's a good one, since it is modern. I was looking at more that has more manual analysis, since it will be less automatic, e.g., FST. But couldn't find FST for English.

notani · 2019-04-24T16:40:37Z

How about this?
https://github.com/knowitall/morpha

justhalf · 2019-04-24T18:22:50Z

That looks good. You have one for Japanese as well? (I guess we don't need this for Chinese?)

notani · 2019-04-24T20:48:43Z

Fortunately, Japanese segmentation by UDPipe is already morpheme segmentation and has normalized forms. I think we don't need normalization for Chinese.

Can you do English normalization?

justhalf · 2019-04-24T22:04:00Z

I am trying. Morpha apparently only handles plural nouns and verb inflections, but not derivations. So happiness stays as is.

#2)

notani self-assigned this Apr 24, 2019

notani added a commit that referenced this issue Apr 25, 2019

Add detok-lemma mode to conllu2plain.py for normalizing Japanese texts (

2bd7a77

#2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running morphological analyzers in English and Japanese #2

Running morphological analyzers in English and Japanese #2

notani commented Apr 24, 2019

justhalf commented Apr 24, 2019

notani commented Apr 24, 2019 •

edited

Loading

justhalf commented Apr 24, 2019 •

edited

Loading

notani commented Apr 24, 2019

justhalf commented Apr 24, 2019 •

edited

Loading

justhalf commented Apr 24, 2019 •

edited

Loading

justhalf commented Apr 24, 2019

notani commented Apr 24, 2019

justhalf commented Apr 24, 2019

notani commented Apr 24, 2019

justhalf commented Apr 24, 2019

notani commented Apr 24, 2019

justhalf commented Apr 24, 2019

Running morphological analyzers in English and Japanese #2

Running morphological analyzers in English and Japanese #2

Comments

notani commented Apr 24, 2019

justhalf commented Apr 24, 2019

notani commented Apr 24, 2019 • edited Loading

justhalf commented Apr 24, 2019 • edited Loading

notani commented Apr 24, 2019

justhalf commented Apr 24, 2019 • edited Loading

justhalf commented Apr 24, 2019 • edited Loading

justhalf commented Apr 24, 2019

notani commented Apr 24, 2019

justhalf commented Apr 24, 2019

notani commented Apr 24, 2019

justhalf commented Apr 24, 2019

notani commented Apr 24, 2019

justhalf commented Apr 24, 2019

notani commented Apr 24, 2019 •

edited

Loading

justhalf commented Apr 24, 2019 •

edited

Loading

justhalf commented Apr 24, 2019 •

edited

Loading

justhalf commented Apr 24, 2019 •

edited

Loading