Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running morphological analyzers in English and Japanese #2

Open
notani opened this issue Apr 24, 2019 · 13 comments
Open

Running morphological analyzers in English and Japanese #2

notani opened this issue Apr 24, 2019 · 13 comments
Assignees

Comments

@notani
Copy link
Collaborator

notani commented Apr 24, 2019

Normalizing allomorphs (?) so that BPE can find identical morphemes across words.

cats -> cat+s
boxes -> box+s

usefulness -> useful+ness
happiness -> happy+ness
食べる taberu = eat
食べた tabeta = eat+past -> 食べる+た taberu+ta
食べなかった tabenakatta = did not eat -> 食べる+ない+た taberu+nai+ta

I expect we get more similar results to UDPipe segmentation if we normalize Japanese morphemes.

@notani notani self-assigned this Apr 24, 2019
@justhalf
Copy link
Owner

Should we keep the plus sign?
In my Indonesian morphology normalizer script I didn't. Also, are you working on this?

@notani
Copy link
Collaborator Author

notani commented Apr 24, 2019

Should we keep the plus sign?

No, + is just for an illustration purpose.

I will do English and Japanese normalization. Were you also working on this?

@justhalf
Copy link
Owner

justhalf commented Apr 24, 2019

Yes, the scripts for Indonesian is at https://github.com/justhalf/bpe_analysis/blob/master/morphind/process_txt.py
It uses MorphInd for the morphology analyzer.

No, + is just for an illustration purpose.

I asked because in the Indonesian one I explicitly remove +. I think we should remove the plus sign, yeah.

@notani
Copy link
Collaborator Author

notani commented Apr 24, 2019

Did you start English and Japanese normalization, too?

@justhalf
Copy link
Owner

justhalf commented Apr 24, 2019

Did you start English and Japanese normalization, too?

No, I haven't started. I didn't know which morphology analyzer to use. But if we have them, we can simply replace the subprocess call with the corresponding call.

@justhalf
Copy link
Owner

justhalf commented Apr 24, 2019

In this paper it says the lexicon (19MB) is large:
image

Does anyone know a good English morphology analyzer? I was surprised to find none, only Morfessor, which was automatic.

@justhalf
Copy link
Owner

Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox

@notani
Copy link
Collaborator Author

notani commented Apr 24, 2019

Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox

Does this output surface forms of morphemes?

This statistical morphological segmenter can generate normalized surface forms like un+test+able+ly:
https://github.com/ryancotterell/treeseg

We can search for similar studies by "morphological segmentation" rather than "morphological analysis"

@justhalf
Copy link
Owner

Based on my cursory look, it seems so.

This statistical morphological segmenter can generate normalized surface forms like un+test+able+ly:
https://github.com/ryancotterell/treeseg

That's a good one, since it is modern. I was looking at more that has more manual analysis, since it will be less automatic, e.g., FST. But couldn't find FST for English.

@notani
Copy link
Collaborator Author

notani commented Apr 24, 2019

How about this?
https://github.com/knowitall/morpha

@justhalf
Copy link
Owner

That looks good. You have one for Japanese as well? (I guess we don't need this for Chinese?)

@notani
Copy link
Collaborator Author

notani commented Apr 24, 2019

Fortunately, Japanese segmentation by UDPipe is already morpheme segmentation and has normalized forms. I think we don't need normalization for Chinese.

Can you do English normalization?

@justhalf
Copy link
Owner

I am trying. Morpha apparently only handles plural nouns and verb inflections, but not derivations. So happiness stays as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants