-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running morphological analyzers in English and Japanese #2
Comments
Should we keep the plus sign? |
No, I will do English and Japanese normalization. Were you also working on this? |
Yes, the scripts for Indonesian is at https://github.com/justhalf/bpe_analysis/blob/master/morphind/process_txt.py
I asked because in the Indonesian one I explicitly remove |
Did you start English and Japanese normalization, too? |
No, I haven't started. I didn't know which morphology analyzer to use. But if we have them, we can simply replace the subprocess call with the corresponding call. |
In this paper it says the lexicon (19MB) is large: Does anyone know a good English morphology analyzer? I was surprised to find none, only Morfessor, which was automatic. |
Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox |
Does this output surface forms of morphemes? This statistical morphological segmenter can generate normalized surface forms like We can search for similar studies by "morphological segmentation" rather than "morphological analysis" |
Based on my cursory look, it seems so.
That's a good one, since it is modern. I was looking at more that has more manual analysis, since it will be less automatic, e.g., FST. But couldn't find FST for English. |
How about this? |
That looks good. You have one for Japanese as well? (I guess we don't need this for Chinese?) |
Fortunately, Japanese segmentation by UDPipe is already morpheme segmentation and has normalized forms. I think we don't need normalization for Chinese. Can you do English normalization? |
I am trying. Morpha apparently only handles plural nouns and verb inflections, but not derivations. So |
Normalizing allomorphs (?) so that BPE can find identical morphemes across words.
I expect we get more similar results to UDPipe segmentation if we normalize Japanese morphemes.
The text was updated successfully, but these errors were encountered: