Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Korean Support #4

Open
polm opened this issue Dec 5, 2019 · 1 comment
Open

Korean Support #4

polm opened this issue Dec 5, 2019 · 1 comment

Comments

@polm
Copy link
Owner

@polm polm commented Dec 5, 2019

It'd be nice to support Korean. A simple way to do this would be to subclass the tagger with a KoreanTagger and overwrite the field names, or allow fields to be passed in at creation time.

The tagspec for mecab-ko-dict is here. 2.0 seems to be the most recent one so I guess it makes sense to support that.

Field names and meaning based on Google translate:

Original English
품사 태그 part of speech tag
의미 부류 meaning type
종성 유무 patchim presence (T or F)
읽기 reading (pronunciation, for hanja?)
타입 type (*/Inflected/Compound/Preanalysis)
첫번째 품사 first pos (for compounds?)
마지막 품사 last pos
표현 notes(?) (seems to specify composition of compounds, uses / as delimiter)

In Korean a fork of MeCab is used, it looks like one difference is how whitespace is handled. Not sure if fugashi will just work with it, but since natto-py seems to work there should be a way to support it.

@polm

This comment has been minimized.

Copy link
Owner Author

@polm polm commented Jan 7, 2020

Korean support is in since 0.1.8, but it needs more testing. If anyone could take a look at it and make sure it's OK that'd be much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.