Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to romanize with full katakana strings #8

Closed
ykim opened this issue Jul 25, 2020 · 3 comments
Closed

Unable to romanize with full katakana strings #8

ykim opened this issue Jul 25, 2020 · 3 comments

Comments

@ykim
Copy link

ykim commented Jul 25, 2020

I'm not sure if this is in the scope of cutlet, but it looks like any katakana-only sentences / phrases seem to not romanize:

% cutlet
アマガミ Sincerely Your S シンシアリーユアーズ
アマガミ Sincerely Your S シンシアリーユアーズ
ケメコデラックス
ケメコデラックス
polm added a commit that referenced this issue Jul 25, 2020
The issue is in unk handling - unks aren't converted to romaji, even if
they're all kana. This fixes that. The issue of what to do with
non-Japanese input is more complicated - what do you do if your input
has cyrillic or hangul?

An unrelated issue is that cutlet seems to be putting out zenkaku
eiji sometimes, I need to check why that's happening.
@polm
Copy link
Owner

polm commented Jul 25, 2020

Thanks for the report. All-katakana strings are in scope, and so this is a bug.

There are a couple of things going on in your example text here. The main problem is not actually that the words are just katakana, it's that words not in the dictionary are output as-is, even if they're in katakana or hiragana. I fixed that and will make a release soon. You can try it out with the latest commit.

Dealing with words not in the dictionary is easy if they're ascii or kana, but it gets harder if they're something else. What about kanji not in the dictionary, or ghost character with no readings like 彁? What about cyrillic or hangul? The old strategy was to output all unks as-is. The main branch currently does replaces unknown kanji with question marks but returns other characters as-is. I may change that, I need to think about it more.

Another thing that's happening here is that "Your" is blowing up into four tokens. This is a tokenizer dictionary issue that happens sometimes when you have lots of English words in a row. This can be fixed by tweaking dictionary settings, I'll take a look at it later.

Thanks again for the bug report, let me know if you have any questions.

@polm
Copy link
Owner

polm commented Jul 26, 2020

I believe the just released v0.1.6 fixes this so I'll close this issue, but please feel free to reopen or comment if you have any other questions or problems.

@polm polm closed this as completed Jul 26, 2020
@ykim ykim mentioned this issue Aug 3, 2020
@polm
Copy link
Owner

polm commented Aug 14, 2020

Note that the issue with "Your" turning into four tokens should be resolved with the 1.0.7 release of unidic-lite or the latest release of unidic-py (unidic on pypi).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants