-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to romanize with full katakana strings #8
Comments
The issue is in unk handling - unks aren't converted to romaji, even if they're all kana. This fixes that. The issue of what to do with non-Japanese input is more complicated - what do you do if your input has cyrillic or hangul? An unrelated issue is that cutlet seems to be putting out zenkaku eiji sometimes, I need to check why that's happening.
Thanks for the report. All-katakana strings are in scope, and so this is a bug. There are a couple of things going on in your example text here. The main problem is not actually that the words are just katakana, it's that words not in the dictionary are output as-is, even if they're in katakana or hiragana. I fixed that and will make a release soon. You can try it out with the latest commit. Dealing with words not in the dictionary is easy if they're ascii or kana, but it gets harder if they're something else. What about kanji not in the dictionary, or ghost character with no readings like 彁? What about cyrillic or hangul? The old strategy was to output all unks as-is. The main branch currently does replaces unknown kanji with question marks but returns other characters as-is. I may change that, I need to think about it more. Another thing that's happening here is that "Your" is blowing up into four tokens. This is a tokenizer dictionary issue that happens sometimes when you have lots of English words in a row. This can be fixed by tweaking dictionary settings, I'll take a look at it later. Thanks again for the bug report, let me know if you have any questions. |
I believe the just released v0.1.6 fixes this so I'll close this issue, but please feel free to reopen or comment if you have any other questions or problems. |
Note that the issue with "Your" turning into four tokens should be resolved with the 1.0.7 release of unidic-lite or the latest release of unidic-py (unidic on pypi). |
I'm not sure if this is in the scope of cutlet, but it looks like any katakana-only sentences / phrases seem to not romanize:
The text was updated successfully, but these errors were encountered: