Unable to romanize with full katakana strings #8

ykim · 2020-07-25T10:37:09Z

I'm not sure if this is in the scope of cutlet, but it looks like any katakana-only sentences / phrases seem to not romanize:

% cutlet
アマガミ Sincerely Your S シンシアリーユアーズ
アマガミ Sincerely Your S シンシアリーユアーズ
ケメコデラックス
ケメコデラックス

The text was updated successfully, but these errors were encountered:

The issue is in unk handling - unks aren't converted to romaji, even if they're all kana. This fixes that. The issue of what to do with non-Japanese input is more complicated - what do you do if your input has cyrillic or hangul? An unrelated issue is that cutlet seems to be putting out zenkaku eiji sometimes, I need to check why that's happening.

polm · 2020-07-25T13:07:53Z

Thanks for the report. All-katakana strings are in scope, and so this is a bug.

There are a couple of things going on in your example text here. The main problem is not actually that the words are just katakana, it's that words not in the dictionary are output as-is, even if they're in katakana or hiragana. I fixed that and will make a release soon. You can try it out with the latest commit.

Dealing with words not in the dictionary is easy if they're ascii or kana, but it gets harder if they're something else. What about kanji not in the dictionary, or ghost character with no readings like 彁? What about cyrillic or hangul? The old strategy was to output all unks as-is. The main branch currently does replaces unknown kanji with question marks but returns other characters as-is. I may change that, I need to think about it more.

Another thing that's happening here is that "Your" is blowing up into four tokens. This is a tokenizer dictionary issue that happens sometimes when you have lots of English words in a row. This can be fixed by tweaking dictionary settings, I'll take a look at it later.

Thanks again for the bug report, let me know if you have any questions.

polm · 2020-07-26T09:46:17Z

I believe the just released v0.1.6 fixes this so I'll close this issue, but please feel free to reopen or comment if you have any other questions or problems.

polm · 2020-08-14T08:46:00Z

Note that the issue with "Your" turning into four tokens should be resolved with the 1.0.7 release of unidic-lite or the latest release of unidic-py (unidic on pypi).

polm closed this as completed Jul 26, 2020

ykim mentioned this issue Aug 3, 2020

KeyError: 'ｰ' #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to romanize with full katakana strings #8

Unable to romanize with full katakana strings #8

ykim commented Jul 25, 2020

polm commented Jul 25, 2020

polm commented Jul 26, 2020

polm commented Aug 14, 2020

Unable to romanize with full katakana strings #8

Unable to romanize with full katakana strings #8

Comments

ykim commented Jul 25, 2020

polm commented Jul 25, 2020

polm commented Jul 26, 2020

polm commented Aug 14, 2020