-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why are some entries transliterated in this way? #55
Comments
It looks like are due to a couple of different reasons. As a general rule, the best way to debug things is to check what the dictionary entries are. You can do this by just running To get to the point though, I would fix all of these with an exception, as that's the fastest workaround. For お兄さん and お母さん that won't work very well unfortunately, as the components are still tokenized correctly (お/兄/さん), so if you make an exception for the main word it'll be wrong in other circumstances. You could also add a final postprocessing step on the output romaji (
UniDic has the right entries for these, but they are not what comes out in normal analysis, so I think this is a problem with the MeCab model distributed with the dictionary.
In this case, "Yurii" is registered as a foreign spelling in the dictionary, like "card" for カード. It is one spelling of the name rendered as ユーリ, though the choice here seems arbitrary.
This is weird. It seems that for forms that use katakana this way, the
This one was a surprise. The dictionary lemma is Sorry I don't have a great solution for any of that, but thanks for reporting these! I'll work on fixing the prefix thing at least. |
Thank you for the in-depth write-up! My script is part of a larger project to romanize some titles for a thing I'm doing and I was reviewing the results and building a custom post-processing dictionary by reviewing each output individually (since it doesn't get the character names right 100% of the time). Here are a few more that cutlet seems to struggle with:
Regarding your points:
Should I create an issue on mecab-python3 then?
Should I install the full UniDic for the best accuracy? (the size...-_-) In order to make the exceptions and post-processing dictionaries properly, the exceptions should use the kanji/kana, while the post-processing acts on the romanized result, correct? Example: EXCEPTIONS = {
'ユーリ': 'Yuuri',
'環': 'Tamaki', # Character name that keeps being romanized incorrectly
'虎於': 'Torao',
'ナギ': 'Nagi', # Keeps being romanized as "Nagy"
'ラーメン': 'ramen', # Was romanized as "rahmen"???
'日本語': 'nihongo', # Was romanized as "nippon go" # Or should this go into PP_REPLACEMENTS because "nippon" + "go"
}
# Used in a fxn with replace/re.sub
PP_REPLACEMENTS = {
'oanichan': 'oniichan',
'oanisan': 'oniisan',
'ohahasan': 'okaasan',
r'~ ': ' ~',
r' \?': '?', # Sometimes a space is added before the "?" idk why
} Function: def setup_cutlet():
"""Set up the Cutlet romanization system."""
katsu = cutlet.Cutlet()
katsu.use_foreign_spelling = True
hello = katsu.romaji("こんにち", capitalize=False) + "wa"
katsu.add_exception("こんにちは", hello)
for jp, rom in EXCEPTIONS.items():
katsu.add_exception(jp, rom)
return katsu |
Unfortunately no - I don't train the dictionary, I just distribute it. The models are trained by NINJAL, so technically they should fix it, but I don't think they have any particular public suggested method for bug reports.
Yes, the larger and more recent version will be more accurate. The version I distribute via the About the specific errors...
This is detected as a non-ASCII, non-Japanese token. Usually that covers things like Cyrillic or other non-latin scripts, but I hadn't considered that it covers non-ASCII Latin. This isn't exactly a bug but it may be possible to improve it, there are existing methods for stripping things to ASCII.
In UniDic it looks like ◯ is treated as punctuation and has no reading. You could use a MeCab user dictionary to override this.
Common words with odoriji like 日々 are treated as a single word in UniDic, so cutlet doesn't have to figure out how to interpret them. I put some work into handling odoriji a while ago but didn't have many examples of them not being picked up, so this may just be a bug.
UniDic has multiple entries for ドール. It looks like for this particular one it picks "d'Or", which is plausible in a generic sense, if not here. I'm not sure why the first quote mark disappears, that's probably a bug.
Exceptions in cutlet use the raw form you see in the document, see the included exceptions file. Post processing is not a cutlet feature, so up to you, but it would be easiest to work on romaji. |
I wrote a simple python function to transliterate given words/phrases from a specific source. These are some of the results:
I'm not too savvy with the stuff under the hood, but why is it having difficulty figuring out the context of the character usage? I'm probably going to have to add some exceptions, aren't I?
The text was updated successfully, but these errors were encountered: