New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example - wrong kana conversion in "I don't care for the way he talks" #753

Closed
Todew opened this Issue Apr 30, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@Todew

Todew commented Apr 30, 2017

Negative is turned into affirmative.

@mvysny

This comment has been minimized.

Show comment
Hide comment
@mvysny

mvysny Apr 30, 2017

Owner

Do you mean this one? https://tatoeba.org/eng/sentences/show/287845 And do you mean the conversion of kiniiranai -> kiniiru?

Owner

mvysny commented Apr 30, 2017

Do you mean this one? https://tatoeba.org/eng/sentences/show/287845 And do you mean the conversion of kiniiranai -> kiniiru?

@Todew

This comment has been minimized.

Show comment
Hide comment
@Todew

Todew Apr 30, 2017

Todew commented Apr 30, 2017

@mvysny

This comment has been minimized.

Show comment
Hide comment
@mvysny

mvysny May 1, 2017

Owner

Sure, it does. The problem lies in the following. I'm using the data from the Tatoeba project: https://tatoeba.org/eng/downloads
I am extracting furigana from the jpn_indices.tar.bz2 file, which a list of so-called B-lines (the B-line format is described here: http://www.edrdg.org/wiki/index.php/Sentence-Dictionary_Linking )

The B-line for this particular sentence is as follows:
彼(かれ) の 話し方 が 気に入る{気にいらない} のだ{のです}
That means that 気に入る is the dictionary form of 気にいらない. In order to get the reading from that I need to find 気に入る in JMDict and resolve the reading (which it does correctly, kiniiru). The algorithm then grabs the reading and match it somehow onto the "sentence form" of 気にいらない. In this particular case it is impossible since 入 kanji is not present in the sentence and thus る is changed completely into いらない and my algorithm can't thus match that (it can match other cases and sometimes it does wonders, but in this case the stem is simply too different from the conjugated form and I try to avoid 'just guessing'). Thus, the kiniiru is kept as-is.

This issue has two possible solutions:

  1. Do not provide any reading in this case. I think that possibly deconjugated reading is better than none - what do you think?
  2. Contact guys from Tatoeba and let them include the conjugated reading of kiniiranai into the B-line somehow. This probably needs to be discussed with the Tatoeba guys since the B-Line has no support for including such information.

But maybe in this case I can actually assign the readings to kanjis (that is, 気->ki and 入->i) and simply replace it in the dictionary form? That could be doable.

Owner

mvysny commented May 1, 2017

Sure, it does. The problem lies in the following. I'm using the data from the Tatoeba project: https://tatoeba.org/eng/downloads
I am extracting furigana from the jpn_indices.tar.bz2 file, which a list of so-called B-lines (the B-line format is described here: http://www.edrdg.org/wiki/index.php/Sentence-Dictionary_Linking )

The B-line for this particular sentence is as follows:
彼(かれ) の 話し方 が 気に入る{気にいらない} のだ{のです}
That means that 気に入る is the dictionary form of 気にいらない. In order to get the reading from that I need to find 気に入る in JMDict and resolve the reading (which it does correctly, kiniiru). The algorithm then grabs the reading and match it somehow onto the "sentence form" of 気にいらない. In this particular case it is impossible since 入 kanji is not present in the sentence and thus る is changed completely into いらない and my algorithm can't thus match that (it can match other cases and sometimes it does wonders, but in this case the stem is simply too different from the conjugated form and I try to avoid 'just guessing'). Thus, the kiniiru is kept as-is.

This issue has two possible solutions:

  1. Do not provide any reading in this case. I think that possibly deconjugated reading is better than none - what do you think?
  2. Contact guys from Tatoeba and let them include the conjugated reading of kiniiranai into the B-line somehow. This probably needs to be discussed with the Tatoeba guys since the B-Line has no support for including such information.

But maybe in this case I can actually assign the readings to kanjis (that is, 気->ki and 入->i) and simply replace it in the dictionary form? That could be doable.

@mvysny

This comment has been minimized.

Show comment
Hide comment
@mvysny

mvysny May 1, 2017

Owner

Fixed! :-) Please wait until I will publish an updated Tatoeba dictionary files, then update the dictionaries in your phone's Aedict.

Owner

mvysny commented May 1, 2017

Fixed! :-) Please wait until I will publish an updated Tatoeba dictionary files, then update the dictionaries in your phone's Aedict.

@mvysny mvysny closed this May 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment