Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example sentences showing all occurrences of kanji when looking for single-kanji expressions #809

Open
chabo128 opened this Issue Nov 2, 2017 · 7 comments

Comments

Projects
None yet
2 participants
@chabo128
Copy link

chabo128 commented Nov 2, 2017

I've run into this several times and tried looking for a fix in the settings but can't seem to find anything. Inclusion of the Tatoeba Examples is a great feature, but it really backfires when looking for example sentences of a one-kanji expression. For example, I was searching for example sentences using 程(ほど), but the vast majority of example sentences listed are for 程度(ていど). In the past when I had this issue, the word I was looking for usually outnumbered the unrelated ones, but sometimes, like this case, scrolling through example sentences unrelated to my search is a bit tedious. An option to display example sentences only with occurrences identical to the search would be really useful. Apologies if there's already a workaround for this, but I haven't found it yet. Thanks!

screenshot_20171102-085836


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Nov 2, 2017

Thanks for letting me know,this definitely looks like a bug. Let me try to find some ways to improve accuracy.

@mvysny mvysny self-assigned this Nov 2, 2017

@mvysny mvysny added the bug label Nov 2, 2017

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Nov 8, 2017

Fixed, now it will find the following sentences (and 1000 others):

悪ふざけはほどほどにしろ。: わるふざけはほどほど に しろ。
早ければ早いほどいい。: はやければ はやい ほど いい。
早ければ早い程よい。: はやければはやいほどいい。
死ぬほどお会いしたい。: しぬ ほど あい したい。
金が腐るほどある。: かね が くさる ほど ある。
また後ほど来ます。: また のちほど くます。

Fixed in Aedict 3.46

@mvysny mvysny closed this Nov 8, 2017

@chabo128

This comment has been minimized.

Copy link
Author

chabo128 commented Nov 10, 2017

I really appreciate your attention to this. However, I don't think the update gets at the root of the matter. I think the main issue is that it's pulling examples from every occurrence of the characters ほど. It doesn't look like 3.46 is available yet, but the issue I mentioned is apparent in the fix you posted as well. ほどほどに and ほど use the same characters, but their meaning and usage is quite different. They each have their own dictionary entry, yet the example sentences for one are displayed when searching for the other.

Also, I didn't mention this in my original post, but I noticed that the example sentences that appear under the actual full dictionary entry/definition appear to pull from every instance of the kanji 程 (even if I search for the reading ほど), making the issue much more apparent. If I take the time to do a new Word Search and search for ほど and check Examples, it displays only occurrences of ほど (not displaying words like 程度, but including words like ほどほどに since they share the same kana). Usually when I'm looking for examples it's after I search for the definition and I'm already looking at the dictionary entry, so the example sentences I'm seeing usually include any occurrence of the kanji if it's a single-kanji expression. If there's a way to make the example sentences displayed under the dictionary entry to display only occurrences of a kanji's actual reading (ほど in this case) rather than every example sentence associated with the kanji 程 (程度, 日程, 過程, etc.) that would help alleviate the issue. It would still display examples that share the same kana (in this case, searching for ほど would still display examples for ほどほどに), though those instances would be significantly more uncommon in comparison.

I'm sure the solution (jf there even is one) is very complex and would probably require more than one overworked programmer to find, so please don't consider this a complaint of any kind. It's just a slight hiccup I noticed that I think would greatly improve an already awesome app if resolved :)

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Nov 10, 2017

I think the main issue is that it's pulling examples from every occurrence of the characters ほど. It doesn't look like 3.46 is available yet, but the issue I mentioned is apparent in the fix you posted as well. ほどほどに and ほど use the same characters, but their meaning and usage is quite different. They each have their own dictionary entry, yet the example sentences for one are displayed when searching for the other.

Yup, 3.46 is not out yet; I plan to fix some other things as well before releasing that.

I humbly disagree: if you look up 悪ふざけはほどほどにしろ on Aedict Online, you can see in the sentence breakdown that ほどほど's kanjis are indeed 程々, 程ほど, 程程, so from my point of view this sentence is eligible to be included as an example sentence for 程/ほど. You're right, it's ほどほど, not just ほど - should I try to filter this out as well?

Also, I didn't mention this in my original post, but I noticed that the example sentences that appear under the actual full dictionary entry/definition appear to pull from every instance of the kanji 程 (even if I search for the reading ほど), making the issue much more apparent.

Yup, the search was very simple and only looked up 程; that was incorrect. Now the search will search for 程+ほど and thus it should filter out 程度, 日程, 過程 and others. Yet it will include 程々, 程ほど, 程程, so please let me know if that is a problem or not.

@chabo128

This comment has been minimized.

Copy link
Author

chabo128 commented Nov 13, 2017

I humbly disagree: if you look up 悪ふざけはほどほどにしろ on Aedict Online, you can see in the sentence breakdown that ほどほど's kanjis are indeed 程々, 程ほど, 程程, so from my point of view this sentence is eligible to be included as an example sentence for 程/ほど.

Well, I'd argue that even though both expressions share the same kanji that is read as ほど, the meaning and usage is still very different. 程 is categorized as a common adverbial noun, and ほどほど is a rare no-adjective. It should certainly be included in the "Buddies" tab, which it is, but if someone is looking for example sentences to clarify the appropriate usage of ほど, I think including ほどほど muddles things up a bit. It'd be like an English dictionary including example sentences for "graduate" under the entry for "graduated cylinder"...they share the same root word (English "kanji" for all intents and purposes), which is useful reference information, but including example sentences for "graduate" under the entry for "graduated cylinder" wouldn't make much sense.

Now the search will search for 程+ほど and thus it should filter out 程度, 日程, 過程 and others. Yet it will include 程々, 程ほど, 程程, so please let me know if that is a problem or not.

That's great news! If it's applied to search results beyond this specific case like I think you're saying, it would solve 90% of the issue for me (100% of the problem initially mentioned in my post). If there was a solution or option that would filter out entries like ほどほど as well I would opt for that for reasons I mentioned earlier, though I think your update is a huge improvement and I'd be totally happy with that alone. Thanks a ton! お疲れ様でした。

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Dec 9, 2017

Happy to help! The "90%" solution implemented in Aedict 3.46, please upgrade and let me know if it helped.

Regarding the 100% solution: I wonder if I can filter out example sentences where the word is not used in that exact form (unfortunately I have no information about a word in a sentence being, say, adverbial noun. I can auto-reconstruct this information by doing a JMDict lookup, but it's not 100% accurate). But I wonder if that would not filter out useful examples where the word is used in a slightly modified form (say, irregularly inflected or otherwise). Hard to say.

Your example of "graduate" vs "graduated cylinder" is very illustrative, thanks! The difference between "graduate" and "graduated" is more of a difference in semantics; and that's something that a simple automatized algorithm can't differentiate. It would require having either a hand-selected set of examples (which we don't have), or a proper analysis of all example sentences by human (which is definitely missing from Tatoeba - the analysis is done by Aedict, it is done by a very simple matching algorithm and thus often not accurate).

As you can see, we're dealing with an imperfect data here. I believe that doing 100% match may do more harm than good.

@mvysny mvysny reopened this Dec 9, 2017

@chabo128

This comment has been minimized.

Copy link
Author

chabo128 commented Dec 11, 2017

Thanks for the update! Looks to be working quite well. Everything you said makes perfect sense...I suppose that's the price you pay for using a huge selection of "unregulated" but very useful information like Tatoeba.

Before the update, I ran into something interesting as I was looking up "熟す" (こなす). It seemed to be displaying example sentences for "熟す"(じゅくす)in addition to こなす, which seemed really similar to the ほど/ていど issue from before. I screenshotted it before the update:
screenshot_20171130-194500
I thought I'd check out what it looks like on 3.46 and ran into something peculiar. Here's the entry for the same word, こなす. Notice how all of the examples related to fruit ripening are left out of the new updated one.
screenshot_20171211-095027
This word is pretty complicated so I looked it up elsewhere...as the Aedict entry says, こなす is usually written in kana, and doesn't appear to be related to fruit ripening, whereas 熟す(じゅくす)written with kanji is used like ripen, even though there's an example or two for こなす in the pre-3.46 entry related to ripen (like "The apples are ripe. りんごが熟(こな)している.) I looked up じゅくす after the update, expecting to see the examples related to ripening that are now missing from the entry for こなす, and here's what it looks like:
screenshot_20171211-093243
As you can see, the example sentences that were formerly in the entry for こなす related to ripen ("The apples are ripe) are nowhere to be seen in neither じゅくす nor こなす. However, when I do a new word search, check "examples", and search for "熟す", there it is:
screenshot_20171211-090137
Not sure what causes this or how it was affected by the update. I noticed that the post-update examples in the entry for 熟す(じゅくす) don't include conjugations like 熟している, and the word search with "examples" checked does include those. Also, the furigana above 熟す in both the old/new example "The apples are ripe" reads こなす, though when I click it post-update, the word breakdown displays the entry for じゅくす (which I believe is the correct one in this situation, rather than こなす):
screenshot_20171211-094327
熟す and its readings are reeeeeally complicated as both share the same kanji and hiragana, and the meaning changes depending on the context/if it's written using kanji or not. I'm not even sure if you can go as far to call it a bug, aside from the last screenshot where the furigana differs from the actual word it's referencing. This is a pretty specific issue that probably wouldn't affect the vast majority of users, but hopefully the information might be able to address a larger issue as well.

Thanks again for addressing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.