Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in kanji view, aedict should list a few similar kanjis #788

Closed
blastrock opened this Issue Aug 9, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@blastrock
Copy link

blastrock commented Aug 9, 2017

In the kanji view, aedict could have a "not to be confused with" section to help people learn their kanjis and spot the differences between similar ones, for example 副 should not be confused with 福.

Finding similar kanjis is difficult, but someone wrote a thesis about that and compiled a database of similar kanjis.

The important files are this one and this one which use two different methods to determine a distance between kanjis.

Here is an example of a line from storkeEditDistance.csv:

天 夫 1 矢 0.8 末 0.8 未 0.8 失 0.8 丈 0.75 文 0.75 井 0.75 木 0.75 大 0.75

Which means that 天 is very similar to 夫, and their "distance" is 1. That value is a little different from a "distance", as the higher it is, the more the kanjis are similar. On the same line, we can see that 天 is also similar to 大, but with a lesser "distance" of 0.75.

I don't think aedict needs to show the distance to the user as it is more or less an arbitrary number, but it should list for each kanji the kanjis that are similar in the same order (from the most similar to the less one).

I am not sure you need both files, as the yehAndLiRadical file is based on radical and not strokes, but they often overlap and strokeEditDistance really gives kanjis that look alike, even if they don't share radicals. Here is an example from yehAndLiRadical and strokeEditDistance respectively:

則 測 0.894 側 0.894 財 0.750 敗 0.750 賊 0.671 損 0.671 慣 0.671 販 0.612 漬 0.612 債 0.612
則 側 0.818182 財 0.8 貝 0.777778 測 0.75 貯 0.666667 昇 0.666667 見 0.666667 販 0.636364 敗 0.636364 眺 0.636364

So I would recommend only adding strokeEditDistance, but it's up to you!

And don't forget to add a link to that page from aedict :)

Keep up the good work!

@mvysny mvysny self-assigned this Aug 10, 2017

@mvysny mvysny added the enhancement label Aug 10, 2017

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Aug 10, 2017

Hey, that's a really nifty feature, thanks! I agree that the strokeEditDistance is the more important here: I believe that the confusion stems from kanjis being similar in a way of misplaced strokes. Thus, misplaced/different radicals are not as important.

Let's thus start by using strokeEditDistance only. From the examples you attached (very helpful, thanks!) I guess we can consider the matching score of 0.8 or higher - what do you think?

@blastrock

This comment has been minimized.

Copy link
Author

blastrock commented Aug 10, 2017

Going through the file, I think 0.7 would be more acceptable, for example

巨 臣 0.714286

Some kanjis are relevant even with a distance of 0.5 I think. For example:

育 青 0.625 背 0.555556

But that would include a lot of false positives. I guess you should try a value and see, maybe gather user feedback. In my app, I didn't put a limit, I included all of them, but it doesn't have the same goal.

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Sep 12, 2017

Okay, a newest Kanjidic2 has been indexed which contains the stroke edit distance info, and is uploaded on the server. Please wait for Aedict 3.42 which will be able to display that information.

@mvysny mvysny closed this Sep 12, 2017

@blastrock

This comment has been minimized.

Copy link
Author

blastrock commented Sep 20, 2017

I just saw the feature, great job! And thank you! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.