Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentense analysis bug with verb-suru #856

Closed
Spanja opened this Issue Jan 21, 2019 · 6 comments

Comments

Projects
None yet
2 participants
@Spanja
Copy link

Spanja commented Jan 21, 2019

Hello,

I think there is a mistake with the sentense analysis with the verb-suru when using the -tai form.
Just make a search for sentences that contains したい :

screenshot_20190120-235307

The したい is analized as 死体 instead of 遠慮したい (遠慮 + する)

BTW, what's the analysis of は|1 ?
Is this a bug ?

Pixel 2
Android 9
French dictionary

@mvysny mvysny self-assigned this Jan 22, 2019

@mvysny mvysny added the bug label Jan 22, 2019

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Jan 22, 2019

The sentence is this one: https://tatoeba.org/eng/sentences/show/196839
The problem here is that it isn't Aedict who is doing the analysis - the analysis comes from the Tatoeba site. You can verify that by downloading the "Japanese Indices" from here: https://tatoeba.org/eng/downloads
In the file, search for "196839" (the sentence ID), and you'll find the following line:

196839	34018	ベジタリアン~ なので 出来れば{できれば} 御(お){お} 肉[02] は|1 遠慮 したい

It's a space-separated format which may include reading and sentence forms (in () and {}).

We can see the following:

  1. は|1 is listed, without any further explanation as of what the |1 part means. Please try finding out the meaning on the Tatoeba forums - I'm also interested in what that means :-)
  2. したい is listed as a separate word, without kanji nor any deinflection information. Therefore Aedict assumes it is a standalone word in its base form and tries to find something matching that. We can see that it finds nonsense, but it's best we can do. This is a bug in the data file, the data file should not list 遠慮 and したい separately, but as one word with the deinflection information attached: 遠慮suru{遠慮したい}. Since it's a bug in a data file provided by Tatoeba, please feel free to report this bug to the Tatoeba guys.

Please see here for the details on the format of the data file: https://dict.longdo.com/about/hintcontents/tanakacorpus.html

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Jan 23, 2019

@Spanja thank you very much for bringing up the topic with the Tatoeba guys. I'll attach the conversation here.

Regarding the Japanese Indices, Trang of Tatoeba wrote:

Jim Breen will be able to answer your questions better than us. I've put him in CC.
The Tatoeba team is not maintaining these indices, we only provide the platform.

Then Jim replied:

[Aedict] assumes it is a standalone word in its base form and tries to find something matching that. We can see that it finds nonsense, but it's best we can do. This is a bug in the data file, the data file should not list 遠慮 and したい separately, but as one word with the deinflection information attached:

In fact したい is in the dictionary as a standalone entry, so it is not a bug in the sentence
database. For the "vs" entries any occurrence of the する (した/して/etc.) part is always indexed
separately in the sentence indices.

Jim, thank you very much for your information, I stand corrected. You're also right: there in fact is the entry "sitai" without any kanjis. I'll fix Aedict to match that entry then.

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Jan 23, 2019

Fixed in Aedict 3.50.11

@mvysny mvysny closed this Jan 23, 2019

@Spanja

This comment has been minimized.

Copy link
Author

Spanja commented Jan 25, 2019

Does that fix the inflection search auto-generated sentence too ?

For example, if you search for 勉強しました, you get the result of 勉強 and 擦る, but that's the wrong する :

screenshot_20190125-194148

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Jan 25, 2019

Not sure, let me check!

@mvysny mvysny reopened this Jan 25, 2019

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Feb 7, 2019

Yes, the problem is that there is no "correct" する in JMDict. Try searching for "suru" at https://aedict-online.eu - if you can spot the correct suru just let me know in a new bug report please.

@mvysny mvysny closed this Feb 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.