Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACTION_SEARCH_JMDICT API: Option to include reference to matched section of query #497

Closed
timrae opened this issue May 5, 2015 · 26 comments

Comments

@timrae
Copy link

timrae commented May 5, 2015

When an inexact / compound query is searched, it would be useful to be able to match the result back to the query.

Example:
If I search for 寿司が食べたい then I'd want to get back 寿司、が、食べたい together with the search results 寿司、が、食べる. The start and stop indices of the match would work as well.

@timrae
Copy link
Author

timrae commented May 5, 2015

Another option could be to just include the verb inflections separately like kuromoji does. For example 寿司が食べたい。 returns:

Surface form Part-of-Speech Base form Reading Pronunciation
寿司 名詞,一般,, 寿司 スシ スシ
が 助詞,格助詞,一般,* が ガ ガ
食べ 動詞,自立,, 食べる タベ タベ
たい 助動詞,,,* たい タイ タイ
。 記号,句点,, 。 。 。

@timrae
Copy link
Author

timrae commented May 7, 2015

After actually spending quite a bit of time working with kuromoji and furigana, I've found that with the way they do it, it's a bit of a pain reassembling the words, so please just ignore my last comment. Finally I think giving two indices (start, stop) for each entry will be the most convenient for me:

e.g. the following sentence 寿司が食べたい。 should return:
寿司 (0, 2)
が (2, 3)
食べる (3, 7)

Do you think this is something which you could add to the API relatively quickly? It's kind of crucial for me to proceed with my application

@mvysny
Copy link
Owner

mvysny commented May 7, 2015

Hi Tim, thanks for the feature request. I will revisit the API and I will let you know. Don't hold your breath though, this will be done next week soonest... sorry.

@mvysny
Copy link
Owner

mvysny commented May 14, 2015

Just to confirm: you use the ACTION_SEARCH_JMDICT api with "kanjis" set to "寿司が食べたい" and "return_results" set to true. Is it okay if in the resulting list of maps, each map would contain e.g. "origin_range" with the format of 0,2 (that is, start index, end index, no braces)?

@timrae
Copy link
Author

timrae commented May 14, 2015

Yes, great!
On 14/05/2015 11:22 pm, "Martin Vysny" notifications@github.com wrote:

Just to confirm: you use the ACTION_SEARCH_JMDICT api with "kanjis" set to
"寿司が食べたい" and "return_results" set to true. Is it okay if in the resulting
list of maps, each map would contain e.g. "origin_range" with the format of
0,2 (start index, end index)


Reply to this email directly or view it on GitHub
#497 (comment).

@mvysny
Copy link
Owner

mvysny commented May 15, 2015

Fixed in Aedict 3.19

@mvysny mvysny closed this as completed May 15, 2015
@mvysny
Copy link
Owner

mvysny commented May 15, 2015

The key will be called "position_in_sentence".

@mvysny
Copy link
Owner

mvysny commented Jun 16, 2015

Tim, can you please share a link to your application if it is on the Google Play? I'm quite interested.

@timrae
Copy link
Author

timrae commented Jun 17, 2015

@mvysny
It's not on Google Play yet, I mainly just made it for myself to be honest, but OK I'll try and upload it sometime this week.

@mvysny
Copy link
Owner

mvysny commented Jun 17, 2015

If you do not wish to go public, no problem. In such case if this is okay with you, you can just send me the APK via e-mail. Thanks!

@timrae
Copy link
Author

timrae commented Jun 17, 2015

Probably I'll make it available via the beta testing facilities on Google Play, just give me a few days.

@timrae
Copy link
Author

timrae commented Jun 17, 2015

This doesn't appear to be working (I'm currently using a different analysis engine because I required this PR to proceed with Aedict)... I just sent the following query taken from Wikipedia:

漢字(かんじ)は、古代中国に発祥を持つ文字。古代において中国から日本、朝鮮、ベトナムなど周辺諸国にも伝わり、その形態・機能を利用して日本語など各地の言語の表記にも使われている(ただし、現在は漢字表記を廃している言語もある。日本の漢字については日本における漢字を参照)。

The first hit has "position_in_sentence" -> "127,2" which is obviously wrong... others were wrong as well

I also tried a simpler search: 漢字難しいよ and got back:
("kanji" -> "漢字", "position_in_sentence" -> "0,2")
("kanji" -> "難しい, 六借しい, 六ヶ敷い", "position_in_sentence" -> "2,3")
("kanji" -> "よ", "position_in_sentence" -> "5,1")

Whereas I'd expect to get back:
"0,2"
"2,5"
"5,6"

@mvysny mvysny reopened this Jun 17, 2015
@mvysny
Copy link
Owner

mvysny commented Jun 17, 2015

The position of "127,2" is obviously wrong, I'll look at it. Regarding the simpler search: the second digit is actually the length of the matched string, so if you transcribe the 5,1 into the start,end notation then you will get 5,6.

@timrae
Copy link
Author

timrae commented Jun 17, 2015

Ah! Thanks!!

@mvysny
Copy link
Owner

mvysny commented Jun 17, 2015

Hmm, I just tried the long long sentence from the wiki and the analyzer got it right: first hit was 漢字: かんじ with the range of 0,2... Can you please let me know which word had the position of 127,2 (which is 127,129 translated to the start,end notation).

@timrae
Copy link
Author

timrae commented Jun 17, 2015

You can see here in the debugger... Item 0 in the list of results from Aedict has the indices 127,2
untitled

@timrae
Copy link
Author

timrae commented Jun 17, 2015

Here is the exact string getting sent through the (sk.baka.aedict3.action.ACTION_SEARCH_JMDICT) intent:
漢字(かんじ)は、古代中国に発祥を持つ文字。古代において中国から日本、朝鮮、ベトナムなど周辺諸国にも伝わり、その形態・機能を利用して日本語など各地の言語の表記にも使われている(ただし、現在は漢字表記を廃している言語もある。日本の漢字については日本における漢字を参照)。

I'm using Aedict v3.25

@timrae
Copy link
Author

timrae commented Jun 17, 2015

It seems to be working fine with the _NOUI intent... Do they have different code paths?

@mvysny
Copy link
Owner

mvysny commented Jun 17, 2015

Yes, the _NOUI intent is handled by a different (invisible) Activity, but the search engine should be the same... Let me check the UI version.

@mvysny
Copy link
Owner

mvysny commented Jun 17, 2015

Gotcha. 漢字;かんじ was present multiple times in the sentence; the 127,2 was the last location. Fixed in Aedict 3.26 so that the first 漢字;かんじ will receive the correct location of 0,2 and the last 漢字;かんじ will receive 127,2

@mvysny mvysny closed this as completed Jun 17, 2015
@timrae
Copy link
Author

timrae commented Jun 17, 2015

Hmmm, why was the NOUI intent returning the correct result though?

@mvysny
Copy link
Owner

mvysny commented Jun 17, 2015

The NOUI intent grabbed the result and fed it directly to the intent. The UI intent grabbed the result, transformed it into displayable list, displayed it, then transformed it into something that could be exported and fed that into the intent. The transformation at some point used a HashMap to retain the original information, including the original sentence position. Weird, I know, but currently the implementation is as this :)

@timrae
Copy link
Author

timrae commented Jun 17, 2015

Ah I see, thanks! I'll use the NOUI version, which seems like it should be more reliable in general.

@timrae
Copy link
Author

timrae commented Jun 17, 2015

By the way, I can see a ton of results like JMDICT: Query jp:WでW produced 1 results (result size was limited to 1) in the catlog...

It's standard practice to refrain from printing all but the absolute necessary amount of logs in the released version of an app, as it can slow things down quite a lot. In AnkiDroid we use a library called Timber to disable all except warning and error level logs in the release version. It's also possible to filter them out automatically with proguard apparently.

@timrae
Copy link
Author

timrae commented Jun 19, 2015

@mvysny
You can get an APK for my very simple app here:
https://github.com/timrae/rikaidroid/releases

It's using an online engine for the sentence analysis instead of Aedict due to performance reasons. However tapping on any of the analyzed words opens the word in Aedict. If I can get the sentence analysis with Aedict to work better and faster then I'd like to use that instead.

@mvysny
Copy link
Owner

mvysny commented Jun 20, 2015

Well, the sentence analysis is a tedious process and the offline analysis will be slower than online analysis (unless you are running some flagship phone), because of way slower hardware. Just out of curiosity: which online service are you using for the sentence analysis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants