Skip to content

Raw data from JMDict? #747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
atagunov opened this issue Mar 28, 2017 · 13 comments
Closed

Raw data from JMDict? #747

atagunov opened this issue Mar 28, 2017 · 13 comments

Comments

@atagunov
Copy link

atagunov commented Mar 28, 2017

Hi,

Android AEdict has become a daily tool for me, thanks a bunch.

[1]

One thing which I'm still sorely missing is the ability to view raw data from JMDict
Would be nice if I could click a small button in the 'Dictionary Entry Screen' or slide my finger from one of the screen edges or smth else to view raw data like this

http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1225790

or even like this

http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&disp=jm&e=21440

That way I would be able to get the full richness of info - particularly full details of frequency notes (spec2,news2,nf25) and the full relationship between spelling and glosses. I appreciate all the hard work that has gone into presenting this info nicely and wish you all the best in further bettering it but I still treat and guess always will treat the raw data as the ultimate truth and would like to be able to access from the app (I don't always have access to the internet so having it all in my phone is super-handy).

[2]

If I can abuse the issue system as a communication tool - would it be possible to ask you to name the source of pitch accent data? Just as I'm able to view JMDict data outside of AEDict I would like to keep the option open for myself to look up pitch data myself too. BTW it doesn't look like this info is available on the About screen. Maybe it would be nice to have it there?

[3]

The source code for these libraries that AEDict depends on
sk.baka.autils:autils
sk.baka.tools:bakatools
is presently not available online is it? I'm not saying it's wrong or it has to be - just making sure. In fact I only wanted to see the source to find out the source of pitch accent data because I couldn't find the word 'pitch' in the main source code repo :)

@atagunov atagunov changed the title Raw data from Raw data from JMDict display? Mar 28, 2017
@atagunov atagunov changed the title Raw data from JMDict display? Raw data from JMDict? Mar 28, 2017
@mvysny
Copy link
Owner

mvysny commented Mar 28, 2017

First, please allow me to thank you for your support on Patreon, it is highly appreciated!

[1] - I'll come to that later on ;)
[2] - Sure, the pitch accent data was added in #634 there is also the link for the data files.
[3] - I apologize but the sources located here only apply for the old old old Aedict 2. The codebase is so old I don't think it's worth any value. Aedict 3 is unfortunately not yet open-source software. This is because my main income comes from Google Play, and I'm worried that open-sourcing Aedict would shrink that income to zero.

@mvysny mvysny self-assigned this Mar 28, 2017
@mvysny
Copy link
Owner

mvysny commented Mar 28, 2017

(1) that's quite unusual request :-) I am somewhat hesitant to add raw XML to the dictionary files since that would enlarge the data files by quite a lot. Currently I'm storing info in a binary blob rather than in a XML since it takes up less space and is faster to parse. I'll experiment and I'll let you know.
(2) thanks, I'll add the pitch data source to the about screen

@mvysny
Copy link
Owner

mvysny commented Mar 28, 2017

[2] Fixed in Aedict 3.39.34

@mvysny
Copy link
Owner

mvysny commented Mar 28, 2017

[1]: dictionary file without xml: 78mb; With XML: 171mb. I'm not sure, that's three times the original size... Maybe I can reconstruct the XML from the binary data? Not sure though, the entry XML structure may get quite complex... Let me look whether the frequency notes are transferred to the binary data.

@atagunov
Copy link
Author

atagunov commented Mar 29, 2017

Okay, maybe [1] was a bad idea
Let me break [1] into [1A], [1B], [1C] then

[1A] bug report

http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1547720
shows 来る [ ichi1,news1,nf16] ; 來る [oK ]

but Aedict v3.39.33 on Android shows both Kanji as 'common'
probably some issue converting XML to dictionary file?

[1B] enhancement discussion

Might be nice if this info was retained in full: [ichi1,news1,nf16]. I've seen cases when one Kanji had 3 or 4 of these markers and the other only 1; yet both are displayed as 'Common' in Aedict. It would feel somewhat reassuring to know how an entry was marked in JMDict raw. I realize that this is a much bigger change than [1A] and that perhaps this will end up being less compatible with other dictionaries

[1C]

I seem to recollect seeing cases when there would be more than on spelling and more than one reading and one reading was marked as especially relevant for one of the spellings. But I can't find that example any more and I'm not sure if it existed in the first place. Much less do I remember if Aedict was able to present information with same detail as JMDict raw

@mvysny
Copy link
Owner

mvysny commented Mar 29, 2017

Thank you, I agree , let's split the issue.
(1a): thanks, ill look at this issue.
(1b): luckily the binary format is future proof and new stuff can be added without breaking backward compatibility. If the number of constants is reasonable then this should be doable easily. I am actually ignoring some of the stuff in jmdict on purpose, to not to overwhelm the beginner user. Yet I agree that a power user needs to have access to all gory details that jmdict provides :)
(1c): sure, just let me know once you find such word, and please open a separate bug for that, so that the discussion is not fragmented.

@atagunov
Copy link
Author

I am actually ignoring some of the stuff in jmdict on purpose,
to not to overwhelm the beginner user. Yet I agree that a power
user needs to have access to all gory details that jmdict provides :)

There already exists a mechanism - once you tap your finger on a marker like 'Common' in Aedict a pop-up appears giving some detail. I imagine it is here that [ichi1,news1,nf16] markers can appear. We already have separate 'Common' markers for Kanji and for reading which matches quite nicely jmdict format

@mvysny
Copy link
Owner

mvysny commented Mar 30, 2017

[1a] quoting from JMDict.xml: (The entries with news1, ichi1, spec1 and gai1 values are marked with a "(P)" in the EDICT and EDICT2 files.). Quoting from the xml file:

<ent_seq>1547720</ent_seq>
<k_ele>
<keb>来る</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf16</ke_pri>
</k_ele>
<k_ele>
<keb>來る</keb>
<ke_inf>&oK;</ke_inf>
</k_ele>
<r_ele>
<reb>くる</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf16</re_pri>
</r_ele>

So 来る is rightfully marked (P) - common, yet 來る should not be marked common. A clear bug, I'll fix that. The reading is common, and maybe thus Aedict incorrectly assumes that all kanjis it connects to are common as well. Thank you for spotting that!

@mvysny
Copy link
Owner

mvysny commented Mar 30, 2017

[1a] fixed:
device-2017-03-30-181627

once you tap your finger on a marker like 'Common' in Aedict a pop-up appears giving some detail. I imagine it is here that [ichi1,news1,nf16] markers can appear.

That's a brilliant idea, I'll implement it as such.

@mvysny
Copy link
Owner

mvysny commented Mar 31, 2017

[1b]: actually when there is no Common marking, the Pri markers wouldn't show. Therefore I have added them into the info dialog, see screenshot. What do you think?
device-2017-03-31-200159

@mvysny
Copy link
Owner

mvysny commented Apr 7, 2017

Fixed in Aedict 3.39.34; if you are missing still more raw data please open a new feature request.

@mvysny mvysny closed this as completed Apr 7, 2017
@atagunov
Copy link
Author

Thanks a lot for this improvement Martin.
It's take me 3 weeks though to find where that info is :)

@mvysny
Copy link
Owner

mvysny commented Apr 24, 2017

Thanks a lot for this improvement Martin.
It's take me 3 weeks though to find where that info is :)

Sorry about that :-) For the future reference, just press the blackish (i) button, it should show this frequency information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants