Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read bcp47 language of zim #400

Closed
wants to merge 6 commits into from
Closed

Conversation

Jaifroid
Copy link
Member

As per discussion in #166 , this PR, which relies on @mossroy's unmerged PR #397, attempts to extract the BCP 47 language code that is embedded in the ZIM's meta Name attribute.

Note that I found a ZIM that does not have the meta Name attribute, which is the anomalous stackoverflow. com_eng_all ZIM. Maybe related to this is the fact that this ZIM is the only one I've found that has an ISO-639-3 code embedded in its title. I've implemented an "exceptions" section to cover this case, and also a backup for the file name where the Name attribute is not available for whatever reason. This is somewhat dirty because of course the user could have renamed the ZIM file. But as most ZIMs have the name with the correct language code in the meta Name attribute, this is really a workaround in one exceptional case, from what I can tell.

This PR relies on all ZIMs being provided by Kiwix, and using the standard Kiwix naming convention. If the ZIM does not use this convention, it is likely to return either a blank language code or, possibly (but somewhat unlikely) a garbage code. I don't think there's anything we can do about this.

The only other way to get the BCP 47 code would be to implement a conversion process using very large lookup tables. To cover every possible language, these tables would be enormous and would add significantly to the weight of the app. Since it seems that mwoffliner already does a conversion in order to produce the ZIM's filename, it would IMHO be a waste of effort for us to do the conversion again in-app.

Sample usage of the provided function in, say, app.js:

        uiUtil.readBCP47LanguageCode(selectedArchive, function(code) {
            console.log('Language of ZIM is: ' + code);
        });

@kasemmarifet please note that using this function does a costly lookup of the Name attribute using a file read. Therefore it should not be use on every article load, as it will impact performance significantly. It should only be used when the user selects the "Read" button and a voice has not already been explicitly chosen by the user.

So that to populate ZIMArchive.language
Fixes #395
So that the callbackReady is called after the language has been read.
@Jaifroid Jaifroid added this to the v2.4 milestone Jul 10, 2018
@Jaifroid Jaifroid self-assigned this Jul 10, 2018
@Jaifroid Jaifroid requested a review from mossroy July 10, 2018 16:39
@mossroy
Copy link
Contributor

mossroy commented Jul 12, 2018

The code looks OK for me, but it seems to me that this function should be a prototype of zimArchive.js, instead of in uiUtil.js.
In any case, there are two possibilities :

@kelson42 : maybe it's not the first time Kiwix tackles these language codes issues, do you have any recommendations?

@Jaifroid
Copy link
Member Author

Jaifroid commented Jul 12, 2018

Just to be clear, it's not in the filename, but in the meta Name attribute that we can find the two-letter code we need. I asked @kelson42 whether this code follows any specific standard, but I don't think we have any clear info on what standard the naming convention follows. It seems to use the correct one.

To give an idea of the kind of lookup table we'd need to implement, look at the list of codes here:

https://github.com/kiwix/kiwix-js-windows/blob/master/www/js/lib/kiwixServe.js#L22

I don't want to implement a translation between this number of codes if we actually already have access to the correct code, which it seems that we do! Note that it's not a simple case of truncating three letters to two, as the cases of "spa" and "es" show.

Very happy for this to move to zimarchive.js.

@mossroy
Copy link
Contributor

mossroy commented Oct 21, 2018

I close this PR.
As explained by @kelson42 in #395, we should not rely on the two-letter language code found in the "Name" metadata attribute, but instead use the "Language" metadata string, and transcode it if necessary.

@mossroy mossroy closed this Oct 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants