Error extracting add-on when files inside the add-on hahve non-ASCII characters #2505

Closed
nvaccessAuto opened this Issue Jul 2, 2012 · 23 comments

1 participant

@nvaccessAuto

Reported by ragb on 2012-07-02 13:38
This is more of python 2 and unicode terror stories...

When we extract the contects of an Add-on, the zipfile module can't handle file names with non-ASCII characteres. In fact, the zip format doesn't define any specific encoding for files.

The only solution is to extract manually the zipfile contents, decoding paths

I attach here the patch with the changes. Shall I commit it?
I use cp437 (dos-latine) to decode paths, because it seems what winzip and windows assume, according to the python docs.

@nvaccessAuto

Attachment addons-accented-chars-patch.patch added by ragb on 2012-07-02 13:38
Description:

@nvaccessAuto

Comment 1 by jteh on 2012-07-10 01:03
The situation is a little more complicated than this.

Recent versions of the zip specification allow for file names to be encoded in UTF-8 by setting the EFS flag bit (bit 11, 0x800). 7-Zip at least does support this (see below), so we should too.

As you noted, historically, the zip spec specified CP437 as the encoding for file names. This won't work for non-Latin languages. Interestingly, this post suggests that other common zip applications, even pkzip, don't honour this, instead just using the local code page. By default, 7-Zip uses the local code page where possible, using "UTF-8 encoding only for file names that contain symbols unsupported by local code page". You can also configure 7-Zip to always use UTF-8 for non-ASCII file names (by setting -mcu=on). 7-Zip does set the EFS flag bit when UTF-8 is used. Based on this, I think we should use mbcs when EFS is not set.

So, when extracting, if entry.flag_bits & 0x800, we should use UTF-8. Otherwise, we should use mbcs. We should also set 0x800 in flag_bits when creating archives.

Btw, minor nit: rather than checking os.path.exists(dir), just try except OSError around os.makedirs, as this is technically more efficient.

@nvaccessAuto

Comment 2 by ragb (in reply to comment 1) on 2012-07-10 17:06
Replying to jteh:

The situation is a little more complicated than this.

Recent versions of the zip specification allow for file names to be encoded in UTF-8 by setting the EFS flag bit (bit 11, 0x800). 7-Zip at least does support this (see below), so we should too.

Agreed.

As you noted, historically, the zip spec specified CP437 as the encoding for file names. This won't work for non-Latin languages. Interestingly, this post suggests that other common zip applications, even pkzip, don't honour this, instead just using the local code page. By default, 7-Zip uses the local code page where possible, using "UTF-8 encoding only for file names that contain symbols unsupported by local code page". You can also configure 7-Zip to always use UTF-8 for non-ASCII file names (by setting -mcu=on). 7-Zip does set the EFS flag bit when UTF-8 is used. Based on this, I think we should use mbcs when EFS is not set.

The problem is that other zip applications, such as windows explorer (or whatever zip thing it uses), seems to use cp-437. I tried using mbcs for decoding and it doesn't give the proper file name, seems to ignore the acented character. No error at all. I believe many people would use windows itself to zip the add-on contents so... At least we should inform them about this.

So, when extracting, if entry.flag_bits & 0x800, we should use UTF-8. Otherwise, we should use mbcs. We should also set 0x800 in flag_bits when creating archives.

Patch with this and testing add-on (created on windows explorer) attached..

@nvaccessAuto

Attachment addons-accented-chars-patch2.patch added by ragb on 2012-07-10 17:07
Description:
New patch using MBCS and utf-8 if EFS is set.

@nvaccessAuto

Attachment test.nvda-addon added by ragb on 2012-07-10 17:08
Description:
Test add-on with a file with a non-ascii character created on windows.

@nvaccessAuto

Comment 3 by jteh (in reply to comment 2) on 2012-07-10 17:58
Replying to ragb:

The problem is that other zip applications, such as windows explorer (or whatever zip thing it uses), seems to use cp-437. I tried using mbcs for decoding and it doesn't give the proper file name, seems to ignore the acented character. No error at all.

The story gets even more confusing. There are two groups of code pages in Windows: ANSI and OEM. Most stuff (including mbcs) uses ANSI code pages. However, CP437 is an OEM code page. CP437 is missing some characters needed even by Portuguese and other Western European languages; e.g. "Ã". Further investigation reveals that Explorer seems to use the local OEM code page. On my system, this is CP850, which does support "Ã". So, instead of mbcs (which uses the local ANSI code page), we need something which uses the local OEM code page.

@nvaccessAuto

Comment 4 by jteh on 2012-07-10 18:21
Right. You can get the OEM codec like this:

str(winKernel.kernel32.GetOEMCP())

Just use that instead of mbcs.

This bug is ridiculous. It's taken me hours of research and debugging to figure all of this out. I wish everyone would just use Unicode/UTF-8/UTF-16. :)

@nvaccessAuto

Comment 5 by ragb (in reply to comment 4) on 2012-07-11 11:51
Replying to jteh:

Right. You can get the OEM codec like this:

str(winKernel.kernel32.GetOEMCP())

Just use that instead of mbcs.

ARRGGG! I was reading msdn and found it and it worked (I believe) then came here... and you found it too. What a double wast of time...

One thing I'm not sure: If, for instance, I create an add-on in a different code page then you, will it give any problems when the add-on tries to open some file with accented chars? In principal, I believe not. If the application uses mbcs to decode the paths, it will have a correspondence to users file system (whatever code page they are using), but I'm not really sure if this applies everytime.

I think we chould mention this on the developer guide and tell people to please utf-8 capable zip applications when generating add-ons, specially when having non-ascii chars on file names.

This bug is ridiculous. It's taken me hours of research and debugging to figure all of this out. I wish everyone would just use Unicode/UTF-8/UTF-16. :)

Really... I believe most bugs I introduced on NVDA or on the vocalizer driver have to do with unicode, MBCS, and such.

Rui Batista

@nvaccessAuto

Attachment addons-accented-chars-patch3.patch added by ragb on 2012-07-11 14:09
Description:
Revised-revised patch to use oem code page when EFS flag is not set on zip entries.

@nvaccessAuto

Comment 6 by ragb on 2012-07-11 14:11
Added revised patch with OEM code page instead of MBCS. This is really annoying :).

@nvaccessAuto

Comment 7 by jteh (in reply to comment 5) on 2012-07-11 21:49
Replying to ragb:

One thing I'm not sure: If, for instance, I create an add-on in a different code page then you, will it give any problems when the add-on tries to open some file with accented chars?

If the add-on opens the file using a Unicode path (which is recommended), it will probably fail, since it won't decode the way it does on their system. This applies to app modules as well. Even if it opens it with an ANSI path, the zip uses an OEM code page (and ANSI and OEM code pages are different), so it will probably still fail. I'm guessing the biggest reason for non-ASCII paths in add-ons is for app modules for apps in another language, so we can hope that most users will only install the add-on if their system is configured to use that language.

I think we chould mention this on the developer guide and tell people to please utf-8 capable zip applications when generating add-ons, specially when having non-ascii chars on file names.

It's worth noting that even 7-Zip doesn't use UTF-8 by default unless the local OEM code page can't encode the characters. However, you can force it with -mcu=on. That said, I agree we should document this guideline.

Really... I believe most bugs I introduced on NVDA or on the vocalizer driver have to do with unicode, MBCS, and such.

If we can ever move to Python 3, this should get a lot simpler for pure Python stuff, since text strings are all Unicode. It'll still be painful for ctypes stuff though.

@nvaccessAuto

Comment 8 by jteh on 2012-07-11 21:53
Last version of the patch looks great, despite the crappy situation. Please commit. However, please leave this open until we update the Developer Guide.

@nvaccessAuto

Comment 9 by jteh on 2012-07-25 07:54
@ragb: Ping! Do you want to commit this or are there pending issues? I'm happy to do it myself if you're busy. :)

@nvaccessAuto

Comment 10 by ragb (in reply to comment 9) on 2012-07-25 09:53
Replying to jteh:

@ragb: Ping! Do you want to commit this or are there pending issues? I'm happy to do it myself if you're busy. :)

For some reason, I missed notifications of your previous two comments... I'll commit it shortly. Sorry for the delay.

@nvaccessAuto

Comment 11 by jteh on 2012-07-26 00:27
Just a reminder to add What's New entries for user visible changes like this. I've done this in 0b473cd.

@nvaccessAuto

Comment 12 by jteh on 2012-07-26 01:46
98c981c adds info about UTF_8 zip archives to the Developer Guide.
Changes:
State: closed

@nvaccessAuto

Comment 13 by tspivey on 2012-07-27 02:41
Looking at the latest commited patch, this has some problems, one of which completely breaks add-on installs.

It doesn't handle directory entries. e.g. "appModules/".

It doesn't handle / at the beginning of the filename. The extract method handles both of these.

My solution is attached. It's probably not strictly good practice to modify info.filename, and we might want to not even try extracting if strings like "../" or ":" occur anywhere in the list of filenames, but that last is probably for another ticket and patch.

Changes:
State: reopened

@nvaccessAuto

Attachment extract.patch added by tspivey on 2012-07-27 02:42
Description:

@nvaccessAuto

Comment 14 by jteh (in reply to comment 13) on 2012-07-27 03:42
Replying to tspivey:

It doesn't handle directory entries. e.g. "appModules/".

I assume this only matters for empty directories, since makedirs will create them otherwise?

It doesn't handle / at the beginning of the filename.

I'm guessing this is the more severe issue, though I've never actually seen an archiver do this.

My solution is attached. It's probably not strictly good practice to modify info.filename,

True, though realistically, nor is reimplementing the extraction logic. :) Any reason for removing all of the comments and the UTF-8 encoding check?

@nvaccessAuto

Comment 15 by tspivey (in reply to comment 14) on 2012-07-27 03:59
Replying to jteh:

Replying to tspivey:

It doesn't handle directory entries. e.g. "appModules/".

I assume this only matters for empty directories, since makedirs will create them otherwise?

I think so. The zip will have a name like "appModules/", then "appModules/file". You can't write to, say, userconfig/addons/test/appModules/ as an extracted file, you'll get a permission denied error.

It doesn't handle / at the beginning of the filename.

I'm guessing this is the more severe issue, though I've never actually seen an archiver do this.

My solution is attached. It's probably not strictly good practice to modify info.filename,

True, though realistically, nor is reimplementing the extraction logic. :) Any reason for removing all of the comments and the UTF-8 encoding check?

I didn't think the comments were needed because I simplified most of the function, and the UTF-8 check can disappear because zipfile handles it internally by checking the flags, setting filename as a unicode instance if it's UTF-8. See _decodeFilename of zipfile.ZipInfo.

@nvaccessAuto

Comment 16 by jteh (in reply to comment 15) on 2012-07-27 05:59
Replying to tspivey:

I think so. The zip will have a name like "appModules/", then "appModules/file". You can't write to, say, userconfig/addons/test/appModules/ as an extracted file, you'll get a permission denied error.

Oh. I get it. So that's why it completely breaks things. :)

I didn't think the comments were needed because I simplified most of the function, and the UTF-8 check can disappear because zipfile handles it internally by checking the flags, setting filename as a unicode instance if it's UTF-8. See _decodeFilename of zipfile.ZipInfo.

Err... so it does. I could swear i tested this, but apparently I didn't.

@nvaccessAuto

Comment 17 by jteh on 2012-07-27 06:34
Fixed in 5f4a702. @tspivey, thanks for the catch and your help.
Changes:
State: closed

@nvaccessAuto

Comment 18 by ragb (in reply to comment 16) on 2012-07-27 12:10
Replying to jteh:

Replying to tspivey:

I think so. The zip will have a name like "appModules/", then "appModules/file". You can't write to, say, userconfig/addons/test/appModules/ as an extracted file, you'll get a permission denied error.

Oh. I get it. So that's why it completely breaks things. :)

OMG!

I didn't think the comments were needed because I simplified most of the function, and the UTF-8 check can disappear because zipfile handles it internally by checking the flags, setting filename as a unicode instance if it's UTF-8. See _decodeFilename of zipfile.ZipInfo.

Err... so it does. I could swear i tested this, but apparently I didn't.

+1. I installed a bunch of add-ons with the previous patch applied, even test ones. Interesting, Python documentation doesn't care about any of this. Even if for listing archive contents one can have many issues, just trying to print file names somewhere...

Thanks for the corrections though.

@nvaccessAuto nvaccessAuto added this to the 2012.3 milestone Nov 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment