-
-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error extracting add-on when files inside the add-on hahve non-ASCII characters #2505
Comments
Attachment addons-accented-chars-patch.patch added by ragb on 2012-07-02 13:38 |
Comment 1 by jteh on 2012-07-10 01:03 Recent versions of the zip specification allow for file names to be encoded in UTF-8 by setting the EFS flag bit (bit 11, 0x800). 7-Zip at least does support this (see below), so we should too. As you noted, historically, the zip spec specified CP437 as the encoding for file names. This won't work for non-Latin languages. Interestingly, this post suggests that other common zip applications, even pkzip, don't honour this, instead just using the local code page. By default, 7-Zip uses the local code page where possible, using "UTF-8 encoding only for file names that contain symbols unsupported by local code page". You can also configure 7-Zip to always use UTF-8 for non-ASCII file names (by setting -mcu=on). 7-Zip does set the EFS flag bit when UTF-8 is used. Based on this, I think we should use mbcs when EFS is not set. So, when extracting, if entry.flag_bits & 0x800, we should use UTF-8. Otherwise, we should use mbcs. We should also set 0x800 in flag_bits when creating archives. Btw, minor nit: rather than checking os.path.exists(dir), just try except OSError around os.makedirs, as this is technically more efficient. |
Comment 2 by ragb (in reply to comment 1) on 2012-07-10 17:06
Agreed.
The problem is that other zip applications, such as windows explorer (or whatever zip thing it uses), seems to use cp-437. I tried using mbcs for decoding and it doesn't give the proper file name, seems to ignore the acented character. No error at all. I believe many people would use windows itself to zip the add-on contents so... At least we should inform them about this.
Patch with this and testing add-on (created on windows explorer) attached.. |
Attachment addons-accented-chars-patch2.patch added by ragb on 2012-07-10 17:07 |
Attachment test.nvda-addon added by ragb on 2012-07-10 17:08 |
Comment 3 by jteh (in reply to comment 2) on 2012-07-10 17:58
The story gets even more confusing. There are two groups of code pages in Windows: ANSI and OEM. Most stuff (including mbcs) uses ANSI code pages. However, CP437 is an OEM code page. CP437 is missing some characters needed even by Portuguese and other Western European languages; e.g. "Ã". Further investigation reveals that Explorer seems to use the local OEM code page. On my system, this is CP850, which does support "Ã". So, instead of mbcs (which uses the local ANSI code page), we need something which uses the local OEM code page. |
Comment 4 by jteh on 2012-07-10 18:21
Just use that instead of mbcs. This bug is ridiculous. It's taken me hours of research and debugging to figure all of this out. I wish everyone would just use Unicode/UTF-8/UTF-16. :) |
Comment 5 by ragb (in reply to comment 4) on 2012-07-11 11:51
ARRGGG! I was reading msdn and found it and it worked (I believe) then came here... and you found it too. What a double wast of time... One thing I'm not sure: If, for instance, I create an add-on in a different code page then you, will it give any problems when the add-on tries to open some file with accented chars? In principal, I believe not. If the application uses mbcs to decode the paths, it will have a correspondence to users file system (whatever code page they are using), but I'm not really sure if this applies everytime. I think we chould mention this on the developer guide and tell people to please utf-8 capable zip applications when generating add-ons, specially when having non-ascii chars on file names.
Really... I believe most bugs I introduced on NVDA or on the vocalizer driver have to do with unicode, MBCS, and such. Rui Batista |
Attachment addons-accented-chars-patch3.patch added by ragb on 2012-07-11 14:09 |
Comment 6 by ragb on 2012-07-11 14:11 |
Comment 7 by jteh (in reply to comment 5) on 2012-07-11 21:49
If the add-on opens the file using a Unicode path (which is recommended), it will probably fail, since it won't decode the way it does on their system. This applies to app modules as well. Even if it opens it with an ANSI path, the zip uses an OEM code page (and ANSI and OEM code pages are different), so it will probably still fail. I'm guessing the biggest reason for non-ASCII paths in add-ons is for app modules for apps in another language, so we can hope that most users will only install the add-on if their system is configured to use that language.
It's worth noting that even 7-Zip doesn't use UTF-8 by default unless the local OEM code page can't encode the characters. However, you can force it with -mcu=on. That said, I agree we should document this guideline.
If we can ever move to Python 3, this should get a lot simpler for pure Python stuff, since text strings are all Unicode. It'll still be painful for ctypes stuff though. |
Comment 8 by jteh on 2012-07-11 21:53 |
Comment 9 by jteh on 2012-07-25 07:54 |
Comment 10 by ragb (in reply to comment 9) on 2012-07-25 09:53
For some reason, I missed notifications of your previous two comments... I'll commit it shortly. Sorry for the delay. |
Comment 11 by jteh on 2012-07-26 00:27 |
Comment 12 by jteh on 2012-07-26 01:46 |
Comment 13 by tspivey on 2012-07-27 02:41 It doesn't handle directory entries. e.g. "appModules/". It doesn't handle / at the beginning of the filename. The extract method handles both of these. My solution is attached. It's probably not strictly good practice to modify info.filename, and we might want to not even try extracting if strings like "../" or ":" occur anywhere in the list of filenames, but that last is probably for another ticket and patch. Changes: |
Attachment extract.patch added by tspivey on 2012-07-27 02:42 |
Comment 14 by jteh (in reply to comment 13) on 2012-07-27 03:42
I assume this only matters for empty directories, since makedirs will create them otherwise?
I'm guessing this is the more severe issue, though I've never actually seen an archiver do this.
True, though realistically, nor is reimplementing the extraction logic. :) Any reason for removing all of the comments and the UTF-8 encoding check? |
Comment 15 by tspivey (in reply to comment 14) on 2012-07-27 03:59
I think so. The zip will have a name like "appModules/", then "appModules/file". You can't write to, say, userconfig/addons/test/appModules/ as an extracted file, you'll get a permission denied error.
I didn't think the comments were needed because I simplified most of the function, and the UTF-8 check can disappear because zipfile handles it internally by checking the flags, setting filename as a unicode instance if it's UTF-8. See _decodeFilename of zipfile.ZipInfo. |
Comment 16 by jteh (in reply to comment 15) on 2012-07-27 05:59
Oh. I get it. So that's why it completely breaks things. :)
Err... so it does. I could swear i tested this, but apparently I didn't. |
Comment 18 by ragb (in reply to comment 16) on 2012-07-27 12:10
OMG!
+1. I installed a bunch of add-ons with the previous patch applied, even test ones. Interesting, Python documentation doesn't care about any of this. Even if for listing archive contents one can have many issues, just trying to print file names somewhere... Thanks for the corrections though. |
Reported by ragb on 2012-07-02 13:38
This is more of python 2 and unicode terror stories...
When we extract the contects of an Add-on, the zipfile module can't handle file names with non-ASCII characteres. In fact, the zip format doesn't define any specific encoding for files.
The only solution is to extract manually the zipfile contents, decoding paths
I attach here the patch with the changes. Shall I commit it?
I use cp437 (dos-latine) to decode paths, because it seems what winzip and windows assume, according to the python docs.
The text was updated successfully, but these errors were encountered: