Fix #70091: Phar does not mark UTF-8 filenames in ZIP archives #6630

cmb69 · 2021-01-22T11:18:39Z

The default encoding of filenames in a ZIP archive is IBM Code Page
437. Phar, however, only supports UTF-8 filenames. Therefore we have
to mark non ASCII filenames as being stored in UTF-8 by setting the
general purpose bit 11 (the language encoding flag).

The effect of not setting this bit for non ASCII filenames can be seen
in popular tools like 7-Zip and UnZip, but not when extracting the
archives via ext/phar (which is agnostic to the filename encoding), or
via ext/zip (which guesses the encoding). Thus we add a somewhat
brittle low-level test case.

The default encoding of filenames in a ZIP archive is IBM Code Page 437. Phar, however, only supports UTF-8 filenames. Therefore we have to mark non ASCII filenames as being stored in UTF-8 by setting the general purpose bit 11 (the language encoding flag). The effect of not setting this bit for non ASCII filenames can be seen in popular tools like 7-Zip and UnZip, but not when extracting the archives via ext/phar (which is agnostic to the filename encoding), or via ext/zip (which guesses the encoding). Thus we add a somewhat brittle low-level test case.

nikic · 2021-01-25T09:52:51Z

ext/phar/zip.c

@@ -878,6 +887,13 @@ static int phar_zip_changed_apply_int(phar_entry_info *entry, void *arg) /* {{{
 	memcpy(central.datestamp, local.datestamp, sizeof(local.datestamp));
 	PHAR_SET_16(central.filename_len, entry->filename_len + (entry->is_dir ? 1 : 0));
 	PHAR_SET_16(local.filename_len, entry->filename_len + (entry->is_dir ? 1 : 0));
+	if (!is_ascii(entry->filename, entry->filename_len)) {


Just wondering, would just unconditionally setting the flag be fine? ASCII and UTF-8 are the same when only ASCII characters are used.

I don't see a real problem doing this unconditionally; if a ZIP tool doesn't cater to that flag, there still shouldn't be a difference regarding ASCII only filenames. OTOH, setting the flag conditionally, wouldn't cause any behavioral change for ASCII only filenames.

I pushed a commit which would set the flag unconditionally. I'm fine with either solution.

Always setting the flag is less code, so if that works, let's go for it :)

cmb69 added the Bug label Jan 22, 2021

nikic reviewed Jan 25, 2021

View reviewed changes

cmb69 added 2 commits January 26, 2021 16:15

Unconditionally set UTF-8 flag

0809a9c

Fix test case

05bfab4

nikic approved these changes Jan 26, 2021

View reviewed changes

php-pulls closed this in 6a0b889 Jan 26, 2021

cmb69 deleted the cmb/70091 branch July 13, 2021 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #70091: Phar does not mark UTF-8 filenames in ZIP archives #6630

Fix #70091: Phar does not mark UTF-8 filenames in ZIP archives #6630

Uh oh!

cmb69 commented Jan 22, 2021

Uh oh!

nikic Jan 25, 2021

Uh oh!

cmb69 Jan 25, 2021

Uh oh!

cmb69 Jan 26, 2021

Uh oh!

nikic Jan 26, 2021

Uh oh!

Uh oh!

Fix #70091: Phar does not mark UTF-8 filenames in ZIP archives #6630

Fix #70091: Phar does not mark UTF-8 filenames in ZIP archives #6630

Uh oh!

Conversation

cmb69 commented Jan 22, 2021

Uh oh!

nikic Jan 25, 2021

Choose a reason for hiding this comment

Uh oh!

cmb69 Jan 25, 2021

Choose a reason for hiding this comment

Uh oh!

cmb69 Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

nikic Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!