Skip to content

Fix #70091: Phar does not mark UTF-8 filenames in ZIP archives #6630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

cmb69
Copy link
Member

@cmb69 cmb69 commented Jan 22, 2021

The default encoding of filenames in a ZIP archive is IBM Code Page
437. Phar, however, only supports UTF-8 filenames. Therefore we have
to mark non ASCII filenames as being stored in UTF-8 by setting the
general purpose bit 11 (the language encoding flag).

The effect of not setting this bit for non ASCII filenames can be seen
in popular tools like 7-Zip and UnZip, but not when extracting the
archives via ext/phar (which is agnostic to the filename encoding), or
via ext/zip (which guesses the encoding). Thus we add a somewhat
brittle low-level test case.

The default encoding of filenames in a ZIP archive is IBM Code Page
437.  Phar, however, only supports UTF-8 filenames.  Therefore we have
to mark non ASCII filenames as being stored in UTF-8 by setting the
general purpose bit 11 (the language encoding flag).

The effect of not setting this bit for non ASCII filenames can be seen
in popular tools like 7-Zip and UnZip, but not when extracting the
archives via ext/phar (which is agnostic to the filename encoding), or
via ext/zip (which guesses the encoding).  Thus we add a somewhat
brittle low-level test case.
@cmb69 cmb69 added the Bug label Jan 22, 2021
ext/phar/zip.c Outdated
@@ -878,6 +887,13 @@ static int phar_zip_changed_apply_int(phar_entry_info *entry, void *arg) /* {{{
memcpy(central.datestamp, local.datestamp, sizeof(local.datestamp));
PHAR_SET_16(central.filename_len, entry->filename_len + (entry->is_dir ? 1 : 0));
PHAR_SET_16(local.filename_len, entry->filename_len + (entry->is_dir ? 1 : 0));
if (!is_ascii(entry->filename, entry->filename_len)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, would just unconditionally setting the flag be fine? ASCII and UTF-8 are the same when only ASCII characters are used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a real problem doing this unconditionally; if a ZIP tool doesn't cater to that flag, there still shouldn't be a difference regarding ASCII only filenames. OTOH, setting the flag conditionally, wouldn't cause any behavioral change for ASCII only filenames.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit which would set the flag unconditionally. I'm fine with either solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always setting the flag is less code, so if that works, let's go for it :)

@php-pulls php-pulls closed this in 6a0b889 Jan 26, 2021
@cmb69 cmb69 deleted the cmb/70091 branch July 13, 2021 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants