Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When iterating over Zipfiles, always use the Unix file separator to fix a Windows issue #638

Merged

Conversation

yorinasub17
Copy link
Contributor

Hello!

I know pex doesn't officially support windows, but there seems to be a regression in pex 1.6.0 with windows with respect to the way the vendoring works. Specifically, the regex for iterating the zip path in the vendoring uses the os separator, which has two problems on windows (where the separator is \):

  1. \ is an escape sequence, so unless the separator is added twice in the string, it uses it to escape the character immediately after the separator, causing the regex compilation to fail.
  2. The path in the zip file uses unix separators (/), so using the os separator is actually wrong.

This PR addresses this by updating the regex to replace windows separators (\) in the prefix var with the unix separator.

Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yorinasub17, thank you for submitting this patch!

I believe this PR can be simplified. Rather than ever using os.sep, we should be able to always use / as this is a zip file, which as you state is how zip files work everywhere, including in Windows.

To land this PR, could you please do two things:

  1. Research and confirm that this assumption about zip files always using / is indeed 100% true. It would be great to find the spec claiming this and to link to this in the PR description. Because we don't have any CI testing for Windows, this would be a good way to verify correctness.
  2. Simplify the code to restore the original change and simply swap os.sep with /, along with a brief comment explaining how we can do this thanks to zip files' standard (maybe include the link from prior step).

Thank you!

@yorinasub17
Copy link
Contributor Author

@Eric-Arellano Thanks for the PR review! I am currently bogged down with my other work, but will take a look at this when I free up later in the month.

@yorinasub17
Copy link
Contributor Author

Research and confirm that this assumption about zip files always using / is indeed 100% true. It would be great to find the spec claiming this and to link to this in the PR description. Because we don't have any CI testing for Windows, this would be a good way to verify correctness.

It appears that the "official spec" of zip files is the one published by PKWARE (See https://support.pkware.com/display/PKZIP/APPNOTE, the reference to that in wikipedia, and the reference to the appnote file from winzip). From there, I checked the latest spec and confirmed that all paths must use forward slashes (section 4.4.17):

All slashes MUST be forward slashes '/' as opposed to backwards slashes '' for compatibility with Amiga and UNIX file systems etc.

Based on the above, I'll go through the process of simplifying the code.

Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the research to confirm this! That's very helpful.

Couple suggestions and then this should be ready to merge 🎉

pex/third_party/__init__.py Outdated Show resolved Hide resolved
pex/third_party/__init__.py Outdated Show resolved Hide resolved
@Eric-Arellano
Copy link
Contributor

It also looks like CI is failing due to the style check. After making the above changes, run tox -v -e style to test if style will be pass.

@yorinasub17
Copy link
Contributor Author

@Eric-Arellano I believe I addressed all the comments, but please don't merge this yet. I haven't had a chance to test on my windows box yet and I have a suspicion the updated version won't work because I haven't touched the prefix var, which might contain os.sep.

@yorinasub17
Copy link
Contributor Author

Sorry about the style error. I thought tox succeeded, but I was fooled because it actually failed on the find on my mac and I wasn't paying close attention to the logs:

using tox.ini: /Users/yoriy/packages/yoripex/tox.ini
using tox-3.4.0 from /Users/yoriy/.pyenv/versions/2.7.15/lib/python2.7/site-packages/tox/__init__.pyc
GLOB sdist-make: /Users/yoriy/packages/yoripex/setup.py
  /Users/yoriy/packages/yoripex$ /Users/yoriy/.pyenv/versions/2.7.15/bin/python2.7 /Users/yoriy/packages/yoripex/setup.py sdist --formats=zip --dist-dir /Users/yoriy/packages/yoripex/.tox/dist >/Users/yoriy/packages/yoripex/.tox/log/tox-0.log
style reusing: /Users/yoriy/packages/yoripex/.tox/style
style inst-nodeps: /Users/yoriy/packages/yoripex/.tox/dist/pex-1.6.0.zip
  /Users/yoriy/packages/yoripex$ /Users/yoriy/packages/yoripex/.tox/style/bin/python -m pip install --no-deps -U /Users/yoriy/packages/yoripex/.tox/dist/pex-1.6.0.zip >/Users/yoriy/packages/yoripex/.tox/style/log/style-12.log
  /Users/yoriy/packages/yoripex$ /Users/yoriy/packages/yoripex/.tox/style/bin/python -m pip freeze >/Users/yoriy/packages/yoripex/.tox/style/log/style-13.log
style installed: DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.,gitdb==0.6.4,GitPython==0.3.2rc1,pep8==1.4.5,pex==1.6.0,pyflakes==0.7.2,smmap==0.9.0,twitter.checkstyle==0.1.0,twitter.common.app==0.3.0,twitter.common.collections==0.3.0,twitter.common.contextutil==0.3.0,twitter.common.dirutil==0.3.0,twitter.common.lang==0.3.0,twitter.common.log==0.3.0,twitter.common.options==0.3.0,twitter.common.process==0.3.0,twitter.common.string==0.3.0,twitter.common.util==0.3.0
style run-test-pre: PYTHONHASHSEED='1115460778'
style runtests: commands[0] | bash scripts/style.sh
  /Users/yoriy/packages/yoripex$ /bin/bash scripts/style.sh
find: ,: unknown primary or operator
__________________________________________________________________________ summary ___________________________________________________________________________
  style: commands succeeded
  congratulations :)

@yorinasub17
Copy link
Contributor Author

yorinasub17 commented Apr 10, 2019

@Eric-Arellano Thanks for your patience with me through this review.

Ok after testing on my windows box, it turns out the updated version doesn't work for two reasons:

  • As suspected, I needed to fix the prefix as well.
  • I also had forgotten that the relpath that is passed into the function is constructed with os.sep. This needs to be the case because the _DirIterator expects the path to contain the sep.

I attempted to address both of these, but it sort of reverts back to what I had originally and I am not sure if the current way is actually simpler. What do you think? Any ideas to improve this?

EDIT: Note that I confirmed the latest version at commit 3241d07 works on my windows box.

Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to iterate on this!

Indeed this is now closer to the original PR, but with the requested changes it should still be simpler than checking if we're using Windows, plus the comment you are adding is quite useful.

pex/third_party/__init__.py Outdated Show resolved Hide resolved
pex/third_party/__init__.py Outdated Show resolved Hide resolved
if zipfile.is_zipfile(path):
return cls(zipfile_path=path, prefix=prefix + os.sep if prefix else '')
prefix = os.path.join(prefix, os.path.basename(path))
return cls(zipfile_path=path, prefix='{}/'.format(prefix) if prefix else '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any guarantee prefix will never already have os.sep in it? That's an earnest question - I couldn't figure it out from a very quick search for _ZipIterator.

Either way, for consistency and safety, might be better to change this line to this idiom:

return cls(zipfile_path=path, prefix='' if not prefix else '{}{}'.format(prefix, os.sep).replace(os.sep, '/')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any guarantee prefix will never already have os.sep in it?

I think this can be guaranteed because prefix starts off as '' outside the while loop, and only contains parts constructed with os.path.basename. os.path.basename takes the last path element, so it is equivalent to arg.split(os.sep)[-1], which will mean the result won't contain the os.sep character.

That said, there is something that doesn't make sense to me about this while loop. This threw me off at first, but I believe the while loop is repeatedly walking up root directory by directory until it hits a zip file. E.g ignoring prefix for a minute, the loop boils down to:

while path:
  if zipfile.is_zipfile(path):
    return
  # if path = foo/bar/baz, then os.path.dirname(path) = foo/bar
  path = os.path.dirname(path)

Given that, I think the prefix is actually backwards. If we go into the loop with foo.zip/bar/baz, assuming foo.zip is the zip file, prefix will become baz/bar, because after the first iteration, prefix will be basename of the path (which is baz), and the second iteration joins that with basename of the path again (the original was prefix = os.path.join(prefix, os.path.basename(path)), so os.path.join('baz', os.path.basename('foo.zip/bar')) = 'baz/bar'). Am I missing something with my logic here?

This must be working because the tests pass, but I wonder if this is actually an uncaught bug? I haven't had a chance to walk through all the test cases to know if there is any test case for this, or if the caller of _ZipIterator somehow guarantees this loop only goes one iteration deep.

Also this loop is an infinite loop if root is ever an absolute path that does not contain a zip file, because os.path.dirname('/') == '/', so it will never be falsy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about the bug - and it skates by because items are exactly one level deep. The prefix = os.path.join(prefix, os.path.basename(path)) line should be prefix = os.path.join(os.path.basename(path), prefix).

As far as the infinite loop goes, the one call of containing ensures root is not a dir (but not that it's not a file). It probably makes sense to either guard the infinite loop case or else assert early that root is either a zipfile or else not a file and not a directory (foo.zip/bar/baz is not either of these).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigation @yorinasub17 and thanks @jsirois for confirming.

I think it'd be best to save this bug fix for a separate PR, as this one is focused on Windows support. If you're up for it, that'd be great for you to open this PR. No worries if you're busy - we can do it otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the confirmation. I made the fix for the prefix here, but have not addressed the infinite loop.

I won't be able to address it today though. I'll check back again when I have time and if it still hasn't been fixed yet, I'll open a PR.

pex/third_party/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your continued work to land this! Merging now.

@Eric-Arellano Eric-Arellano changed the title Change windows separator with unix separator in regex When iterating over Zipfiles, always use the Unix file separator to fix a Windows issue Apr 20, 2019
pex/third_party/__init__.py Outdated Show resolved Hide resolved
pex/third_party/__init__.py Outdated Show resolved Hide resolved
pex/third_party/__init__.py Outdated Show resolved Hide resolved
@yorinasub17
Copy link
Contributor Author

Thanks for your continued support on this @jsirois and @Eric-Arellano.

I made the changes requested and tested it out on my windows box and confirmed it still works.

@jsirois jsirois merged commit 28869f6 into pex-tool:master Apr 20, 2019
@jsirois jsirois mentioned this pull request Apr 21, 2019
11 tasks
@yorinasub17 yorinasub17 mentioned this pull request May 9, 2019
@yorinasub17 yorinasub17 deleted the yori-vendor-regex-supports-windows branch May 11, 2019 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants