Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When iterating over Zipfiles, always use the Unix file separator to fix a Windows issue #638

Merged

Conversation

@yorinasub17
Copy link
Contributor

yorinasub17 commented Dec 21, 2018

Hello!

I know pex doesn't officially support windows, but there seems to be a regression in pex 1.6.0 with windows with respect to the way the vendoring works. Specifically, the regex for iterating the zip path in the vendoring uses the os separator, which has two problems on windows (where the separator is \):

  1. \ is an escape sequence, so unless the separator is added twice in the string, it uses it to escape the character immediately after the separator, causing the regex compilation to fail.
  2. The path in the zip file uses unix separators (/), so using the os separator is actually wrong.

This PR addresses this by updating the regex to replace windows separators (\) in the prefix var with the unix separator.

Copy link
Contributor

Eric-Arellano left a comment

Hi @yorinasub17, thank you for submitting this patch!

I believe this PR can be simplified. Rather than ever using os.sep, we should be able to always use / as this is a zip file, which as you state is how zip files work everywhere, including in Windows.

To land this PR, could you please do two things:

  1. Research and confirm that this assumption about zip files always using / is indeed 100% true. It would be great to find the spec claiming this and to link to this in the PR description. Because we don't have any CI testing for Windows, this would be a good way to verify correctness.
  2. Simplify the code to restore the original change and simply swap os.sep with /, along with a brief comment explaining how we can do this thanks to zip files' standard (maybe include the link from prior step).

Thank you!

@yorinasub17

This comment has been minimized.

Copy link
Contributor Author

yorinasub17 commented Mar 12, 2019

@Eric-Arellano Thanks for the PR review! I am currently bogged down with my other work, but will take a look at this when I free up later in the month.

@yorinasub17

This comment has been minimized.

Copy link
Contributor Author

yorinasub17 commented Apr 9, 2019

Research and confirm that this assumption about zip files always using / is indeed 100% true. It would be great to find the spec claiming this and to link to this in the PR description. Because we don't have any CI testing for Windows, this would be a good way to verify correctness.

It appears that the "official spec" of zip files is the one published by PKWARE (See https://support.pkware.com/display/PKZIP/APPNOTE, the reference to that in wikipedia, and the reference to the appnote file from winzip). From there, I checked the latest spec and confirmed that all paths must use forward slashes (section 4.4.17):

All slashes MUST be forward slashes '/' as opposed to backwards slashes '' for compatibility with Amiga and UNIX file systems etc.

Based on the above, I'll go through the process of simplifying the code.

Copy link
Contributor

Eric-Arellano left a comment

Thank you so much for the research to confirm this! That's very helpful.

Couple suggestions and then this should be ready to merge 🎉

pex/third_party/__init__.py Outdated Show resolved Hide resolved
pex/third_party/__init__.py Outdated Show resolved Hide resolved
@Eric-Arellano

This comment has been minimized.

Copy link
Contributor

Eric-Arellano commented Apr 10, 2019

It also looks like CI is failing due to the style check. After making the above changes, run tox -v -e style to test if style will be pass.

@yorinasub17

This comment has been minimized.

Copy link
Contributor Author

yorinasub17 commented Apr 10, 2019

@Eric-Arellano I believe I addressed all the comments, but please don't merge this yet. I haven't had a chance to test on my windows box yet and I have a suspicion the updated version won't work because I haven't touched the prefix var, which might contain os.sep.

@yorinasub17

This comment has been minimized.

Copy link
Contributor Author

yorinasub17 commented Apr 10, 2019

Sorry about the style error. I thought tox succeeded, but I was fooled because it actually failed on the find on my mac and I wasn't paying close attention to the logs:

using tox.ini: /Users/yoriy/packages/yoripex/tox.ini
using tox-3.4.0 from /Users/yoriy/.pyenv/versions/2.7.15/lib/python2.7/site-packages/tox/__init__.pyc
GLOB sdist-make: /Users/yoriy/packages/yoripex/setup.py
  /Users/yoriy/packages/yoripex$ /Users/yoriy/.pyenv/versions/2.7.15/bin/python2.7 /Users/yoriy/packages/yoripex/setup.py sdist --formats=zip --dist-dir /Users/yoriy/packages/yoripex/.tox/dist >/Users/yoriy/packages/yoripex/.tox/log/tox-0.log
style reusing: /Users/yoriy/packages/yoripex/.tox/style
style inst-nodeps: /Users/yoriy/packages/yoripex/.tox/dist/pex-1.6.0.zip
  /Users/yoriy/packages/yoripex$ /Users/yoriy/packages/yoripex/.tox/style/bin/python -m pip install --no-deps -U /Users/yoriy/packages/yoripex/.tox/dist/pex-1.6.0.zip >/Users/yoriy/packages/yoripex/.tox/style/log/style-12.log
  /Users/yoriy/packages/yoripex$ /Users/yoriy/packages/yoripex/.tox/style/bin/python -m pip freeze >/Users/yoriy/packages/yoripex/.tox/style/log/style-13.log
style installed: DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.,gitdb==0.6.4,GitPython==0.3.2rc1,pep8==1.4.5,pex==1.6.0,pyflakes==0.7.2,smmap==0.9.0,twitter.checkstyle==0.1.0,twitter.common.app==0.3.0,twitter.common.collections==0.3.0,twitter.common.contextutil==0.3.0,twitter.common.dirutil==0.3.0,twitter.common.lang==0.3.0,twitter.common.log==0.3.0,twitter.common.options==0.3.0,twitter.common.process==0.3.0,twitter.common.string==0.3.0,twitter.common.util==0.3.0
style run-test-pre: PYTHONHASHSEED='1115460778'
style runtests: commands[0] | bash scripts/style.sh
  /Users/yoriy/packages/yoripex$ /bin/bash scripts/style.sh
find: ,: unknown primary or operator
__________________________________________________________________________ summary ___________________________________________________________________________
  style: commands succeeded
  congratulations :)
@yorinasub17

This comment has been minimized.

Copy link
Contributor Author

yorinasub17 commented Apr 10, 2019

@Eric-Arellano Thanks for your patience with me through this review.

Ok after testing on my windows box, it turns out the updated version doesn't work for two reasons:

  • As suspected, I needed to fix the prefix as well.
  • I also had forgotten that the relpath that is passed into the function is constructed with os.sep. This needs to be the case because the _DirIterator expects the path to contain the sep.

I attempted to address both of these, but it sort of reverts back to what I had originally and I am not sure if the current way is actually simpler. What do you think? Any ideas to improve this?

EDIT: Note that I confirmed the latest version at commit 3241d07 works on my windows box.

Copy link
Contributor

Eric-Arellano left a comment

Thanks for continuing to iterate on this!

Indeed this is now closer to the original PR, but with the requested changes it should still be simpler than checking if we're using Windows, plus the comment you are adding is quite useful.

pex/third_party/__init__.py Outdated Show resolved Hide resolved
pex/third_party/__init__.py Outdated Show resolved Hide resolved
if zipfile.is_zipfile(path):
return cls(zipfile_path=path, prefix=prefix + os.sep if prefix else '')
prefix = os.path.join(prefix, os.path.basename(path))
return cls(zipfile_path=path, prefix='{}/'.format(prefix) if prefix else '')

This comment has been minimized.

Copy link
@Eric-Arellano

Eric-Arellano Apr 10, 2019

Contributor

Is there any guarantee prefix will never already have os.sep in it? That's an earnest question - I couldn't figure it out from a very quick search for _ZipIterator.

Either way, for consistency and safety, might be better to change this line to this idiom:

return cls(zipfile_path=path, prefix='' if not prefix else '{}{}'.format(prefix, os.sep).replace(os.sep, '/')

This comment has been minimized.

Copy link
@yorinasub17

yorinasub17 Apr 10, 2019

Author Contributor

Is there any guarantee prefix will never already have os.sep in it?

I think this can be guaranteed because prefix starts off as '' outside the while loop, and only contains parts constructed with os.path.basename. os.path.basename takes the last path element, so it is equivalent to arg.split(os.sep)[-1], which will mean the result won't contain the os.sep character.

That said, there is something that doesn't make sense to me about this while loop. This threw me off at first, but I believe the while loop is repeatedly walking up root directory by directory until it hits a zip file. E.g ignoring prefix for a minute, the loop boils down to:

while path:
  if zipfile.is_zipfile(path):
    return
  # if path = foo/bar/baz, then os.path.dirname(path) = foo/bar
  path = os.path.dirname(path)

Given that, I think the prefix is actually backwards. If we go into the loop with foo.zip/bar/baz, assuming foo.zip is the zip file, prefix will become baz/bar, because after the first iteration, prefix will be basename of the path (which is baz), and the second iteration joins that with basename of the path again (the original was prefix = os.path.join(prefix, os.path.basename(path)), so os.path.join('baz', os.path.basename('foo.zip/bar')) = 'baz/bar'). Am I missing something with my logic here?

This must be working because the tests pass, but I wonder if this is actually an uncaught bug? I haven't had a chance to walk through all the test cases to know if there is any test case for this, or if the caller of _ZipIterator somehow guarantees this loop only goes one iteration deep.

Also this loop is an infinite loop if root is ever an absolute path that does not contain a zip file, because os.path.dirname('/') == '/', so it will never be falsy.

This comment has been minimized.

Copy link
@jsirois

jsirois Apr 20, 2019

Member

You're right about the bug - and it skates by because items are exactly one level deep. The prefix = os.path.join(prefix, os.path.basename(path)) line should be prefix = os.path.join(os.path.basename(path), prefix).

As far as the infinite loop goes, the one call of containing ensures root is not a dir (but not that it's not a file). It probably makes sense to either guard the infinite loop case or else assert early that root is either a zipfile or else not a file and not a directory (foo.zip/bar/baz is not either of these).

This comment has been minimized.

Copy link
@Eric-Arellano

Eric-Arellano Apr 20, 2019

Contributor

Great investigation @yorinasub17 and thanks @jsirois for confirming.

I think it'd be best to save this bug fix for a separate PR, as this one is focused on Windows support. If you're up for it, that'd be great for you to open this PR. No worries if you're busy - we can do it otherwise.

This comment has been minimized.

Copy link
@yorinasub17

yorinasub17 Apr 20, 2019

Author Contributor

Thanks for the confirmation. I made the fix for the prefix here, but have not addressed the infinite loop.

I won't be able to address it today though. I'll check back again when I have time and if it still hasn't been fixed yet, I'll open a PR.

pex/third_party/__init__.py Outdated Show resolved Hide resolved
yorinasub17 added 2 commits Apr 10, 2019
Copy link
Contributor

Eric-Arellano left a comment

Thanks for your continued work to land this! Merging now.

@Eric-Arellano Eric-Arellano changed the title Change windows separator with unix separator in regex When iterating over Zipfiles, always use the Unix file separator to fix a Windows issue Apr 20, 2019
pex/third_party/__init__.py Outdated Show resolved Hide resolved
pex/third_party/__init__.py Outdated Show resolved Hide resolved
pex/third_party/__init__.py Outdated Show resolved Hide resolved
@yorinasub17

This comment has been minimized.

Copy link
Contributor Author

yorinasub17 commented Apr 20, 2019

Thanks for your continued support on this @jsirois and @Eric-Arellano.

I made the changes requested and tested it out on my windows box and confirmed it still works.

@jsirois jsirois merged commit 28869f6 into pantsbuild:master Apr 20, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jsirois jsirois mentioned this pull request Apr 21, 2019
10 of 11 tasks complete
@yorinasub17 yorinasub17 deleted the yorinasub17:yori-vendor-regex-supports-windows branch May 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.