Make open_zip print realpath when raising BadZipfile. #4186

Merged
merged 4 commits into from Jan 26, 2017

Conversation

Projects
None yet
3 participants
@mateor
Member

mateor commented Jan 17, 2017

Problem

contextutil.open_zip(fp) is generally called on files under .pants.d, which
increasingly are symlinks.

In the motivating example, a jar was corrupted and could
not be opened by detect_duplicates task. The Exception was:

Exception message: Bad Zipfile
/buildroot/.pants.d/pom-resolve/pom-resolve/syms/org/some/corrupted.jar: File is not a zip file

The corrupted file is really at the realpath, and
neither cleaning nor deleting the symlink will help.

Also, it turns out that zipfile does not filter for falsey values
but instead bubbles up an AttributeError that is tough to traceback:

        # Determine file size
>       fpin.seek(0, 2)
E       AttributeError: 'NoneType' object has no attribute 'seek'

Solution

  • Check if open_zip is passed a falsey value, and return a custom subclass of
    zipfile.BadZipfile created for that circumstance
  • Provide the realpath of files that cause zipfile.BadZipfile to bubble up.

Result

  • If passed a falsey file path:
        if not path_or_file:
>       raise InvalidZipPath('{} is not a valid zip location'.format(path_or_file))
E       InvalidZipPath: Invalid zip location: None
  • If passed a symlink to a corrupted archive:

It used to be:

    Exception message: Bad Zipfile
    /buildroot/.pants.d/pom-resolve/pom-resolve/syms/org/some/corrupted.jar: File is not a zip file

And now it says

    Exception message: Bad Zipfile
    ~/.m2/repository/org/some/corrupted.jar: File is not a zip file
Make open_zip print realpath when raising BadZipfile.
open_zip is used to unzip jars, but is usually called
on files under .pants.d, which increasingly are symlinks.

In the motivating example, a jar was corrupted and could
not be opened by detect_duplicates task. The Exception was:

```
Exception message: Bad Zipfile
/buildroot/.pants.d/pom-resolve/pom-resolve/syms/org/some/corrupted.jar: File is not a zip file
```

The corrupted file is really at the realpath, and
neither cleaning nor deleting the symlink will help.

The updated output hopefully makes that more clear.

This commit also adds a None check for the passed file_path.

It turns out that zipfile does not filter falsey values but instead
bubbles up an AttributeError that is tough to traceback:
```
        # Determine file size
>       fpin.seek(0, 2)
E       AttributeError: 'NoneType' object has no attribute 'seek'
```
a custom subclass of BadZipFile to return in that instance.

None of the behavior here is new, this commit just provides improved feedback.
Passing None to zipfile.Zipfile has always halted execution, as did non-zip files.

@jsirois jsirois self-requested a review Jan 17, 2017

src/python/pants/util/contextutil.py
zf = zipfile.ZipFile(path_or_file, *args, allowZip64=allowZip64, **kwargs)
except zipfile.BadZipfile as bze:
- raise zipfile.BadZipfile("Bad Zipfile {0}: {1}".format(path_or_file, bze))
+ # Print the realpath in order to follow symlinks back to the problem source file.

This comment has been minimized.

@jsirois

jsirois Jan 17, 2017

Member

s/Print/Use/? not 'Print' anyway in this context.

@jsirois

jsirois Jan 17, 2017

Member

s/Print/Use/? not 'Print' anyway in this context.

src/python/pants/util/contextutil.py
@@ -22,6 +22,13 @@
from pants.util.tarutil import TarFile
+class InvalidZipPath(zipfile.BadZipfile):
+ """Raise instead of passing unchecked falsey values through to zipfile.ZipFile.

This comment has been minimized.

@jsirois

jsirois Jan 17, 2017

Member

The doc here reads like a bug report, instead prefer language that says what the exception means. Perhaps:

"""Indicates a bad zip file path."""
@jsirois

jsirois Jan 17, 2017

Member

The doc here reads like a bug report, instead prefer language that says what the exception means. Perhaps:

"""Indicates a bad zip file path."""

This comment has been minimized.

@mateor

mateor Jan 19, 2017

Member

I was hoping to motivate it's existence alongside the upstream BadZipfile - without that info the distinction is tough to describe.

But that was all an attempt to not break the public API behavior of raising BadZipfile. I will update as requested, since I have been persuaded by your argument that since it wasn't documented, the exception type can be fair game :)

@mateor

mateor Jan 19, 2017

Member

I was hoping to motivate it's existence alongside the upstream BadZipfile - without that info the distinction is tough to describe.

But that was all an attempt to not break the public API behavior of raising BadZipfile. I will update as requested, since I have been persuaded by your argument that since it wasn't documented, the exception type can be fair game :)

src/python/pants/util/contextutil.py
@@ -22,6 +22,13 @@
from pants.util.tarutil import TarFile
+class InvalidZipPath(zipfile.BadZipfile):

This comment has been minimized.

@jsirois

jsirois Jan 17, 2017

Member

Although open_zip is :API: public, the API is not documented as raising anything in particular. I find no trapping of BadZipfile in our codebase so it is probably appropriate to switch InvalidZipPath to subclassing ValueError - which is more appropriate - and documenting the actual public api in open_zip if you're feeling charitable.

@jsirois

jsirois Jan 17, 2017

Member

Although open_zip is :API: public, the API is not documented as raising anything in particular. I find no trapping of BadZipfile in our codebase so it is probably appropriate to switch InvalidZipPath to subclassing ValueError - which is more appropriate - and documenting the actual public api in open_zip if you're feeling charitable.

This comment has been minimized.

@mateor

mateor Jan 19, 2017

Member

Sure, I can do that.

@mateor

mateor Jan 19, 2017

Member

Sure, I can do that.

src/python/pants/util/contextutil.py
zf = zipfile.ZipFile(path_or_file, *args, allowZip64=allowZip64, **kwargs)
except zipfile.BadZipfile as bze:
- raise zipfile.BadZipfile("Bad Zipfile {0}: {1}".format(path_or_file, bze))
+ # Print the realpath in order to follow symlinks back to the problem source file.
+ raise zipfile.BadZipfile("Bad Zipfile {0}: {1}".format(os.path.realpath(path_or_file), bze))

This comment has been minimized.

@wisechengyi

wisechengyi Jan 18, 2017

Contributor

Feel free to ignore, but do you think it would be useful to print both the symlink and the realpath?

@wisechengyi

wisechengyi Jan 18, 2017

Contributor

Feel free to ignore, but do you think it would be useful to print both the symlink and the realpath?

This comment has been minimized.

@mateor

mateor Jan 19, 2017

Member

My instinct is not to, since it seems like an implementation detail, and adds noise to the signal.

But I could perhaps be persuaded - do you have a circumstance you think it would be useful to know the symlink path?

@mateor

mateor Jan 19, 2017

Member

My instinct is not to, since it seems like an implementation detail, and adds noise to the signal.

But I could perhaps be persuaded - do you have a circumstance you think it would be useful to know the symlink path?

This comment has been minimized.

@wisechengyi

wisechengyi Jan 19, 2017

Contributor

No particular use case yet hence not a strong opinion, but I feel in case of debugging, it is easy to find the realpath given symlink, but not the other way around.

@wisechengyi

wisechengyi Jan 19, 2017

Contributor

No particular use case yet hence not a strong opinion, but I feel in case of debugging, it is easy to find the realpath given symlink, but not the other way around.

This comment has been minimized.

@jsirois

jsirois Jan 19, 2017

Member

Maybe, in a new PR, a utility to describe a path could be added and used. In this context, there is no notion of symlink, but realpath makes some sense - ie, lets report the fully resolved terminator. The utility could, if the path is a symlink, recursively follow and build a list of all paths terminating in the realpath - handing back this whole thing as a string for use in exceptions or otherwise.

@jsirois

jsirois Jan 19, 2017

Member

Maybe, in a new PR, a utility to describe a path could be added and used. In this context, there is no notion of symlink, but realpath makes some sense - ie, lets report the fully resolved terminator. The utility could, if the path is a symlink, recursively follow and build a list of all paths terminating in the realpath - handing back this whole thing as a string for use in exceptions or otherwise.

src/python/pants/util/contextutil.py
@@ -173,17 +170,24 @@ def pushd(directory):
@contextmanager
def open_zip(path_or_file, *args, **kwargs):
"""
- A with-context for zip files. Passes through positional and kwargs to zipfile.ZipFile.
+ A with-context for zip files. Passes through positional and kwargs to zipfile.ZipFile.

This comment has been minimized.

@jsirois

jsirois Jan 23, 2017

Member

If you want, the pep-257 would have """[summary]

[rest]
""",
so:

"""A with-context for zip files.

Passes through positional and kwargs to zipfile.ZipFile.

...
"""
@jsirois

jsirois Jan 23, 2017

Member

If you want, the pep-257 would have """[summary]

[rest]
""",
so:

"""A with-context for zip files.

Passes through positional and kwargs to zipfile.ZipFile.

...
"""

This comment has been minimized.

@mateor

mateor Jan 23, 2017

Member

Sure, can do.

@mateor

mateor Jan 23, 2017

Member

Sure, can do.

@mateor

This comment has been minimized.

Show comment
Hide comment
@mateor

mateor Jan 23, 2017

Member

I updated the doc string newline but canceled the Travis CI build since the last run covered all the code changes.

Member

mateor commented Jan 23, 2017

I updated the doc string newline but canceled the Travis CI build since the last run covered all the code changes.

@jsirois

This comment has been minimized.

Show comment
Hide comment
@jsirois

jsirois Jan 23, 2017

Member

FWIW - the format is still wonked. Only the summary sentence should be up on that 1st line.

Member

jsirois commented Jan 23, 2017

FWIW - the format is still wonked. Only the summary sentence should be up on that 1st line.

@mateor

This comment has been minimized.

Show comment
Hide comment
@mateor

mateor Jan 23, 2017

Member

oh yes - so it is. I wasn't careful there - sorry. Will fix.

Member

mateor commented Jan 23, 2017

oh yes - so it is. I wasn't careful there - sorry. Will fix.

@mateor mateor merged commit 35ec8d7 into pantsbuild:master Jan 26, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@mateor mateor deleted the mateor:mateo.open_zip_realpath branch Jan 26, 2017

lenucksi added a commit to lenucksi/pants that referenced this pull request Apr 25, 2017

Make open_zip print realpath when raising BadZipfile. (#4186)
open_zip is used to unzip jars, but is usually called
on files under .pants.d, which increasingly are symlinks.

In the motivating example, a jar was corrupted and could
not be opened by detect_duplicates task. The Exception was:

```
Exception message: Bad Zipfile
/buildroot/.pants.d/<snip>/syms/org/some/corrupted.jar: \
    File is not a zip file
```

The corrupted file is really at the realpath, and
neither cleaning nor deleting the symlink will help.
The updated output hopefully makes that more clear.

This commit also adds a None check for the passed file_path.
It turns out that zipfile does not filter falsey values but instead
bubbles up an AttributeError that is tough to traceback:
```
        # Determine file size
>       fpin.seek(0, 2)
E       AttributeError: 'NoneType' object has no attribute 'seek'
```
a custom subclass of BadZipFile to return in that instance.

None of the behavior here is new, this commit just provides improved feedback.
Passing None to zipfile.Zipfile has always halted execution, as did non-zip files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment