Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consume packed wheel cache in zipapp creation #2175

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Jul 19, 2023

Closes #2158.

Problem

#1675 has another repro.

Generating a --compress --layout zipapp (the default) regularly exhibits pathologically slow performance (46s zip / 1m1s total) on many real-world inputs (especially tensorflow). This occurs even when all dists are already downloaded from pypi:

; time pex -vvvvvvvvv --layout zipapp --resolver-version=pip-2020-resolver 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3' -o test.pex
# ...
# Saving PEX file to test.pex
# pex: Zipping PEX file.: 45659.1ms
# 57.18s user 3.02s system 97% cpu 1:01.57 total
; ls -lAvFh test.pex
# -rwxrwxr-x 1 cosmicexplorer cosmicexplorer 617M Jul 26 09:17 test.pex*
; du -b test.pex
# 616754775       test.pex

Alternatives

#1705 introduced --no-compress, which drastically improves the zip performance (2.2s zip => 20.6x speedup / 19.3s total = 2.4x speedup):

; time pex -vvvvvvvvv --layout zipapp --no-compress --resolver-version=pip-2020-resolver 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3' -o test.pex
# ...
# Saving PEX file to test.pex
# pex: Zipping PEX file.: 2209.7ms
# pex  14.11s user 3.06s system 89% cpu 19.285 total

but as #1686 describes, it also introduces a significant tradeoff in file size (1.6G => 2.5x blowup):

; ls -lAvFh test.pex
# -rwxrwxr-x 1 cosmicexplorer cosmicexplorer 1.6G Jul 26 09:20 test.pex*
; du -b test.pex
# 1541737605      test.pex

The cache entries are also many times larger:

; ls $(find ~/.pex/packed_wheels -name '*tensorflow_gpu*whl')
# -rw-rw-r-- 1 cosmicexplorer cosmicexplorer 468M Aug  1 09:10 /home/cosmicexplorer/.pex/packed_wheels/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
# -rw-rw-r-- 1 cosmicexplorer cosmicexplorer 1.2G Aug  1 09:16 /home/cosmicexplorer/.pex/packed_wheels/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/un-compressed/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl

Solution

  1. Introduce pex.ziputils.MergeableZipFile to incorporate the contents of other zip files without decompressing them.
    • Introduce pex.common.copy_file_range() to copy over a limited range of the contents of another file handle.
  2. Introduce the --cache-dists option to toggle the use (and/or creation) of the packed wheel caches for .bootstrap/ and .deps/ subdirectories.
  3. Consume the existing --layout packed caches for --layout zipapp when --cache-dists is provided.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Jul 19, 2023

Reading @jsirois's thoughts at #2158 (comment) again:

Since this would only be used by the Pex CLI (build-time), a rust implementation is actually how I'd approach this. It's the right language for the job, the problem really has nothing to do with Pex (it has much wider applicability), and there are great tools for turning a crate into an easily consumable Python platform-specific distribution.

I'll take a look at pyoxy and friends to convert the crate into a python distribution. I'll leave this as a draft until then, and I'll re-ping reviewers at that time.

@cosmicexplorer
Copy link
Contributor Author

Ah, wonderful, I've just found the docs for pyo3 https://docs.rs/pyo3/latest/pyo3/.

@jsirois
Copy link
Member

jsirois commented Jul 20, 2023

Ah, wonderful, I've just found the docs for pyo3 https://docs.rs/pyo3/latest/pyo3/.

Great. Yeah @cosmicexplorer that's exactly what I meant. I've been away climbing here and haven't taken a look yet, but if this all looks good, consuming as a native wheel was the idea.

@cosmicexplorer
Copy link
Contributor Author

Ok, just had a lot of fun converting this into a pyo3 package and the result runs even faster! This branch isn't passing because I've added some hacks to get it running, but I will be working to remove those!

@cosmicexplorer cosmicexplorer force-pushed the medusa-zip branch 13 times, most recently from e6ea967 to fb4bbae Compare July 26, 2023 10:57
@cosmicexplorer cosmicexplorer marked this pull request as ready for review July 26, 2023 10:58
@cosmicexplorer cosmicexplorer changed the title Medusa zip zip -FF merge optimization for --compress --layout zipapp Jul 26, 2023
@cosmicexplorer cosmicexplorer force-pushed the medusa-zip branch 2 times, most recently from 954476e to 4d04953 Compare July 26, 2023 11:39
@cosmicexplorer
Copy link
Contributor Author

Ok, I didn't think it would be possible, but @jsirois (who is off climbing now) was absolutely right to suggest investigating the reuse of --layout packed cache entries. This solution:

  • doesn't rely on any external binaries or libraries,
  • doesn't introduce any new cache entries,
  • doesn't rely on any private or undocumented methods,
  • does produce standards-compliant zip files! You'll notice it looks much more similar to the previous implementation.

The magic trick is in pex.ziputils.MergeableZipFile, a subclass of zipfile.ZipFile with a method .merge_archive() that loops over entries from a source ZipFile and accesses its (public!) file pointer attribute .fp to directly copy over compressed file contents.

@Eric-Arellano @stuhood: absolutely zero rush on this, but since @jsirois is out right now, please feel free to leave any thoughts/comments whenever you find the time.

As a side note, I'm planning to submit a proposal to PackagingCon about this + pypa/pip#12186 today/this week (the cfp closes August 6th).

@cosmicexplorer
Copy link
Contributor Author

One thing it would be very helpful to get input on: there are two changes in behavior in the current diff (both of which are subjective, not necessary to achieve the goals of this diff, and can be reverted):

  1. source files now come before .bootstrap/ in the output of unzip -l for --layout zipapp: https://github.com/pantsbuild/pex/blob/7130bd1c57bd00cd81581e0b901c4aba740b5a15/pex/pex_builder.py#L855-L858
  2. all entries in ~/.pex/{bootstrap_zips,packed_wheels}/ have deterministic_timestamp=True: https://github.com/pantsbuild/pex/blob/7130bd1c57bd00cd81581e0b901c4aba740b5a15/pex/pex_builder.py#L816-L820

I suspect the right thing to do is to revert to the prior behavior in this change, then split out separate PRs to propose both of these behavioral changes, but I would like to know whether these changes in behavior are considered modifying the public API.

@jsirois
Copy link
Member

jsirois commented Aug 4, 2023

I'm reviewing presently, but as to:

  1. source files now come before .bootstrap/ in the output of unzip -l for --layout zipapp:

This is not good for 2 reasons:

  1. Zips are tail oriented; so having __main__.py in the tail is desirable. That file is executed on every run (unless you use --sh-boot) from within the zip even if the PEX zip has previously been extracted / installed and if it was already loaded into the page cache when the end of central directory record was read, that's a small overhead win.
  2. When I run zipinfo my.pex today, the .bootstrap/, etc scroll up and away and I see my user code at the bottom in the console.

Python is so slow compared to (SSD) disk accesses that 1 may not be all that relevant any longer and 2 is eye of the beholder - presumably your zipinfo environment is in a pager and so you see the 1st page 1st. Either way, I think this change is, on the whole, not justified.

  1. all entries in ~/.pex/{bootstrap_zips,packed_wheels}/ have deterministic_timestamp=True:

Seems right to me.

@cosmicexplorer
Copy link
Contributor Author

When I run zipinfo my.pex today, the .bootstrap/, etc scroll up and away and I see my user code at the bottom in the console.

Oops! I was so used to .deps/ being quite large for the test cases I've been playing with here that I hadn't realized you're describing the more usual case! I had been repeatedly piping everything through less. This one is easy to revert; however:

Zips are tail oriented; so having main.py in the tail is desirable.

If I'm hearing you right, it sounds like you're advocating for something a little different: the ordering of:

  1. .bootstrap/
  2. .deps/
  3. sources, metadata and runner, incl PEX-INFO and __main__.py

Right now, it's .bootstrap/, then sources, then .deps/ is at the end. To address both of your reasons, it sounds like we should change the behavior from the current? I can push a diff to show what that would look like!

add change to do parallel zipping only, no crawling

modify cli arg format for medusa-zip tool

update cli arg format

fix non-exhaustive CopyMode usage

[BROKEN] add first run of complete medusa zip with cli arg!

the resulting zip cannot be zipimported yet....

medusa zipping works great now, let's revert .zip() changes

bump medusa options

bump more medusa options

use the merged medusa command lines now

manage a cache of parallel intermediate zip generation jobs!

small fix

much closer to mergeable now

working much more complex control flow between the medusa-zip cli

move medusa zip to medusa.py

medusa works for packed apps now too

works for everything, but kind of crazy

close stdin after writing to the child process

factor out a ridiculous amount of boilerplate

add back the non-medusa impl for packed layouts

implement a "normal" version which uses atomic directories

revert unintentional whitespace changes

separate the serial and parallel pex creations

remove the attempts at parallelism

add --medusa-path

remove unused code

make the medusa hook work when not provided

add back a tracer

revert some changes that make things harder to read

revert some changes i shouldn't need

make medusa work with the medusa-zip package and not subprocesses!

update after adding defaults in medusa-zip python package

remove -f arg for resolving medusa-zip

[BROKEN] possibly obsolete!

fix cli arg

add stitched layout

create stitch copymode

no

initial stitch impl

add merge_zip_file() method

move packed wheel caching into common methods

initial impl of merging zips with entry renaming

make MergeableZipFile into a subclass of ZipFile

fix header offset calculation

tmp

fix mypy

remove unused imports

fix buffering for file handles in py2

Revert "tmp"

This reverts commit 8ad12de71a455c3918434cd81dcaf319fa43d9b4.
@cosmicexplorer
Copy link
Contributor Author

Oops! Didn't realize that was the current behavior! However, regarding the following:

Zips are tail oriented; so having main.py in the tail is desirable.

Would it be useful to put __main__.py at the literal very end, then?

@jsirois
Copy link
Member

jsirois commented Aug 4, 2023

If I'm hearing you right, it sounds like you're advocating for something a little different: the ordering of:

I'm advocating for don't muck with essentially unrelated stuff in this PR. Please keep the status quo.
That status quo, right or wrong, is this:

$ pex -V
2.1.140
$ touch foo.py bar.py
$ pex cowsay --module foo --module bar -o cowsay.pex
$ zipinfo -1 cowsay.pex
.bootstrap/
.bootstrap/pex/
...
.bootstrap/pex/version.py
.bootstrap/pex/ziputils.py
.deps/
.deps/cowsay-5.0-py2.py3-none-any.whl/
...
.deps/cowsay-5.0-py2.py3-none-any.whl/cowsay-5.0.dist-info/entry_points.txt
.deps/cowsay-5.0-py2.py3-none-any.whl/cowsay-5.0.dist-info/top_level.txt
PEX-INFO
__main__.py
__pex__/
__pex__/__init__.py
bar.py
foo.py

Your contributions are very welcome, but I'm having flashbacks of the wild sprawling nature these things tended to have and that part is very unwelcome as a reviewer, especially a plodding, single threaded reviewer like me. Please keep the exposed products of your work as focused as possible regardless of how sprawling your internal work process may be.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Aug 4, 2023

Please see #2175 (comment), where you'll find I had a simple misunderstanding. I understand you are not at liberty to review in depth right now.

Copy link
Member

@jsirois jsirois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Danny. This looks good to me mod some details.

pex/common.py Outdated Show resolved Hide resolved
pex/common.py Outdated Show resolved Hide resolved
pex/pex_builder.py Outdated Show resolved Hide resolved
copy_file_range(source_zf.fp, self.fp, zinfo.compress_size)

# (4) Hack the synthesized ZipInfo onto the destination zip's infos.
self.filelist.append(zinfo)
Copy link
Member

@jsirois jsirois Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the hack tally is self.{fp,filelist,NameToInfo} afaict. Techincally "public", but certainly not documented as such. CI proves this is OK across CPython 2.7, 3.5, 3.7, 3.11 and 3.12 as well as PyPy 2.7 and 3.9. CI will continue to be extensive; so I'm not too worried in the medium term, but it does seem like this should all be behind a flag given the combination of the hackyness and the new cache burden (which you thankfully eliminated for --layout packed users like Pants, but still exists for folks who never use(d) that option). You seemed to be on board with a feature flag ealier - does that still seem reasonable?

Ideally the original tack of a library that handled the zipping, possibly including parallelization in addition to the ability here of merging would take the place of all this to both eliminate the hack factor and speed things up in the cold cache case not to mention make these features available outside Pex and outside Python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps no flag is just fine. I'm a bit conflicted here. I still do not see the real use case for this feature, I'd personally use --layout packed for both iteration and distribution (via rsync over ssh), but Pex has never treated new caches as opt-in. Its cache behavior is currently uniformly untunable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seemed to be on board with a feature flag ealier - does that still seem reasonable?

Yes, absolutely! Pip still has the zip file magic I proposed to them behind --use-feature=fast-deps for similar reasons, although that's had a lot more people contribute to it since then.

the new cache burden (which is you thankfully eliminated for --layout packed users like Pants, but still exists for folks who never use(d) that option).

Yes: for this reason alone I think it should be behind a feature flag.

Ideally the original tack of a library that handled the zipping, possibly including parallelization in addition to the ability here of merging would take the place of all this to both eliminate the hack factor and speed things up in the cold cache case not to mention make these features available outside Pex and outside Python.

Well yes, that definitely hasn't evaporated; in fact, it's quite useful, and indeed takes advantage of rust's structured parallelism, and it's a CLI, a rust library, and a pyo3 python library: https://github.com/cosmicexplorer/medusa-zip. I also got the hang of using github actions and cibuildwheels in the meantime and it's now published to crates.io ({,lib,py}medusa-zip) and pypi (medusa-zip) upon every tagged release. I've been continuing to speak with @jonjohnsonjr regarding the similarity of this work to Chainguard's use case efficiently merging tarballs and/or containers (see chainguard-dev/apko#781) and considering whether we can share any code. libarchive exists, but its goal is compatibility with the widest range of formats, not performance, and it seems plausible that a general framework for efficiently merging the contents of archives might have applicability to a wide range of build tools and package managers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outside Python.

I might propose this ZipFile#merge_archive() method to python-ideas too, although it seems possibly too specialized to be in the stdlib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems too specialized. FWICT they have been on a somewhat lazy warpath to do less in the stdlib, not more.

@jsirois
Copy link
Member

jsirois commented Aug 4, 2023

Would it be useful to put main.py at the literal very end, then?

Yes, but __pex__/__init__.py also has the same role when the PEX is used as a sys.path entry; i.e.: PYTHONPATH=my.pex python -c 'from __pex__ import cowsay; ...' (this is a newish feature you may not be aware of). That said, I think some PR later should probably make the tail order exactly:

__pex__/
__pex__/__init__.py
__main__.py

The __main__.py entrypoint case will probably always be more commonly used than the __pex__ magic import hook.

@jsirois
Copy link
Member

jsirois commented Aug 4, 2023

Just noticed this:

Ok. The docs are in a sorry state to be sure, but that can be remedied.

That wasn't my intention -- please don't assume antagonism.

I actually mean it. The docs are in a sorry state. They are horrible. I have not spent any time on them in the ~5 years I've been working as the primary Pex maintainer; meanwhile I have spent time adding a lot of features for Pants; so the docs are severely out of date. I do plan to circle back though. I've vetted out a whole system over at https://github.com/a-scie/lift based on Sphinx that I will be applying here along with fully re-worked content.

@cosmicexplorer
Copy link
Contributor Author

That said, I think some PR later should probably make the tail order exactly:

On it! #2209

@cosmicexplorer
Copy link
Contributor Author

@jsirois you already approved this, but I assumed you'd want to review again after I introduced the --cache-dists option to gate the use of the packed wheel cache (--cache-dists is default off for zipapps, but default on for packed, in order to maintain the current behavior). I have also significantly shorted the commit message, although in this case I've kept the Problem/Solution structure as I feel that describing the tradeoff vs --no-compress in the Alternatives section was worth specifically calling out.

Copy link
Member

@jsirois jsirois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look more closely here in an hour or so (this is from phone); then I'm AFK until August 18th.

@@ -664,17 +672,34 @@ def build(
else:
shutil.rmtree(tmp_pex, True)

if not compress:
if cache_dists is True:
pex_warnings.warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, say there is some legitimate use case, aka the user knows better. You've now pissed them off by nagging them. It's much cleaner if you just fail fast if that makes sense to do. I'm not commenting on whether or not there is a legitimate use case, my intent is just to point out the cost of combinatorial feature explosion and the nanny vs. document vs. trust the user tradeoffs.

@jsirois
Copy link
Member

jsirois commented Aug 9, 2023

Re the commit description, I'm still not a fan of the formula - you can say the exact same thing without the damn template. But that's just my grumpy dislike of the 5 paragraph essay. More importantly you just whiff the actual alternative, which is the packed layout. I think you should address why that doesn't work.

Copy link
Member

@jsirois jsirois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding the uncompressed option to the existing packed layout is a mistake. Seems good otherwise.

"PEX's bootstrap and dependency zip files. Does nothing for loose layout PEXes."
"PEX's bootstrap and dependency zip files. "
"Uncompressed PEX files are much faster to create from an empty cache, but are no "
"faster after the cache has been populated, and uncompressed cache entries will "
Copy link
Member

@jsirois jsirois Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading a second time there is just no reason to cache uncompressed at all. I definitely vote for fail fast. It seems compress=False, cached_dists=True should not be allowed at all. Further, there is no reason for the current packed layout behavior to be toggleable. If you're building a packed layout, the zips should be compressed zips since the whole point is cache friendliness. If there is no call to add a degree of freedom, don't add it because, at least in Pex, you can never take it back.

zinfo.compress_type = zipfile.ZIP_STORED
data = b""
else:
zinfo.file_size = st.st_size
zinfo.compress_type = zipfile.ZIP_DEFLATED
# File contents may be compressed or decompressed. Decompressed is significantly faster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a gratuitous comment. You're selling here or else speaking to an end user. Clearly this is not the place to speak to the user and the coder doesn't need to be sold to, if a function takes a param ... it takes a param. If its a public function maybe sell in the API doc.

Copy link
Member

@jsirois jsirois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, I'll be offline ~presently. Hopefully this approve over-rides the prior request changes. I would like to see the change, but I trust you to do the right thing.

Thanks for putting up with my opinions here. You've started a contribution spike and I'm grateful for that, but I'm pretty open about not viewing all contribution as a good thing by default. As such I'm just trying to set up my expectations / the current Pex code base opinions clearly to help make further contribution more straightforward for us both. My time on this project is limited as I'm sure yours is too; so making things efficient seems to benefit us both.

@jsirois
Copy link
Member

jsirois commented Dec 13, 2023

@cosmicexplorer although this PR is approved, another thing to weigh is #2298 which provides even greater speedups with 0 hacks by skipping zipping entirely with a bonus of skipping unzipping (pre-installing) as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

research area: parallel zip creation
2 participants