Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pip for resolving and building distributions #781

Closed
jsirois opened this issue Oct 20, 2019 · 20 comments · Fixed by #788
Closed

Use pip for resolving and building distributions #781

jsirois opened this issue Oct 20, 2019 · 20 comments · Fixed by #788

Comments

@jsirois
Copy link
Member

jsirois commented Oct 20, 2019

Now that we can vendor distributions, we might consider vendoring pip to get pip-compatible resolution and wheel building behavior. This would likely kill a fair amount of PEX code and open issues at the expense of a larger PEX tool binary and, without care, a larger PEX runtime. Since pip should be largely (completely?) un-needed at runtime, this might have little downside.

@jsirois
Copy link
Member Author

jsirois commented Oct 20, 2019

This came up in the context of #770.

@wisechengyi
Copy link
Contributor

Curious what was the historical context of pex implementing the resolve logic? Did it predate pip?

@benjyw
Copy link
Collaborator

benjyw commented Oct 21, 2019

Seems like a big win to me. Especially since we could then resolve VCS dependencies, which there has been demand for from multiple quarters.

@adeandrade
Copy link

If file size is a concern we could have a package variant (say pex[runtime]) that does not have pip.

@jsirois
Copy link
Member Author

jsirois commented Oct 22, 2019

If file size is a concern we could have a package variant (say pex[runtime]) that does not have pip.

The idea here is to vendor pip and rely on its code. As such it wouldn't / couldn't be optional without losing all benefits described above. IE: if PEX were built without pip it would either be buggy wrt manylinux 2XXX, etc or we'd have to implement these things in pex anyhow.

@benjyw
Copy link
Collaborator

benjyw commented Oct 22, 2019

"Features are an asset, code is a liability." So I'm in favor.

@kwlzn
Copy link
Contributor

kwlzn commented Oct 23, 2019

Since pip should be largely (completely?) un-needed at runtime, this might have little downside.

hmm, I guess it's the case that pkg_resources alone handles the runtime-side resolution that happens to isolate deps in the pex for the case of e.g. multi-platform pexes?

based on pypa/pip#3121 it also seems like the only supported integration mode/philosophy of integration for pip is via subprocess execution of the CLI - and you'll recall that Patrick had an incredibly tough time even getting basic "manifest a resolved set of wheels into a dir" support added upstream to support the vend format work previously. I'd be leary about this aspect of things - esp given we have little to no relationship with the pypa community to effect change here.

but from a standardization perspective, it makes a lot of sense to support pluggable resolvers.

I also don't think pex is that far off from being at parity nor do I expect further radical change that we'll have to keep up with - afaict, just manylinux2010 and vcs support are the biggest gaps? those seem relatively trivial to me - and ideally Twitter should/would fund this imho.

This would likely kill a fair amount of PEX code and open issues at the expense of a larger PEX tool binary and, without care, a larger PEX runtime.

if we teased pex apart to not much more than a zipapp wrapper around pip and a lightweight runtime, what's its inherent value over tools like e.g. shiv? could it be time to re-imagine it as something entirely different (i.e. a modular rewrite)?

@jsirois
Copy link
Member Author

jsirois commented Oct 23, 2019

hmm, I guess it's the case that pkg_resources alone handles the runtime-side resolution that happens to isolate deps in the pex for the case of e.g. multi-platform pexes?

Yes, and even pkg_resources isn't really buying us much at all.

based on pypa/pip#3121 it also seems like the only supported integration mode/philosophy of integration for pip is via subprocess execution of the CLI ...

Yes, and we should invoke it via CLI if we go this route. In terms of the reliability of the pip feature set for our needs, really the only requirement we have beyond being compatible with its resolving and building - which should be the defacto correct way to do both of these things from here in perpetuity, is to be able to install the resolved requirements into chroots of some sort so that, at runtime, we can only activate the chroots needed for the current platform (we do this today by always resolving or building a wheel and then using a no-longer-maintained Wheel.install API).

Essentially we require (using [requests] as the list of requirements handed to the PEX CLI):

pip wheel --wheel-dir /tmp/wheels requests
for wheel in /tmp/wheels/*.whl; do
  wheel_install_chroot="$(echo "${wheel}" | sed -e "s|/tmp/wheels/|/tmp/deps/|")"
  pip install --no-deps --target "${wheel_install_chroot}" "${wheel}"
done

I think it's safe to say the pip install bit will be stable since that is what I imagine almost all pip uses in the wild invoke. The pip wheel bit is probably less used, buit it has already been fought for and exists at least as far back as version 1.5.3 (early 2014).

I also don't think pex is that far off from being at parity.

I'm not sure about this. Besides VCS and manylinux there is the fact mentioned above that we use a yanked Wheel.install api, forcing us to remain on old vendored setuptools and wheel. Not being able to upgrade setuptools means we need to continue to patch the copy of pep425tags.py.

Here are some other non-manylinux, non-vcs resolver bugs that would be fixed by just using the de-facto standard pip:

@fhoering
Copy link
Contributor

fhoering commented Oct 24, 2019

+1 on this proposal.

In our use case additional 5 MB for pip is nothing compared to the packages we use.
Some examples:
PySpark: 215 MB
TensorFlow: 86 MB
numpy: 20 MB
pandas: 10 MB
MLFlow: 14 MB
Jupyter notebook: 9 MB

In fact pex is becoming a critical piece of our infrastructure. When people get blocked because they can't run new packages/versions on the cluster and then can't release in Prod they want a fix quite quickly.

So having it behave like pip seems very reasonable & can easily be explained. I like the comment from @kwlzn about that it would become a "zipapp wrapper around pip and a lightweight runtime". This would be good news. When we started using it we actually thought that is exactly what it is. I don't think it needs to differentiate from other similar tools. Better it works well with less than trying to do everything (and having a lot of issues like the ones above)

However I'm a bit afraid of the timeline for this. How long will this take to implement ? If the answer is month can you at least unblock some recent issues (like #763, #768) that are starting to pile up ?

@benjyw
Copy link
Collaborator

benjyw commented Oct 25, 2019

@kwlzn Just for clarity - are you saying that you prefer not to vendor pip and use it, but to implement that functionality ourselves? Or are you just proposing it as a viable alternative. What would be the main concerns re using pip vs. the alternative?

@fhoering
Copy link
Contributor

I have another question. What is the exact reason why the runtime would actually need pip ?

When I zip a virtualenv/conda I can then use it without pip (haven't tested that though but at least it is not embedded inside)

wheel is needed as wheels are not served from within the zip file but maybe just this code is enough to put it into the cache:

from wheel.wheelfile import WheelFile
wf = WheelFile("mywheel.whl")
wf.extractall("/tmp")

@jsirois
Copy link
Member Author

jsirois commented Oct 25, 2019

@fhoering it shouldn't. See in the original issue description above:

Since pip should be largely (completely?) un-needed at runtime, this might have little downside.

@fhoering
Copy link
Contributor

But where is the downside then if it doesn't need to be embedded ?

@jsirois
Copy link
Member Author

jsirois commented Oct 25, 2019

But where is the downside then if it doesn't need to be embedded ?

I think that's what we're trying to flesh out with @kwlzn who has contributed heavily to pex and generally has the pulse of a decent representation of the community.

@jsirois
Copy link
Member Author

jsirois commented Oct 27, 2019

Curious what was the historical context of pex implementing the resolve logic? Did it predate pip?

@wisechengyi Pex did not predate pip, but it did pre-date its ascencsion as the defacto package installation tool for python. At the time (2010), pip was still pre 1.0 release and easy_install from setuptools was the package installation tool it was starting to take over for. Those facts aside, a strong goal was to have a zero dependency tool, so that may also have influenced the choice to implement a custom resolver, etc.

@jsirois
Copy link
Member Author

jsirois commented Oct 27, 2019

There is enough positive feedback that I'm going to take a whack at this so we all have a concrete change to look at and discuss further.

@jsirois
Copy link
Member Author

jsirois commented Nov 10, 2019

The single problem I'm left with is here:
https://github.com/pantsbuild/pex/blob/7d593478280844370cda0f71df391148d129deb0/pex/platforms.py#L138-L149

An exercise of this in integration tests here fails for OSX python 2.7:
https://github.com/pantsbuild/pex/blob/7d593478280844370cda0f71df391148d129deb0/tests/test_integration.py#L956-L967

To emulate this behavior I'd need to launch 6 parallel download attempts with pip over the abis currently generated by:
https://github.com/pantsbuild/pex/blob/7d593478280844370cda0f71df391148d129deb0/pex/platforms.py#L61-L69

This behavior / feature though seems fundamentally unsound and I'd rather drop it. As things currently stand, in a single resolve (download) over many transitive deps, you might find one deployed wheel for the, say cp27mu abi, and another for the cp27m abi. Different projects will simply release differently. The result would be a pex with an incomplete set of cp27 dependencies which is incorrect. In other words, this feature really only works against custom indexes / find-links repos pre-arranged to have uniform abis. By definition, if that is a precondition for this feature to work, the pre-arranged abis are known or knowable and can simply be included in an extended platform specifier. In direct use of pex, the user would simply be forced to specify full platform strings. In indirect use, say by pants, the convenience this feature affords when repos are controlled meeting the uniform abi criteria but the average python user may not know the correct abi string, the convenience can be more appropriately moved into a target macro or target re-definition that looks for simple platform strings and extends them to the known-correct values for that environment's custom repos.

I've stated all this rather declaratively but I do mean to solicit a sanity check since I may have mis-stated or mis-understood facts. @kwlzn, @wisechengyi, @stuhood and other folks who I'd guess might be leveraging this pex feature from Pants - do you buy my assessment / agree we can drop this feature? Do you agree that at the current layer it's applied (pex), it's actually a buggy feature and that it should be provided at the customized Pants deployment level?

@jsirois
Copy link
Member Author

jsirois commented Nov 11, 2019

If folks agree on the above (I'll add a commit to #788 that contains the proposed feature yank so it can be visualized), I'd also like to yank the --manylinux flag from the cli since that is the equivalent of just specifying an explicit platform like manylinux2010-x86_64-cp-36-m instead of linux-x86_64-cp-36-m. IOW, if no platforms are specified and the current interpreter is a linux interpreter, its manylinux support will be auto-determined and it makes no sense to force add manylinux to the interpreter if it does not support it nor remove manylinux support if it does support it. If a platform is specified however, we can just be exact. IOW we know we'll be deploying to manylinux2014 machines, etc.

@hrfuller
Copy link
Contributor

If this will keep us up to date with the latest platform tags coming out I think its a great idea!

@kwlzn
Copy link
Contributor

kwlzn commented Nov 12, 2019

@kwlzn Just for clarity - are you saying that you prefer not to vendor pip and use it, but to implement that functionality ourselves? Or are you just proposing it as a viable alternative. What would be the main concerns re using pip vs. the alternative?

@benjyw I don't have a strong preference either way as long as things continue to function at parity, ultimately. I do think there are clear upsides and downsides to both paths - so it's a bit of a stalemate in my mind. if development costs were free tho, I'd probably be strongly advocating that we continue to maintain pex as a key viable alternative to pip in the ecosystem (competition being a driver of innovation and all). that aspect feels like the only material loss here to me - but one we can/should probably live with tho for the better alignment/reduced maintenance costs.

jsirois added a commit that referenced this issue Nov 14, 2019
This changes Pex to vendor pip and defer all resolution and building to it.

As part of this change some ambiguous and / or broken features were changed and APIs adjusted:
+ The egg distribution format is no longer supported.
+ The deprecated `--interpreter-cache-dir` CLI option was removed.
+ The `--cache-ttl` CLI option and `cache_ttl` resolver API argument were removed.
+ The resolver API replaced `fetchers` with a list of `indexes` and a list of `find_links` repos.
+ The resolver API removed (http) `context` which is now automatically handled.
+ The resolver API removed `precedence` which is now pip default precedence - wheels when available and not ruled out via the `--no-wheel` CLI option or `use_wheel=False` API argument.
+ The `--platform` CLI option and `platform` resolver API argument now must be full platform strings that include platform, implementation, version and abi; e.g.: `--platform=macosx-10.13-x86_64-cp-36-m`.
+ The `--manylinux` CLI option and `use_manylinux` resolver API argument were removed. Instead, to resolve manylinux wheels for a foreign platform, specify the manylinux platform to use with an explicit `--platform` CLI flag or `platform` resolver API argument; e.g.: `--platform=manylinux2010-x86_64-cp-36-m`.

Fixes #781

Additionally:
Fixes #771
Fixes #763
Fixes #761
Fixes #735 
Fixes #694 
Fixes #660
Fixes #658 
Fixes #642
Fixes #641
Fixes #628
Fixes #620
Fixes #614
Fixes #611
Fixes #608
Fixes #439
Fixes #415
Fixes #387
Fixes #315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants