Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make use of pip install --report JSON output #2210

Open
4 tasks
cosmicexplorer opened this issue Aug 6, 2023 · 4 comments
Open
4 tasks

make use of pip install --report JSON output #2210

cosmicexplorer opened this issue Aug 6, 2023 · 4 comments

Comments

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Aug 6, 2023

After quite a long saga (pypa/pip#53), pip has the --report=<out.json> option to pip install (see pypa/pip#10771). This can be combined with --ignore-installed and --dry-run to produce a resolve report specifically for the uses of tools like pex. There are some further changes in flight to make this metadata-only resolve significantly faster by avoiding any downloads at all (see pypa/pip#12186), and plans to get it down to almost instantaneous by caching metadata lookups (pypa/pip#12184). With the --use-feature=fast-deps option, these improvements also apply to resolves against wheels in a --find-links index or a pypi-like index that hasn't yet implemented PEP 658 (pypi itself has only just now enabled it).

One use case where this shines is lock file creation. A prototype I made incorporating a few of the mentioned in-progress changes exposes a function pex.resolver.resolve_new() to execute pip install --report, but with otherwise the same arguments as resolve(): https://github.com/pantsbuild/pex/compare/main...cosmicexplorer:pip-json-resolve?expand=1. Without any of the work from pypa/pip#12184, this halves the time pex spends within pip when creating a lockfile:

# Not the same output as a lock file, but PEX would process this json instead of needing to parse pip output.
> time python3.8 -c 'from pex.resolver import resolve_new; import json; print(json.dumps(list(resolve_new(requirements=["numpy>=1.19.5", "keras==2.4.3", "mtcnn", "pillow>=7.0.0", "bleach>=2.1.0", "tensorflow-gpu==2.5.3"]))[0]))'
...
real    0m8.769s
user    0m5.858s
sys     0m1.224s
> time python3.8 -m pex.cli lock create --resolver-version=pip-2020-resolver 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m15.923s
user    0m11.784s
sys     0m1.643s

Executing pex with sufficient verbosity confirms that >15 seconds of that pex process is spent within pip. In the uncached case, we still do better, at 26s for resolve_new() in the prototype branch vs 43s for pex lock create on main.

While looking to incorporate these changes, I found that pex3 lock create currently scans the output of pip download to extract hashes and download locations, which are contained in the current --report json output. I didn't want to spend the time replacing that yet, but I suspect leaning on the metadata-only resolve json will make the implementation of pex3 lock easier to follow.

Remaining tasks (for the prototype branch at https://github.com/cosmicexplorer/pex/tree/pip-json-resolve):

  • Revert any modifications to pex vendoring pip, and instead lean on PipVersionValue to select pip versions that support --report.
  • Rename resolve_new() to something like metadata_only_resolve().
  • Make lock {create,update} consume metadata_only_resolve().
    • Should do this using the latest possible PipVersionValue to keep up to speed with performance improvements, otherwise defaulting to the current implementation which scans output logs when the latest pip version does not support --report.
  • Get memoizing dependency lookups with a local resolve index pypa/pip#12184 and other in-flight changes merged and released upstream.
@cosmicexplorer
Copy link
Contributor Author

Over slack (https://pantsbuild.slack.com/archives/C087V4P1T/p1691343949034449), @jsirois urged me to look at the prior investigation by @thejcannon, with discussion at https://pantsbuild.slack.com/archives/C087V4P1T/p1688051841183419 and #2044 (comment). In particular, @jsirois raised the possibility of making use of resolvelib directly as opposed to invoking pip at all, which would require reimplementing PEP 658 and lazy wheel/fast-deps support in pex to take full advantage of, but also makes it easier for pex (and therefore pants) to employ the pip resolution algorithm incrementally to support pex's use cases. In particular he identified the application to universal lockfiles as the key reason to avoid using pip install --report, as he suspected it would present the most difficulty for the resolve report.

@cosmicexplorer
Copy link
Contributor Author

In particular, I was advised by pip maintainers (see pypa/pip#12184 (comment)) to approach the metadata lookup caching sketched out in pypa/pip#12184 as a plugin to resolvelib, or some other such mechanism that would also be employable by other users of resolvelib.

@cosmicexplorer
Copy link
Contributor Author

@thejcannon's prior branch testing this is at https://github.com/thejcannon/pex/tree/jcannon/pip-report.

@thejcannon
Copy link
Contributor

In my testing, the only large red flag was that VCS reqs in PEX are hashed via their downloaded zip. pip's report doesn't do that (but does embed the relevant commit in the metadata).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants