Skip to content
This repository has been archived by the owner on Apr 6, 2021. It is now read-only.

Load PyPI dependency info into a database #2

Open
chadwhitacre opened this issue Oct 18, 2016 · 77 comments
Open

Load PyPI dependency info into a database #2

chadwhitacre opened this issue Oct 18, 2016 · 77 comments

Comments

@chadwhitacre
Copy link
Contributor

chadwhitacre commented Oct 18, 2016

Resolving dependencies by running pip install -r requirements.txt and then sniffing the results is a reeeaalllyy inefficient way to go about it. Better is going to be loading each package index into a database. Let's do this once, starting with PyPI.

First step is to basically download PyPI.

@chadwhitacre
Copy link
Contributor Author

@chadwhitacre
Copy link
Contributor Author

Ultimately what we want is to pass in a requirements.txt and/or a setup.py and get back a structure representing the dependencies.

Hmm ... https://graphcommons.com/

@chadwhitacre
Copy link
Contributor Author

Jackpot!

https://github.com/anvaka/allpypi

@chadwhitacre
Copy link
Contributor Author

@chadwhitacre
Copy link
Contributor Author

Well, jackpot: https://github.com/anvaka/pm#individual-visualizations. Indexers for 13 ecosystems!

@chadwhitacre
Copy link
Contributor Author

As of https://mail.python.org/pipermail/distutils-sig/2015-January/025683.html we don't have to worry about mutability in PyPI. That means we never need to update info once we have it. It's possible to delete packages but I think we don't want to do that. We want to keep old info around.

@chadwhitacre
Copy link
Contributor Author

And we only need one-week granularity. If we update once a day we'll be well inside our loop.

@chadwhitacre
Copy link
Contributor Author

Where the mutability comes in is that dependencies are subject to a range. If I depend on Foo >= 1.0, then when Foo 1.1 comes out my dependency chain will need to be updated.

@chadwhitacre
Copy link
Contributor Author

What's the data structure we want?

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 18, 2016

We want to support taking in a list of files of type text/x-python (for setup.py) and/or text/plain (for requirements.txt), and returning a single flattened list of dependencies with this info:

  • package_manager—PyPI, but also GitHub, Bitbucket, ... probably Git? SCMs in general?
  • name—name for PyPI, repo name for GitHub, Bitbucket, ... something for SCMs?
  • url—on package manager
  • license—a string or url for the actual license
  • osi_license—a link under https://opensource.org/licenses; null indicates non OSI-approved
  • version—the single latest version satisfying version_range; null means dep hell
  • version_range—a single range computed from folding together required_by; empty set indicates dependency hell
  • required_by—a list of (location, version range) tuples
    • location for text/plain—filename or URL; and line number
    • location for text/x-python—filename, URL, or package; parameter (*_require{s}); and index in the argument

Take care to handle different files with the same name in the upload.

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 18, 2016

Two more hops:

  • database schema to support that query
  • ETL flow to populate the database from PyPI

@ewdurbin
Copy link

ewdurbin commented Oct 18, 2016

Briefly spoke on the phone with @whit537 and summarized the basic guidance for PyPI data analytics that require traversing dependency trees.

Note that right now metadata available via PyPI is limited and may be for some time as PEP 426 is indefinitely deferred. Information is available in the deferment section of the PEP describing other PEPs addressing these topics.

In order to crawl PyPI for dependency links, you'll need general metadata for "indexing" as well as the package files themselves to obtain dependency information via setuptools/distutils.

Recommended approach:

  • Setup and maintain a bandersnatch based mirror of PyPI's simple index and package data (wheels, tarballs, zips, exes... all of it) (Around 450GB right now)
  • Use the simple index or XMLRPC list_packages() call to retrieve a list of packages registered with PyPI.
  • Use individual packages JSON API https://pypi.org/pypi/<package_name>/json to retrieve metadata and releases for individual packages (https://pypi.org/pypi/requests/json for example)
  • For specific releases, use the JSON API https://pypi.org/pypi/<package_name>/<release_identifier>/json (https://pypi.org/pypi/requests/2.11.1/json for example)
  • In order to obtain dependency information for a given package and release, you'll need to read in from the artifact itself (obtained via bandersnatch)

All of the above endpoints and tools with the exception of the XMLRPC are designed to minimize impact on the PyPI backend infrastructure, as they are easily cached in our CDN.

@chadwhitacre
Copy link
Contributor Author

A table for releases, unique on (package_manager_id, package_id, version), also has path, license, and osi_license columns.

A releases.deps column stores upstream dependencies, which are self-references to specific releases, along with version_range and required_by for each. Resolving a set of dependencies is then a matter of merging the deps chains for the input set. The process used to precompute deps should be usable on-the-fly for queries.

PyPI gives us a changelog that includes new release events. With that, we should be able to recompute deps for the subset of affected packages; we'll need a reverse mapping (release_id, required_by) for that. Basically, when a new release comes out we want to:

  1. compute deps for that release, and
  2. go through everything that depends on any other version of the newly released package, and update deps.

We'll need to keep a table of packages, and do the reverse mapping based on that. Something like:

for each package in might_depend_on(new_release):
    for release in package.releases:
        release.update_deps(new_release)

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 18, 2016

[editwhat he said :]

@chadwhitacre
Copy link
Contributor Author

So for ETL we can look at bandersnatch ... before the call I had been thinking we'd roll our own (I'd already started based on #2 (comment)) that would look like this:

  • one xmlrpc call to get the list of all names
  • download json for each package
  • split the json into multiple files: one for the package, and one for each release
  • walk the release json to download tarballs
  • crack open tarballs to discover requirements info, load into db
  • watch for new release in the changelog XML-RPC
  • download the release json for new releases, and update the db

@chadwhitacre
Copy link
Contributor Author

Another point @ewdurbin made on the phone is that some projects vendor in their dependencies (e.g., that's how Requests uses urllib3), and an approach that looks only at setup.py and requirements.txt won't pick that up.

@chadwhitacre
Copy link
Contributor Author

I've downloaded and run a bit of bandersnatch. I am finding tarballs. I think we should be able to get what we need from that, without having to resort to the JSON API (bandersnatch does use fetch JSON under the hood, but afaict it throws it away) . The name, version, and license are in the PKG-INFO (is PKG-INFO guaranteed to exist and have those keys?). With the name we can compute the url. osi_license will be something we compute based on license. Depenedency info we've already said we need to extract from the tarballs.

One issue with bandersnatch is that it doesn't download tarballs that aren't on PyPI.

Another is that we don't actually need to keep the tarballs around after we process them. Doing so would cost about $50/mo at Digital Ocean. Will we be able to easily convince bandersnatch not to redownload things we've already downloaded and then deleted?

@chadwhitacre
Copy link
Contributor Author

D'oh! :-/

screen shot 2016-10-18 at 3 07 39 pm

@chadwhitacre
Copy link
Contributor Author

If we can delete old tarballs without tripping up bandersnatch, then we should be able to run a bandersnatch process, and a second process to consume tarballs: ETL them and then throw them away. This second process can run cronishly, offset from bandersnatch, and simply walk the tree looking for tarballs.

@chadwhitacre
Copy link
Contributor Author

I've moved http://gdr.rocks/ over to NYC1 and am attaching a 500 GB volume.

@chadwhitacre
Copy link
Contributor Author

Derp. Volume are only resizable up.

screen shot 2016-10-18 at 3 28 05 pm

@chadwhitacre
Copy link
Contributor Author

cd /mnt/pypi/
virtualenv .
bin/pip install bandersnatch
bin/bandersnatch -c conf mirror
vim conf
  directory = /mnt/pypi
  delete-packages = false
nohup bin/bandersnatch -c conf mirror &

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 18, 2016

grep 'Storing index page' nohup.out indicates that it's processed about 1% of records so far.

@chadwhitacre
Copy link
Contributor Author

That puts us at about eight hours to finish.

@chadwhitacre
Copy link
Contributor Author

Okay! Let's do some local testing wrt snatching tarballs out from under bandersnatch. Also: ETL.

@chadwhitacre
Copy link
Contributor Author

From reading through mirror.py, it looks like we should be able to satisfy bandersnatch with a state file that records the serial number we are satisfied that we're good through. What is a serial?

@chadwhitacre
Copy link
Contributor Author

Here's what it looks like when I echo 2229089 > status and rerun:

[gdr]$ bandersnatch -c conf mirror
2016-10-18 16:08:17,248 INFO: bandersnatch/1.11 (CPython 2.7.11-final0, Darwin 14.5.0 x86_64)
2016-10-18 16:08:17,248 INFO: Removing inconsistent todo list.
2016-10-18 16:08:17,249 INFO: Syncing with https://pypi.python.org.
2016-10-18 16:08:17,250 INFO: Current mirror serial: 2229089
2016-10-18 16:08:17,250 INFO: Syncing based on changelog.

@chadwhitacre
Copy link
Contributor Author

The weird thing is that on the first run through, it processes packages in alphabetical order by name, not in numeric order by serial. It only writes status after a successful sync. On subsequent runs, it uses the changelog RPC. But what is it doing with serial in that case?

@chadwhitacre
Copy link
Contributor Author

New status is 2410197.

@chadwhitacre
Copy link
Contributor Author

Re-ran, updated three packages, status is 2410203.

@chadwhitacre
Copy link
Contributor Author

Okay!

@chadwhitacre
Copy link
Contributor Author

So it looks like there will be order of magnitude 1,000s of packages to update if we run bandersnatch once a day. Call it 10% of the total, so we could probably manage with 50 GB of disk for $5/mo.

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 19, 2016

Looking at file types (h/t):

root@gdr:/mnt/pypi# find web/packages -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' > extensions &
root@gdr:/mnt/pypi# cat extensions | sort | uniq -c | sort -n
      5 deb
      6 dmg
     25 tgz
     60 rpm
     64 msi
    353 bz2
   1464 exe
   5239 egg
   6556 zip
   9905 whl
  50351 gz

@chadwhitacre
Copy link
Contributor Author

What if the only tarball we have for a release is an MSI?

@chadwhitacre
Copy link
Contributor Author

Or exe, more likely.

@chadwhitacre
Copy link
Contributor Author

I guess let's focus on gz, whl, zip, egg, bz2, and tgz.

@chadwhitacre
Copy link
Contributor Author

So it looks like there will be order of magnitude 1,000s of packages to update if we run bandersnatch once a day. Call it 10% of the total, so we could probably manage with 50 GB of disk for $5/mo.

I guess it's a time/space trade-off. Fresh mirrors are within an hour. If we update every 30 minutes then it should be 100s or maybe even 10s of packages, and we can probably manage with 5 GB or maybe even 1 GB.

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 19, 2016

If it's under 1 GB then we can keep it on the droplet and not use a separate volume, though if we're going to run a database at Digital Ocean to back gdr.rocks then we should maybe store it on a volume for good decoupling.

@chadwhitacre
Copy link
Contributor Author

Managed Postgres at DO starts at $19/mo.

@chadwhitacre
Copy link
Contributor Author

Alright, let's keep this lightweight. One $5/mo droplet, local Postgres. The only reason we are mirroring PyPI is to extract dependency info, which we can't get from metadata. We don't need to store all metadata, because PyPI itself gives us a JSON API, which we can even hit from the client side if we want to (I checked: we have Access-Control-Allow-Origin: *). That should be sufficient to populate the /on/pypi/* pages.

@chadwhitacre
Copy link
Contributor Author

Okay! I think I've figured out incremental updates. Bandersnatch needs a generation file or it'll start from scratch. That refers to the bandersnatch schema version, basically. 5 is the latest. And then it needs a status file with the serial number we want to start from (i.e., the last seen, ... hmm—on which side is it inclusive?). Then it needs a configuration file. That's all it needs to do an incremental update! We can rm -rf web (the directory it downloads into). We can throw away the todo file. As long as we have a conf, generation, and status, bandersnatch will happily sync based on the changelog.

Now, it will over-download, but if we process frequently enough, we should be okay. It looks like if we process every 30 minutes then we'll have well less than 100 packages to update. Packages generally have well less than 100 release files, though when Requests or Django pushes a new release we'll have a lot of old ones to download. I guess we want to tune the cron to run frequently enough to keep the modal batch size small, while still giving us enough time to complete processing for the occasional larger batch. Logging ftw.

@chadwhitacre
Copy link
Contributor Author

That should be sufficient to populate the /on/pypi/* pages.

On the other hand, the JSON is heavy (100s of kB), and the description field is a mess. We might want to do our own README analysis while we've got the tarballs cracked. Hmm ...

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 19, 2016

How about we grab READMEs while we're in there as well as long_description from setup. That way we'll at least have them if we want to do something with them later. What if there are multiple README files? README.rst, README.md, ...

@chadwhitacre
Copy link
Contributor Author

@chadwhitacre
Copy link
Contributor Author

@chadwhitacre
Copy link
Contributor Author

Since we're going to be importing untrusted setup.py modues we probably still want the Docker sandbox.

@chadwhitacre
Copy link
Contributor Author

But we'd have it in the tarchomper process instead of in the web app.

@chadwhitacre
Copy link
Contributor Author

Extension finder died mid-write. :]

root@gdr:/mnt/pypi# cat extensions | sort | uniq -c | sort -n
      1 g
      1 ZIP
     24 deb
     39 dmg
    187 tgz
    417 rpm
    424 msi
   2717 bz2
  11684 exe
  40647 egg
  50619 zip
  77144 whl
 391041 gz

@chadwhitacre
Copy link
Contributor Author

Okay! So! tarchomper! 🎯

@chadwhitacre
Copy link
Contributor Author

  1. cp status status.bak to save our place in case the process crashes
  2. rm -rf web todo to start from a clean slate
  3. bandersnatch -c conf mirror to fetch all tarballs for packages where there have been changes
  4. walk the tree for the new tarballs
  5. for each tarball, open it up and extract the info we need
  6. spit out SQL—COPY?
  7. run the SQL

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 19, 2016

Nomenclature update:

  • application—the top-level thing for which we are resolving a comprehensive list of dependencies
  • package—the thing identified by name, e.g., requests
  • release—the thing identified by (name, version), e.g., (requests, 2.11.1)
  • artifact —actual file (been calling this "tarball" above), e.g., requests-2.11.1.tar.gz

@chadwhitacre
Copy link
Contributor Author

(Note: projects listed in setup_requires will NOT be automatically installed on the system where the setup script is being run. They are simply downloaded to the ./.eggs directory if they’re not locally available already. If you want them to be installed, as well as being available when the setup script is run, you should add them to install_requires and setup_requires.)

http://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords

Okay, let's not worry about setup_requires. We don't need test_requires either, since that's a dependency of the project itself, not the project's users.

On the other hand we should include extras_require only if the extras are in use by the downstream package/application. Hmm ... optionality.

@chadwhitacre
Copy link
Contributor Author

Blorg. Tests are failing after installing bandersnatch, because it install_requires some pytest plugins. I guess the workaround is to manually uninstall these. We'll have to teach Travis to do the same.

@chadwhitacre
Copy link
Contributor Author

PR in #5.

@chadwhitacre
Copy link
Contributor Author

In light of the shift of focus at gratipay/gratipay.com#4135 (comment), I've removed the droplet, volume, and floating IP from Digital Ocean to avoid incurring additional cost.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants