Load PyPI dependency info into a database #2

chadwhitacre · 2016-10-18T12:34:21Z

Resolving dependencies by running pip install -r requirements.txt and then sniffing the results is a reeeaalllyy inefficient way to go about it. Better is going to be loading each package index into a database. Let's do this once, starting with PyPI.

First step is to basically download PyPI.

The text was updated successfully, but these errors were encountered:

chadwhitacre · 2016-10-18T12:45:05Z

https://twitter.com/whit537/status/788359742271791104

chadwhitacre · 2016-10-18T12:46:01Z

https://packaging.python.org/mirrors/
https://pypi.org/
https://pypi.python.org/mirrors
https://www.pypi-mirrors.org/
https://pypi.python.org/pypi/bandersnatch
https://bitbucket.org/pypa/bandersnatch/
https://github.com/openstack-infra/pypi-mirror
https://github.com/pypa/warehouse/
http://doc.devpi.net/latest/

chadwhitacre · 2016-10-18T13:07:33Z

Ultimately what we want is to pass in a requirements.txt and/or a setup.py and get back a structure representing the dependencies.

Hmm ... https://graphcommons.com/

chadwhitacre · 2016-10-18T13:39:26Z

Jackpot!

https://github.com/anvaka/allpypi

chadwhitacre · 2016-10-18T13:42:31Z

https://github.com/koder-ua/python_deps

chadwhitacre · 2016-10-18T13:43:31Z

Well, jackpot: https://github.com/anvaka/pm#individual-visualizations. Indexers for 13 ecosystems!

chadwhitacre · 2016-10-18T14:19:57Z

As of https://mail.python.org/pipermail/distutils-sig/2015-January/025683.html we don't have to worry about mutability in PyPI. That means we never need to update info once we have it. It's possible to delete packages but I think we don't want to do that. We want to keep old info around.

chadwhitacre · 2016-10-18T14:29:30Z

And we only need one-week granularity. If we update once a day we'll be well inside our loop.

chadwhitacre · 2016-10-18T14:54:54Z

Where the mutability comes in is that dependencies are subject to a range. If I depend on Foo >= 1.0, then when Foo 1.1 comes out my dependency chain will need to be updated.

chadwhitacre · 2016-10-18T14:55:17Z

What's the data structure we want?

chadwhitacre · 2016-10-18T15:34:08Z

We want to support taking in a list of files of type text/x-python (for setup.py) and/or text/plain (for requirements.txt), and returning a single flattened list of dependencies with this info:

package_manager—PyPI, but also GitHub, Bitbucket, ... probably Git? SCMs in general?
name—name for PyPI, repo name for GitHub, Bitbucket, ... something for SCMs?
url—on package manager
license—a string or url for the actual license
osi_license—a link under https://opensource.org/licenses; null indicates non OSI-approved
version—the single latest version satisfying version_range; null means dep hell
version_range—a single range computed from folding together required_by; empty set indicates dependency hell
required_by—a list of (location, version range) tuples
- location for text/plain—filename or URL; and line number
- location for text/x-python—filename, URL, or package; parameter (*_require{s}); and index in the argument

Take care to handle different files with the same name in the upload.

chadwhitacre · 2016-10-18T15:36:14Z

Two more hops:

database schema to support that query
ETL flow to populate the database from PyPI

ewdurbin · 2016-10-18T17:19:37Z

Briefly spoke on the phone with @whit537 and summarized the basic guidance for PyPI data analytics that require traversing dependency trees.

Note that right now metadata available via PyPI is limited and may be for some time as PEP 426 is indefinitely deferred. Information is available in the deferment section of the PEP describing other PEPs addressing these topics.

In order to crawl PyPI for dependency links, you'll need general metadata for "indexing" as well as the package files themselves to obtain dependency information via setuptools/distutils.

Recommended approach:

Setup and maintain a bandersnatch based mirror of PyPI's simple index and package data (wheels, tarballs, zips, exes... all of it) (Around 450GB right now)
Use the simple index or XMLRPC list_packages() call to retrieve a list of packages registered with PyPI.
Use individual packages JSON API https://pypi.org/pypi/<package_name>/json to retrieve metadata and releases for individual packages (https://pypi.org/pypi/requests/json for example)
For specific releases, use the JSON API https://pypi.org/pypi/<package_name>/<release_identifier>/json (https://pypi.org/pypi/requests/2.11.1/json for example)
In order to obtain dependency information for a given package and release, you'll need to read in from the artifact itself (obtained via bandersnatch)

All of the above endpoints and tools with the exception of the XMLRPC are designed to minimize impact on the PyPI backend infrastructure, as they are easily cached in our CDN.

chadwhitacre · 2016-10-18T17:32:18Z

A table for releases, unique on (package_manager_id, package_id, version), also has path, license, and osi_license columns.

A releases.deps column stores upstream dependencies, which are self-references to specific releases, along with version_range and required_by for each. Resolving a set of dependencies is then a matter of merging the deps chains for the input set. The process used to precompute deps should be usable on-the-fly for queries.

PyPI gives us a changelog that includes new release events. With that, we should be able to recompute deps for the subset of affected packages; we'll need a reverse mapping (release_id, required_by) for that. Basically, when a new release comes out we want to:

compute deps for that release, and
go through everything that depends on any other version of the newly released package, and update deps.

We'll need to keep a table of packages, and do the reverse mapping based on that. Something like:

for each package in might_depend_on(new_release):
    for release in package.releases:
        release.update_deps(new_release)

chadwhitacre · 2016-10-18T17:38:56Z

[edit—what he said :]

chadwhitacre · 2016-10-18T17:46:05Z

So for ETL we can look at bandersnatch ... before the call I had been thinking we'd roll our own (I'd already started based on #2 (comment)) that would look like this:

one xmlrpc call to get the list of all names
download json for each package
split the json into multiple files: one for the package, and one for each release
walk the release json to download tarballs
crack open tarballs to discover requirements info, load into db
watch for new release in the changelog XML-RPC
download the release json for new releases, and update the db

chadwhitacre · 2016-10-18T18:34:39Z

Another point @ewdurbin made on the phone is that some projects vendor in their dependencies (e.g., that's how Requests uses urllib3), and an approach that looks only at setup.py and requirements.txt won't pick that up.

chadwhitacre · 2016-10-18T19:07:19Z

I've downloaded and run a bit of bandersnatch. I am finding tarballs. I think we should be able to get what we need from that, without having to resort to the JSON API (bandersnatch does use fetch JSON under the hood, but afaict it throws it away) . The name, version, and license are in the PKG-INFO (is PKG-INFO guaranteed to exist and have those keys?). With the name we can compute the url. osi_license will be something we compute based on license. Depenedency info we've already said we need to extract from the tarballs.

One issue with bandersnatch is that it doesn't download tarballs that aren't on PyPI.

Another is that we don't actually need to keep the tarballs around after we process them. Doing so would cost about $50/mo at Digital Ocean. Will we be able to easily convince bandersnatch not to redownload things we've already downloaded and then deleted?

chadwhitacre · 2016-10-18T19:08:39Z

D'oh! :-/

chadwhitacre · 2016-10-18T19:16:16Z

If we can delete old tarballs without tripping up bandersnatch, then we should be able to run a bandersnatch process, and a second process to consume tarballs: ETL them and then throw them away. This second process can run cronishly, offset from bandersnatch, and simply walk the tree looking for tarballs.

chadwhitacre · 2016-10-18T19:23:23Z

I've moved http://gdr.rocks/ over to NYC1 and am attaching a 500 GB volume.

chadwhitacre · 2016-10-18T19:28:34Z

Derp. Volume are only resizable up.

chadwhitacre · 2016-10-18T19:40:58Z

cd /mnt/pypi/
virtualenv .
bin/pip install bandersnatch
bin/bandersnatch -c conf mirror
vim conf
  directory = /mnt/pypi
  delete-packages = false
nohup bin/bandersnatch -c conf mirror &

chadwhitacre · 2016-10-18T19:44:51Z

grep 'Storing index page' nohup.out indicates that it's processed about 1% of records so far.

chadwhitacre · 2016-10-18T19:45:40Z

That puts us at about eight hours to finish.

chadwhitacre · 2016-10-18T19:46:09Z

Okay! Let's do some local testing wrt snatching tarballs out from under bandersnatch. Also: ETL.

chadwhitacre · 2016-10-18T19:57:00Z

From reading through mirror.py, it looks like we should be able to satisfy bandersnatch with a state file that records the serial number we are satisfied that we're good through. What is a serial?

chadwhitacre · 2016-10-18T20:09:09Z

Here's what it looks like when I echo 2229089 > status and rerun:

[gdr]$ bandersnatch -c conf mirror
2016-10-18 16:08:17,248 INFO: bandersnatch/1.11 (CPython 2.7.11-final0, Darwin 14.5.0 x86_64)
2016-10-18 16:08:17,248 INFO: Removing inconsistent todo list.
2016-10-18 16:08:17,249 INFO: Syncing with https://pypi.python.org.
2016-10-18 16:08:17,250 INFO: Current mirror serial: 2229089
2016-10-18 16:08:17,250 INFO: Syncing based on changelog.

chadwhitacre · 2016-10-18T20:15:20Z

The weird thing is that on the first run through, it processes packages in alphabetical order by name, not in numeric order by serial. It only writes status after a successful sync. On subsequent runs, it uses the changelog RPC. But what is it doing with serial in that case?

chadwhitacre · 2016-10-19T11:33:49Z

New status is 2410197.

chadwhitacre · 2016-10-19T11:35:12Z

Re-ran, updated three packages, status is 2410203.

chadwhitacre · 2016-10-19T11:35:16Z

Okay!

chadwhitacre · 2016-10-19T11:39:22Z

So it looks like there will be order of magnitude 1,000s of packages to update if we run bandersnatch once a day. Call it 10% of the total, so we could probably manage with 50 GB of disk for $5/mo.

chadwhitacre · 2016-10-19T11:43:22Z

Looking at file types (h/t):

root@gdr:/mnt/pypi# find web/packages -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' > extensions &
root@gdr:/mnt/pypi# cat extensions | sort | uniq -c | sort -n
      5 deb
      6 dmg
     25 tgz
     60 rpm
     64 msi
    353 bz2
   1464 exe
   5239 egg
   6556 zip
   9905 whl
  50351 gz

chadwhitacre · 2016-10-19T11:44:25Z

What if the only tarball we have for a release is an MSI?

chadwhitacre · 2016-10-19T11:45:32Z

Or exe, more likely.

chadwhitacre · 2016-10-19T11:46:33Z

I guess let's focus on gz, whl, zip, egg, bz2, and tgz.

chadwhitacre · 2016-10-19T11:55:29Z

So it looks like there will be order of magnitude 1,000s of packages to update if we run bandersnatch once a day. Call it 10% of the total, so we could probably manage with 50 GB of disk for $5/mo.

I guess it's a time/space trade-off. Fresh mirrors are within an hour. If we update every 30 minutes then it should be 100s or maybe even 10s of packages, and we can probably manage with 5 GB or maybe even 1 GB.

chadwhitacre · 2016-10-19T11:57:40Z

If it's under 1 GB then we can keep it on the droplet and not use a separate volume, though if we're going to run a database at Digital Ocean to back gdr.rocks then we should maybe store it on a volume for good decoupling.

chadwhitacre · 2016-10-19T12:00:57Z

Managed Postgres at DO starts at $19/mo.

chadwhitacre · 2016-10-19T12:25:33Z

Alright, let's keep this lightweight. One $5/mo droplet, local Postgres. The only reason we are mirroring PyPI is to extract dependency info, which we can't get from metadata. We don't need to store all metadata, because PyPI itself gives us a JSON API, which we can even hit from the client side if we want to (I checked: we have Access-Control-Allow-Origin: *). That should be sufficient to populate the /on/pypi/* pages.

chadwhitacre · 2016-10-19T12:42:21Z

Okay! I think I've figured out incremental updates. Bandersnatch needs a generation file or it'll start from scratch. That refers to the bandersnatch schema version, basically. 5 is the latest. And then it needs a status file with the serial number we want to start from (i.e., the last seen, ... hmm—on which side is it inclusive?). Then it needs a configuration file. That's all it needs to do an incremental update! We can rm -rf web (the directory it downloads into). We can throw away the todo file. As long as we have a conf, generation, and status, bandersnatch will happily sync based on the changelog.

Now, it will over-download, but if we process frequently enough, we should be okay. It looks like if we process every 30 minutes then we'll have well less than 100 packages to update. Packages generally have well less than 100 release files, though when Requests or Django pushes a new release we'll have a lot of old ones to download. I guess we want to tune the cron to run frequently enough to keep the modal batch size small, while still giving us enough time to complete processing for the occasional larger batch. Logging ftw.

chadwhitacre · 2016-10-19T12:50:03Z

That should be sufficient to populate the /on/pypi/* pages.

On the other hand, the JSON is heavy (100s of kB), and the description field is a mess. We might want to do our own README analysis while we've got the tarballs cracked. Hmm ...

chadwhitacre · 2016-10-19T13:00:29Z

How about we grab READMEs while we're in there as well as long_description from setup. That way we'll at least have them if we want to do something with them later. What if there are multiple README files? README.rst, README.md, ...

chadwhitacre · 2016-10-19T13:12:17Z

https://docs.python.org/2/distutils/setupscript.html#additional-meta-data

chadwhitacre · 2016-10-19T13:19:22Z

https://docs.python.org/2/distutils/packageindex.html

chadwhitacre · 2016-10-19T14:07:18Z

Since we're going to be importing untrusted setup.py modues we probably still want the Docker sandbox.

chadwhitacre · 2016-10-19T15:36:16Z

But we'd have it in the tarchomper process instead of in the web app.

chadwhitacre · 2016-10-19T15:36:32Z

Extension finder died mid-write. :]

root@gdr:/mnt/pypi# cat extensions | sort | uniq -c | sort -n
      1 g
      1 ZIP
     24 deb
     39 dmg
    187 tgz
    417 rpm
    424 msi
   2717 bz2
  11684 exe
  40647 egg
  50619 zip
  77144 whl
 391041 gz

chadwhitacre · 2016-10-19T15:37:41Z

Okay! So! tarchomper! 🎯

chadwhitacre · 2016-10-19T16:06:31Z

cp status status.bak to save our place in case the process crashes
rm -rf web todo to start from a clean slate
bandersnatch -c conf mirror to fetch all tarballs for packages where there have been changes
walk the tree for the new tarballs
for each tarball, open it up and extract the info we need
spit out SQL—COPY?
run the SQL

chadwhitacre · 2016-10-19T16:12:26Z

Nomenclature update:

application—the top-level thing for which we are resolving a comprehensive list of dependencies
package—the thing identified by name, e.g., requests
release—the thing identified by (name, version), e.g., (requests, 2.11.1)
artifact —actual file (been calling this "tarball" above), e.g., requests-2.11.1.tar.gz

chadwhitacre · 2016-10-19T16:30:29Z

(Note: projects listed in setup_requires will NOT be automatically installed on the system where the setup script is being run. They are simply downloaded to the ./.eggs directory if they’re not locally available already. If you want them to be installed, as well as being available when the setup script is run, you should add them to install_requires and setup_requires.)

http://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords

Okay, let's not worry about setup_requires. We don't need test_requires either, since that's a dependency of the project itself, not the project's users.

On the other hand we should include extras_require only if the extras are in use by the downstream package/application. Hmm ... optionality.

chadwhitacre · 2016-10-19T17:40:32Z

Blorg. Tests are failing after installing bandersnatch, because it install_requires some pytest plugins. I guess the workaround is to manually uninstall these. We'll have to teach Travis to do the same.

chadwhitacre · 2016-10-19T18:44:22Z

PR in #5.

chadwhitacre · 2016-10-23T00:01:12Z

In light of the shift of focus at gratipay/gratipay.com#4135 (comment), I've removed the droplet, volume, and floating IP from Digital Ocean to avoid incurring additional cost.

chadwhitacre mentioned this issue Oct 18, 2016

Load npm dependencies into a database #4

Open

chadwhitacre mentioned this issue Oct 19, 2016

Make it easier for companies to fund open source gratipay/gratipay.com#4135

Closed

chadwhitacre mentioned this issue Oct 19, 2016

Make room for chomper #5

Closed

chadwhitacre mentioned this issue Oct 20, 2016

Core Radar 81 gratipay/inside.gratipay.com#852

Closed

Load PyPI dependency info into a database #2

Load PyPI dependency info into a database #2

Comments

chadwhitacre commented Oct 18, 2016 • edited Loading

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016 • edited Loading

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016 • edited Loading

chadwhitacre commented Oct 18, 2016 • edited Loading

ewdurbin commented Oct 18, 2016 • edited Loading

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016 • edited Loading

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016 • edited Loading

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 18, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016 • edited Loading

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016 • edited Loading

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016 • edited Loading

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016 • edited Loading

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 19, 2016

chadwhitacre commented Oct 23, 2016

chadwhitacre commented Oct 18, 2016 •

edited

Loading

chadwhitacre commented Oct 18, 2016 •

edited

Loading

chadwhitacre commented Oct 18, 2016 •

edited

Loading

chadwhitacre commented Oct 18, 2016 •

edited

Loading

ewdurbin commented Oct 18, 2016 •

edited

Loading

chadwhitacre commented Oct 18, 2016 •

edited

Loading

chadwhitacre commented Oct 18, 2016 •

edited

Loading

chadwhitacre commented Oct 19, 2016 •

edited

Loading

chadwhitacre commented Oct 19, 2016 •

edited

Loading

chadwhitacre commented Oct 19, 2016 •

edited

Loading

chadwhitacre commented Oct 19, 2016 •

edited

Loading