Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Categorizing by implementation #14

Closed
andrew opened this issue Mar 4, 2019 · 12 comments
Closed

Categorizing by implementation #14

andrew opened this issue Mar 4, 2019 · 12 comments

Comments

@andrew
Copy link
Collaborator

andrew commented Mar 4, 2019

A slightly different approach to categorizing package managers than outlined in the Glossary that I've been thinking about with regards to IPFS implementations specifically.

File-system based

This maps closely to the Multi-Registry category.

Many system package managers (APT, apk, RPM pacman, portage), plus some of the older language package managers (Maven, CPAN, CRAN) are literally a network attached folder full of files and other folders, often exposed over http, ftp etc.

Metadata is also stored as files so everything is quite self-contained and easily mirrored using rsync.

This style of registry maps nicely onto IPFS MFS and unixfs, it also seems like most sourcecode and binaries are stored within tar/zip/ar files to preserve any file permissions when downloaded over http, which conveniently means we don't need to wait for unixfs-v2 before implementing things.

Essentially, from IPFS point of view, all of these package managers end up having a very similar API (unixfs) that needs to be implemented, then the clients decide how they want to organise the files and metadata within that top level folder. A number of existing attempts take this approach (arch-mirror, apt-transport-ipfs, Gentoo-distfiles-IPFS)

Database based

This maps closely to the Centralized Registry category.

Many newer language package managers (rubygems, npm, Packagist, PyPI, NuGet, Cargo, bower) run a database-backed web application that handles authentication, uploading, provides APIs for the clients to list packages and versions and other metadata on demand.

Actual package contents is often hosted on s3 and requests to download packages may be proxied through the application to track download statistics, or redirected to a CDN link.

Mirroring these package managers usually requires either trawling package list APIs and recursively downloading packages and their metadata, or, if the registry provides it, downloading a dump of the registries database and running a copy of the web application or a slimmed down alternative that provides a similar API.

Unlike the file-system based package managers, almost every database based registry implements it's own, unique API for querying metadata and publishing packages.

This may explain why there are generally less public mirrors available for Centralized/database-backed package managers as there's a lot more overhead and complexity in keeping a mirror up to date than in the file-system based registries.

The one shared attribute they all share is that clients communicate with registries remotely via https urls, which can be proxied locally and backed by IPFS. A few existing implementations take this approach (npm-on-ipfs, dpip), as well as general purpose artifact stores like Sonatype Nexus and Artifactory.

Git based

This maps to the Portable Registry and the Registry-less categories.

Portable Registries

A number of Portable registries (Homebrew, CocoaPods) use Git (usually GitHub) as a database rather than a traditional centralized database, often as a way to avoid becoming a full time, on-call DBA for their community.

One notable exception is Cargo's main registry, https://crates.io, which has parts of both a portable registry in https://github.com/rust-lang/crates.io-index and central database in https://github.com/rust-lang/crates.io

With Homebrew, PRs can be opened directly on their GitHub repository database to add and update Formula files, which contain the metadata for a package, including links to the source and compiled binaries with integrity hashes.

With CocoaPods, you used to be able to send PRs to their GitHub repository database but after they merged a PR publishing a new version by someone who shouldn't have been able to, they implemented a separate web service called Trunk to handle adding new version to the git database.

Similarly with Crates.io, their GitHub repository database is updated automatically whenever someone publishes a new version via the crates.io website.

In all three of these cases, the end user only ever uses data from the latest commit that they have locally, tags and branches are not utilized and in the case of Cocoapods and Cargo, the history of the repository is not used for previous versions, in fact end users often do a shallow clone (git clone --depth 1 so don't even have history data to go back on.

Homebrew only keeps the latest version of a Formula in it's database, but does clone the full repository (243.22 MiB as of February 2019) so users can check out previous revisions of the repository to install old versions of a Formula, although it's not encouraged (Formula cannot declare a particular version of a dependency) because the speed of operations on the large repo are slow.

When it comes to mapping these registries onto IPFS, git is used both as a transport protocol and storage mechanism, git-remote-ipld should be able to help with integration.

There is a restriction that files stored in those repositories can't be larger than 2MiB but it doesn't look like any of those three databases store metadata files larger than a few Kb.

But none of the git repository databases actually contain the code of the releases, usually they have URIs (homebrew can reference SVN, CVS, Git and other protocols, not just http), which are hosted elsewhere, often GitHub, but could be arbitrary.

For IPFS to host both the registry and source code, each one of these remote URIs will need to be referenced either along side an IPFS CID (see #12) or loaded via some kind of http proxy extension added to the client.

Registry-less

Some package managers (Go, Carthage, Swiftpm) do away with hosted registries all together, instead preferring to declare dependencies as fully qualified http urls or shortcuts for GitHub urls (owner/repo-name for example).

These package managers often have support for checking to see if the url is a git repository and then querying for git tags as the list of published, named versions, and git commits as a full list of all possible versions (named by git commit sha and/or branch).

The end result is instead of a single registry, these package managers have thousands of single project registries, and rely on APIs in Git and/or GitHub for metadata on releases.

When it comes to mapping these package managers onto IPFS, putting any individual repository on IPFS is relatively easy with git-remote-ipld, again with the restriction of 2MiB files, luckily none of the registry-less package managers support sharing binaries in the same way.

Rather the problem comes when declaring dependencies, which are often simple URIs within source code:

import("github.com/golang/dep/internal/fs")

this can be swapped out with an IPFS CID, as seen with gx :

import(pstore "gx/ipfs/QmaCTz9RkrU13bm9kMB54f7atgqM4qkjDZpRwRoJiWXEqs/go-libp2p-peerstore")

But the original metadata of the package, such as upstream source and version number) is lost.

In addition, this import functionality is baked into the language implementation itself, making changes much more difficult than in external tooling.

@mikeal
Copy link

mikeal commented Mar 4, 2019

First of all, your work on understanding and categorizing package managers has been AMAZING! I had felt like we lacked a framework for tackling such a large topic but you’ve been able to break down all the elements of package management into something much more understandable and comparable across systems. Great work!

When it comes to mapping these package managers onto IPFS, putting any individual repository on IPFS is relatively easy with git-remote-ipld, again with the restriction of 2MiB files, luckily none of the registry-less package managers support sharing binaries in the same way.

My concern with this method is that we would be translating a reference to a mutable resource (the current remote master) to an immutable one (the state of master at whatever time we run git-remove-ipld).

This is actually a general problem we have with any package manager that doesn’t keep an immutable reference. Most often, the resource is a URL to a tarball. In almost all cases there is a cache reference (git hash, etag, etc) but we’ll want to build a generic method of:

  1. Looking up a mutable resource.
  2. Translating it to IPFS/IPLD.
  3. Storing the IPFS/IPLD reference keyed to the ( immutable resource + cache key).
  4. Returning the immutable reference to a given client.

IMO, we should always hit this mechanism so that we catch mutations. We don’t want to do the work of converting these resources over and over again to IPFS/IPLD when not necessary but we also can’t cache something for any duration of time if the package manager assumes it’s talking about a live mutable reference.

@lanzafame
Copy link

Re: Golang in the register-less category, there is a very recent proposal to add a notary for verifying module integrity: https://go.googlesource.com/proposal/+/master/design/25530-notary.md

@andrew
Copy link
Collaborator Author

andrew commented Mar 5, 2019

@lanzafame very interesting, that notary could effectively become an index, shifting Go into the Portable Registry category. Will be interesting to see if they stick with only allowing a single notary:

The Go team at Google will run the Go notary as a service to the Go ecosystem, similar to running godoc.org and golang.org. There is no plan to allow use of alternate notaries, which would add complexity and potentially reduce the overall security of the system, allowing different users to be attacked by compromising different notaries.

@andrew
Copy link
Collaborator Author

andrew commented Mar 5, 2019

@mikeal agreed, there's going to be an ongoing challenge between supporting how communities currently manage their dependencies and encouraging them to change their tooling to be more predictable.

When git tags are referenced it's probably ok to treat them as immutable even if they can always been force pushed and I believe specific git commit hashes are immutable (although can always be deleted entirely), it's really only git branches that are the problem.

Golang has been the odd-one out here, as it's only just adding built-in support for declaring versions. Both SwiftPM and Carthage have built-in support for declaring versions, integrity checks and lockfiles, whilst there have been many different external tools built for Go to add that kind of support but none have gained enough popularity to change the behavior of the community as a whole.

At this point I'd still recommend reducing the priority of attempting to completely put Go package management on IPFS, waiting until everything has settled down and the community has reached a consensus, hopefully that'll be towards the end of 2019 as Google starts to rollout more tooling like the Module index.

In the short term we can still support Go package consumers by improving end user tooling like gx, I'm putting together a proposal for that at the moment.

@andrew
Copy link
Collaborator Author

andrew commented Mar 11, 2019

Some possible approaches for implementing IPFS support based on implementation category:

File system based

Approach:

Mirroring these registries into MFS and adding the root CID to dnslink/ipns then rsyncing updates on a regular basis along with transport plugins like https://github.com/JaquerEspeis/apt-transport-ipfs

Problems

  • Performance of adding/update large registries to MFS takes many hours, causing mirrors to lag behind the source
  • updating indexes files like Packages.gz in MFS isn't supported with the filestore

Database based

Approach

IPFS support directly in mainline registries:

  • On package upload, resulting artifacts are added to IPFS and CID is stored and served alongside other metadata
  • Clients receive CIDs along with other metadata from registry APIs and use IPFS to download artifacts
  • Clients store resolved CIDs in lockfiles to allow for installation

Problems

Requires direct buy-in from package manager maintainers

Approach

IPFS Wrappers for package manager clients:

  • Separate client command line tool that calls out to underlying package manager but downloads/adds artifacts to IPFS and places them where original client would expect
  • some package manager clients have plugin architectures that allow for more integration with client without requiring upstream maintainers to take on the responsibility

Problems

  • Much smaller uptake in users than direct integration
  • wrappers tend to only run ipfs daemon for short periods, doesn't help increase number of artifact seeders

Approach

HTTP proxy for package manager clients:

  • proxy http requests from clients to registries and cache to IPFS
  • cache is lazy, only storing what's requested rather than whole mirrors, requiring fewer resources
  • proxied can be ran on individual's computers or hosted within datacenters or offices for groups of developers in an organsiaton

Problems

  • metadata requests can't be aggressively cached due to frequent upstream changes
  • proxy addresses often stored in lockfiles, making personal proxies unattractive to developers working in groups or open source

Git based

TODO

@jessicaschilling
Copy link
Contributor

@andrew What would you think about moving this, as an addendum to https://github.com/ipfs/package-managers/blob/master/glossary.md, into the docs directory?

@andrew
Copy link
Collaborator Author

andrew commented May 9, 2019

👍

@andrew
Copy link
Collaborator Author

andrew commented May 31, 2019

This now lives in the docs folder of the repo: https://github.com/ipfs/package-managers/blob/master/docs/categories.md

@andrew andrew closed this as completed May 31, 2019
@momack2
Copy link
Contributor

momack2 commented Jun 9, 2019

This comment (#14 (comment)) is really valuable and didn't make it into the docs. Think we can find a way for these useful thoughts to live on?

@jessicaschilling
Copy link
Contributor

@momack2 Do you mean what’s in the docs directory at https://github.com/ipfs/package-managers/blob/master/docs/categories.md or are you referring to a different piece of content?

@momack2
Copy link
Contributor

momack2 commented Jun 11, 2019

I'm specifically referencing the #14 (comment) with the approaches and problems for each area. AFAIK that isn't covered in the docs you linked.

Screen Shot 2019-06-10 at 7 43 54 PM

jessicaschilling added a commit that referenced this issue Jun 11, 2019
Separating comment #14 (comment) into its own document per @momack2 note in #14 (comment)
@jessicaschilling
Copy link
Contributor

Separated this comment into a new document and referenced it in the docs index. @andrew -- feel free to append/amend as you see fit.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants