ipfs and pacman #84
Relevant thread on Arch Linux forums: https://bbs.archlinux.org/viewtopic.php?id=203853
@robcat thanks! let us know how we can help. we'd love to contribute to making this easy + nice. cc @whyrusleeping
(adding @anatol, Arch Linux developer that has shown interest on the forums)
So, in general ipfs can help in many ways the package distribution (distributed storage, versioning, package signing).
The low hanging fruit is the distributed storage: given a "package entry" in the pacman database, get directly the package from ipfs by hash.
A custom downloader can be already plugged in pacman using the XferCommand configuration variable (some examples)
The big problem is the following: the pacman database includes only SHA256 and MD5 hashes, and no ipfs-style multihashes (that unfortunately cannot be constructed using the already included hashes).
My plan is to build a custom XferCommand script that, querying some kind of service, translates the SHA256 hash into a "standard" ipfs hash, and then does ipfs add -o <cache-path> <ipfs-hash>
This hash translation service can be centralized in this initial stage, but it adds a constant lag to each package download (the next step should be to build and distribute a package database with ipfs-multihashes).
I'm still in search of a more elegant solution that doesn't require a central translation service, ideas are welcome
A possible solution could be creating a dag with all packages: For example having the repo mirror servers add everything to ipfs and provide the hash of the folder, then deduplication takes care of efficiency and the pacman client only needs that one hash of the latest version of all the packages and the .db repository files ![]()
For this to work smoothly we would probably need to patch pacman. Or maybe a tool that wraps pacman but uses IPFS to get the data could be built.
@fazo96 About deduplication: I downloaded the pool of archlinux packages and tried both the chunking algorithms provided by ipfs (fixed blocks and rabin). Apparently there is no detectable deduplication in my case in either case. Can you suggest an effective chunking strategy for xz compressed packages?
About the dag of packages: cool idea, but it requires to manage a fat central mirror that regularly rsyncs and compiles a new dag at every update. Unfortunately I don't have such a server available (but maybe an existing Arch mirror could be interested in running the ipfs daemon?).
@robcat having an existing mirror run an ipfs daemon would be ideal: by mounting its IPNS publication using fuse, it could store the data directly in IPFS while also being able to read it from the file system and when it updates the files it would automagically update what's on IPFS.
About deduplication, there's probably no trivial way to apply it inside packages but at least two copies of the same package will produce the same hash and thus will become the same file in IPFS, so that the pacman client can get a package from any IPFS node that has a copy without having any configuration telling it the nodes' ip addresses.
This way choosing the best mirror, ranking them etc will not be needed: ipfs will be in charge of downloading the package from the best location. This will mean that updating even when an Internet connection is not there could be implemented if there's a reachable computer that is serving the packages and .db files.
@fazo96 About deduplication: I downloaded the pool of archlinux packages and tried both the chunking algorithms provided by ipfs (fixed blocks and rabin). Apparently there is no detectable deduplication in my case in either case. Can you suggest an effective chunking strategy for xz compressed packages?
Are they tarballs? Did you try ipfs tar? it imports tars smartly.
@whyrusleeping what is the tar branch that detects tarballs on add?
Did you try ipfs tar? it imports tars smartly.
Cool feature! I didn't know about that. (at the moment it doesn't seem to work for some tars, I'm opening a separate issue)
In the Arch Linux case it's unfortunately not very useful, since the packages are signed only after compression. Distributing the packages in uncompressed form would mean:
- having to recompress the package at installation time to verify the signature (at a great CPU cost)
- guessing which compression settings were used by the developers on their machines, in order to reproduce the same exact tar.xz file
Well, packages aren't that heavy. I mean, the biggest package I ever downloaded is probably netbeans which is less than 300 MB, and it's wayyyyyy bigger than the average package which is like a few megabytes (quanitity pulled out of thin air).
I don't think there are many packages that share common data except multiple versions of the same package. It would be very nice to figure out how to take advantage of deduplication with the current way pacman packages stuff, but even without that, an IPFS transport for pacman would be a huge step forward in terms of efficiency and repository management.
Just having the official repository publish the hash of the latest version of the packages folder and all the mirrors just automatically pinning that when it updates would propagate changes pretty fast and no one would have to worry about choosing the fastest mirror except bitswap developers
To sum it up my humble suggestion is not to try too hard to fix this problem (applying deduplication to arch packages) now. IPFS is still very useful in this use case, even without that, and in the future pacman could consider packaging stuff differently if IPFS becomes the standard for distribution.
Just having the official repository publish the hash of the latest version of the packages folder and all the mirrors just automatically pinning that when it updates would propagate changes pretty fast and no one would have to worry about choosing the fastest mirror except bitswap developers
@fazo96
I hear you, this would be a pretty good solution.
But the problem is, to follow your plan would mean to convince the central official mirror to:
- hash the whole repository using ipfs
- manage and run an ipfs node (at least to initially seed the new packages)
- use twice the storage space (to store the ipfs blocks of the whole repository)
- sign and publish the ipns periodically (right now there's no central authority, only individual packages are signed)
There is a problem of incentives here: why should the central authority to do all this, if nobody is already using ipfs?
convince the central official mirror
@robcat shouldn't any mirror work, as long as IPFS users trust it? Anyone could set up a mirror to serve as the bridge to IPFS. Then once people start using it, down the line the official repo could take over that function.
Setting up a new mirror to copy packages into IPFS is closely related to the archival efforts discussed at ipfs/archives#5
manage and run an ipfs node (at least to initially seed the new packages)
It's not hard at all to do, there are also docker images.
use twice the storage space (to store the ipfs blocks of the whole repository)
You can mount IPNS with FUSE so that you don't need twice the space
sign and publish the ipns periodically (right now there's no central authority, only individual packages are signed)
Well, if the packages are all in one folder, and that folder is inside the IPNS mountpoint (using FUSE), when a file changes everything gets republished. It's not the best of solution but it's worth a shot (not by the official mirror maintainers of course, but by some interest third party)
@robcat shouldn't any mirror work, as long as IPFS users trust it? Anyone could set up a mirror to serve as the bridge to IPFS. Then once people start using it, down the line the official repo could take over that function.
Yes, that's the point!
I don't see why a random guy can't set up an IPFS mirror (except the actual hassle of downloading every package). Also no need to trust IPFS or the mirror since the packages are signed, if my understanding is correct.
If somebody has the resources needed to set up an IPFS mirror of the arch repositories and some people manage to start using it, it will be a lot easier to get more adoption. I could do it, but I only have 1 Mbit upload bandwidth, for four people and at least 6 devices...
I could do it, but I only have 1 Mbit upload bandwidth, for four people and at least 6 devices...
Ok, so what about this three pronged approach:
- @fazo96 publishes the ipns entry (minimal bandwidth required)
- I keep a non-authoritative mirror with a lot of bandwidth that will sync the most requested packages
- users will put
https://ipfs.io/ipns/<faso96-hash>/at the top of their mirrorlist
My node is already up at 178.62.202.191 and it independently adds all the x86_64 packages from core, extra, multilib, testing and multilib-testing (it syncs to an official arch mirror and adds the packages to ipfs).
@fazo96 you can go ahead and publish the ipns entry, if IPFS works the bulk of the requests will not even hit your node :)
@robcat I'm very interested in trying this out! :)
However:
- I need the hash to be able to publish something
- You can publish it yourself if you want, or you can mount ipns with FUSE and drop the root of the mirror in
<mountpoint>/localso that they are automatically published on your node's IPNS (this is the suggested approach) while also being accessible on the filesystem thus not taking up twice the space! Try publishing the hash of the repo before doing this so that the files will already be there - If you want to quickly publish to ipns, cd to the root of the repo then run
ipfs name publish $(ipfs add -r -q -w . | tail -n1). It will publish the folder you are in and tell you the IPNS name and the hash. You can then tell us the IPNS name or the IPFS hash so we can try
- users shouldn't use ipfs.io as the gateway or all the traffic will have to come from ipfs.io to the users and it doesn't make much sense: they would have to use their local gateway (localhost)
- I have quite some credit on digital ocean thanks to github's student developer pack, and I'm willing to set up a mirror for pacman using IPFS
- I don't know if there's a way to disable the pacman cache, but I hope there is because using this approach would make it a waste of disk space except for downgrading.
Also, another use case for this is local package sharing: I have two computers in LAN and a shitty bandwidth to the internet, and all available solutions to share packages in LAN are pretty ugly compared to this.
If you want we can meet on freenode (I'm Fazo) later today or some other time to try this out
If we succeed (we probably will) we can set up a page on the arch wiki or a github repo that explains how to use IPFS with pacman this way (without any external tools except ipfs)
Also, another use case for this is local package sharing: I have two computers in LAN and a shitty bandwidth to the internet
@fazo96 A bit of out-of-topic. I was looking for a simple Arch package sharing tool for LAN and did not find what matches my expectations. So I wrote my own tool - https://github.com/anatol/pacoloco an Arch package proxy repo. It is not published anywhere and just a bunch of code without much documentation. But it works well for several months for me. Advantages of this tool - pure C, no external dependencies except glibc, run perfectly at OpenWRT MIPS router, single-threaded, event-loop based architecture, HTTP pipeline support to reduce request latency.
Ok, time for some results!
me and @robcat did it. He set up an arch linux mirror on a VPS then we got it to store files inside the FUSE mountpoint (/ipns/local) so that they are always available via IPNS.
Using IPFS to download arch packages
The node that is serving them is QmPY3tsmwoCjePd9SEudAGX2bZU65SwiL8KBKaSRKyUdzt. If you want to try it, you can put Server = http://localhost:8080/ipns/QmPY3tsmwoCjePd9SEudAGX2bZU65SwiL8KBKaSRKyUdzt/$repo/os/$arch on top of your pacman mirrorlist. Make sure your daemon is running, then it will download packages from IPFS when possible! It's that simple.
It has a few hiccups, for example the timeout is too low and if it's your first time downloading package it's going to fall back to boring centralized mirrors. Also the mirror we have is incomplete due to the resources needed to hash tens of gigabytes of files.
But it works! Keep in mind that the IPFS mirror mentioned was just used for testing, we are not guaranteeing anything.
I achieved the best result by using this XferCommand: XferCommand = /usr/bin/wget --quiet --show-progress --timeout=180 -O %o %u
This gives IPFS the time to answer (long timeout) and tells wget not to clutter the output and display a nice progress bar.
Setting up a mirror
Just follow the steps to set up a regular arch mirror, but mount IPNS to /ipns before and then drop your mirror files in /ipns/local/. That's it, it's that simple. go-ipfs + FUSE works fine on ubuntu 15.10 and Arch Linux.
EDIT: another issue is that signatures for the .db files are missing from the mirror:
09:39:50.913 ERROR core/serve: Path Resolve error: no link named "core.db.sig" under QmbEx32ruv1FLp6dHdsgouKSsg1hdaidnZnwSLA52mvAiu gateway_handler.go:458This is awesome! congrats @robcat and @fazo96 ! \o/
Please take a look at @diasdavid's recent registry-mirror work. It has very similar goal (replicate npm), and has been addressing many of the issues.
- You can see a preview blog post here: https://ipfs.io/ipfs/QmREB6yt38NWQVjjaeLKYdhotLqmvt421GxBT2J1UDB8uU/blog/4-registry-mirror/ (not yet published to our blog)
- https://github.com/diasdavid/registry-mirror (includes how it works)
- talk went live yesterday here: https://www.youtube.com/watch?v=-S-Tc7Gl8FM
- some of the development discussion here: #2
@diasdavid uses the new mfs (files) API in 0.4.0-dev, to make life much easier. I suspect you can likely use that too programmatically, instead of relying on the /ipns mount. then can publish the hash when you're done (ipfs name publish $hash). ipfs's fuse interface is not the best of things yet (apologies there, if you found UX issues), so definitely give the files api a try.
Also 0.4.0 is way faster than 0.3.x.
@jbenet Thanks!!
We know about 0.4.0 being faster and the issues with the mount, but we wanted to make it as easy as possible for arch mirror maintainers to integrate IPFS in the mix, and the IPNS mounts really helps because it makes it very simple to use the regular arch mirror administration tools and practices with IPFS. However, we'll surely consider a better approach and test some more.
Just follow the steps to set up a regular arch mirror, but mount IPNS to /ipns before and then drop your mirror files in /ipns/local/. That's it, it's that simple. go-ipfs + FUSE works fine on ubuntu 15.10 and Arch Linux.
I have to say, it is absolutely awesome that it's this simple.
to all the hard work making the abstractions nice. (cc @whyrusleeping)
@robcat @fazo96 -- @whyrusleeping and i have discussed making the files api mountable as well. that interface still needs to be defined. potentially something like this ipfs/go-ipfs#2060 (comment)
We know about 0.4.0 being faster and the issues with the mount, but we wanted to make it as easy as possible for arch mirror maintainers to integrate IPFS in the mix, and the IPNS mounts really helps because it makes it very simple to use the regular arch mirror administration tools and practices with IPFS. However, we'll surely consider a better approach and test some more.
That's fantastic. it really is ![]()
@fazo96 we may want to make a simple sh script or Makefile that does all of this
- boilerplate shell script params stuff:
- install ipfs with:
- simple shell script https://github.com/ipfs/install-ipfs
- new hotness https://github.com/ipfs/ipfs-update
- boilerplate Makefile with install things: https://github.com/ipfs/go-ipfs/blob/master/test/Makefile
We're still figuring out issues with rsync crashing go-ipfs when used with a target inside /ipns/local. If we figure out exactly what is causing the problem, we'll open an issue about it
EDIT: looks like we were using a build from the master branch... made a few modifications to our script and used 0.3.10, now it looks like it doesn't crash anymore. Here's the script, but it's not done at all and still requires some manual intervention.
The biggest issue now is that for some reason rsync can't detect that two files are the same if one of the two is in /ipns/local so it redownloads everything every time
Ok, I've been able to sync the "core" repository to /ipns/local using this script, and it works wonderfully on my Arch machine with ipfs 0.3.10:
https://gist.github.com/robcat/3dbbafee096269b6843a
But me and @fazo96 were trying to make it work on a Ubuntu 15.10 VM (same versions of ipfs and fuse), and rsync doesn't detect the unchanged files (it overwrites every file, triggering the re-hash and re-publish by ipfs). This is of course too slow.
Unfortunately this problem is not straightforward to debug (it seems to be a timeout issue in the interaction between rsync and fuse): if anyone want to try, please contact me or @fazo96 in order to get access to the VM.
@jbenet The mountable files api you described can be useful in a lot of situations.
But in the "mirror" use case, we are relying on the auto-publishing feature of /ipns/local. I believe we would have no real advantage in copying the packages in a different folder.
@robcat i think I know the issue. the Atime and Ctime of the files in /ipns arent set. Let me work on a patch for ya
hrmm... after a bit more thought its not going to be easy to fix. I'm working modtimes into the mfs code on 0.4.0 (the code behind the fuse interface). You could try the -c option on rsync as a workaround for now
hrmm... after a bit more thought its not going to be easy to fix. I'm working modtimes into the mfs code on 0.4.0 (the code behind the fuse interface). You could try the -c option on rsync as a workaround for now
Uhm, I double checked and in fact there is something seriously wrong with my Ubuntu machine.
Rsync doesn't give any errors, but the resulting files accessed via ls have zero sizes (of course the -c option doesn't help):
# ls -l /ipns/local/foo
total 0
-rw-rw-rw- 0 root root 0 Aug 30 1754 acl-2.2.52-2-i686.pkg.tar.xz
-rw-rw-rw- 0 root root 0 Aug 30 1754 acl-2.2.52-2-i686.pkg.tar.xz.sig
[...]
But the file objects are stored correctly by ipfs:
# ipfs/ipfs object links QmcCShzUmGvMjbWyC1jT1fG9y4LFYt7Cxwms7Pm82vg72d
Hash Size Name
QmaKpUC352G5raBe1YMhKAWEGR4zbHVh4GgWG4sbVuEdFo 133358 acl-2.2.52-2-i686.pkg.tar.xz
QmaobnVG1xfHcYEF4DiLQYQFGGyJaZ32rYNUCeAchGqXe5 607 acl-2.2.52-2-i686.pkg.tar.xz.sig
To ask for a third opinion, I will try soon with osx.
@robcat that's very nice!
I'll give it a try too on my machine if I have the time and network bandwidth. We could try setting up a dockerfile based on the go-ipfs one that also mounts fuse and keeps an updated arch mirror using your script, so that we can run it on any platform. I'm not sure it's possible to mount IPFS with fuse inside a docker container...
Then, only the "root" arch mirror using IPFS needs to actually sync with the other mirrors using rsync. The other IPFS based mirrors will just need to pin the IPNS name of the "root" mirror once pub/sub is implemented. As a temporary workaround, a very small script could poll for IPNS updates and pin the new hash.
@robcat you could try enabling the fuse debug logs, it currently need to be added in and recompiled like so:
diff --git a/fuse/ipns/ipns_unix.go b/fuse/ipns/ipns_unix.go
index bd4b861..5f634d4 100644
--- a/fuse/ipns/ipns_unix.go
+++ b/fuse/ipns/ipns_unix.go
@@ -23,6 +23,12 @@ import (
ft "github.com/ipfs/go-ipfs/unixfs"
)
+func init() {
+ fuse.Debug = func(i interface{}) {
+ fmt.Println(i)
+ }
+}
+
var log = logging.Logger("fuse/ipns")
// FileSystem is the readwrite IPNS Fuse Filesystem.but that should give you some more insight
So I tried @robcat's script on my machine and it works for [core], but the bigger packages from [community] (in my case 0ad, it's really big, but it's also alphabetically first) run into ipfs/go-ipfs#1456 and crashes violently, complaining about too many open files.
When it does work, though, it works beautifully. I'm playing with it at /ipns/ani.sh/, but that's my laptop, so don't expect much uptime.
@atondwal thanks for the report! could you try upping your ulimit before starting the daemon and seeing how things go for you then?
On linux thats: ulimit -n 2048 (or any number you like,really)
Huh, that makes it fail to mount
~ ❯❯❯ ipfs daemon --mount ⏎
Initializing daemon...
Swarm listening on /ip4/127.0.0.1/tcp/4001
Swarm listening on /ip4/192.168.0.9/tcp/4001
Swarm listening on /ip4/68.100.248.158/tcp/20207
Swarm listening on /ip6/::1/tcp/4001
API server listening on /ip4/127.0.0.1/tcp/5001
Gateway (readonly) server listening on /ip4/127.0.0.1/tcp/8080
01:53:18.061 ERROR bitswap: failed to find any peer in table workers.go:92
01:53:18.144 ERROR core/comma: error mounting: %!s(<nil>) failed to find any peer in table mount_unix.go:219
Error: failed to find any peer in table
(without --mount it starts fine though...)
Oh, I just had to umount it by hand before remounting it, because I had a shell open at that directory when I restarted ipfs daemon.
Also after playing around with it, the problem wasn't the filesize, it was the fact that I was using cp instead of rsync. Which is sort of a strange thing to be the problem.
EDIT: No, the only difference is that rync fails gracefully. See all the very small files in /ipfs/QmQpNqAn44qsxLKABB5G2DGBKrtkmR5NGWXKS6iYWmrsZi ? Any idea what's causing that? Also the fact that there are only 82 entries, vs the 2672 packages I tried to dump in that directory. (Is there a max number of children in IPFS?)

notes integrating with pacman here