cabal update uses too much bandwidth #421

bos opened this Issue May 24, 2012 · 7 comments

3 participants

Haskell member

(Imported from Trac #428, reported by claus on 2008-12-05)

cabal update appears to download a compressed tarball (>600k and growing), which is larger than most packages, takes a while over a slow line, and doesn't provide much info (-v just adds low-level details).

using rsync over the relevant parts of the directory tree ought to be faster (increasingly so as hackage keeps growing), use less bandwith, and be able to tell me which package descriptions have changed (although it would be nice to limit the latter info to new packages and changes to installed packages).

Haskell member

(Imported comment by @dcoutts on 2008-12-05)

Unfortunately we cannot use rsync because it is not present on all platforms. However we certainly will have to address the issue of the ever growing size of the index and use some kind of incremental update.

One possibility is to use http range requests to get just the tail of the index, assuming that we can arrange to only append to the index in the usual case.

However, whatever we do has to use ordinary http-1.1.

Haskell member

(Imported comment by claus on 2008-12-05)

The cabal tool could try for rsync and fall back to the current method if that isn't available/useable. That would work even for windows cygwin (and presumably msys?) users who have rsync installed. Alternatively, put the index dirs/files into a darcs repo, and have cabal try for darcs first.

But why not use good old diff or find on the server side (a hackage server service that returns a list of files/dirs changed), then fetch only the files/dirs that have changed (possibly with some large cutoff - if everything has changed, it is cheaper to fetch one tar-file instead of lots of little files)?

If running a server find for each cabal update turns out to be a problem, one could instead provide weekly update lists on the server, with the clients consulting as many of those as needed (fetching the whole index tarball if the local index is more than a couple of months old).

Haskell member

(Imported comment by @dcoutts on 2009-01-26)

One approach I was thinking of was providing the uncompressed tarball and mostly use it append only. So most clients could do a conditional request for the byte range from the point they have currently to the end of the file. If the cache ends up not matching then the client can just request the whole compressed tarball. That uses standard HTTP-1.1 without needing anything special on the server side which is important if we want to let people host dumb repos easily.

Haskell member

(Imported comment by @igloo on 2009-01-27)

FWIW, what Debian/apt does is, when making a new package list:

  • Run diff -e (Output an ed script) between the last package list and the new one
  • Add a line with the hash of the last package list, and the script filename to the index
  • Garbage collect old lines from the index as appropriate (e.g. leave at most n lines in the remove entries more than d days old, etc. In Debian it's easier as the package list is updated exactly once a day), along with the scripts that those lines point to.
Then to update the index you:

  • Download the index
  • If the hash of your package list is in the index, download and apply all scripts since then
  • Otherwise, download the whole new package list
Example index is with scripts in the directory.

To do this for hackage, cabal-install would need to be able to apply ed scripts itself - or at least, enough of it that it can apply scripts that diff -e makes.

Haskell member

(Imported comment by claus on 2009-01-27)

I just found another reason why this is annoying: when using --remote-repo, cabal requires update, and that will always re-download the huge hackage index as well as the tiny index for the remote repo. Not to mention that we still have to re-download the huge index every time we want to install a newly updloaded package.

The compressed index has now grown to over 2MB!

Given that hackage only adds, almost never removes packages, why not have checkpoints of the index every week, and daily tarballs with just the added-since-checkpoint .cabal` files? Then cabal could download just the dailies, starting from the last downloaded local checkpoint.

Very little additional technology - just untar the dailies on top of the last checkpoint (if the repo doesn't support dailies, fall back to the current scheme; for hackage, one could either have a server-side script selecting the dailies, or even let cabal do that client-side - the former being more efficient, the latter placing fewer burdens on non-hackage repo providers).


I will try to make progress on this.


Given the plans of @dcoutts (et al.) for index signing, incremental (secure!) update,etc. is any of this relevant any longer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment