(Imported from Trac #428, reported by claus on 2008-12-05)
cabal update appears to download a compressed tarball (>600k and growing), which is larger than most packages, takes a while over a slow line, and doesn't provide much info (-v just adds low-level details).
using rsync over the relevant parts of the directory tree ought to be faster (increasingly so as hackage keeps growing), use less bandwith, and be able to tell me which package descriptions have changed (although it would be nice to limit the latter info to new packages and changes to installed packages).
(Imported comment by @dcoutts on 2008-12-05)
Unfortunately we cannot use rsync because it is not present on all platforms. However we certainly will have to address the issue of the ever growing size of the index and use some kind of incremental update.
One possibility is to use http range requests to get just the tail of the index, assuming that we can arrange to only append to the index in the usual case.
However, whatever we do has to use ordinary http-1.1.
(Imported comment by claus on 2008-12-05)
The cabal tool could try for rsync and fall back to the current method if that isn't available/useable. That would work even for windows cygwin (and presumably msys?) users who have rsync installed. Alternatively, put the index dirs/files into a darcs repo, and have cabal try for darcs first.
But why not use good old diff or find on the server side (a hackage server service that returns a list of files/dirs changed), then fetch only the files/dirs that have changed (possibly with some large cutoff - if everything has changed, it is cheaper to fetch one tar-file instead of lots of little files)?
If running a server find for each cabal update turns out to be a problem, one could instead provide weekly update lists on the server, with the clients consulting as many of those as needed (fetching the whole index tarball if the local index is more than a couple of months old).
(Imported comment by @dcoutts on 2009-01-26)
One approach I was thinking of was providing the uncompressed tarball and mostly use it append only. So most clients could do a conditional request for the byte range from the point they have currently to the end of the file. If the cache ends up not matching then the client can just request the whole compressed tarball. That uses standard HTTP-1.1 without needing anything special on the server side which is important if we want to let people host dumb repos easily.
(Imported comment by @igloo on 2009-01-27)
FWIW, what Debian/apt does is, when making a new package list:
To do this for hackage, cabal-install would need to be able to apply ed scripts itself - or at least, enough of it that it can apply scripts that diff -e makes.
(Imported comment by claus on 2009-01-27)
I just found another reason why this is annoying: when using --remote-repo, cabal requires update, and that will always re-download the huge hackage index as well as the tiny index for the remote repo. Not to mention that we still have to re-download the huge index every time we want to install a newly updloaded package.
The compressed index has now grown to over 2MB!
Given that hackage only adds, almost never removes packages, why not have checkpoints of the index every week, and daily tarballs with just the added-since-checkpoint .cabal` files? Then cabal could download just the dailies, starting from the last downloaded local checkpoint.
Very little additional technology - just untar the dailies on top of the last checkpoint (if the repo doesn't support dailies, fall back to the current scheme; for hackage, one could either have a server-side script selecting the dailies, or even let cabal do that client-side - the former being more efficient, the latter placing fewer burdens on non-hackage repo providers).
I will try to make progress on this.
Given the plans of @dcoutts (et al.) for index signing, incremental (secure!) update,etc. is any of this relevant any longer?