Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strategy for huge repositories #5361

Open
purcell opened this issue Mar 9, 2018 · 27 comments
Open

Strategy for huge repositories #5361

purcell opened this issue Mar 9, 2018 · 27 comments
Labels

Comments

@purcell
Copy link
Member

purcell commented Mar 9, 2018

To build packages, we clone the repositories that contain them. Sometimes the resulting working directories are enormous, e.g.

  • Erlang: 460MB
  • Mozc: 940MB
  • Nemerle: 600MB
  • Ansible: 170MB
  • Cmake-mode: 170MB
  • Fuel: 260MB
  • Sekka: 185MB
  • Po-mode: 174MB
  • Caml: 200MB

This is a nuisance on the server, though arguably the solution may simply be a larger server. I feel it's worth considering removing some of these packages if they're little-used, e.g. Nemerle, Sekka, Po-Mode. And in other cases, some agitation upstream to split Emacs libraries out from massive repositories might be worthwhile.

I've looked previously at sparse checkouts, and checkouts with limited history, but it conflicted with our need to have the latest change times for any files matched by our recipe wildcards (even if they were from very old commits).

@tarsius
Copy link
Member

tarsius commented Mar 21, 2018

And in other cases, some agitation upstream to split Emacs libraries out from massive repositories might be worthwhile.

Sounds like a good plan (even for the cases mentioned earlier).

@jaseemabid
Copy link

@purcell

  1. Is it ok if I make a git repo for things like llvm-mode and copy over the code and release it on melpa? Does it have to be official?

  2. Is it acceptable to maintain a git repo for these libs under the melpa org and consider that safe and official? We will have to keep the code in sync, but I wouldn't expect the llvm mode for example to change that often. I could cherry pick the commits and keep it in sync for some of these.

  3. Can we have one official melpa repo with all these problematic emacs modes grouped together? Do we have a 1-1 mapping b/w emacs packages and git repos or is it flexible?

@jaseemabid
Copy link

@purcell Can you add llvm-mode to the list please? Makes it easier to find this issue with search engines.

@tarsius
Copy link
Member

tarsius commented May 9, 2018

Well it's a list of packages that might be removed - not including packages that have already been removed. Also now that you have mentioned it yourself, adding it again is no longer necessary for that purpose.

@tarsius
Copy link
Member

tarsius commented May 9, 2018

Is it ok if I make a git repo for things like llvm-mode and copy over the code and release it on melpa? [...]

The first step is to contact upstream and try to work out something with them. In this case (and many of the others listed above) this is a huge project and the maintainers probably do not care very much about this non-critical part. So you should probably contact the author of the elisp part instead. That person should then setup a separate repository and ask upstream to remove the bundled copy.

Bundling an Emacs mode with such a huge repository is problematic because the size of the project discourages small contributions. The additional red tape is necessary for the core of the code-base, but it also applies to the non-critical Elisp parts. This may even discourage the original author from making future changes.

@jaseemabid
Copy link

@tarsius, Makes sense. I'll get in touch with the upstream authors.

PS: Thanks again for Magit :D
PPS: I should do the same for erlang mode as well; which I write once in a while.

@tarsius
Copy link
Member

tarsius commented May 9, 2018

@jaseemabid Thanks for looking into this!

@bbatsov
Copy link
Contributor

bbatsov commented May 14, 2018

Sounds like a good plan (even for the cases mentioned earlier).

I've been trying with Erlang, and supposedly they plan to move the Emacs stuff to a separate repo at some point, but things are happening slowly - too much history. :-)

@Wilfred
Copy link
Contributor

Wilfred commented Aug 23, 2018

Many of these projects are on GitHub. Could we use the GitHub API to fetch the files and to get the last commit date? That would save us needing a git clone at all.

@purcell
Copy link
Member Author

purcell commented Apr 22, 2019

For the record, I resolved po-mode by using the emacsmirror clone of it from the large GNU gettext repo.

@purcell
Copy link
Member Author

purcell commented Apr 22, 2019

Many of these projects are on GitHub. Could we use the GitHub API to fetch the files and to get the last commit date? That would save us needing a git clone at all.

For updates that might work, but everything is currently quite filesystem-based, e.g. the wildcards in recipes. Would the API let us glob those patterns to find matching files and then determine their last-modified dates? I'd worry it would become a complicated workaround.

@riscy
Copy link
Member

riscy commented Jul 21, 2019

A related idea is that the GitHub projects also support svn, so they could use svn's partial checkout (e.g. svn checkout https://github.com/melpa/melpa/trunk/recipes)

@tarsius tarsius added the policy label Aug 5, 2019
@riscy
Copy link
Member

riscy commented Dec 13, 2020

I just noticed package-build.el isn't doing a "lightweight" checkout when it's available. It's not available from every repo manager, but GitHub, GitLab, and BitBucket all support cloning with --depth 1.

Some really quick benchmarks on my machine with depth 1:

  • Built magit in 1.975s, finished at Sat Dec 12 18:08:49 2020
  • Built erlang in 14.531s, finished at Sat Dec 12 18:09:06 2020
  • Built nemerle in 23.176s, finished at Sat Dec 12 18:09:55 2020

And without depth specified:

  • Built magit in 6.910s, finished at Sat Dec 12 18:07:02 2020
  • Built erlang in 62.120s, finished at Sat Dec 12 18:08:22 2020
  • Built nemerle in 46.494s, finished at Sat Dec 12 18:10:59 2020

@riscy
Copy link
Member

riscy commented Dec 13, 2020

Apologies, this just sunk in:

it conflicted with our need to have the latest change times for any files matched by our recipe wildcards (even if they were from very old commits).

@akirak
Copy link
Contributor

akirak commented Mar 22, 2021

@riscy
Recent versions of Git supports partial clones, which lets you clone repositories without blobs of past commits. Here is a benchmark.

Blobless clone of mozc:

$ time git clone --filter=blob:none https://github.com/google/mozc.git
Cloning into 'mozc'...
remote: Enumerating objects: 393, done.
remote: Counting objects: 100% (393/393), done.
remote: Compressing objects: 100% (261/261), done.
remote: Total 11921 (delta 239), reused 264 (delta 128), pack-reused 11528
Receiving objects: 100% (11921/11921), 2.20 MiB | 3.50 MiB/s, done.
Resolving deltas: 100% (6818/6818), done.
remote: Enumerating objects: 1621, done.
remote: Counting objects: 100% (1621/1621), done.
remote: Compressing objects: 100% (1559/1559), done.
remote: Total 1880 (delta 706), reused 123 (delta 62), pack-reused 259
Receiving objects: 100% (1880/1880), 36.20 MiB | 7.88 MiB/s, done.
Resolving deltas: 100% (750/750), done.
Updating files: 100% (1922/1922), done.
git clone --filter=blob:none https://github.com/google/mozc.git 3.35s user 0.88s system 25% cpu 16.870 total

and a normal (full) clone of the same repository:

$ time git clone https://github.com/google/mozc.git
Cloning into 'mozc'...
remote: Enumerating objects: 1692, done.
remote: Counting objects: 100% (1692/1692), done.
remote: Compressing objects: 100% (1159/1159), done.
remote: Total 47835 (delta 1050), reused 875 (delta 529), pack-reused 46143
Receiving objects: 100% (47835/47835), 593.56 MiB | 21.85 MiB/s, done.
Resolving deltas: 100% (39771/39771), done.
git clone https://github.com/google/mozc.git 30.18s user 6.57s system 115% cpu 31.827 total

It's twice as fast with much lower CPU usage. It's not as fast as shallow clones, but git-diff-tree works on the blobless clone. It's a relatively new feature, so its use should be limited to CI for now.

@purcell
Copy link
Member Author

purcell commented Mar 23, 2021

Hey @akirak, that's great information, thanks: interesting features I hadn't come across. I'd have to read up more on its behaviour, because we do things at build time like resetting to the head of the default branch (for regular MELPA) or to the latest tag (MELPA Stable), so I guess for that to work, you need to have fetched the blobs for those commits.

@alphapapa
Copy link
Contributor

alphapapa commented Mar 23, 2021

Reading the article that Akira linked to, treeless clones might also be viable, which IIUC should perform even better, assuming that only one commit is needed. (I'm not sure if I understand Steve's comment correctly, though. If one invocation of the build script checks out both master and a tag, using blobless or treeless clones might not be faster. In that case, though, I wonder if cloning bloblessly or treelessly, once for each of master and the stable tag, would work...)

@purcell
Copy link
Member Author

purcell commented Mar 23, 2021

Here's roughly what we need to be able to do with the repos: https://github.com/melpa/package-build/blob/af4f87beb48afc3fb455c58648749e5bfdda1d03/package-build.el#L244-L298

@akirak
Copy link
Contributor

akirak commented Mar 23, 2021

If one invocation of the build script checks out both master and a tag, using blobless or treeless clones might not be faster. In that case, though, I wonder if cloning bloblessly or treelessly, once for each of master and the stable tag, would work...

Regarding this case, you may be able to use git-checkout to incrementally fetch blobs of a blobless clone, according to the following description:

Git remembers the filter spec we provided when cloning the repository so that fetching updates will also exclude large files until we need them.

(https://about.gitlab.com/blog/2020/03/13/partial-clone-for-massive-repositories/)

I don't know if the incremental fetching makes the entire operation faster than single entire cloning. I think it depends on how Git resolves objects, but it will probably be faster than cloning multiple times for different commits.

According the man page of git-clone(1), git-clone supports the same --filter option as git-rev-list command does.

purcell added a commit to melpa/package-build that referenced this issue Apr 13, 2021
This can dramatically reduce the size of large repos, see

melpa/melpa#5361
melpa/melpa#7524

- We always checkout or reset the repo before building, at which point
  blobs will be fetched lazily
- History of individual files still works

Presumably, over time, some sort of gc on long-lived checked-out
working directories will also be desirable.
@purcell
Copy link
Member Author

purcell commented Apr 13, 2021

I've made a preliminary change to start building using partial clones. It's quite trivial given how we currently use git, and should provide a win. The server disk filled up again (see #7524) so this is a good time to experiment. Newly-added packages will start building with partial clones, and sometime soon we can wipe out the non-partial clones on the server so that existing packages start to build the same way.

A few related things on my radar:

  • Prior to making this change, I was able to save 4GB on the server (relative to 18GB of checked-out workdirs) by running git gc so we should run this periodicially.
  • When recipes are deleted or renamed, any existing workdirs remain on the server, so a periodic automated clean-up of these would alse be helpful.
  • It would be good to start re-using working directories when multiple packages live in the same workdir. One way to do this would be to check out into a directory name that is a hash of the repo URL and fetcher type, then symlink it to the name of the recipe. Just wary of breaking things for Windows users, since presumably that platform has no symlink support. Please correct me if I'm wrong.

@akirak
Copy link
Contributor

akirak commented Apr 13, 2021

I'm sorry, but partial clones may not reduce initial cloning time of source repositories. According to further experiments, partial cloning actually takes more time on small repositories (approximately 1.5 times to normal clones).

Thus an ideal solution might be to use different modes of cloning depending on the size of the repository. That is,
to perform partial clones only on huge repositories and clone normally otherwise. AFAIK, there is no general way to know the size of a remote Git repository beforehand. GitHub REST API seems to provide the repository size as part of response, but I don't know if it is worth the effort.

I really don't know what would be the fastest solution if you keep repositories across builds. In the end, partial clones may save downloading time even on small repositories. It will save the storage for sure.

One way to do this would be to check out into a directory name that is a hash of the repo URL and fetcher type, then symlink it to the name of the recipe.

It is unnecessary to create symlinks. You could just replace (oref rcp name) in the following definition with a hash-generating expression:

(cl-defmethod package-recipe--working-tree ((rcp package-recipe))
  (file-name-as-directory
   (expand-file-name (oref rcp name) package-build-working-dir)))

@purcell
Copy link
Member Author

purcell commented Apr 13, 2021

My main concern is saving storage so initial cloning time isn't an issue at all. For each package we build, we unconditionally fetch the workdir's repo history then reset its local head to the desired remote ref: with partial clones this will lead to a separate fetch of any new objects, but it's probably still less network activity. So I'm not worried about performance costs, but will keep an eye on it.

It is unnecessary to create symlinks. You could just replace (oref rcp name) in the following definition with a hash-generating expression:

Yes, but that doesn't allow "garbage collection" of unused workdirs using a simple script: I should have explained my rationale better, sorry.

@tarsius
Copy link
Member

tarsius commented Apr 14, 2021

What I was really hoping for is that when this gets too painful that we would then start to put pressure on these large projects to move the elisp into separate repositories (and help them with that). Failing that I would just have thrown a bigger server plan at it.

The change you already made seems okay to me, but things like this should be opt-in:

It would be good to start re-using working directories when multiple packages live in the same workdir. One way to do this would be to check out into a directory name that is a hash of the repo URL and fetcher type, then symlink it to the name of the recipe. Just wary of breaking things for Windows users, since presumably that platform has no symlink support. Please correct me if I'm wrong.

As a recipe contributor whose following the instructions and building and testing the package, I would find this annoying, I think. On the other hand I am also not in favor of complicating package-build.el by adding additional code-paths and configuration nobs.

@purcell
Copy link
Member Author

purcell commented Apr 14, 2021

What I was really hoping for is that when this gets too painful that we would then start to put pressure on these large projects to move the elisp into separate repositories (and help them with that). Failing that I would just have thrown a bigger server plan at it.

Those things are not mutually exclusive.

things like this should be opt-in

Yes, I've not estimated the likely effect of this, and I'm not particularly in a hurry to implement it: I will avoid it if possible. Best solution by far is obviously to avoid big repos.

One small thing I've thought about is collecting repo size as part of the build process, so we can name/shame/search for big repos. Some stats could trivially be collected on the server and served as a text file, JSON or sqlite DB.

@akirak
Copy link
Contributor

akirak commented Apr 15, 2021

@purcell FYI: For reusing working directories (actually repository data), I would suggest using git-worktree instead of symbolic links. Here is a proof of concept.

First, clone the remote Git repository as a bare repository:

git clone --bare --filter=blob:none --no-checkout REMOTE-URL repos/URL-HASH.git

Then use git-worktree to check out a branch/tag:

git --git-dir=repos/URL-HASH.git worktree add $TMPDIR/PACKAGE-NAME [TAG]

In the bare repository, you can get a list of its working trees using git worktree list. After building all packages, you can purge bare repositories that have no working trees. Before you start the next build cycle, run git worktree prune on all repositories to reset the worktree information.

This way, the bare repositories serve as a network cache, compressed in gzip. You can remove the working tree after building a package, so there will be only one working tree at a time, which means it will be small enough to be created on tmpfs. Only bare repositories will take storage space. git-worktree is part of the standard Git command, so I suppose it is cross-platform.

I agree with @tarsius's point, of course, that package developers wouldn't want to have another stateful thing on their machines. It will be a good idea to collect statistics on large repositories before you implement something that's hard to rollback.

@tarsius
Copy link
Member

tarsius commented Nov 14, 2022

I saw that you got scad-mode split into its own repository (openscad/openscad#4403) and am wondering whether you just happen to have an interest in this particular package, or are planning to do the same for a few other bundled emacs packages.

As you can see above, I am in the midst of such an effort myself. If you are interested in helping with that, see the list of Filtered (subtree) repositories for some candidates. (M=✓ means the package is distributed on Melpa.)

@tarsius
Copy link
Member

tarsius commented Mar 10, 2023

Also see #8172.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants