-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strategy for huge repositories #5361
Comments
Sounds like a good plan (even for the cases mentioned earlier). |
|
|
@purcell Can you add llvm-mode to the list please? Makes it easier to find this issue with search engines. |
|
Well it's a list of packages that might be removed - not including packages that have already been removed. Also now that you have mentioned it yourself, adding it again is no longer necessary for that purpose. |
The first step is to contact upstream and try to work out something with them. In this case (and many of the others listed above) this is a huge project and the maintainers probably do not care very much about this non-critical part. So you should probably contact the author of the elisp part instead. That person should then setup a separate repository and ask upstream to remove the bundled copy. Bundling an Emacs mode with such a huge repository is problematic because the size of the project discourages small contributions. The additional red tape is necessary for the core of the code-base, but it also applies to the non-critical Elisp parts. This may even discourage the original author from making future changes. |
|
@tarsius, Makes sense. I'll get in touch with the upstream authors. PS: Thanks again for Magit :D |
|
@jaseemabid Thanks for looking into this! |
I've been trying with Erlang, and supposedly they plan to move the Emacs stuff to a separate repo at some point, but things are happening slowly - too much history. :-) |
|
Many of these projects are on GitHub. Could we use the GitHub API to fetch the files and to get the last commit date? That would save us needing a git clone at all. |
|
For the record, I resolved |
For updates that might work, but everything is currently quite filesystem-based, e.g. the wildcards in recipes. Would the API let us |
|
A related idea is that the GitHub projects also support svn, so they could use svn's partial checkout (e.g. |
|
I just noticed package-build.el isn't doing a "lightweight" checkout when it's available. It's not available from every repo manager, but GitHub, GitLab, and BitBucket all support cloning with Some really quick benchmarks on my machine with depth 1:
And without depth specified:
|
|
Apologies, this just sunk in:
|
|
Blobless clone of mozc:
and a normal (full) clone of the same repository:
It's twice as fast with much lower CPU usage. It's not as fast as shallow clones, but |
|
Hey @akirak, that's great information, thanks: interesting features I hadn't come across. I'd have to read up more on its behaviour, because we do things at build time like resetting to the head of the default branch (for regular MELPA) or to the latest tag (MELPA Stable), so I guess for that to work, you need to have fetched the blobs for those commits. |
|
Reading the article that Akira linked to, treeless clones might also be viable, which IIUC should perform even better, assuming that only one commit is needed. (I'm not sure if I understand Steve's comment correctly, though. If one invocation of the build script checks out both |
|
Here's roughly what we need to be able to do with the repos: https://github.com/melpa/package-build/blob/af4f87beb48afc3fb455c58648749e5bfdda1d03/package-build.el#L244-L298 |
Regarding this case, you may be able to use
(https://about.gitlab.com/blog/2020/03/13/partial-clone-for-massive-repositories/) I don't know if the incremental fetching makes the entire operation faster than single entire cloning. I think it depends on how Git resolves objects, but it will probably be faster than cloning multiple times for different commits. According the man page of |
This can dramatically reduce the size of large repos, see melpa/melpa#5361 melpa/melpa#7524 - We always checkout or reset the repo before building, at which point blobs will be fetched lazily - History of individual files still works Presumably, over time, some sort of gc on long-lived checked-out working directories will also be desirable.
|
I've made a preliminary change to start building using partial clones. It's quite trivial given how we currently use git, and should provide a win. The server disk filled up again (see #7524) so this is a good time to experiment. Newly-added packages will start building with partial clones, and sometime soon we can wipe out the non-partial clones on the server so that existing packages start to build the same way. A few related things on my radar:
|
|
I'm sorry, but partial clones may not reduce initial cloning time of source repositories. According to further experiments, partial cloning actually takes more time on small repositories (approximately 1.5 times to normal clones). Thus an ideal solution might be to use different modes of cloning depending on the size of the repository. That is, I really don't know what would be the fastest solution if you keep repositories across builds. In the end, partial clones may save downloading time even on small repositories. It will save the storage for sure.
It is unnecessary to create symlinks. You could just replace (cl-defmethod package-recipe--working-tree ((rcp package-recipe))
(file-name-as-directory
(expand-file-name (oref rcp name) package-build-working-dir))) |
|
My main concern is saving storage so initial cloning time isn't an issue at all. For each package we build, we unconditionally fetch the workdir's repo history then reset its local head to the desired remote ref: with partial clones this will lead to a separate fetch of any new objects, but it's probably still less network activity. So I'm not worried about performance costs, but will keep an eye on it.
Yes, but that doesn't allow "garbage collection" of unused workdirs using a simple script: I should have explained my rationale better, sorry. |
|
What I was really hoping for is that when this gets too painful that we would then start to put pressure on these large projects to move the elisp into separate repositories (and help them with that). Failing that I would just have thrown a bigger server plan at it. The change you already made seems okay to me, but things like this should be opt-in:
As a recipe contributor whose following the instructions and building and testing the package, I would find this annoying, I think. On the other hand I am also not in favor of complicating |
Those things are not mutually exclusive.
Yes, I've not estimated the likely effect of this, and I'm not particularly in a hurry to implement it: I will avoid it if possible. Best solution by far is obviously to avoid big repos. One small thing I've thought about is collecting repo size as part of the build process, so we can name/shame/search for big repos. Some stats could trivially be collected on the server and served as a text file, JSON or sqlite DB. |
|
@purcell FYI: For reusing working directories (actually repository data), I would suggest using First, clone the remote Git repository as a bare repository: git clone --bare --filter=blob:none --no-checkout REMOTE-URL repos/URL-HASH.gitThen use git --git-dir=repos/URL-HASH.git worktree add $TMPDIR/PACKAGE-NAME [TAG]In the bare repository, you can get a list of its working trees using This way, the bare repositories serve as a network cache, compressed in gzip. You can remove the working tree after building a package, so there will be only one working tree at a time, which means it will be small enough to be created on tmpfs. Only bare repositories will take storage space. I agree with @tarsius's point, of course, that package developers wouldn't want to have another stateful thing on their machines. It will be a good idea to collect statistics on large repositories before you implement something that's hard to rollback. |
|
I saw that you got As you can see above, I am in the midst of such an effort myself. If you are interested in helping with that, see the list of Filtered (subtree) repositories for some candidates. (M=✓ means the package is distributed on Melpa.) |
|
Also see #8172. |
To build packages, we clone the repositories that contain them. Sometimes the resulting working directories are enormous, e.g.
This is a nuisance on the server, though arguably the solution may simply be a larger server. I feel it's worth considering removing some of these packages if they're little-used, e.g. Nemerle, Sekka, Po-Mode. And in other cases, some agitation upstream to split Emacs libraries out from massive repositories might be worthwhile.
I've looked previously at sparse checkouts, and checkouts with limited history, but it conflicted with our need to have the latest change times for any files matched by our recipe wildcards (even if they were from very old commits).
The text was updated successfully, but these errors were encountered: