-
Notifications
You must be signed in to change notification settings - Fork 19
[RFC] Decreasing build costs #153
Comments
Do you mean if you have three independent (but identical) checkouts of them and build all of them? |
That's one way to cause the replication of build files. But I was talking about a more common way of doing it: simply having A, B and C on your FS and then building them. Each of them will have their copy of the build files for A. For example, we have multiple projects that need |
The opportunity to reuse caches from
Same thing here. A, B, and C can all have different Lean versions (at least in my experience with A=std4 and B=mathlib4). |
Do they? If a file on |
I like the general direction this RFC goes in. It is certainly technically feasible, but there's a few subtle points to consider. We need significantly better "dependency specification" for this to work reliably. Does a file depend on another file (#86), does it depend on the operating system, architecture, etc., does it depend on its file path? Lean also depends on environment variables ( |
I'm not sure what you mean exactly, maybe we're talking about the same thing. If the published If you have a |
Ah, I understand now. Because
|
File dependency would be similar to how it's done in For environment variables, I would need to better understand their roles. So @gebner please feel free to provide some guidance 🙏🏼 |
This is generally invalid. Lean oleans are OS and architecture dependent. Mathlib works hard to avoid the areas where this is the case, but that will not work for Lake. The default case has to stick with what Lean is to support all kinds of packages. |
Also, I would like to touch on some statements you made:
I question the way you phrased this. Lake does not rebuild the same outputs over and over. A package and its dependencies build outputs are all stored in the root package's directory. Each package is only built once (i.e., for a dependency chain: A -> B -> C, Lake will only build A once, not twice for both B and C). Dependencies build directories should ideally never be used (just as one would not build a JS package from inside
As Gabriel noted, Lake build traces are not fully reproducible, it is not safe to assume that packages with different configurations will produce the same outputs. More work could be put into them properly reproducible, but this is non-trivial due to the fact that Lean meta-programs can technically. make use of all aspects of the build environment. The current approach is best effort and seems to work for a single package, but I imagine cache builds across packages will result in a lot more false positives when it comes to trace testing (requiring manually cleaning of the build cache). Summary: A more powerful caching mechanism would be nice, but I am not sure it is currently practical due to high likelihood of cache misses on not exactly identical environments (e.g., different Lean versions, different configurations, different OSes/architectures). EDIT: Mathlib can avoid a lot of these concerns since it is not a programming project designed to produce native objects, but their techniques do not generalize across all kinds of packages. |
@arthurpaulino One more question that just occurred to me: What is the plan for hosting the build cache online? Are you planning to piggy back of Lake's current cloud releases or do something else. You mentioned per-file caching in a previous issue and I don't think you can do that with GitHub, so I am curious what the plan is. |
This is true if you only have C on your FS. But if you are the maintainer of A, B and C, and if you have clones of them on your FS, you will build A when maintaining A, when maintaining B and when maintaining C.
In If it's OS and architecture specific, then we make it so and include those in the hash mix as well.
|
True. However, there is still the I guess my major concern is that this approach does not seem scalable once dealing with a real package ecosystems where packages are likely to have different dependency versions locked. I based Lake's package managing on NPM which, afaik, does what Lake currently does: clones and builds all nested dependencies of a package in the package's root
Are you suggesting that the hashing algorithm for traces be configurable by the package? If so, that does sound like a good idea! It would allow different packages to specify how much of the surrounding environment they care about.
This is a problem. The cloud build system Lake provides should not require users to setup their own cloud storage to host builds (though that could be a customization). The goal here is to emulate packaging services provided by other languages, so whatever the solution is, it needs to be free and relatively painless to setup (hence why the current solution in Lake uses GitHub Releases). |
The Lake repository is being deprecated and its issues transferred to leanprover/lean4. As this issue is stale and more of broad discussion of the planned Lake-based successor to |
Introduction
This RFC issue is motivated by the following:
mathlib4
. And themathlib4
solution is restricted tomathlib4
, but we'd like to extend it to other repositories likestd4
or any other package managed by independent groups of developersAt a first glance, 1 and 2 are orthogonal. But, in reality, having the solution I will present for 1 implemented would make the implementation of 2 much easier. So let's get right into it.
An unified Lake build storage
Lake stores build files inside the
build
folder (by default) that's created inside the package directory. So the default behavior for Lake is to create multiple copies of olean files, one for each copy of the repository in a computed (clones inlake-packages
count too).This causes Lake to consume extra space in the FS, but this is not the only issue. Lake ends up building the same things to produce the same outputs over and over.
While implementing
mathlib4
's Cache, I thought that Lake could do something similar: have a$HOME/.cache/lake-store
and centralize build files there, distinguished by their traces. Considering the chain A -> B -> C again, all of those packages could make use of the same build files for A.A powerful
lake cache
commandlake cache
would interface with thelake-store
folder, being able to perform many tasks:put
andget
operationsThe third item needs elaboration. First, it would be configurable in the
lakefile.lean
so the user would be able to set their own hosts and functions to format URLs. Second, clearly,put
/get
would deal with files that are contained in the package, so it wouldn't, for example, try to upload my entirelake-store
. Third, it would have a "recursive"get
behavior.mathlib4
depends onstd4
and currently we are hostingstd4
build files in themathlib4
Azure blob. Ideally,std4
would have its own Azure blob andlake cache get
would be able to figure out thatstd4
has a host set up... and would be able to pull from there.The text was updated successfully, but these errors were encountered: