Skip to content
This repository has been archived by the owner on Oct 25, 2023. It is now read-only.

[RFC] Decreasing build costs #153

Closed
arthurpaulino opened this issue Feb 6, 2023 · 14 comments
Closed

[RFC] Decreasing build costs #153

arthurpaulino opened this issue Feb 6, 2023 · 14 comments
Labels
discussion Topic for discussion enhancement New feature or request

Comments

@arthurpaulino
Copy link
Contributor

arthurpaulino commented Feb 6, 2023

Introduction

This RFC issue is motivated by the following:

  1. Equivalent copies of Lake build files stack up on the FS cumulatively. So, for example, if I have the chain of lake packages A -> B -> C, then I will end up with the build files for A three times on my FS
  2. Lake's native caching/hosting solution for build files is not as efficient as the one implemented in mathlib4. And the mathlib4 solution is restricted to mathlib4, but we'd like to extend it to other repositories like std4 or any other package managed by independent groups of developers

At a first glance, 1 and 2 are orthogonal. But, in reality, having the solution I will present for 1 implemented would make the implementation of 2 much easier. So let's get right into it.

An unified Lake build storage

Lake stores build files inside the build folder (by default) that's created inside the package directory. So the default behavior for Lake is to create multiple copies of olean files, one for each copy of the repository in a computed (clones in lake-packages count too).

This causes Lake to consume extra space in the FS, but this is not the only issue. Lake ends up building the same things to produce the same outputs over and over.

While implementing mathlib4's Cache, I thought that Lake could do something similar: have a $HOME/.cache/lake-store and centralize build files there, distinguished by their traces. Considering the chain A -> B -> C again, all of those packages could make use of the same build files for A.

A powerful lake cache command

lake cache would interface with the lake-store folder, being able to perform many tasks:

  1. Find orphan build files and clean them up
  2. Print a summary
  3. Perform put and get operations

The third item needs elaboration. First, it would be configurable in the lakefile.lean so the user would be able to set their own hosts and functions to format URLs. Second, clearly, put/get would deal with files that are contained in the package, so it wouldn't, for example, try to upload my entire lake-store. Third, it would have a "recursive" get behavior.

mathlib4 depends on std4 and currently we are hosting std4 build files in the mathlib4 Azure blob. Ideally, std4 would have its own Azure blob and lake cache get would be able to figure out that std4 has a host set up... and would be able to pull from there.

@Kha
Copy link
Member

Kha commented Feb 6, 2023

  1. So, for example, if I have the chain of lake packages A -> B -> C, then I will end up with the build files for A three times on my FS

Do you mean if you have three independent (but identical) checkouts of them and build all of them?

@arthurpaulino
Copy link
Contributor Author

arthurpaulino commented Feb 6, 2023

Do you mean if you have three independent (but identical) checkouts of them and build all of them?

That's one way to cause the replication of build files. But I was talking about a more common way of doing it: simply having A, B and C on your FS and then building them. Each of them will have their copy of the build files for A.

For example, we have multiple projects that need Std's RBMap and it's become pretty annoying to build it so many times.

@gebner
Copy link
Member

gebner commented Feb 6, 2023

mathlib4 depends on std4 and currently we are hosting std4 build files in the mathlib4 Azure blob. Ideally, std4 would have its own Azure blob and lake cache get would be able to figure out that std4 has a host set up... and would be able to pull from there.

The opportunity to reuse caches from std4 is extremely limited, since the Lean toolchain versions need to match up exactly.

That's one way to cause the replication of build files. But I was talking about a more common way of doing it: simply having A, B and C on your FS and then building them. Each of them will have their copy of the build files for A.

Same thing here. A, B, and C can all have different Lean versions (at least in my experience with A=std4 and B=mathlib4).

@arthurpaulino
Copy link
Contributor Author

since the Lean toolchain versions need to match up exactly

Do they? If a file on mathlib4 compiles with a certain version of std4, that will be accounted in when we generate the hash for the mathlib4 file. We can have mathlib4 at 2023-02-01 compiling with std4 at 2023-01-01 (just for the sake of the example) like we already do today. The std4 cache won't be reused only if its version changes

@gebner
Copy link
Member

gebner commented Feb 6, 2023

I like the general direction this RFC goes in. It is certainly technically feasible, but there's a few subtle points to consider.

We need significantly better "dependency specification" for this to work reliably. Does a file depend on another file (#86), does it depend on the operating system, architecture, etc., does it depend on its file path? Lean also depends on environment variables (LEAN_ABORT_ON_PANIC e.g.), and then there's LD_LIBARY_PATH and co. as well.

@gebner
Copy link
Member

gebner commented Feb 6, 2023

since the Lean toolchain versions need to match up exactly

Do they? If a file on mathlib4 compiles with a certain version of std4, that will be accounted in when we generate the hash for the mathlib4 file. We can have mathlib4 at 2023-02-01 compiling with std4 at 2023-01-01 (just for the sake of the example) like we already do today. The std4 cache won't be reused only if its version changes

I'm not sure what you mean exactly, maybe we're talking about the same thing. If the published std4 builds (in the azure bucket specified in the std4 lakefile) are for a different Lean version than mathlib4 requires, then we can't use them.

If you have a mathlib4 and std4 checked out side-by-side then the same point applies. The toolchains specified by std4 and mathlib4 are not the same, so you still build std4 twice.

@arthurpaulino
Copy link
Contributor Author

arthurpaulino commented Feb 6, 2023

Ah, I understand now. Because std4 is re-built with a more recent toolchain anyways. I see two (not mutually exclusive) options:

  • A diplomatic way for one package to push cache files for another package (as if mathlib4 did the favor of building std4 with a new toolchain)
  • An option to resolve/cache locally if they aren't found in the source package, so mathlib4 would host std4 cache if std4 moves too slowly

@arthurpaulino
Copy link
Contributor Author

arthurpaulino commented Feb 6, 2023

I like the general direction this RFC goes in. It is certainly technically feasible, but there's a few subtle points to consider.

We need significantly better "dependency specification" for this to work reliably. Does a file depend on another file (#86), does it depend on the operating system, architecture, etc., does it depend on its file path? Lean also depends on environment variables (LEAN_ABORT_ON_PANIC e.g.), and then there's LD_LIBARY_PATH and co. as well.

File dependency would be similar to how it's done in mathlib4's cache. Its hash shouldn't depend on OS nor architecture.

For environment variables, I would need to better understand their roles. So @gebner please feel free to provide some guidance 🙏🏼

@tydeu
Copy link
Member

tydeu commented Feb 7, 2023

@arthurpaulino

File dependency would be similar to how it's done in mathlib4's cache. Its hash shouldn't depend on OS nor architecture.

This is generally invalid. Lean oleans are OS and architecture dependent. Mathlib works hard to avoid the areas where this is the case, but that will not work for Lake. The default case has to stick with what Lean is to support all kinds of packages.

@tydeu tydeu added enhancement New feature or request discussion Topic for discussion labels Feb 7, 2023
@tydeu
Copy link
Member

tydeu commented Feb 7, 2023

@arthurpaulino

Also, I would like to touch on some statements you made:

Lake stores build files inside the build folder (by default) that's created inside the package directory. ... Lake ends up building the same things to produce the same outputs over and over.

I question the way you phrased this. Lake does not rebuild the same outputs over and over. A package and its dependencies build outputs are all stored in the root package's directory. Each package is only built once (i.e., for a dependency chain: A -> B -> C, Lake will only build A once, not twice for both B and C). Dependencies build directories should ideally never be used (just as one would not build a JS package from inside node_modules).

I thought that Lake could do something similar: have a $HOME/.cache/lake-store and centralize build files there, distinguished by their traces.

As Gabriel noted, Lake build traces are not fully reproducible, it is not safe to assume that packages with different configurations will produce the same outputs. More work could be put into them properly reproducible, but this is non-trivial due to the fact that Lean meta-programs can technically. make use of all aspects of the build environment.

The current approach is best effort and seems to work for a single package, but I imagine cache builds across packages will result in a lot more false positives when it comes to trace testing (requiring manually cleaning of the build cache).

Summary: A more powerful caching mechanism would be nice, but I am not sure it is currently practical due to high likelihood of cache misses on not exactly identical environments (e.g., different Lean versions, different configurations, different OSes/architectures). EDIT: Mathlib can avoid a lot of these concerns since it is not a programming project designed to produce native objects, but their techniques do not generalize across all kinds of packages.

@tydeu
Copy link
Member

tydeu commented Feb 7, 2023

@arthurpaulino One more question that just occurred to me: What is the plan for hosting the build cache online? Are you planning to piggy back of Lake's current cloud releases or do something else. You mentioned per-file caching in a previous issue and I don't think you can do that with GitHub, so I am curious what the plan is.

@arthurpaulino
Copy link
Contributor Author

arthurpaulino commented Feb 7, 2023

I question the way you phrased this. Lake does not rebuild the same outputs over and over. A package and its dependencies build outputs are all stored in the root package's directory. Each package is only built once (i.e., for a dependency chain: A -> B -> C, Lake will only build A once, not twice for both B and C)

This is true if you only have C on your FS. But if you are the maintainer of A, B and C, and if you have clones of them on your FS, you will build A when maintaining A, when maintaining B and when maintaining C.

As Gabriel noted, Lake build traces are not full reproducible, it is not safe to assume that packages with different configurations will produce the same outputs. More work could be put into them properly reproducible, but this is non-trivial due to the fact that Lean meta-programs can technically. make use of all aspects of the build environment

In mathlib4, this is the formula for the file hash. We add the root hash (which includes the lakefile.lean, the lean-toolchain and the lake-manifest.json), the relative file path as FilePath.components, the content of the file and the hashes of the files it imports. What else do we need? include C files in the root hash?

If it's OS and architecture specific, then we make it so and include those in the hash mix as well.

What is the plan for hosting the build cache online?

mathlib4 hosts its files on Azure, but it could be somewhere else. The URL is something you can change. And if anything gets too specific, we could require the user to define their own functions.

@tydeu
Copy link
Member

tydeu commented Feb 7, 2023

@arthurpaulino

This is true if you only have C on your FS. But if you are the maintainer of A, B and C, and if you have clones of them on your FS, you will build A when maintaining A, when maintaining B and when maintaining C.

True. However, there is still the lean-toolchain incongruencies to worry about along with debug vs release build configurations to consider (e.g., you may have some traces set to on while developing A that you don't want to carry over into the release build used by B). Furthermore, unless you are the sole maintainer of A, B, and C and/or the packages are tightly coupled, it is unlikely their lean-toolchain (and dependency versions) will be updated in lockstep in the long run.

I guess my major concern is that this approach does not seem scalable once dealing with a real package ecosystems where packages are likely to have different dependency versions locked. I based Lake's package managing on NPM which, afaik, does what Lake currently does: clones and builds all nested dependencies of a package in the package's root node_modules directory and there is no caching of results between packages. However, it does sound like Haskell and Rust may perform the kind of caching describing (as does Nix), so I would be interesting in learning what lead to the these different design decisions. That is, why does Rust do system-local caching and NPM does not -- are there aspects of each language enable or prohibit this (and to which does Lean more closely align)?

This is the formula for the file hash. What else do we need? include C files in the root hash? If it's OS and architecture specific, then we make it so and include those in the hash mix as well.

Are you suggesting that the hashing algorithm for traces be configurable by the package? If so, that does sound like a good idea! It would allow different packages to specify how much of the surrounding environment they care about.

mathlib4 hosts its files on Azure, but it could be somewhere else. The URL is something you can change. And if anything gets too specific, we could require the user to define their own functions.

This is a problem. The cloud build system Lake provides should not require users to setup their own cloud storage to host builds (though that could be a customization). The goal here is to emulate packaging services provided by other languages, so whatever the solution is, it needs to be free and relatively painless to setup (hence why the current solution in Lake uses GitHub Releases).

@tydeu
Copy link
Member

tydeu commented Oct 25, 2023

The Lake repository is being deprecated and its issues transferred to leanprover/lean4. As this issue is stale and more of broad discussion of the planned Lake-based successor to lake exe cache than an actionable item, I am closing it instead of transferring it. Feel free to reopen a similar issue over on leanprover/lean4.

@tydeu tydeu closed this as completed Oct 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
discussion Topic for discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants