-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat(#56) blog about caching #58
Changes from 9 commits
b46d9c1
805d79c
ca11fee
b8789f7
198cb96
96e9f05
4d30d65
3cdae01
ed510bc
5c2fa3e
ad6e8eb
9e6a736
daab2ce
25396af
5c065a5
8f27368
f25314b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,239 @@ | ||
--- | ||
layout: post | ||
date: 2024-02-06 | ||
title: "Build cache in EO and other build systems" | ||
author: Alekseeva Yana | ||
--- | ||
|
||
|
||
## Introduction | ||
In [EO](https://github.com/objectionary/eo), caching is used to speed up program compilation. | ||
Recently we found a caching | ||
[bug](https://github.com/objectionary/eo/issues/2790) in `eo-maven-plugin` | ||
for EO version `0.34.0`. The bug occurred because the old verification method | ||
contains a comparison of the compilation time and caching time to search for the cached file. | ||
This is not the most reliable verification method, | ||
because caching time does not have to be equal to compilation time. | ||
We came to the conclusion that we need caching with a reliable verification method. | ||
And this verification method should not use the information that the cached file contains. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Furthermore, this verification method should refrain from reading the file content." |
||
|
||
The goal of this blog is to research caching in frequently used build systems (`ccache`, `Maven`, `Gradle`) | ||
and to implement effective caching in [EO](https://github.com/objectionary/eo). | ||
|
||
<!--more--> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 "More"? |
||
|
||
## Build caching of existing build systems | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 What about "Caching in Build Systems" ? or " Caching in Other Build Systems". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo I will choose " Caching in Other Build Systems" |
||
|
||
### ccache/sccache | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Is it a build system or what? Where is the link? Short description? |
||
In compiled programming languages, building a project containing many source code files takes a long time. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "containing" -> "with" |
||
This time is spent on loading of libraries, preparing, optimizing, checking the code, and so on. | ||
To speed up the assembly of compiled languages, [ccache](https://ccache.dev) | ||
and [sccache](https://github.com/mozilla/sccache) are used. | ||
Let's look at the assembly scheme using C++ as an example | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 To be honest, I have some doubts about this paragraph where you discuss "compilation steps":
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo I believe that a reader will better understand this article if we briefly talk about the stages of compilation of the presented build systems. Yes, I describe compilation incompletely, but enough to indicate where caching works. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Then, it's good to mention it.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo "The goal is to implement effective caching in EO. |
||
to imagine the build process in compiled languages: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Maybe we can we change "Imagine" to "Visualize"? What do you think? |
||
|
||
<p align="center"> | ||
<img src="/images/defaultCPhase.svg"> | ||
</p> | ||
|
||
1) First, preprocessor gets the input files. The input files are source files (.cpp) and header files (.h). | ||
The result is a single edited file with human-readable code that the compiler will get. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Which format the output file has? |
||
|
||
|
||
2) The compiler receives the finished code file and converts it into machine code, presented in an object file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 "finished" code? What does it mean? |
||
At the compilation stage, parsing occurs, which checks whether the code matches | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
rules of a specific programming language. | ||
At the end, the compiler optimizes the resulting machine code and produces an object file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 You already mentioned it:
|
||
To speed up compilation, different files of the same project are compiled in parallel. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 I guess it's better to say "might be compiled". |
||
|
||
3) Then, the linker gets object files. | ||
Linker is a program that combines object files into an executable file or library. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
The result of the linker is an executable .exe file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Maybe it's better to quote |
||
|
||
|
||
As a result, in compiled languages, multiple files are simultaneously and independently converted | ||
into machine code at the compilation stage. | ||
This machine code is then combined into one executable file. | ||
|
||
|
||
`ccache` uses hashcode to find cached files. The hashcode includes information: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 [Question] I'm not sure here, but it seems that a "hashcode" isn't frequently used term, from Hash Function definition:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo Thanks, I will use "hash algorithm" as ccache documentation. |
||
file contents, directory, compiler information, compilation time, extensions | ||
used by the compiler. A compressed machine code file is placed in the cache using the received key. | ||
|
||
|
||
`ccache` has two main caching methods: | ||
1) `Direct mode` - hashcode is generated based on the source code. | ||
`Direct mode` compiles the program faster, since the preprocessor step is skipped. | ||
However,the header files are not checked for changes, so the wrong project may be built. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 What are "wrong" and "right" projects here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Maybe we can clarify it in the text? |
||
2) `Preprocessor mode` - hashcode is generated based on the result of preprocessor. | ||
`Preprocessor mode` is slower than `direct mode`, but the right project is built always. | ||
|
||
`Sccache`, unlike `ccache`, allows you to store cached files not only locally, but also in the cloud. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Do you mean some particular cloud? ("the") |
||
And it also has fixed some bugs (for example, there is a check of header files, which makes direct mode more accurate). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 You have some problem with tense here (grammar) |
||
|
||
|
||
### Maven | ||
[Maven](https://maven.apache.org) automates and manages Java-project builds. | ||
Building a project in `Maven` is completed in three | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 in three maven LifeCycles Maven. "Maven, maven" |
||
maven [LifeCycles Maven](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html), | ||
which consist of `phases`. `Phases` consist of sets of `goals`. | ||
|
||
`Maven` has default `phases` and `goals` for building any projects: | ||
|
||
<p align="center"> | ||
<img src="/images/defaultPhaseMaven.svg"> | ||
</p> | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Maybe we need to write a short summary 1-2 sentences about this type of caching? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo The principle of caching in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 I mean ccache and sccache altogether. What is the difference with other types of caching? Why did you choose these tools? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo I wrote above that I looked at well-known used build systems. Isn't this enough? |
||
In `Maven` all phases and goals are executed strictly in order, linearly. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 So, Maven doesn't use caching at all? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo As far as I understand, that Maven can use added extensions from Gradle for caching. Or Maven can rebuild only changed project modules. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Maven has There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 It's not about Gradle, I guess: https://maven.apache.org/extensions/maven-build-cache-extension/ |
||
But in `Maven` there is no build-time caching as such. | ||
`Maven` suggests rebuilding only changed project modules to speed up the build process. | ||
|
||
### Gradle | ||
But unlike `Maven`, [Gradle](https://gradle.org) builds projects using a task graph - | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Just "Unlike |
||
[Directed Acyclic Graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Maybe it's better to give a link to a "Gradle task graph" instead? Why do I need to read about DAGs? |
||
in which some tasks can be executed synchronously. | ||
To speed up project builds, `Gradle` employs incremental builds | ||
[Incremental build](https://docs.gradle.org/current/userguide/incremental_build.html#sec:how_does_it_work). | ||
For an incremental build to work, the tasks used to build the project must have specified | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Could you please simplify this sentence and use simple active voice? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo "The tasks that build the project must have input and output files for an incremental build to work." - is it ok? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 The second sentence clearly explains the idea which you are trying to explain here. I would suggest to combine this two sentences into a single one. Or jut to remove this sentence. What do you think? |
||
source and output files. | ||
``` | ||
task myTask { | ||
inputs.dir 'src/main/java/MyTask.somebody' // Specify the input directory | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo I have fixed this example:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 good |
||
outputs.dir 'build/classes/java/main/MyTask.somebody' // Specify the output directory | ||
|
||
doLast { | ||
// Task actions go here | ||
// This code will only be executed if the inputs or outputs have changed | ||
} | ||
} | ||
``` | ||
Before executing a task, `Gradle` makes a fingerprint of the path | ||
and contents of the source files and saves it. | ||
If the task completes successfully, `Gradle` also makes a fingerprint from the resulting files. | ||
To avoid re-fingerprinting the original files, `Gradle` checks the last modification time and the size of the original | ||
files before reassembling. This allows `Gradle` to use the results already obtained when the project is rebuilt. | ||
Additionally, `Gradle` stores fingerprints of previous builds enabling quick project builds, | ||
for example when switching from one branch to another - known as the - | ||
[Build Cache](https://docs.gradle.org/current/userguide/build_cache.html). | ||
|
||
|
||
|
||
|
||
### EO build cache | ||
|
||
EO code uses the `Maven` build system to build. | ||
For this purpose, the `eo-maven-plugin` plugin was created, | ||
which contains the necessary goals for working with EO code. | ||
As mentioned earlier, the build of projects in `Maven` occurs in a specific order of phases. | ||
In the diagram you can observe the main phases and their goals for the EO last version of the compiler: | ||
|
||
<p align="center"> | ||
<img src="/images/EO.svg"> | ||
</p> | ||
|
||
In [Picture 3](/images/EO.svg) the goals from the `eo-maven-plugin` | ||
are highlighted in green. | ||
|
||
|
||
However, the actual work with EO code takes place in `AssembleMojo`. | ||
`AssembleMojo` is the goal consisting of other goals that work with the EO file, as shown in | ||
[Picture 4](/images/AssembleMojo.svg). | ||
|
||
|
||
<p align="center"> | ||
<img src="/images/AssembleMojo.svg"> | ||
</p> | ||
|
||
Each goal in `AssembleMojo` is a specific compilation step for EO code, and we need to use | ||
caching at each step to speed up the build of the EO program. | ||
|
||
|
||
In EO version `0.34.0`, | ||
caching used unrelated `Footprint` and `Optimization` interfaces for different `Mojo`, | ||
which used the same methods. | ||
The difference between interfaces is that `Footprint` checks the EO version of the compiler, | ||
while the rest of the checks are exactly the same. | ||
|
||
|
||
The disadvantages of initial caching in EO include: | ||
* The cached file is actual if the compilation time and the time of saving to the cache are equal. | ||
* Verification data is read from a file on file system. | ||
* Each goal uses own classes and interfaces for data caching, making the code difficult to extend and read. | ||
|
||
|
||
|
||
To address these disadvantages, the following solutions are proposed: | ||
|
||
|
||
1) Create a new class `Cache` responsible for data verification, saving to cache and loading from cache. | ||
|
||
``` | ||
public class Cache { | ||
|
||
private List<CacheValidation> validations; | ||
|
||
public Cache(final List<CacheValidation> cv) { | ||
this.validations = cv; | ||
} | ||
|
||
public Optional<XML> load(final Path source, final Path cache) {...}; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Maybe it's better to make |
||
|
||
public void save(final Path cache, final Scalar<String> program, final Path relative) {...}; | ||
} | ||
``` | ||
|
||
|
||
`List<CacheValidation>` is a list of validations. Validations are implemented from the `CacheValidation` interface. | ||
Different `Mojo` can use different validations. | ||
|
||
|
||
``` | ||
public interface CacheValidation { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 I didn't grasp the idea why we might need this class and why it has exactly this implementation. |
||
boolean validate(final Path source, final Path cache) throws IOException; | ||
} | ||
``` | ||
|
||
2) To avoid reading from disk, we will use file paths `Path`. | ||
The classes `Path` and `Files` have methods to obtain the necessary information. | ||
|
||
|
||
3) Searching for a cached data will use the following conditions: | ||
* The source file and cached file should have same file name; | ||
* Each saving cached file `Mojo` should have a cache directory and a result directory. | ||
* The time of the last modification of the source file should be earlier or equal than cached file. | ||
|
||
|
||
There is an EO program `program.eo`, which is launched for the first time. | ||
The cache of each `Mojo` will save the execution results. | ||
If this program is run again, these `Mojo` will receive data from the cache, | ||
without wasting time and computer resources on recompilation. | ||
If we change something in the `program.eo` file, the program will have to be recompiled, | ||
since the last modification time the source file will be later than the cached file. | ||
As a result of `Mojo` work, the cache was overwritten. | ||
|
||
|
||
### Conclusion | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 I don't think we need such a conclusion in a blog post. It isn't a scientific article. Moreover it doesn't provide any useful information. Kinda "water". |
||
In this blog, we showed that `Maven` builds the EO code using the goals of the `eo-maven-plugin`. | ||
Since the Maven goals work in a strict order and linearly, | ||
we only need to check that the last modification time of the source files is not younger than the cached files. | ||
The cached file and the source file should have the same name | ||
(but not the same file format, for example - name.eo and name.xml). | ||
This condition is necessary so that you can quickly find the cached file in the file system. | ||
Each Mojo participating in caching should have its own cache directory. | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yanich96 It's better to say: "The bug occurred because the old verification method used compilation time and caching time to search for a cached file"