Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(#56) blog about caching #58

Closed
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
239 changes: 239 additions & 0 deletions _posts/2024/2024-02-06-about-caching-in-eo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
---
layout: post
date: 2024-02-06
title: "Build cache in EO and other build systems"
author: Alekseeva Yana
---


## Introduction
In [EO](https://github.com/objectionary/eo), caching is used to speed up program compilation.
Recently we found a caching
[bug](https://github.com/objectionary/eo/issues/2790) in `eo-maven-plugin`
for EO version `0.34.0`. The bug occurred because the old verification method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 It's better to say: "The bug occurred because the old verification method used compilation time and caching time to search for a cached file"

contains a comparison of the compilation time and caching time to search for the cached file.
This is not the most reliable verification method,
because caching time does not have to be equal to compilation time.
We came to the conclusion that we need caching with a reliable verification method.
And this verification method should not use the information that the cached file contains.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96

"Furthermore, this verification method should refrain from reading the file content."


The goal of this blog is to research caching in frequently used build systems (`ccache`, `Maven`, `Gradle`)
and to implement effective caching in [EO](https://github.com/objectionary/eo).

<!--more-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "More"?


## Build caching of existing build systems
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 What about "Caching in Build Systems" ? or " Caching in Other Build Systems".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I will choose " Caching in Other Build Systems"


### ccache/sccache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Is it a build system or what? Where is the link? Short description?

In compiled programming languages, building a project containing many source code files takes a long time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96

"containing" -> "with"

This time is spent on loading of libraries, preparing, optimizing, checking the code, and so on.
To speed up the assembly of compiled languages, [ccache](https://ccache.dev)
and [sccache](https://github.com/mozilla/sccache) are used.
Let's look at the assembly scheme using C++ as an example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 To be honest, I have some doubts about this paragraph where you discuss "compilation steps":

  1. First of all this is a blog about caching, not about compilation
  2. You describe compilation incompletely. What about optimizations? Moreover, modern compilers usually convert source code to some sort of IR, like LLVM IR, for example. You can take a look how clang works.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I believe that a reader will better understand this article if we briefly talk about the stages of compilation of the presented build systems. Yes, I describe compilation incompletely, but enough to indicate where caching works.

Copy link
Member

@volodya-lombrozo volodya-lombrozo Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Then, it's good to mention it.

I describe compilation incompletely, but enough to indicate where caching works.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo "The goal is to implement effective caching in EO.
For this, we will briefly look at how frequently used build systems (ccache, Maven, Gradle) work
in order to better understand the ideas behind caching in them."
it's ok?

to imagine the build process in compiled languages:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe we can we change "Imagine" to "Visualize"? What do you think?
BTW, "to imagine the build process in compiled languages" looks redundant.


<p align="center">
<img src="/images/defaultCPhase.svg">
</p>

1) First, preprocessor gets the input files. The input files are source files (.cpp) and header files (.h).
The result is a single edited file with human-readable code that the compiler will get.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Which format the output file has?



2) The compiler receives the finished code file and converts it into machine code, presented in an object file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "finished" code? What does it mean?

At the compilation stage, parsing occurs, which checks whether the code matches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Wdyt about "parsing checks ..."

rules of a specific programming language.
At the end, the compiler optimizes the resulting machine code and produces an object file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 You already mentioned it:

The compiler receives the file .cpp from the preprocessor and compiles it into an object file - .obj.

To speed up compilation, different files of the same project are compiled in parallel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 I guess it's better to say "might be compiled".


3) Then, the linker gets object files.
Linker is a program that combines object files into an executable file or library.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 What do you think if we just add the link to the Linter description, instead of explaining it here?

The result of the linker is an executable .exe file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe it's better to quote .exe? What do you think?



As a result, in compiled languages, multiple files are simultaneously and independently converted
into machine code at the compilation stage.
This machine code is then combined into one executable file.


`ccache` uses hashcode to find cached files. The hashcode includes information:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 [Question] I'm not sure here, but it seems that a "hashcode" isn't frequently used term, from Hash Function definition:

The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo Thanks, I will use "hash algorithm" as ccache documentation.

file contents, directory, compiler information, compilation time, extensions
used by the compiler. A compressed machine code file is placed in the cache using the received key.


`ccache` has two main caching methods:
1) `Direct mode` - hashcode is generated based on the source code.
`Direct mode` compiles the program faster, since the preprocessor step is skipped.
However,the header files are not checked for changes, so the wrong project may be built.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 What are "wrong" and "right" projects here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo
"the wrong project" - is the project built with not verified header files.
"the right project" - is the project built with verified header files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe we can clarify it in the text?

2) `Preprocessor mode` - hashcode is generated based on the result of preprocessor.
`Preprocessor mode` is slower than `direct mode`, but the right project is built always.

`Sccache`, unlike `ccache`, allows you to store cached files not only locally, but also in the cloud.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Do you mean some particular cloud? ("the")

And it also has fixed some bugs (for example, there is a check of header files, which makes direct mode more accurate).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 You have some problem with tense here (grammar)



### Maven
[Maven](https://maven.apache.org) automates and manages Java-project builds.
Building a project in `Maven` is completed in three
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 in three maven LifeCycles Maven. "Maven, maven"

maven [LifeCycles Maven](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html),
which consist of `phases`. `Phases` consist of sets of `goals`.

`Maven` has default `phases` and `goals` for building any projects:

<p align="center">
<img src="/images/defaultPhaseMaven.svg">
</p>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe we need to write a short summary 1-2 sentences about this type of caching?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo The principle of caching in sccache is the same as in ccache (using Direct and Preprocessor modes), the only difference is in the places where the data is stored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 I mean ccache and sccache altogether. What is the difference with other types of caching? Why did you choose these tools?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I wrote above that I looked at well-known used build systems. Isn't this enough?

In `Maven` all phases and goals are executed strictly in order, linearly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 So, Maven doesn't use caching at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo As far as I understand, that Maven can use added extensions from Gradle for caching. Or Maven can rebuild only changed project modules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maven has .m2 folder at least. In this folder it keeps all downloaded dependencies. So it's some sort of caching too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in `Maven` there is no build-time caching as such.
`Maven` suggests rebuilding only changed project modules to speed up the build process.

### Gradle
But unlike `Maven`, [Gradle](https://gradle.org) builds projects using a task graph -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Just "Unlike Maven..."

[Directed Acyclic Graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe it's better to give a link to a "Gradle task graph" instead? Why do I need to read about DAGs?

in which some tasks can be executed synchronously.
To speed up project builds, `Gradle` employs incremental builds
[Incremental build](https://docs.gradle.org/current/userguide/incremental_build.html#sec:how_does_it_work).
For an incremental build to work, the tasks used to build the project must have specified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Could you please simplify this sentence and use simple active voice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo "The tasks that build the project must have input and output files for an incremental build to work." - is it ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 The second sentence clearly explains the idea which you are trying to explain here. I would suggest to combine this two sentences into a single one. Or jut to remove this sentence. What do you think?

source and output files.
```
task myTask {
inputs.dir 'src/main/java/MyTask.somebody' // Specify the input directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 MyTask.somebody looks like a file, not a directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I have fixed this example:

task myTask {
    inputs.file 'src/main/java/MyTask.somebody' // Specify the input file
    outputs.file 'build/classes/java/main/MyTask.somebody' // Specify the output file
    
    doLast {
        // Task actions go here
        // This code will only be executed if the inputs or outputs have changed
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 good

outputs.dir 'build/classes/java/main/MyTask.somebody' // Specify the output directory

doLast {
// Task actions go here
// This code will only be executed if the inputs or outputs have changed
}
}
```
Before executing a task, `Gradle` makes a fingerprint of the path
and contents of the source files and saves it.
If the task completes successfully, `Gradle` also makes a fingerprint from the resulting files.
To avoid re-fingerprinting the original files, `Gradle` checks the last modification time and the size of the original
files before reassembling. This allows `Gradle` to use the results already obtained when the project is rebuilt.
Additionally, `Gradle` stores fingerprints of previous builds enabling quick project builds,
for example when switching from one branch to another - known as the -
[Build Cache](https://docs.gradle.org/current/userguide/build_cache.html).




### EO build cache

EO code uses the `Maven` build system to build.
For this purpose, the `eo-maven-plugin` plugin was created,
which contains the necessary goals for working with EO code.
As mentioned earlier, the build of projects in `Maven` occurs in a specific order of phases.
In the diagram you can observe the main phases and their goals for the EO last version of the compiler:

<p align="center">
<img src="/images/EO.svg">
</p>

In [Picture 3](/images/EO.svg) the goals from the `eo-maven-plugin`
are highlighted in green.


However, the actual work with EO code takes place in `AssembleMojo`.
`AssembleMojo` is the goal consisting of other goals that work with the EO file, as shown in
[Picture 4](/images/AssembleMojo.svg).


<p align="center">
<img src="/images/AssembleMojo.svg">
</p>

Each goal in `AssembleMojo` is a specific compilation step for EO code, and we need to use
caching at each step to speed up the build of the EO program.


In EO version `0.34.0`,
caching used unrelated `Footprint` and `Optimization` interfaces for different `Mojo`,
which used the same methods.
The difference between interfaces is that `Footprint` checks the EO version of the compiler,
while the rest of the checks are exactly the same.


The disadvantages of initial caching in EO include:
* The cached file is actual if the compilation time and the time of saving to the cache are equal.
* Verification data is read from a file on file system.
* Each goal uses own classes and interfaces for data caching, making the code difficult to extend and read.



To address these disadvantages, the following solutions are proposed:


1) Create a new class `Cache` responsible for data verification, saving to cache and loading from cache.

```
public class Cache {

private List<CacheValidation> validations;

public Cache(final List<CacheValidation> cv) {
this.validations = cv;
}

public Optional<XML> load(final Path source, final Path cache) {...};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe it's better to make Path cache a field? Since you are using it in all the methods.


public void save(final Path cache, final Scalar<String> program, final Path relative) {...};
}
```


`List<CacheValidation>` is a list of validations. Validations are implemented from the `CacheValidation` interface.
Different `Mojo` can use different validations.


```
public interface CacheValidation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 I didn't grasp the idea why we might need this class and why it has exactly this implementation.

boolean validate(final Path source, final Path cache) throws IOException;
}
```

2) To avoid reading from disk, we will use file paths `Path`.
The classes `Path` and `Files` have methods to obtain the necessary information.


3) Searching for a cached data will use the following conditions:
* The source file and cached file should have same file name;
* Each saving cached file `Mojo` should have a cache directory and a result directory.
* The time of the last modification of the source file should be earlier or equal than cached file.


There is an EO program `program.eo`, which is launched for the first time.
The cache of each `Mojo` will save the execution results.
If this program is run again, these `Mojo` will receive data from the cache,
without wasting time and computer resources on recompilation.
If we change something in the `program.eo` file, the program will have to be recompiled,
since the last modification time the source file will be later than the cached file.
As a result of `Mojo` work, the cache was overwritten.


### Conclusion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 I don't think we need such a conclusion in a blog post. It isn't a scientific article. Moreover it doesn't provide any useful information. Kinda "water".

In this blog, we showed that `Maven` builds the EO code using the goals of the `eo-maven-plugin`.
Since the Maven goals work in a strict order and linearly,
we only need to check that the last modification time of the source files is not younger than the cached files.
The cached file and the source file should have the same name
(but not the same file format, for example - name.eo and name.xml).
This condition is necessary so that you can quickly find the cached file in the file system.
Each Mojo participating in caching should have its own cache directory.

















49 changes: 49 additions & 0 deletions images/AssembleMojo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading