Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(#56) blog about caching #58

Closed
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
241 changes: 241 additions & 0 deletions _posts/2024/2024-02-06-about-caching-in-eo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
---
layout: post
date: 2024-02-06
title: "Build cache in EO and other build systems"
author: Alekseeva Yana
---


## Introduction
Wasting a lot of time on building a project is a programming problem. At the moment a programmer starts an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96

Wasting a lot of time on building a project is a programming problem. At the moment a programmer starts an assembly, he loses focus on a task and spends valuable working

"Empty words". We can remove them without losing any meaning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96

Different build systems use many tools,
helping to assemble a project faster, namely caching, task parallelization, distributed building and much more.

Why do I need this information?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I written it to start this blog. I will delete these suggestions if they are not necessary.

assembly, he loses focus on a task and spends valuable working time. Different build systems use many tools,
helping to assemble a project faster, namely caching, task parallelization, distributed building and much more.
The subject of this article is caching, because completed tasks caching allows not to spend resources again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96

"The subject of this article is caching."

The other is obvious:

because completed tasks caching allows not to spend resources again

So in [EO](https://github.com/objectionary/eo) caching is used for speeding up programs work.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Caching speeds up a "build time" or "program execution", not "programs work".

While developing [EO](https://github.com/objectionary/eo) we found caching errors in `eo-maven-plugin`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Do you have particular links to these issues?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo do you mean that I should to attach a link to the issue where the error occurred?

for EO version `0.34.0`. The error occurred, because using a file name and comparing equality of
Copy link
Member

@volodya-lombrozo volodya-lombrozo Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 It's hard to grasp without a context:

The error occurred, because using a file name and comparing equality of
compilation time and caching time is not the most reliable verification.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo Do you have an example of context? Should it be code or diagram?

compilation time and caching time is not the most reliable verification. Unit tests were written showing that
cache does not work correctly. Also reading a file was necessary for getting a programme name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96

"Unit tests were written to demonstrate that the cache does not function correctly. Additionally, reading a file was required to obtain a program name, which slowed down the assembly process."

By the way, what is the "assembly proccess"? A reader might not be familiar with this term.

that slowed down an assembly.
That we came to conclusion that we need caching with a reliable verification which does not require reading a file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96

  1. Came to conclusion" should be "came to the conclusion".
  2. "which does not require reading a file from disk" could be rephrased to "that does not require reading a file from a file system".
  3. "And using cache" should be "And using a cache"

from disk. And using cache should save us enough time for building a project.

The goal of this article is to research caching in frequently used build systems (`ccache`, `Maven`, `Gradle`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 This sentence might be connected with the previous one: "The subject of this article is caching."

and to create effective caching in [EO](https://github.com/objectionary/eo).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "create" -> "implement"


<!--more-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "More"?


## Build caching of existing build systems
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 What about "Caching in Build Systems" ? or " Caching in Other Build Systems".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I will choose " Caching in Other Build Systems"


### ccache/sccache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Is it a build system or what? Where is the link? Short description?

In compiled programming languages, building a project takes a long time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "long time" ? How much is it? I build all my projects relatively fast.

The reason of long compilation is time is spent on preparing, optimizing and checking the code, and so on.
To speed up the assembly of compiled languages, ccache and sccache are used.
Let's look at the compilation scheme using C++ as an example,
to imagine the build process in compiled languages:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe we can we change "Imagine" to "Visualize"? What do you think?
BTW, "to imagine the build process in compiled languages" looks redundant.


<p align="center">
<img src="/images/ccache.svg">
</p>

1) First, preprocessor gets the input files. Input files are code files and header files.
The preprocessor removes comments from the code and converts the code into in accordance
with macros and executes other directives, starting with the “#” symbol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 No need to describe how exactly a preprocessor works. It's important that we get at the end of this phase.

(such as #include, #define, various directives like #pragma).
The result is a single edited file with human-readable code that can be submitted to the compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe we need to write a short summary 1-2 sentences about this type of caching?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo The principle of caching in sccache is the same as in ccache (using Direct and Preprocessor modes), the only difference is in the places where the data is stored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 I mean ccache and sccache altogether. What is the difference with other types of caching? Why did you choose these tools?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I wrote above that I looked at well-known used build systems. Isn't this enough?


2) The compiler receives the finished code file and converts it into machine code, presented in an object file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "finished" code? What does it mean?

At the compilation stage, parsing occurs, which checks whether the code matches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Wdyt about "parsing checks ..."

rules of a specific programming language. Next, the code is parsed into machine code according to the rules.
At the end of its work, the compiler optimizes the resulting machine code and produces an object file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "At the end of its work" -> "At the end"

To speed up compilation, different files of the same project are compiled in parallel,
that is, we receive several object files at once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 This is redundant:

that is, we receive several object files at once


3) After all received project object files are passed to the linker.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 What does it mean:

After all received project object files are passed to the linker.

Is it "After all, received project object files are passed to the linker."
or "After, all received project object files are passed to the linker." ?
Why "received"? Which "project" do you mean?

Maybe it's better just use "Then, object files are passed to the linker.", or better:
"Then linker <...do something...> with object files" (active voice)?

Linker is a program that combines program components, written in assembly language or a high-level programming language,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 I though that Linker combines object files?

to an executable file or library. The result of the linker is an executable .exe file.


As a result, in compiled languages, multiple files are simultaneously and independently converted into machine code at the compilation stage.
This machine code is then combined into one executable file.


`ccache` has two main caching methods они:
1) `Direct mode` - hashcode is generated based on the source code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Which "hashcode" do you mean? You gave the definition below, that paragraph positioning confuses a lot. I have to skip this part and then return to it after.

2) `Preprocessor mode` - hashcode is generated based on the result of preprocessor.

The hashcode includes information: file contents, directory, compiler information, compilation time, extensions
used by the compiler. A compressed machine code file is placed in the cache using the received key.

`Direct mode` compiles the program faster, since the preprocessor step is skipped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 You explains two modes by using this template:

1) Direct mode - hashcode is generated based on the source code.
2) Preprocessor mode - hashcode is generated based on the result of preprocessor.
3) Direct mode compiles the program faster...
4) Preprocessor mode is slower...

Looks strange, maybe it's better to explain one mode and the move to the another?

1) Direct mode - hashcode is generated based on the source code.
3) Direct mode compiles the program faster...
2) Preprocessor mode - hashcode is generated based on the result of preprocessor.
4) Preprocessor mode is slower..

But header files are not checked for changes, so the wrong project may be built.
`Preprocessor mode` is slower than `direct mode`, but right project is built always.

Sccache, unlike ccache, allows to store the cache not only locally but also in the cloud,
and it also has fixed some bugs (for example, there is a check of header files, which makes direct mode more accurate).


### Maven
`Maven` automates and manages Java-projects build. Building a project in `Maven` is completed in three
maven [LifeCycles Maven](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html),
which consist of `phases`. `Phases` in turn consist of sets of `goals`.

`Maven` has default `phases` and `goals` which build any projects:

<p align="center">
<img src="/images/defaultPhaseMaven.svg">
</p>

In `Maven` all phases and goals are executed strictly in order, linearly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 So, Maven doesn't use caching at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo As far as I understand, that Maven can use added extensions from Gradle for caching. Or Maven can rebuild only changed project modules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maven has .m2 folder at least. In this folder it keeps all downloaded dependencies. So it's some sort of caching too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in `Maven` there is no build-time caching as such.
`Maven` suggests rebuilding only changed project modules to speed up the build process.

### Gradle
`Gradle`, like `Maven`, builds a project in
[LifeCycles Gradle](https://docs.gradle.org/current/userguide/build_lifecycle.html), which consists of phases.
But unlike `Maven`, `Gradle` builds projects using a task graph -
[Directed Acyclic Graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe it's better to give a link to a "Gradle task graph" instead? Why do I need to read about DAGs?

in which some tasks can be executed synchronously.
To speed up project builds, `Gradle` uses incremental builds
[Incremental build](https://docs.gradle.org/current/userguide/incremental_build.html#sec:how_does_it_work).
For an incremental build to work, the tasks that are used to build the project must have
source and output files must be specified.
```
task myTask {
inputs.dir 'src/main/java/MyTask.somebody' // Specify the input directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 MyTask.somebody looks like a file, not a directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I have fixed this example:

task myTask {
    inputs.file 'src/main/java/MyTask.somebody' // Specify the input file
    outputs.file 'build/classes/java/main/MyTask.somebody' // Specify the output file
    
    doLast {
        // Task actions go here
        // This code will only be executed if the inputs or outputs have changed
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 good

outputs.dir 'build/classes/java/main/MyTask.somebody' // Specify the output directory

doLast {
// Task actions go here
// This code will only be executed if the inputs or outputs have changed
}
}
```
Every time before executing a task, `Gradle` makes a fingerprint of the path
and contents of the source files and saves it.
If the task completes successfully, then `Gradle` also makes a fingerprint from the resulting files.
To avoid re-fingerprinting the original files, `Gradle` checks the last modification time and the size of the original
files before reassembling. Thus, when the project is rebuilt, some or all of the tasks may be
not completed, but to use the results already obtained.
`Gradle` also stores fingerprints of previous builds so that projects can be built quickly, for example when switching
from one branch to another - `Build Cache`.




### EO build cache

EO code is compiled using the `Maven` build system.
For this purpose, the `eo-maven-plugin` plugin was written,
which contains the goals necessary for working with EO code.
As was written above, the assembly of projects in `Maven` occurs in a certain order of phases.
In the diagram you can see the main phases and their goals for the EO version of the compiler (specify version):

<p align="center">
<img src="/images/EO.svg">
</p>

In [Picture 3](/images/EO.svg) the goals from the `eo-maven-plugin`
are highlighted in green.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 What is the conclusion? Why did you mention Maven? Does this caching similar to Grade? to ccache? What is the difference?


But the actual work with EO code takes place in `AssembleMojo`.
`AssembleMojo` is the goal consisting of other goals that work with the EO file
[Picture 4](/images/AssembleMojo.svg).


<p align="center">
<img src="/images/AssembleMojo.svg">
</p>

Each goal in `AssembleMojo` is a specific compilation step for EO code, and we need to use
caching at each step to speed up the assembly of the EO program.

In EO version `0.34.0`,
caching for different `Mojo` was done using unrelated different `Footprint` and `Optimization` interfaces,
within which mostly the same methods were used.
The difference between interfaces is that in `Footprint` the EO version of the compiler is checked,
while the rest of the checks are exactly the same.


Now goals are `ParseMojo`, `OptimazeMojo` и `ShakeMojo` , in which caching can be applied,
have directory of results and directory of cache.


The disadvantages of initial caching in EO:
* the compilation time and the time of saving to the cache must be equal.
The problem with this verification is that the moment of compilation and the moment of saving to the cache must coincide.
* verification data is read from a file on disk. This is a long and expensive operation.
* each purpose uses its own classes and interfaces for data caching.
This makes the code difficult to extensibility and readability.


Therefore, our target is to create a single class responsible for caching data
and loading the necessary data from the cache, which can be used for any `Mojo` from the `eo-maven-plugin`.


How do we want to fix this disadvantages:
1) Create a new class `Cache` that will be responsible for data verification, saving to cache and loading from cache.

```
public class Cache {

private List<CacheValidation> validations;

public Cache(final List<CacheValidation> cv) {
this.validations = cv;
}

public Optional<XML> load(final Path source, final Path cache) {...};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe it's better to make Path cache a field? Since you are using it in all the methods.


public void save(final Path cache, final Scalar<String> program, final Path relative) {...};
}
```


`List<CacheValidation>` is a list of validations that are implemented from the `CacheValidation` interface.
Different validations can be applied for different `Mojo`.


```
public interface CacheValidation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 I didn't grasp the idea why we might need this class and why it has exactly this implementation.

boolean validate(final Path source, final Path cache) throws IOException;
}
```

2) To avoid reading from disk, we will use file paths `Path`.
The classes `Path` and `Files` have methods to obtain the necessary information.


3) The relevance of the cached data will be checked by the condition
that the time of the last modification of the source file must be earlier than or equal to that saved in the cache.

These solutions will speed up compilation in the build system `Maven`.


### Conclusion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 I don't think we need such a conclusion in a blog post. It isn't a scientific article. Moreover it doesn't provide any useful information. Kinda "water".

There is an EO program `program.eo`, which is launched for the first time.
At each `Mojo` stage, the execution results will be saved to the cache of the current `Mojo`.
If this program is run again, these `Mojo` will receive data from the cache,
without wasting time and computer resources on recompilation.
If we change something in the `program.eo` file, the program will have to be recompiled,
since the last modification time the original file will be later than those stored in the cache.
As a result of `Mojo` work, the cache was overwritten.

















49 changes: 49 additions & 0 deletions images/AssembleMojo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading