feat: directed acyclic graph (DAG) #1563

pd93 · 2024-03-25T21:05:44Z

This PR overhauls how Task reads, parses and merges Taskfiles.

An example project

The following document will get quite technical, so I'm going to use an example to help explain. Imagine a project with the following Taskfile structure:

├── services
│   ├── serviceA
│   │   └── Taskfile.yml
│   ├── serviceB
│   │   └── Taskfile.yml
│   └── serviceC
│       └── Taskfile.yml
├── taskfiles
│   └── utils.yml
└── Taskfile.yml

We have three "services", each with their own Taskfile, and a root Taskfile that includes all of them. This allows us to call task from the project root and get access to all the tasks in each of the services. In addition to this, we also have a taskfiles directory with a utils.yml file that contains some common tasks that are shared between the services and the tasks in the root file.

This is a fairly typical, if not simple, project structure for a Taskfile-based project and on the surface, Task handles it pretty well. However, as we are about to see, under the hood, Task does not process this very efficiently, and when you add more services and more shared utilities, or run on hardware with limited processing capability, the inefficiencies can become problematic.

The current process

For the sake of brevity, I'm not going to write up the contents of each file. Instead, I am going to illustrate the includes by drawing a graph:

flowchart TD
    R(Taskfile.yml)
    A(services/serviceA.yml)
    B(services/serviceB.yml)
    C(services/serviceC.yml)
    U(taskfiles/utils.yml)
    R --> A
    R --> B
    R --> C
    R --> U
    A --> U
    B --> U
    C --> U

You can see that our root Taskfile (Taskfile.yml) is including each of the service Taskfiles and the utilities Taskfile. Each of the service Taskfiles is then also including the utilities Taskfile. The total setup consist of 5 files.

When Task is called, the first thing it does it look for the root Taskfile (aka. the "entrypoint"). The contents of this files are then read into memory and parsed using the taskfile/ast package. So far, so good.

Once the entrypoint has been parsed, we can now start to evaluate any includes that may be specified in the file. In this example, we have 4 includes:

services/serviceA.yml
services/serviceB.yml
services/serviceC.yml
taskfiles/utils.yml

Now, for each of these taskfiles, we will recursively repeat the process of reading the files into memory and parsing them. Once all the includes have been followed, we then unwind the stack and merge the ASTs together. This results in one giant Taskfile the contains all the tasks from all Taskfiles in the project.

If you haven't spotted it yet, there is a major problem with this approach. Let's jump into the code and add a log line at the beginning of each call to the function that reads a Taskfile that print the name of the file being read. Here are the results:

Taskfile.yml
services/service1/Taskfile.yml
taskfiles/utils.yml
services/service2/Taskfile.yml
taskfiles/utils.yml
services/service3/Taskfile.yml
taskfiles/utils.yml
taskfiles/utils.yml

As you can see, despite only having 5 files in the project, Task has read a file 8 times. The taskfiles/utils.yml file has been read 4 times! The number of times a file is read will correspond directly to the total number of includes in all your Taskfiles combined (or the number of arrow heads in the graph diagram) +1 for the entrypoint.

Analysing the problem

There are many simple solutions to the problem described above. For example, we could just add a simple file cache so that a file is not read multiple times. However, I think that the problems with the current approach go deeper than just the inefficiency of reading files multiple times.

I briefly mentioned that the ASTs are merged together before any tasks are run. There are a few disadvantages to this approach:

Merging is very memory intensive and involves deep copying structures between Taskfiles.
The code is complex and difficult to maintain.
- This increases the chance of bugs.
- and increases the barrier to entry for contributors.
You lose information about the location of data when merging.
- The includes the scopes of variables and tasks.
- Currently, we attempt to save some of this information by storing it in the merged Taskfile, but this just adds to the complexity/maintenance cost of the code.
Most of the time, you don't actually need the entire Taskfile tree to be read.
- Often, we just want to run a Task in the root Taskfile. Why bother reading and merging all the other Taskfiles?

It doesn't matter how good our file caching is if we end up calling a merge method for each and every include. Finding a way to store the entire Taskfile tree in memory without merging them together would be a much better solution.

A new approach

Taskfiles are no different to other programming languages or config-based tools that have includes or imports. We usually refer to the structures that store dependencies as graphs (as-per the example diagram). Specifically, these are dependency graphs, which are a type of directed acyclic graph (or DAG for short).

DAGs have a number of properties that make them very useful:

As the name suggests, they are acyclic, which means we get cycle prevention for free. This solves issues like this which occur because of our naive approach to cycle detection.
It gives us file caching for free. A DAG allows us to store/cache the AST of a files once it has been read. Since the key of the node in the graph is the file's location, this will be retrieved next time its needed rather than being read from the filesystem again.
DAGs are a well understood data structure and there are many algorithms and libraries available for working with them.
They allow us to easily output and visualise the Taskfile tree for debugging or illustration purposes.
Most importantly, they allow us to store the entire Taskfile tree in memory without merging them together. This means that we can solve all of the issues mentioned in the previous section.

This PR

An important point to make - This PR does not remove AST merging. The DAG implementation has taken a lot of work to complete. The first DAG changes were made by me nearly a year ago and I have been slowly cherry-picking refactors and improvements from the dag branch into main. This PR represents the final piece of work to bring the DAG implementation to release.

However, this is still just a stepping stone. Currently, the DAG is still resolved into a single AST before tasks are run. The next step would be to refactor the code that fetches/compiles tasks to run off the DAG instead of the merged AST. From here, there are a number of things that we can do...

Future work

This work enables or makes easier a number of future improvements:

Fetching/compiling tasks from the DAG instead of the merged AST.
Removing merging entirely.
Scoped variables (Taskfile Scoped Variables #1030).
Lazy loading of variables (Idea: Consider having functions to access variables and envs #1065, Lazy global dynamic variable #1240).
Taskfile tree visualisation.
DAG for individual tasks?
- This might be helpful for some of the other features mentioned above, or it might just be useful for illustration purposes. This was briefly discussed in Graph tasks and dependencies #1234.

andreynering

I know it look a lot of time. Thanks for your perseverance! This is going to be a great improvement. 👏 👏 👏

marco-m · 2024-05-09T10:03:56Z

@pd93 kudos, great work!

ReillyBrogan · 2024-05-09T20:34:42Z

Is #852 resolved with this?

pd93 · 2024-05-13T10:17:03Z

Is #852 resolved with this?

@ReillyBrogan Unfortunately not. I've responded in #852 to keep everything together.

pd93 added 14 commits March 25, 2024 21:02

feat: dag reader

4a1e479

feat: better error handling for duplicate edges and fixed tests

4973b3b

feat: better taskfile cycle error handling

a586753

fix: optional includes

01929bf

fix: includes interpolation test

841f8fb

fix: missing task locations

446766d

fix: include_with_vars test included the same file multiple times

aa531c7

feat: merger

6272ecf

fix: advanced import resolving dynamic variables incorrectly

4b837d7

fix: bug with merge code

034c05d

fix: advanced import operates on including file instead of included file

7a0d206

feat: merge concurrency

f80d426

fix: linting issues

e1eaa42

chore: remove code that outputs the graphviz file

7493fa8

pd93 marked this pull request as ready for review March 25, 2024 21:13

pd93 requested a review from andreynering March 25, 2024 21:14

andreynering approved these changes Apr 9, 2024

View reviewed changes

pd93 merged commit 4024b4f into main Apr 9, 2024
13 checks passed

pd93 deleted the dag branch April 9, 2024 11:37

pd93 added a commit that referenced this pull request Apr 9, 2024

chore: changelog for #1563

d01b3c8

pd93 mentioned this pull request Apr 16, 2024

Cannot override global variables in included Taskfiles since 3.36.0 #1588

Closed

vmaerten mentioned this pull request May 5, 2024

fix(remote): do not display prompt if it's empty #1634

Merged

pd93 mentioned this pull request May 13, 2024

Make run:once work for dependencies shared amongst included taskfiles #852

Closed

jlucktay mentioned this pull request Aug 6, 2024

Honour the global-level 'run: once' setting in an included Taskfile #1738

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: directed acyclic graph (DAG) #1563

feat: directed acyclic graph (DAG) #1563

pd93 commented Mar 25, 2024 •

edited

Loading

andreynering left a comment

marco-m commented May 9, 2024

ReillyBrogan commented May 9, 2024

pd93 commented May 13, 2024

feat: directed acyclic graph (DAG) #1563

feat: directed acyclic graph (DAG) #1563

Conversation

pd93 commented Mar 25, 2024 • edited Loading

An example project

The current process

Analysing the problem

A new approach

This PR

Future work

andreynering left a comment

Choose a reason for hiding this comment

marco-m commented May 9, 2024

ReillyBrogan commented May 9, 2024

pd93 commented May 13, 2024

pd93 commented Mar 25, 2024 •

edited

Loading