# Workflow orchestration
This document includes a gentle introduction to workflow orchestration and its main elements, with the example of Apache Airflow.

## What is a workflow?
A *workflow* is essentially a set of actionable tasks that need to be run in specific order.

```{figure} ../img/pipeline-workflow.png
---
width: 75%
name: pipeline-workflow
---
Workflow can be composed of multiple data pipelines {cite:p}`harenslak2021data`.
```

## Apache Airflow
Originated as an internal project in Airbnb in 2014, *Airflow* aims to
- solve a "common" problem of designing and scheduling jobs,
- define jobs that have simple configurations and their own schedules,
- allow customization of jobs.

### Directed acyclic graph (DAG)
The sequence of tasks is represented by a directed acyclic graph.

```{figure} ../img/dag.png
---
width: 75%
name: dag
---
An example and counter-example of DAG.
```

### Tasks
Tasks are the basic unit of execution.  Tasks can 
- execute arbitrary code using operators,
- execute codes either locally or remotely.

### Operators 
Operators are just templated tasks, calling specific wrappers to execude common codes, for examples:
- `BashOperator`
- `PythonOperator`
- `DummyOperator`
- `SparkSubmitOperator`

```{figure} ../img/airflow-dag.png
---
width: 75%
name: airflow-dag
---
DAG schema in Airflow.
```

### Schedulers and Executors
Schedulers defines when a DAG should be run and how frequently.  Executors *listens* to the scheduler for a notification before executing any tasks.

```
DAG(
  dag_id="daily_active_users_reporting_job",
  start_date=days_ago(2),
  default_args=args,
  tags=["XXX","YYY"],
  schedule_interval="@daily",
)
```

```{figure} ../img/dag-python-to-airflow.png
---
width: 75%
name: dag-python-to-airflow
---
DAG as Python configuration informs Airflow of all tasks and job sequences.
```

```{bibliography}
:filter: docname in docnames
```