# Apache Airflow



* https://airflow.apache.org/
* https://livebook.manning.com/book/data-pipelines-with-apache-airflow/chapter-1/v-5/

**Orchestrator** or **Workflow manager** allows us to create **Data Pipelines** & describe all steps of our Data Flow: from **where** to where, **what**, **when** and **how** - multiple task in any sequence (not only classical **ETL**).

**Apache Airflow** is one of the most popular workflow management systems to manage data pipelines. It is a platform to programmatically author, schedule and monitor workflows as **directed acyclic graphs** (**DAGs**) of tasks. The airflow scheduler executes our tasks on an array of workers while following the specified dependencies. **Workflows** are defined as code, thus they become more maintainable, versionable, testable, and collaborative.
* Airflow command line utilities make performing complex surgeries on DAGs a snap.
* Airflow user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

## Directed Acyclic Graphs (DAGs)

**DAGs** are a special subset of graphs in which the **edges** between **nodes** have a specific direction, and no **cycles** exist. When we say `no cycles exist` what we mean is the nodes cant create a path back to themselves.

<img src='imgs/DAG.png' alt='DAG' width=40%>

### Nodes
A step or task in the data pipeline process.

### Edges
The dependencies or relationships other between nodes.

<img src='imgs/dag_pipeline.png' alt='dag_pipeline' width=60%>


A DAG is a collection of Tasks that are ordered to reflect the functionality, requirements and dependencies of the workflow.


#### Are there real world cases where a data pipeline is not DAG?

It is possible to model a data pipeline that is not a DAG, meaning that it contains a cycle within the process. However, the vast majority of use cases for data pipelines can be described as a directed acyclic graph (DAG). This makes the code more understandable and maintainable.

#### Can we have two different pipelines for the same data and can we merge them back together?

Yes. It's common for a data pipeline to take the same dataset, perform two different processes to analyze the it, then merge the results of those two processes back together.

## Components of Airflow

<img src='imgs/airflow_main_components.png' alt='airflow_main_components' width=50%>

* **Scheduler** orchestrates the execution of jobs on a trigger or schedule interval. The Scheduler chooses how to prioritize the running and execution of tasks within the system.


* **Executor**, also known as **Work Queue** is used by the scheduler in most Airflow installations to deliver tasks that need to be run to the Workers. Executors are the `workstation for tasks` and acts as a middle man to handle resource allocation and distribute task completion. There are many options available in Airflow for executors:
    * Sequential Executor
    * Debug Executor
    * Local Executor(Single Node Arch)
    * Dask Executor
    * Celery Executor
    * Kubernetes Executor
    * Scaling Out with Mesos (community contributed)
    
    
* **Worker** processes execute the operations defined in each DAG. In most Airflow installations, workers pull from the work queue when it is ready to process a task. When the worker completes the execution of the task, it will attempt to process more work from the work queue until there is no further work remaining. When work in the queue arrives, the worker will begin to process it.


* **Database** saves credentials, connections, history, and configuration. The database, often referred to as the **metadata database**, also stores the state of all tasks in the system. Airflow components interact with the database with the Python ORM, SQLAlchemy.


* **Web Interface** provides a control dashboard for users and maintainers. The web interface allows users to perform tasks such as stopping and starting DAGs, retrying failed tasks, configuring credentials, The web interface visualizes the DAGs parsed by the scheduler and is built using the Flask web-development microframework.

### Order of Operations for an Airflow DAG

<img src='imgs/af_working.png' alt='af_working' width=55%>

* The Airflow Scheduler starts DAGs based on time or external triggers.

* Once a DAG is started, the Scheduler looks at the steps within the DAG and determines which steps can run by looking at their dependencies.

* The Scheduler places runnable steps in the queue.

* Workers pick up those tasks and run them. 

* Tasks get transitioned from one state to another when DAG is run. The below diagram explains the transition:

<img src='imgs/transition.png' alt='transition' width=60%>

* Once the worker has finished running the step, the final status of the task is recorded and additional tasks are placed by the scheduler until all tasks are complete.

* Once all tasks have been completed, the DAG is complete.

# Building a Data Pipeline in Airflow

A DAG.py file is created in the DAG folder in Airflow, containing the imports for operators, DAG configurations like schedule and DAG name, and defining the dependency and sequence of tasks.

<img src='imgs/operator.png' alt='operator' width=20%>


### Operators

* Operators are created in the Operator folder in Airflow. They contain Python Classes that have logic to perform tasks. They are called in the DAG.py file.


* **Operators** define the atomic steps of work that make up a DAG. Instantiated operators are referred to as **Tasks**. Airflow comes with many Operators that can perform common operations:
    * PythonOperator - calls a python function
    * PostgresOperator - - executes a SQL command
    * RedshiftToS3Operator
    * S3ToRedshiftOperator
    * BashOperator - executes a UNIX command
    * SimpleHttpOperator
    * Sensor - waits for a certain time, file, database row, S3 key, etc.
    * EmailOperator - sends an email
    
    

### Schedules 

* `schedule_interval` tells when to run this dag. **Schedules** are optional, and may be defined with **cron strings** or **Airflow Presets**. Airflow provides the following presets:
    * @once - Run a DAG once and then never again
    * @hourly - Run the DAG every hour
    * @daily - Run the DAG every day
    * @weekly - Run the DAG every week
    * @monthly - Run the DAG every month
    * @yearly- Run the DAG every year
    * None - Only run the DAG when the user initiates it

### Parameters/Arguments

* While creating a DAG, give it a **name**, a **description**, a **start date**, and an **interval**.
    
    
* `start_date`: If start date is in the past, Airflow will run DAG as many times as there are schedule intervals between that start date and the current date.


* `end_date`: Unless we specify an optional end date, Airflow will continue to run our DAGs until we disable or delete the DAG.


* `max_active_run` tells how many instances of the Dag can run concurrently.


* `retries` argument re-run a failed task multiple times before aborting the workflow run.


* `on_success_callback` and `on_failure_callback` arguments are used to trigger some actions once the workflow succeeds or fails respectively. This will be useful to send personalized alerts to internal team via Slack, Email, or any other API call when a workflow task succeeds or fails.

### First DAG (Basic)

```python
from datetime import datetime, timedelta
import logging

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

# Initializing the default arguments that we'll pass to our DAG
default_args = {
    'owner': 'jamwine',
    'start_date': datetime(2021, 1, 1),
    'retries': 1, 
    'on_failure_callback': task_failure_alert,
    'max_active_runs': 3
}

def message():
    logging.info("Airflow is awesome!")
    logging.info("Airflow uses DAGs...")

# Creating a DAG instance
dag1 = DAG(
        dag_id="dag1",
        description="Test DAG 1",
        default_args=default_args,
        start_date=datetime.now(),
        schedule_interval="@hourly"
)

message_task = PythonOperator(
    task_id="message_task",
    python_callable=message,
    dag=dag1
)
```

### Task Dependencies

**Task Dependencies** can be described programmatically in Airflow using `>>` and `<<`
* `a >> b` means **a comes before b**
* `a << b` means **a comes after b**

Tasks dependencies can also be set with `set_downstream` and `set_upstream`.
* `a.set_downstream(b)` means **a comes before b**
* `a.set_upstream(b)` means **a comes after b**

Parallel tasks are included in `[]`. For example, `[a,b] >> c`


### Variables

* **Variables** are defined as a generic way to store and retrieve arbitrary content within Airflow. They are represented as a simple key value stored into the meta database of Airflow.


*  Variables are useful for storing and retrieving data at runtime, in avoiding hard-coding values, and from code repetitions within our DAGs.


* Airflow goes through **2-layers** before reaching the metastore. If the variable is found in one of these two layers, Airflow doesn’t need to create a connection, thus it is better optimized.


*  Command Line Interface and Environment Variables Reference: https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#variables

### XComs

* **XComs** or **Cross Communication messages** is designed to communicate small amount of data between tasks. Basically, they let tasks exchange messages within our DAG. For example, if task_B depends on task_A, task_A can push data into a XCom and task_B can pull this data to use it. 


* We use `xcom_push` and `xcom_pull` to push and retrieve variables.


* These methods `xcom_push` and `xcom_pull` are only accessible from a task instance object. With the PythonOperator, we can access them by passing the parameter `ti` to the python callable function. 


* We can pull XComs from multiple tasks at once.


* In the BashOperator, `do_xcom_push` allows us to push the **last line** written to stdout into a XCom. By default, `do_xcom_push` is set to **True**.


*  By default, when a XCom is automatically created by returning a value, Airflow assigns the key `return_value`. The key `return_value` indicates that this XCom has been created by returning the value from the operator. The XCom values gets stored into the `metadata database` of Airflow with the key `return_value`.

<img src='imgs/af_variable.png' alt='af_variable' width=40%>

* XComs create implicit dependencies between the tasks that are not visible from the UI.

### Macros

* **Jinja templates** and **Macros** in Apache Airflow are the way to pass dynamic data to our DAGs at runtime. 


* Templating allows us to interpolate values at run time in static files such as HTML or SQL files, by placing special placeholders in them indicating where the values should be and/or how they should be displayed.


* The curly brackets `{{ }}` represent **placeholders** in which the value is replaced at the **runtime** each time the DAG gets triggered. For example, `{{ ds }}` is a macro that gets replaced by the **execution date** of the DAG.


* **Macros** are functions that take an input, modify that input and give the modified output. Macros can be used in our templates by calling them with the following notation: `macro.macro_func()`.


* Apache Airflow brings predefined variables that we can use in our templates. They are very useful since they allow us to have information about the current executing DAG and task. Macros Reference: https://airflow.apache.org/docs/apache-airflow/stable/macros-ref.html


*  We can also check our Jinja Templates rendered before even executing our DAG in the Rendered section of Airflow UI.