# Workflows 

A **workflow** is a set of steps to accomplish a given data engineering task.

Its complexity varies from one to another.

Depending on the context, workflows can refer to different things.

# Airflow 

**Airflow** is a platform to program workfows, including creation, scheduling and monitoring of the created workflows 

Airflow implements workflows as DAGs (Directed Acyclic Graphs)

Airflow can be accessed via python code (mostly for dag creation), command line (used for running dags manually, start processes or get logging information) or via web interface or rest api

Alternatives to Airflow are Luigi, SSIS or even bash scripts.

# Directed Acyclic Graphs 

set of tasks that make up a workflow and the dependencies between tasks

created with various details about the dag, including the name, start date, owner...

They are **directed** because there is an inherent flow representing dependencies between components

They are **acyclic** because there are not loops 

They are depicted as **graphs** 

DAGs in airflow are written in python. Components are called **tasks** and they can be operators, sensors...

Dags in airflow contain dependencies defined explicitly or implicitly.



In [None]:
from airflow import DAG 
import datetime 

etl_dag = DAG(
    dag_id = 'etl_pipeline',
    default_args={'start_date': '2024-01-08'}, 
    schedule=datetime.timedelta(days=1)
)

In [None]:
!airflow tasks test <dag_id> <task_id> [execution_date]

In [None]:
default_arguments={
    'owner': 'jdoe', 
    'email': 'jdoe@datacamp.com', 
    'start_date': datetime.datetime(2024, 1, 10)
}

with DAG('etl_workflow', default_args=default_arguments) as etl_dag: 
    print('hi')
    

# Operators 

In airflow, an **operator** is a single task in a workflow. 

In general, an operator run independently and do not share information.

## BashOperator

## PythonOperator

## EmailOperator

# Sensors

A **sensor** is an operator that waits for a certain condition yo be true. Conditions could be the creation of a file,the upload of a database record, certain response from a web request... 
Sensors allow to define how often to check for the condition to be true. 

Sensors have arguments like: 
- mode: how to check for the condition. Can be *poke* or *reschedule*.
- poke_interval: how often to wait between checks in the poke mode.
- timeout: how long to wait before failing the task

Since sensors are operators, they include common attributes like task_id or dag.

## FileSensor

Part of the `airflow.sensors` library.

Checks for the existence of a file at a certain location 

Can also check if any files exist within a directory

## ExternalTaskSensor

Waits for a task in another DAG to complete

## HttpSensor 

## SqlSensor

# Sensor VS Operators 

Sensors should be used whenever: 
- Uncertainty when it will be true
- If failure not immediately desired
- To add task repetition without loops

# Executors 

**Executors** run tasks 

Different executors handle running the tasks differently 

`SequentialExecutor` (default, one task at a time, useful for debugging, not recommended for production), `LocalExecutor` (runs on a single system, treats tasks as processes, parallelism defined by the user), `KubernetesExecutor` (uses Kubernetes as task manager, multiple worker systems can be defined, is significantly more difficult to setup and configure, extremely powerful)... 

The executor is set in the `airflow.cfg` file. The command `airflow info` displays that information too. 



In [None]:
!airflow info

# Templates

# Branching