# Abstract of airflow chapter 05

### Defining dependencies between tasks

In this chapter the idea is to focus on dependencies (linear and multiple)

### Linear dependencies

<img src="./pic/CH_02_rocket_download_pipeline.png" width="800">

In this type of DAG, each task must be completed before going on to the next, as the result of the preceding task is required as an input for the next. It could be done usign the following code:

```python
# Set task dependencies one-by-one: download_launches >> get_pictures get_pictures >> notify
# Or in one go:
download_launches >> get_pictures >> notify

```

Task dependencies effectively tell Airflow that it can only start executing a given task once its upstream dependencies have finished executing successfully

### Fan-in/-out Dependencies

<img src="./pic/CH05_final_dag.png" width="800">

```python
[clean_weather, clean_sales] >> join_datasets

```

fan in dependency, add a Dummy operator to be the started point

```python
from airflow.operators.dummy import DummyOperator
# Create a dummy start task.
start = DummyOperator(task_id="start")
# Fan-out (one-to-multiple).
start >> [fetch_weather, fetch_sales]
[clean_weather, clean_sales] >> join_datasets
join_datasets >> train_model >> deploy_model
```

The execution order will be 1, 2a, 2b, 3a, 3b(in paralel), 4, 5, 6. Supose your company changes the ERP system, you must adapt your workflow, to do that a posible solution could be add a new task to get the data and split by time when the new system will start

<img src="./pic/CH05 taskbranhc.png" width="800">

The task is easy we only need to create new flows:

```python
fetch_sales_old >> clean_sales_old
fetch_sales_new >> clean_sales_new
```

Now we still need to connect these tasks to the rest of our DAG and make sure that Airflow knows which of these tasks it should execute when. Fortunately, Airflow provides built-in support for choosing between sets of downstream tasks using the BranchPythonOperator. However, in contrast to the PythonOperator, callables passed to the BranchPythonOperator are expected to return the ID of a downstream task as a result of their computation

```python
def _pick_erp_system(**context)

pick_erp_system = BranchPythonOperator( task_id='pick_erp_system',
 python_callable=_pick_erp_system, 
)

def _pick_erp_system(**context):
    if context["execution_date"] < ERP_SWITCH_DATE:
        return "fetch_sales_old" 
    else:
        return "fetch_sales_new"

pick_erp_system = BranchPythonOperator(task_id='pick_erp_system',
                                       python_callable=_pick_erp_system)
                                       
pick_erp_system >> [fetch_sales_old, fetch_sales_new]
start_task >> pick_erp_system
[clean_sales_old, clean_sales_new] >> join_datasets
```

However, if you do this, running the DAG would result in the join_datasets task and all its downstream tasks being skipped by Airflow. (You can try it out if you wish!)
The reason for this is that, by default, Airflow requires all tasks upstream of a given task to complete successfully before that the task itself can be executed. By connecting both of our cleaning tasks to the join_datasets task, we created a situation where this can never occur, as only one of the cleaning tasks is ever executed! As a result, the join_datasets task can never be executed and is skipped by Airflow

This behavior that defines when tasks are executed is controlled by so-called ‘trigger rules’ in Airflow. Trigger rules can be defined for individual tasks using the trigger_rule argument, which can be passed to any operator. By default, trigger rules are set to ‘all_success’, meaning that all parents of the corresponding task need to succeed before the task can be run. This never happens when using the BranchPythonOperator, as it skips any tasks that are not chosen by the branch, which explains why the join_datasets task and all its downstream tasks were also skipped by Airflow.
To fix this situation, we can change the trigger rule of join_datasets so that it can still trigger if one of its upstream tasks is skipped. One way to achieve this is to change the trigger rule to `none_failed`, which specifies that a task should run as soon as all of its parents are done with executing and none have failed:


```python
join_datasets = PythonOperator(trigger_rule="none_failed", )
```

This way, join_datasets will start executing as soon as all of its parents have finished executing without any failures, allowing join_datasets to continue its execution after the branch

One drawback of this approach is that we now have three edges going into the join_datasets task. This doesn’t really reflect the nature of our flow, in which we essentially want to fetch sales/weather data (choosing between the two ERP systems first) and then feed these two data sources into join_datasets. For this reason, many people choose to make the branch condition more explicit by adding a dummy task joining the different branches before continuing with the DAG

```python
from airflow.operators.dummy import DummyOperator

join_branch = DummyOperator(task_id="join_erp_branch", trigger_rule="none_failed")

[clean_sales_old, clean_sales_new] >> join_branch
join_branch >> join_datasets
```

<img src="./pic/CH05 3edge.png" width="800">

This change also means that we no longer need to change the trigger rule for the join_datasets task, making our branch more self-contained than the original. Another problem is the deploy, it will run for every execution not only for the last, to solve that the correct approach should be create a new task

### 5.4 More about Trigger Rules


In the previous sections, we have seen how Airflow allows us to build dynamic behavior DAGs, which allows us to encode branches or conditional statements directly into our DAGs. Much of this behavior is governed by Airflow’s so-called trigger rules, which determine exactly when a task is executed by Airflow. As we skipped over trigger rules relatively quickly in the previous sections, we’ll explore them in a bit more detail here to give you a feeling of what trigger rules represent and what you can do with them.
To understand trigger rules, we first have to examine how Airflow executes tasks within a DAG run. In essence, when Airflow is executing a DAG, it continuously checks each of your tasks to see whether it can be executed. As soon as a task is deemed ‘ready for execution’, the task is picked up by the scheduler and scheduled to be executed. As a result, the task is executed as soon as Airflow has an execution slot available.
So how does Airflow determine when a task can be executed? That is where trigger rules come in.

<img src="./pic/CH05 triggers01.png" width="800">

<img src="./pic/CH05 triggers02.png" width="800">

### Using xcom to share state between tasks

Dont use xcom!!!!!

### Using API taskflow to share state between tasks

Dont use taskflow its only a decorator for XCOM!!!!!