# Abstract of airflow chapter 03

### Scheduling in Airflow

This chapter goes deeper in the concept of how to create a dag in airflow

``` python
dag = DAG(
dag_id="03_with_end_date", 
    schedule_interval="@daily", 
    start_date=dt.datetime(year=2019, month=1, day=1), 
    end_date=dt.datetime(year=2019, month=1, day=5),
)
```

The start and end day begins at midnight. To support more complicated scheduling intervals, Airflow allows us to define scheduling intervals using the same syntax as used by cron

An important limitation of cron expressions is that they are unable to represent certain frequency- based schedules, ariflow can handle that using timedelta

```python
dag = DAG(
    dag_id="03_with_end_date",
    schedule_interval=timedelta(days=3),
    start_date=dt.datetime(year=2019, month=1, day=1), 
    end_date=dt.datetime(year=2019, month=1, day=5),
    ) 
```

For one, our DAG is downloading and calculating statistics for the entire catalog of user events every day, which is hardly efficient. Moreover, this process is only downloading events for the past 30 days, which means that we are not building up any history for dates further in the past.


Airflow con handle that with Fetching events incrementally

```python

fetch_events = BashOperator(task_id="fetch_events", 
bash_command=(
"mkdir -p /data && "
"curl -o /data/events.json "
 "http://localhost:5000/events?" "start_date=2019-01-01&" "end_date=2019-01-02"
),
dag=dag)

```

The problem of the above approach is to use other dates than 2019-01-01 and 2019-01-02 for that we need aiflow feature: Dynamic time references using execution dates

####  Dynamic time references using execution dates

For many workflows involving time-based processes, it is important to know for which time interval a given task is being executed. For this reason, Airflow provides tasks with extra parameters that can be used to determine for which schedule interval a task is being executed (we’ll go into more detail on these ‘parameters’ in the next chapter).
The most important of these parameters is called the execution_date, which represents the date and time for which our DAG is being executed. In contrast to what the name of the parameter suggests, the execution_date is not a date but a timestamp, which reflects the start time of the schedule interval for which the DAG is being executed.

<img src="./pic/CH03extra_params.png" width="800">

In Airflow, we can use these execution dates by referencing them in our operators. For example, in the BashOperator, we can use Airflow’s templating functionality to include the execution dates dynamically in our Bash command. Templating is covered in detail in Chapter 4.

Using airflow extra parameters we can use the execution date:

```python
fetch_events = BashOperator( task_id="fetch_events", 
bash_command=(
"mkdir -p /data && "
"curl -o /data/events.json "
"http://localhost:5000/events?" "start_date={{execution_date.strftime('%Y-%m-%d')}}" #A "&end_date={{next_execution_date.strftime('%Y-%m-%d')}}" #B
),
dag=dag, )
```

In this example, the syntax {{variable_name}} is an example of using Airflow’s Jinja-based templating syntax for referencing one of Airflow’s specific parameters

For example, ds and ds_nodash parameters are different representations of the execution_date, formatted as YYYY-MM-DD and YYYYMMDD respectively. Similarly, the next_ds, next_ds_nodash, prev_ds, and prev_ds_nodash provide shorthands for the next and previous execution dates, respectively [template_list](https://airflow.apache.org/docs/apache-airflow/stable/macros-ref.html)

#### Partitioning your data

To divide our dataset into daily batches by writing the output of the task to a file bearing the name of the corresponding execution date:

```python
fetch_events = BashOperator( task_id="fetch_events", 
bash_command=(
"mkdir -p /data/events && "
"curl -o /data/events/{{ds}}.json " #A "http://localhost:5000/events?" "start_date={{ds}}&" "end_date={{next_ds}}",
dag=dag, )                         
```

Then with this partition approach we must change our stats_calc() function to use time templates

```python
def _calculate_stats(**context): #A
    
    """Calculates event statistics."""
    input_path = context["templates_dict"]["input_path"] #B 
    output_path = context["templates_dict"]["output_path"]
    Path(output_path).parent.mkdir(exist_ok=True)
    events = pd.read_json(input_path)
    
    stats = events.groupby(["date", "user"]).size().reset_index() 
    stats.to_csv(output_path, index=False)
    
    calculate_stats = PythonOperator(task_id="calculate_stats", 
                                     python_callable=_calculate_stats, 
                                     templates_dict={
                                                    "input_path": "/data/events/{{ds}}.json", #C
                                                    "output_path": "/data/stats/{{ds}}.csv", },
    dag=dag, )
```

- #A Receive all context variables in this dict.
- #B Retrieve the templated values from the templates_dict object. 
- #C Pass the values that we want to be templated.

For Airflow 1.10.x, you’ll need to pass the extra argument provide_context=True to the PythonOperator, otherwise the _calculate_stats function won’t receive the context values.

#### Understanding Airflow’s execution dates


With Airflow execution dates being defined as the start of the corresponding schedule intervals, they can be used to derive the start and end of a specific interval (Figure 3.7). For example, when executing a task, the start and end of the corresponding interval are defined by the execution_date (the start of the interval) and the next_execution date (the start of the next interval) parameters. Similarly, the previous schedule interval can be derived using the previous_execution_date and execution_date parameters.
However, one caveat to keep in mind when using the previous_execution_date and next_execution_date parameters in your tasks is that these parameters are only defined for DAG runs following the schedule interval. As such, the values of these parameters will be undefined for any runs that are triggered manually using Airflow UI or CLI. The reason for this is that Airflow cannot provide you with information about next or previous schedule intervals if you are not following a schedule interval.

<img src="./pic/CH03_extraDef.png" width="800">

By default, Airflow will schedule and run any past schedule intervals that have not yet been run. As such, specifying a past start date and activating the corresponding DAG will result in all intervals that have passed before the current time being executed. This behavior is controlled by the DAG catchup parameter and can be disabled by setting catchup to False:

```python
dag = DAG(
        dag_id="09_no_catchup", 
        schedule_interval="@daily", 
        start_date=dt.datetime(year=2019, month=1, day=1), 
        end_date=dt.datetime(year=2019, month=1, day=5), 
        catchup=False,
)
```

With this setting, the DAG will only be run for the most recent schedule interval, rather than executing all open past intervals

<img src="./pic/CH03catchup.png" width="800">

Backfilling can also be used to re-process data after we have made changes in our code.

#### Best Practices for Designing Tasks

The term atomicity is frequently used in database systems, where an atomic transaction is considered to be an indivisible and irreducible series of database operations such that either all occur, or nothing occurs. Similarly, in Airflow, tasks should be defined so that they either succeed and produce some proper result, or fail in a manner that does not affect the state of the system

<img src="./pic/CH03atomicity.png" width="800">

#### atomicity

The atomicity is like the database property! for example evaluate statistics and send an email two tasks wich should be atomic, with the code:

```python
def _send_stats(email, **context):
    
    stats = pd.read_csv(context["templates_dict"]["stats_path"]) 
    email_stats(stats, email=email) #A
    
    send_stats = PythonOperator(
                                task_id="send_stats",
                                python_callable=_send_stats,
                                op_kwargs={"email": "user@example.com"}, 
                                templates_dict={"stats_path": "/data/stats/{{ds}}.csv"}, 
                                dag=dag,
)
    
calculate_stats >> send_stats
```

- #A Split off the email_stats statement into a separate task for atomicity.

From this example, you might think that separating all operations into individual tasks is
sufficient to make all our tasks atomic. This is however not necessarily true. To see why, think about what would if our event API would require us to log in before querying for events. This would generally require an extra API call to fetch some authentication token, after which we can start retrieving our events.

#### Idempotency

Tasks are said to be idempotent if calling the same task multiple times with the same inputs has no additional effect. This means, for example, that re-running a task without changing the inputs should not change the overall output.

<img src="./pic/CH03 Idempotency.png" width="800"> 