# Scheduler = service to start pipeline at specified times

A scheduler is a system that starts a process (in our case a data pipeline) as specified times or frequency

A scheduler system is usually constantly running (e.g. Airflow scheduler, cron, etc) and will check the timetable information (metadata in Airflow, corntab file forcron, etc) periodically to figure out which (if any process) to start.

With data pipelines you may need to know when a pipeline was supposed to run, for example if your pipeline is supposed to run at 12:00AM every morning, but gets delayed due to infrastructure scale/unavailability you would still get access to the execution time (12:00AM).

Keeping track of exectution time in a pipeline is critical if you want your pipelines to work on a specified set of data per run

add: execution time image from max's blog

The execution time indicates the input data for most data pipelines and depending on how you desing your pipeline this will play a crucial role. Often times this is used as an unique identifier for a specific run of the pipeline (aka `run_id`).
Having a unique identifier per pipeline run( `run_id`) will help you design idempotent, easy to debug pipeline.                 
                                                                                                                                                

In [2]:
# add: airflow macros
# https://airflow.apache.org/docs/apache-airflow/1.10.3/macros.html

While we saw the available macros in Airflow, most schedulers have similar options.

# Orchestrator = System to ensure tasks in a data pipeline are run in the correct order 

Orchestrator is system that is reponsible for orderding the tasks in a data pipeline. With a scheduler your pipelines will always run the taks in a specifed order.

An orchestrator is responsible for ensuring that your data pipeline is a DAG (directed acyclic graph), i.e that there are no infinte loops in your data pipeline.

An orchestrator is often times a python library (e.g. Airflow, Dagster) with features that enable you to order your tasks as you see fit, or systems that have auto generate order of tasks from the code (e.g. dbt uses the `ref` feature to identify data task lineage)

Modern orchestrators offer features that are hard to replicate in native Python without extensive code:
* branching
* dynamic task creation
* grouping related tasks 
* conditional workflow (task a or B or c,,.. -> pass/fail)

**NOTE** ften times shedulers and orchestrators are considered as one system, this is not true. You want to understand where the scheduler ends and orchestrator begins to ensure that you use the right system for the use case.

# Airflow is both a scheduler and an orchestrator

Let's take a quick look at Airflow's architecture:

add: image from https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html#basic-airflow-deployment

When we start docker we can see Airflow webserver (responsible for the UI) and Airflow scheduler (responsbile for starting the DAGs at the right time)

In addtion, we have installed the `apache-airflow-client` which is used for defining data pipelines in our code.

# Define data pipeline as a DAG

Most orchestrators have their own *way* to define a data pipeline (aka DAG). We can define a DAG in Airflow as shown below:

Note that this is stored within the `./dag` folder which is synced to our containers (via volume mount). In the above DAG, we do the follwing

1. run `dbt run`
2. Run data quality checks with `dbt test`
3. Move the metrics data to `sqlite3` which we will use in the next chapter for visualization.

In [1]:
%%capture
! docker compose down

In [2]:
! docker ps

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


In [3]:
# do this in a terminal in this directory
# sudo mkdir -p logs plugins temp dags tests data visualization && sudo chmod -R u=rwx,g=rwx,o=rwx logs plugins temp dags tests data visualization

In [10]:
%%capture
! docker compose up --build -d

In [11]:
! sleep 30

In [12]:
! rm -rf ./dags/tpch_warehouse/models/*/.ipynb_checkpoints # always run before dbt run, caused by notebooks, no need to do this if performed via terminal

zsh:1: no matches found: ./dags/tpch_warehouse/models/*/.ipynb_checkpoints


In [13]:
! docker ps

CONTAINER ID   IMAGE                                           COMMAND                  CREATED          STATUS                    PORTS                                       NAMES
96e73ce0515e   metabase/metabase                               "/app/run_metabase.sh"   33 seconds ago   Up 31 seconds             0.0.0.0:3000->3000/tcp, :::3000->3000/tcp   dashboard
dd6ea004cf36   6-scheduling--orchestration-airflow-webserver   "/usr/bin/dumb-init …"   52 minutes ago   Up 51 minutes (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp   webserver
3b5d361b77a5   6-scheduling--orchestration-airflow-scheduler   "/usr/bin/dumb-init …"   52 minutes ago   Up 52 minutes (healthy)   8080/tcp                                    scheduler
827ce29a3da1   postgres:16                                     "docker-entrypoint.s…"   52 minutes ago   Up 52 minutes (healthy)   0.0.0.0:5432->5432/tcp, :::5432->5432/tcp   postgres


In [14]:
%%capture
! docker compose down