## Job scheduling with Airflow

In this notebook, you'll learn how to automate Hopsworks jobs with the built-in Airflow integration. This can be useful if you have tasks that need to be repeated with a certain time interval, such as ingesting and preprocessing data or training a new model.

An [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html) (directed acyclic graph) describes how different steps in a workflow are connected. In this notebook we'll create a simple DAG for illustrative purposes which will run the first two notebooks in this tutorial series:

1. Load data and do feature engineering.
2. Create a dataset from the data.

If you are unfamiliar with Apache Airflow it could be a good idea to read up on the [concepts](https://airflow.apache.org/docs/apache-airflow/stable/concepts/index.html) (in particular *DAG*, *Task*, and *Operator*) before continuing with this tutorial.

### Creating Jobs

First, we'll need to create the jobs that corresponds to the tasks we will run. A job can be created either programmatically, or in the Hopsworks UI (as described in the [documentation](https://docs.hopsworks.ai/hopsworks/latest/compute/jobs/)). You can create four types of jobs: Spark, Flink, Python, and Docker. However, the latter two are only available in the Enterprise Edition of Hopsworks. In this example, we'll simply convert the first two notebooks of this tutorial series into Spark Jobs.

Inside your project, go to `Jobs => New Job`. Here you need to give the job a name and select a file that contains code for it. You can either upload a file or select a file from your Hopsworks cluster. If you've worked with notebooks on your Hopsworks cluster you would click on `From Project => Jupyter` and then select the notebook you want to convert to a job. Do this with the first two notebooks and name them `feature_group_job` and `dataset_job`, respectively.

### Creating a DAG

#### Hopsworks Operators
We'll use the *HopsworksLaunchOperator*, which will launch the jobs that we just created, and the *HopsworksJobSuccessSensor*, which will tell us whether a job has succeeded or not. The control flow will look like this:

1. Launch feature engineering job using a *HopsworksLaunchOperator*.
2. Check that the job was successfully completed using a *HopsworksJobSuccessSensor*.
3. Launch dataset creation job using a *HopsworksLaunchOperator*.


#### DAG Definition File

Next, we'll create the DAG definition file. We'll schedule the DAG to run at 04:00 every day in this example.

You will need to fill in your Hopsworks username and project name in the code below (see comments).

In [1]:
%%writefile dag.py
from datetime import datetime
from airflow import DAG

from hopsworks_plugin.operators.hopsworks_operator import HopsworksLaunchOperator
from hopsworks_plugin.sensors.hopsworks_sensor import HopsworksJobSuccessSensor

HOPSWORKS_USERNAME = "" # TODO: Change to your username in Hopsworks.
PROJECT_NAME = "" # TODO: Change to your project ID.

args = {
    "owner": HOPSWORKS_USERNAME,
    "depends_on_past": False,
}

dag = DAG(
    dag_id = "fraud_dag",
    default_args = args,
    # The start date is arbitrary as we set catchup = False.
    start_date = datetime(2022,1,1),
    catchup = False,
    schedule_interval = "0 4 * * */1"
)

task1 = HopsworksLaunchOperator(dag=dag, task_id="run_job_0", job_name="feature_group_job", project_name=PROJECT_NAME)
task2 = HopsworksLaunchOperator(dag=dag, task_id="run_job_1", job_name="dataset_job", project_name=PROJECT_NAME)

sensor = HopsworksJobSuccessSensor(dag=dag,
                                   poke_interval=10,
                                   task_id="wait_for_success_job_0",
                                   job_name="feature_group_job",
                                   project_name=PROJECT_NAME)

task1 >> sensor >> task2

Overwriting dag.py


Running this code will create a file called `dag.py`. Go to `Airflow` in the Hopsworks UI, click on the three dots in the top right bar, and click on `upload_files` to upload it. Next, you will need to click on `Open Airflow`, which should show you a list of the DAGs you have uploaded. Switch from `OFF` to `ON` on the left-hand side if you want to activate your DAG. Needless to say, make sure you switch back to `OFF` again so that the DAG doesn't run every day.