-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Orchestrating Jobs with Databricks

New updates to the Databricks Jobs UI have added the ability to schedule multiple tasks as part of a job, allowing Databricks Jobs to fully handle orchestration for most production workloads.

Here, we'll start by reviewing the steps for scheduling a notebook as a triggered standalone job, and then add a dependent job using a DLT pipeline. 

## Learning Objectives
By the end of this lesson, you should be able to:
* Schedule a notebook as a Databricks Job
* Describe job scheduling options and differences between cluster types
* Review Job Runs to track progress and see results
* Schedule a DLT pipeline as a Databricks Job
* Configure linear dependencies between tasks using the Databricks Jobs UI

In [0]:
%run ../Includes/Classroom-Setup-9.1.1

## Create and configure a pipeline
The pipeline we create here is nearly identical to the one in the previous unit.

We will use it as part of a scheduled job in this lesson.

Execute the following cell to print out the values that will be used during the following configuration steps.

In [0]:
print_pipeline_config()    

## Create and configure a pipeline

Steps:
1. Click the **Jobs** button on the sidebar,
1. Select the **Delta Live Tables** tab.
1. Click **Create Pipeline**.
1. Fill in a **Pipeline Name** - because these names must be unique, we suggest using the **Pipeline Name** provided in the cell above.
1. For **Notebook Libraries**, use the navigator to locate and select the companion notebook called **DE 9.1.3 - DLT Job**. Alternatively, you can copy the **Notebook Path** and paste it into the field provided.
1. In the **Target** field, specify the database name printed out next to **Target** in the cell above.<br/>
This should follow the pattern **`dbacademy_<username>_dewd_dlt_demo_91`**
1. In the **Storage location** field, copy the directory as printed above.
1. For **Pipeline Mode**, select **Triggered**
1. Uncheck the **Enable autoscaling** box
1. Set the number of workers to **`1`** (one)
1. Click **Create**.

<img src="https://files.training.databricks.com/images/icon_note_24.png"> **Note**: we won't be executing this pipline directly as it will be executed by our job later in this lesson,<br/>
but if you want to test it real quick, you can click the **Start** button now.

## Schedule a Notebook Job

When using the Jobs UI to orchestrate a workload with multiple tasks, you'll always begin by scheduling a single task.

Before we start run the following cell to get the values used in this step.

In [0]:
print_job_config()

Here, we'll start by scheduling the next notebook

Steps:
1. Navigate to the Jobs UI using the Databricks left side navigation bar.
1. Click the blue **`Create Job`** button
1. Configure the task:
    1. Enter **`reset`** for the task name
    1. Select the notebook **`DE 9.1.2 - Reset`** using the notebook picker.
    1. From the **Cluster** dropdown, under **Existing All Purpose Clusters**, select your cluster
    1. Click **Create**
1. In the top-left of the screen rename the job (not the task) from **`reset`** (the defaulted value) to the **Job Name** provided for you in the previous cell.
1. Click the blue **Run now** button in the top right to start the job.

<img src="https://files.training.databricks.com/images/icon_note_24.png"> **Note**: When selecting your all-purpose cluster, you will get a warning about how this will be billed as all-purpose compute. Production jobs should always be scheduled against new job clusters appropriately sized for the workload, as this is billed at a much lower rate.

## Chron Scheduling of Databricks Jobs

Note that on the right hand side of the Jobs UI, directly under the **Job Details** section is a section labeled **Schedule**.

Click on the **Edit schedule** button to explore scheduling options.

Changing the **Schedule type** field from **Manual** to **Scheduled** will bring up a chron scheduling UI.

This UI provides extensive options for setting up chronological scheduling of your Jobs. Settings configured with the UI can also be output in chron syntax, which can be edited if custom configuration not available with the UI is needed.

At this time, we'll leave our job set with **Manual** scheduling.

## Review Run

As currently configured, our single notebook provides identical performance to the legacy Databricks Jobs UI, which only allowed a single notebook to be scheduled.

To Review the Job Run
1. Select the **Runs** tab in the top-left of the screen (you should currently be on the **Tasks** tab)
1. Find your job. If **the job is still running**, it will be under the **Active runs** section. If **the job finished running**, it will be under the **Completed runs** section
1. Open the Output details by click on the timestamp field under the **Start time** column
1. If **the job is still running**, you will see the active state of the notebook with a **Status** of **`Pending`** or **`Running`** in the right side panel. If **the job has completed**, you will see the full execution of the notebook with a **Status** of **`Succeeded`** or **`Failed`** in the right side panel
  
The notebook employs the magic command **`%run`** to call an additional notebook using a relative path. Note that while not covered in this course, <a href="https://docs.databricks.com/repos.html#work-with-non-notebook-files-in-a-databricks-repo" target="_blank">new functionality added to Databricks Repos allows loading Python modules using relative paths</a>.

The actual outcome of the scheduled notebook is to reset the environment for our new job and pipeline.

## Schedule a DLT Pipeline as a Task

In this step, we'll add a DLT pipeline to execute after the success of the task we configured at the start of this lesson.

Steps:
1. At the top left of your screen, you'll see the **Runs** tab is currently selected; click the **Tasks** tab.
1. Click the large blue circle with a **+** at the center bottom of the screen to add a new task
    1. Specify the **Task name** as **`dlt`**
    1. From **Type**, select **`Delta Live Tables pipeline`**
    1. Click the **Pipeline** field and select the DLT pipeline you configured previously<br/>
    Note: The pipeline will start with **Jobs-Demo-91** and will end with your email address.
    1. The **Depends on** field defaults to your previously defined task but may have renamed itself from the value **reset** that you specified previously to something like **Jobs-Demo-91-youremailaddress**.
    1. Click the blue **Create task** button

You should now see a screen with 2 boxes and a downward arrow between them. 

Your **`reset`** task (possibly renamed to something like **Jobs-Demo-91-youremailaddress**) will be at the top, 
leading into your **`dlt`** task. 

This visualization represents the dependencies between these tasks.

Click **Run now** to execute your job.

**NOTE**: You may need to wait a few minutes as infrastructure for your job and pipeline is deployed.

## Review Multi-Task Run Results

Select the **Runs** tab again and then the most recent run under **Active runs** or **Completed runs** depending on if the job has completed or not.

The visualizations for tasks will update in real time to reflect which tasks are actively running, and will change colors if task failures occur. 

Clicking on a task box will render the scheduled notebook in the UI. 

You can think of this as just an additional layer of orchestration on top of the previous Databricks Jobs UI, if that helps; note that if you have workloads scheduling jobs with the CLI or REST API, <a href="https://docs.databricks.com/dev-tools/api/latest/jobs.html" target="_blank">the JSON structure used to configure and get results about jobs has seen similar updates to the UI</a>.

**NOTE**: At this time, DLT pipelines scheduled as tasks do not directly render results in the Runs GUI; instead, you will be directed back to the DLT Pipeline GUI for the scheduled Pipeline.

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>