-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

## End-to-End ETL in the Lakehouse

In this notebook, you will pull together concepts learned throughout the course to complete an example data pipeline.

The following is a non-exhaustive list of skills and tasks necessary to successfully complete this exercise:
* Using Databricks notebooks to write queries in SQL and Python
* Creating and modifying databases, tables, and views
* Using Auto Loader and Spark Structured Streaming for incremental data processing in a multi-hop architecture
* Using Delta Live Table SQL syntax
* Configuring a Delta Live Table pipeline for continuous processing
* Using Databricks Jobs to orchestrate tasks from notebooks stored in Repos
* Setting chronological scheduling for Databricks Jobs
* Defining queries in Databricks SQL
* Creating visualizations in Databricks SQL
* Defining Databricks SQL dashboards to review metrics and results

## Run Setup
Run the following cell to reset all the databases and directories associated with this lab.

In [0]:
%run ../../Includes/Classroom-Setup-12.2.1L

## Land Initial Data
Seed the landing zone with some data before proceeding.

In [0]:
DA.data_factory.load()

## Create and Configure a DLT Pipeline
**NOTE**: The main difference between the instructions here and in previous labs with DLT is that in this instance, we will be setting up our pipeline for **Continuous** execution in **Production** mode.

In [0]:
print_pipeline_config()

Steps:
1. Click the **Jobs** button on the sidebar.
1. Select the **Delta Live Tables** tab.
1. Click **Create Pipeline**.
1. Fill in a **Pipeline Name** - because these names must be unique, we suggest using the **Pipline Name** provided in the cell above.
1. For **Notebook Libraries**, use the navigator to locate and select the notebook **DE 12.2.2L - DLT Task**.
    * Alternatively, you can copy the **Notebook Path** specified above and paste it into the field provided.
1. Configure the Source
    * Click **`Add configuration`**
    * Enter the word **`source`** in the **Key** field
    * Enter the **Source** value specified above to the **`Value`** field
1. In the **Target** field, specify the database name printed out next to **Target** in the cell above.<br/>
This should follow the pattern **`dbacademy_<username>_dewd_cap_12`**
1. In the **Storage location** field, copy the directory as printed above.
1. For **Pipeline Mode**, select **Continuous**
1. Uncheck the **Enable autoscaling** box
1. Set the number of workers to **`1`** (one)
1. Click **Create**.
1. After the UI updates, change from **Development** to **Production** mode

This should begin the deployment of infrastructure.

## Schedule a Notebook Job

Our DLT pipeline is setup to process data as soon as it arrives. 

We'll schedule a notebook to land a new batch of data each minute so we can see this functionality in action.

Before we start run the following cell to get the values used in this step.

In [0]:
print_job_config()

Steps:
1. Navigate to the Jobs UI using the Databricks left side navigation bar.
1. Click the blue **Create Job** button
1. Configure the task:
    1. Enter **Land-Data** for the task name
    1. Select the notebook **DE 12.2.3L - Land New Data** using the notebook picker
    1. From the **Cluster** dropdown, under **Existing All Purpose Cluster**, select your cluster
    1. Click **Create**
1. In the top-left of the screen rename the job (not the task) from **`Land-Data`** (the defaulted value) to the **Job Name** provided for you in the previous cell.    

<img src="https://files.training.databricks.com/images/icon_note_24.png"> **Note**: When selecting your all purpose cluster, you will get a warning about how this will be billed as all purpose compute. Production jobs should always be scheduled against new job clusters appropriately sized for the workload, as this is billed at a much lower rate.

## Set a Chronological Schedule for your Job
Steps:
1. Navigate to the **Jobs UI** and click on the job you just created.
1. Locate the **Schedule** section in the side panel on the right.
1. Click on the **Edit schedule** button to explore scheduling options.
1. Change the **Schedule type** field from **Manual** to **Scheduled**, which will bring up a chron scheduling UI.
1. Set the schedule to update **Every 2**, **Minutes** from **00** 
1. Click **Save**

**NOTE**: If you wish, you can click **Run now** to trigger the first run, or wait until the top of the next minute to make sure your scheduling has worked successfully.

## Register DLT Event Metrics for Querying with DBSQL

The following cell prints out SQL statements to register the DLT event logs to your target database for querying in DBSQL.

Execute the output code with the DBSQL Query Editor to register these tables and views. 

Explore each and make note of the logged event metrics.

In [0]:
DA.generate_register_dlt_event_metrics_sql()

## Define a Query on the Gold Table

The **daily_patient_avg** table is automatically updated each time a new batch of data is processed through the DLT pipeline. Each time a query is executed against this table, DBSQL will confirm if there is a newer version and then materialize results from the newest available version.

Run the following cell to print out a query with your database name. Save this as a DBSQL query.

In [0]:
DA.generate_daily_patient_avg()

## Add a Line Plot Visualization

To track trends in patient averages over time, create a line plot and add it to a new dashboard.

Create a line plot with the following settings:
* **X Column**: **`date`**
* **Y Column**: **`avg_heartrate`**
* **Group By**: **`name`**

Add this visualization to a dashboard.

## Track Data Processing Progress

The code below extracts the **`flow_name`**, **`timestamp`**, and **`num_output_rows`** from the DLT event logs.

Save this query in DBSQL, then define a bar plot visualization that shows:
* **X Column**: **`timestamp`**
* **Y Column**: **`num_output_rows`**
* **Group By**: **`flow_name`**

Add your visualization to your dashboard.

In [0]:
DA.generate_visualization_query()

## Refresh your Dashboard and Track Results

The **Land-Data** notebook scheduled with Jobs above has 12 batches of data, each representing a month of recordings for our small sampling of patients. As configured per our instructions, it should take just over 20 minutes for all of these batches of data to be triggered and processed (we scheduled the Databricks Job to run every 2 minutes, and batches of data will process through our pipeline very quickly after initial ingestion).

Refresh your dashboard and review your visualizations to see how many batches of data have been processed. (If you followed the instructions as outlined here, there should be 12 distinct flow updates tracked by your DLT metrics.) If all source data has not yet been processed, you can go back to the Databricks Jobs UI and manually trigger additional batches.

With everything configured, you can now continue to the final part of your lab in the notebook [DE 12.2.4L - Final Steps]($./DE 12.2.4L - Final Steps)

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>