-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Lab: Migrating a SQL Pipeline to Delta Live Tables

This notebook will be completed by you to implement a DLT pipeline using SQL. 

It is **not intended** to be executed interactively, but rather to be deployed as a pipeline once you have completed your changes.

To aid in completion of this Notebook, please refer to the <a href="https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-language-ref.html#sql" target="_blank">DLT syntax documentation</a>.

## Declare Bronze Table

Declare a bronze table that ingests JSON data incrementally (using Auto Loader) from the simulated cloud source. The source location is already supplied as an argument; using this value is illustrated in the cell below.

As we did previously, include two additional columns:
* **`receipt_time`** that records a timestamp as returned by **`current_timestamp()`** 
* **`source_file`** that is obtained by **`input_file_name()`**

In [0]:
%sql
-- TODO
CREATE or refresh streaming live table recordings_bronze
AS SELECT current_timestamp() receipt_time, input_file_name() source_file, *
  FROM cloud_files("${source}", "json", map("cloudFiles.schemaHints", "time DOUBLE"))

message
"This Delta Live Tables query is syntactically valid, but you must create a pipeline in order to define and populate your table."


### PII File

Using a similar CTAS syntax, create a live **table** into the CSV data found at */mnt/training/healthcare/patient*.

To properly configure Auto Loader for this source, you will need to specify the following additional parameters:

| option | value |
| --- | --- |
| **`header`** | **`true`** |
| **`cloudFiles.inferColumnTypes`** | **`true`** |

<img src="https://files.training.databricks.com/images/icon_note_24.png"/> Auto Loader configurations for CSV can be found <a href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-csv.html" target="_blank">here</a>.

In [0]:
%sql
-- ANSWER
CREATE OR REFRESH STREAMING LIVE TABLE pii
AS SELECT *
  FROM cloud_files("/mnt/training/healthcare/patient", "csv", map("header", "true", "cloudFiles.inferColumnTypes", "true"))

message
"This Delta Live Tables query is syntactically valid, but you must create a pipeline in order to define and populate your table."


## Declare Silver Tables

Our silver table, **`recordings_parsed`**, will consist of the following fields:

| Field | Type |
| --- | --- |
| **`device_id`** | **`INTEGER`** |
| **`mrn`** | **`LONG`** |
| **`heartrate`** | **`DOUBLE`** |
| **`time`** | **`TIMESTAMP`** (example provided below) |
| **`name`** | **`STRING`** |

This query should also enrich the data through an inner join with the **`pii`** table on the common **`mrn`** field to obtain the name.

Implement quality control by applying a constraint to drop records with an invalid **`heartrate`** (that is, not greater than zero).

In [0]:
%sql
-- ANSWER

CREATE OR REFRESH STREAMING LIVE TABLE recordings_enriched
  (CONSTRAINT positive_heartrate EXPECT (heartrate > 0) ON VIOLATION DROP ROW)
AS SELECT 
  CAST(a.device_id AS INTEGER) device_id, 
  CAST(a.mrn AS LONG) mrn, 
  CAST(a.heartrate AS DOUBLE) heartrate, 
  CAST(from_unixtime(a.time, 'yyyy-MM-dd HH:mm:ss') AS TIMESTAMP) time,
  b.name
  FROM STREAM(live.recordings_bronze) a
  INNER JOIN STREAM(live.pii) b
  ON a.mrn = b.mrn

message
"This Delta Live Tables query is syntactically valid, but you must create a pipeline in order to define and populate your table."


## Gold Table

Create a gold table, **`daily_patient_avg`**, that aggregates **`recordings_enriched`** by **`mrn`**, **`name`**, and **`date`** and delivers the following columns:

| Column name | Value |
| --- | --- |
| **`mrn`** | **`mrn`** from source |
| **`name`** | **`name`** from source |
| **`avg_heartrate`** | Average **`heartrate`** from the grouping |
| **`date`** | Date extracted from **`time`** |

In [0]:
%sql
-- ANSWER

CREATE OR REFRESH STREAMING LIVE TABLE daily_patient_avg
  COMMENT "Daily mean heartrates by patient"
  AS SELECT mrn, name, MEAN(heartrate) avg_heartrate, DATE(time) `date`
    FROM STREAM(live.recordings_enriched)
    GROUP BY mrn, name, DATE(time)

message
"This Delta Live Tables query is syntactically valid, but you must create a pipeline in order to define and populate your table."


-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>