### Learning Objectives
- This lab continues on from lab "07 Develop a Simple Pipeline using LSDP"
- We will attempt to add data quality expectations to the streaming tables.
- It will come in 3 flavours

  1. WARN
      e.g. - `CONSTRAINT valid_notification EXPECT (notifications IN ('Y', 'N'))`
  2. DROP
      e.g. - `CONSTRAINT valid_date EXPECT (order_timestamp > "2021-01-01") ON VIOLATION DROP ROW`
  3. FAIL
     e.g. - `CONSTRAINT valid_id EXPECT (customer_id IS NOT NULL) ON VIOLATION FAIL UPDATE`


### Set Up
- Run `%run ../01_Data_Engineer_Learning_Plan/Lab-Setup/lab-setup-06` if not already done from lab 07
- This should create 4 datasets
  1. `orders/00.json` -> 174 rows
  2. `status/00.json` -> 5000 rows
  3. `customers/00.json` --> 1000 rows
  4. `customers_new01.json` --> 23 rows
- In this lab, we will only focus on the `orders` dataset. The rest of the datasets will be used in the next few labs.


In [0]:
%run ../01_Data_Engineer_Learning_Plan/Lab-Setup/lab-setup-06

#### Steps: 
1. Select folder where you want to store your pipeline
2. Select `Create ETL Pipeline` 
  - This is a UI set up to define your souce folder where pipeline will run.
  - We can create notebooks here too to run code (Create `orders_pipeline.sql`) in this step. (Code below)
    - This code will look similar to lab 7, with the addition of constraints
  - In the UI, click settings to change common settings
    - Under config, we will put key: source, value : `/Volumes/workspace/data_engineering_labs_00/v01` to parameterize the volume location.
3. We can click dry run when ready: This will help to check for errors, without creation of actual tables
4. We can run pipeline with full table refresh(can be dangerous)


### Example 
 - Sample `orders_pipeline.sql` code with constraints
 - Do not run it here. 

### Explaination

- We checked for a 'Yes' or 'No' in notifications column, and only notify for different values. As our data is all 'Y' or 'N', the expectation fails, notifies us, but still writes results
- We checked for order_timestamp > "2022-01-01", and drop for different values. This will end up dropping 53 records.
- We also wrote a failure expectat6ions on customer_id. This passess all. 

In [0]:
-- 1. Create a bronze streaming table from our volume. 
CREATE OR REFRESH STREAMING TABLE workspace.data_engineering_labs_00.bronze_demo_expectations
AS
SELECT 
*, 
current_timestamp() AS processing_time,
_metadata.file_name AS source_file
FROM 
STREAM read_files(
  "${source}/orders", -- source config variable set in pipeline settings
  format => 'JSON'
);


-- 2. Create a silver streaming table from our bronze table, with a transform to convert the timestamp
CREATE OR REFRESH STREAMING TABLE workspace.data_engineering_labs_00.silver_demo_expectations

-- Add the expectations
(
CONSTRAINT valid_notification EXPECT (notifications IN ('Yes', 'No')), -- Check for a Yes or No in notifications column
CONSTRAINT valid_date EXPECT (order_timestamp > "2022-01-01") ON VIOLATION DROP ROW, --drop row if not valid date
CONSTRAINT valid_id EXPECT (customer_id IS NOT NULL) ON VIOLATION FAIL UPDATE -- Fail pipeline if null
)


AS
SELECT 
order_id,
timestamp(order_timestamp) AS order_timestamp,
customer_id,
notifications
FROM 
STREAM bronze_demo_expectations;

-- 3. Create a materialised view from our silver table
CREATE OR REFRESH MATERIALIZED VIEW workspace.data_engineering_labs_00.gold_orders_by_date_demo_expectations
AS
SELECT 
date(order_timestamp) AS order_date,
count(*) AS total_daily_orders
FROM 
silver_demo_expectations
GROUP BY date(order_timestamp);