# Testing our DLT pipeline

Tests can be added directly as expectation within DLT.

This is typically done using a companion notebook and creating a test version of the DLT pipeline.

The test DLT pipeline will consume a small test datasets that we'll use to perform cheks on the output: given a specific input, we test the transformation logic by ensuring the output is correct, adding wrong data as input to cover all cases.

By leveraging expectations, we can simply run a test DLT pipeline. If the pipeline fail, this means that our tests are failing and something is incorrect.

<img style="float: right" width="1000px" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/product_demos/dlt-advanecd/DLT-advanced-unit-test-3.png"/>

<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-engineering&org_id=1765512908890676&notebook=%2Ftest%2FDLT-Tests&demo_name=dlt-unit-test&event=VIEW&path=%2F_dbdemos%2Fdata-engineering%2Fdlt-unit-test%2Ftest%2FDLT-Tests&version=1">

## Testing incorrect schema ingestion

The first thing we'd like to test is that our pipeline is robust and will discard incorrect rows.

As example, this line from our test dataset should be discarded and flagged as incorrect:
```
{"id":"invalid ID", "email":"margaret84@example.com", ....}
```

In [0]:
CREATE TEMPORARY LIVE TABLE TEST_user_bronze_dlt (
  CONSTRAINT incorrect_data_removed EXPECT (not_empty_rescued_data = 0) ON VIOLATION FAIL UPDATE
)
COMMENT "TEST: bronze table properly drops row with incorrect schema"
AS SELECT count(*) as not_empty_rescued_data from live.user_bronze_dlt  where _rescued_data is not null or email='margaret84@example.com'

## Let's continue our tests on the silver table with multiple checks at once

We'll next ensure that our silver table transformation does the following:

* null ids are removed (our test dataset contains null)
* we should have 4 rows as output (based on the input)
* the emails are properly anonymized

In [0]:
CREATE TEMPORARY LIVE TABLE TEST_user_silver_dlt_anonymize (
  CONSTRAINT keep_all_rows              EXPECT (num_rows = 4)      ON VIOLATION FAIL UPDATE, 
  CONSTRAINT email_should_be_anonymized EXPECT (clear_email = 0)  ON VIOLATION FAIL UPDATE,
  CONSTRAINT null_ids_removed           EXPECT (null_id_count = 0) ON VIOLATION FAIL UPDATE  
)
COMMENT "TEST: check silver table removes null ids and anonymize emails"
AS (
  WITH
   rows_test  AS (SELECT count(*) AS num_rows       FROM live.user_silver_dlt),
   email_test AS (SELECT count(*) AS clear_email    FROM live.user_silver_dlt  WHERE email LIKE '%@%'),
   id_test    AS (SELECT count(*) AS null_id_count  FROM live.user_silver_dlt  WHERE id IS NULL)
  SELECT * from email_test, id_test, rows_test)

## Testing Primary key uniqueness

Finally, we'll enforce uniqueness on the gold table to avoid any duplicates

In [0]:
CREATE TEMPORARY LIVE TABLE TEST_user_gold_dlt (
  CONSTRAINT pk_must_be_unique EXPECT (duplicate = 1) ON VIOLATION FAIL UPDATE
)
COMMENT "TEST: check that gold table only contains unique customer id"
AS SELECT count(*) as duplicate, id FROM live.user_gold_dlt GROUP BY id

That's it. All we have to do now is run the full pipeline.

If one of the condition defined in the TEST table fail, the test pipeline expectation will fail and we'll know something need to be fixed!

You can open the <a dbdemos-pipeline-id="dlt-test" href="#joblist/pipelines/fcdb3d8e-104c-422b-a0a1-1c30a57f1650">Delta Live Table Pipeline for unit-test</a> to see the tests in action