# Manage Data Quality with Delta Live Tables

**Objective:**
In this notebook, we will learn how to implement Data Quality constraints in **Delta Live Tables (DLT)** using **Expectations**. We will explore different strategies to handle data violations: Warning, Dropping records, and Failing the pipeline.

**Agenda:**
1.  Understanding DLT **Expectations**.
2.  Types of Expectations (Warn, Drop, Fail).
3.  Defining Data Quality Rules using Python Dictionaries.
4.  Implementing Expectations in DLT Pipelines.
5.  Monitoring Data Quality using DLT Event Logs.

## 1. What are DLT Expectations?

Expectations are optional clauses you add to Delta Live Tables datasets. They allow you to define data quality constraints on the contents of a dataset. DLT runs these checks on each record passing through the query.

### Types of Expectations (Actions)

There are three main actions DLT can take when a record violates a rule:

| Action | Decorator | Behavior | Use Case |
| :--- | :--- | :--- | :--- |
| **Warn (Default)** | `@dlt.expect` / `@dlt.expect_all` | Records are **written** to the target table, but the violation is reported as a metric. The pipeline succeeds. | When you want to monitor data quality but not lose data or stop processing. |
| **Drop** | `@dlt.expect_or_drop` / `@dlt.expect_all_or_drop` | Invalid records are **dropped** before writing to the target. The pipeline succeeds. | When valid data is required for downstream analysis, but you can afford to discard bad rows. |
| **Fail** | `@dlt.expect_or_fail` / `@dlt.expect_all_or_fail` | The pipeline update **fails** immediately upon detecting an invalid record. | When data integrity is critical, and processing must stop if bad data arrives. |

## 2. Defining Data Quality Rules

Instead of hardcoding rules directly into the decorator, it is best practice to define them in a Python dictionary. This allows for reusability and cleaner code.

Let's define rules for our **Orders** and **Customer** datasets.

In [None]:
# Import DLT library
import dlt

# -------------------------------------------------------------------------
# Define Rules for ORDERS Data
# -------------------------------------------------------------------------
# Rule 1: Order Status must be one of 'O' (Open), 'F' (Finished), 'P' (Pending)
# Rule 2: Order Price (Total Price) must be greater than 0
# -------------------------------------------------------------------------

_order_rules = {
    "Valid Order Status": "o_orderstatus in ('O', 'F', 'P')",
    "Valid Order Price": "o_totalprice > 0"
}

# -------------------------------------------------------------------------
# Define Rules for CUSTOMER Data
# -------------------------------------------------------------------------
# Rule 1: Market Segment must not be null
# -------------------------------------------------------------------------

_customer_rules = {
    "Valid Market Segment": "c_mktsegment is not null"
}

## 3. Implementing Expectations

Below are examples of how to apply these rules using the three different actions.

### Scenario A: Warning (Track Metrics Only)
This is the default behavior. If a rule is violated, the data is still ingested, but the failure count increases in the DLT UI.

In [None]:
# -------------------------------------------------------------------------
# Scenario A: Warning
# Using @dlt.expect_all to check multiple rules
# -------------------------------------------------------------------------

@dlt.table(
    comment="Order bronze table with Warning expectations"
)
@dlt.expect_all(_order_rules) # <--- Action: Warn (Default)
def orders_bronze_warn():
    return (
        spark.readStream.table("dev.bronze.orders_raw")
    )

# If 'o_totalprice' is negative, the row IS INSERTED, but marked as failed in metrics.

### Scenario B: Fail the Pipeline
Use this when strict data quality is required.

In [None]:
# -------------------------------------------------------------------------
# Scenario B: Fail
# Using @dlt.expect_all_or_fail
# -------------------------------------------------------------------------

@dlt.table(
    comment="Order bronze table with Fail expectations"
)
@dlt.expect_all_or_fail(_order_rules) # <--- Action: Fail Pipeline
def orders_bronze_fail():
    return (
        spark.readStream.table("dev.bronze.orders_raw")
    )

# If 'o_totalprice' is negative, the pipeline STOPS with an error.

### Scenario C: Drop Invalid Records
Use this to clean data on the fly.

In [None]:
# -------------------------------------------------------------------------
# Scenario C: Drop
# Using @dlt.expect_all_or_drop
# -------------------------------------------------------------------------

@dlt.table(
    comment="Order bronze table with Drop expectations"
)
@dlt.expect_all_or_drop(_order_rules) # <--- Action: Drop Record
def orders_bronze_drop():
    return (
        spark.readStream.table("dev.bronze.orders_raw")
    )

# If 'o_totalprice' is negative, the row is SKIPPED (not inserted), and pipeline continues.

## 4. Applying Rules on Views and Joins

You can also apply expectations on Views or downstream tables (Silver/Gold) where you join multiple datasets.

In [None]:
# Combining rules from both dictionaries
_all_rules = {**_order_rules, **_customer_rules}

@dlt.view(
    comment="Joined view of orders and customers"
)
@dlt.expect_all(_all_rules) # Applying checks on the joined result
def joined_vw():
    # Read streaming tables
    orders_df = dlt.read("orders_bronze")
    cust_df = dlt.read("customer_bronze_vw")
    
    # Perform Join
    return orders_df.join(cust_df, orders_df.o_custkey == cust_df.c_custkey, "left")

## 5. Monitoring Data Quality with Event Logs

DLT stores detailed logs of every pipeline execution in the `event_log` table. You can query this log to build custom Data Quality Dashboards.

The event log contains a JSON column `details` which holds the DQ metrics under `flow_progress.data_quality.expectations`.

### SQL Query to Extract DQ Metrics
You can run the following SQL query in a Notebook or SQL Editor to parse the JSON and see specific failure counts.

In [None]:
-- Replace <pipeline-id> with your actual DLT Pipeline ID found in the UI

WITH event_log_raw AS (
  SELECT * FROM event_log('<pipeline-id>') 
),
latest_update AS (
  SELECT origin.update_id 
  FROM event_log_raw 
  WHERE event_type = 'create_update' 
  ORDER BY timestamp DESC LIMIT 1
)
SELECT
  row_expectations.dataset as dataset,
  row_expectations.name as expectation,
  SUM(row_expectations.passed_records) as passing_records,
  SUM(row_expectations.failed_records) as failing_records
FROM
  event_log_raw,
  LATERAL VIEW explode(from_json(details:flow_progress:data_quality:expectations, "array<struct<name:string, dataset:string, passed_records:int, failed_records:int>>")) AS row_expectations
WHERE
  event_type = 'flow_progress'
  AND origin.update_id = (SELECT update_id FROM latest_update)
GROUP BY
  row_expectations.dataset,
  row_expectations.name;

### Summary

1.  **Expectations** allow you to enforce data quality in DLT.
2.  **Warn** is great for observability without stopping flows.
3.  **Drop** ensures clean data in your target tables.
4.  **Fail** prevents bad data ingestion entirely but stops the pipeline.
5.  Use the **DLT UI** or **Event Log Queries** to monitor the health and quality of your data over time.