# Data Transformation: Lakeflow Spark Declarative Pipelines

## The Situation

Your leadership team just dropped a bombshell: they want dashboards, interactive "talk to my data" capabilities, predictive maintenance models, and AI agent systems - **all by the end of the week**. Your IoT sensors on planes are generating massive amounts of data, and you need to get it production-ready, fast.

Good news: Databricks has **Lakeflow Spark Declarative Pipelines** (formerly Delta Live Tables) that can help you build reliable, production-ready data pipelines in minutes, not days.

---

## What You'll Learn

âœ… What declarative pipelines are and why they matter  
âœ… Create a streaming table to ingest and clean sensor data  
âœ… Create a materialized view for aggregated metrics  
âœ… Deploy your pipeline to production  

**Time to Complete:** 20-30 minutes

---

## What are Declarative Pipelines?

Instead of writing complex code to manage checkpoints, handle incremental processing, and track data quality, you simply declare **what** you want. The framework automatically handles:

- âœ… **Incremental processing** - Only process new/changed data
- âœ… **Dependency management** - Determine execution order automatically
- âœ… **Data quality** - Built-in validation with expectations
- âœ… **Monitoring** - Automatic lineage and observability
- âœ… **Recovery** - Checkpoint management and error handling

### Two Key Building Blocks

**1. Streaming Tables**
- For incremental, append-only data processing
- Perfect for ingesting raw sensor data and cleaning it
- Each row processed exactly once

**2. Materialized Views**
- For aggregations that need to update (not just append)
- Perfect for business metrics and KPIs
- Always return correct, up-to-date results

---

**Reference:** [Lakeflow Pipelines Documentation](https://docs.databricks.com/aws/en/ldp/)

## Step 1: Create Your Streaming Table

Let's build a streaming table that:
1. Ingests raw sensor data from the volume
2. Cleans the data (fixes negative air pressure values)
3. Validates data quality with expectations

This will be your **silver layer** - cleaned, validated sensor data ready for analysis.

### Python Version

```python
from pyspark import pipelines as dp
from pyspark.sql.functions import col, when, abs as abs_func, current_timestamp

@dp.table(
    name="sensor_silver",
    comment="Cleaned and validated aircraft sensor readings",
    # Data quality rules - track violations
    expect={
        "valid_device_id": "device_id IS NOT NULL",
        "valid_timestamp": "timestamp IS NOT NULL",
        "valid_temperature_range": "temperature BETWEEN -50 AND 150"
    },
    # Drop rows that fail critical validation
    expect_or_drop={
        "positive_air_pressure": "air_pressure > 0"
    }
)
def sensor_silver():
    """
    Ingest and clean sensor data with Auto Loader.
    Auto Loader automatically handles:
    - Schema inference
    - New file discovery  
    - Exactly-once processing
    """
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("cloudFiles.inferColumnTypes", "true")
        .option("header", "true")
        .load("/Volumes/default/db_crash_course/sensor_data/")
        # Fix data quality issue: negative air pressure
        .withColumn("air_pressure",
                   when(col("air_pressure") < 0, abs_func(col("air_pressure")))
                   .otherwise(col("air_pressure")))
        # Add processing timestamp
        .withColumn("processed_at", current_timestamp())
    )
```

**What's happening here:**
- `@dp.table` decorator defines a streaming table
- `expect` tracks data quality violations (logs them but doesn't drop rows)
- `expect_or_drop` drops rows that fail critical validation
- Auto Loader (`cloudFiles`) automatically discovers new CSV files
- We fix negative air pressure values inline
- Framework handles all the checkpointing and incremental processing

### SQL Version (Same Logic)

```sql
CREATE OR REFRESH STREAMING TABLE sensor_silver (
  CONSTRAINT valid_device_id EXPECT (device_id IS NOT NULL),
  CONSTRAINT valid_timestamp EXPECT (timestamp IS NOT NULL),
  CONSTRAINT valid_temperature_range EXPECT (temperature BETWEEN -50 AND 150),
  CONSTRAINT positive_air_pressure EXPECT (air_pressure > 0)
)
COMMENT 'Cleaned and validated aircraft sensor readings'
AS SELECT
  device_id,
  trip_id,
  factory_id,
  model_id,
  timestamp,
  airflow_rate,
  rotation_speed,
  CASE 
    WHEN air_pressure < 0 THEN ABS(air_pressure)
    ELSE air_pressure
  END as air_pressure,
  temperature,
  delay,
  density,
  current_timestamp() as processed_at
FROM STREAM read_files(
  '/Volumes/default/db_crash_course/sensor_data/',
  format => 'csv',
  header => 'true'
);
```

âœ… **Result:** Clean, validated sensor data streaming into your silver table automatically as new files arrive!

## Step 2: Create Your Materialized View

Now let's create aggregated metrics that your dashboards and Genie spaces will use. We'll build a **gold layer** materialized view with factory-level KPIs.

Why a materialized view? Because we need aggregations that **update** when new data arrives (not just append).

### Python Version

```python
from pyspark.sql.functions import avg, max, count, countDistinct, round as spark_round

@dp.materialized_view(
    name="factory_kpis_gold",
    comment="Factory-level KPIs for aircraft maintenance dashboards"
)
def factory_kpis_gold():
    """
    Aggregate factory-level metrics.
    This materialized view automatically recomputes when new data arrives.
    Perfect for dashboards and reporting!
    """
    # Read from silver layer (cleaned data)
    sensors = spark.read.table("sensor_silver")
    
    # Join with factory dimension for context
    factories = spark.read.table("default.db_crash_course.dim_factories")
    
    enriched = sensors.join(factories, "factory_id", "left")
    
    # Calculate factory-level KPIs
    return (
        enriched
        .groupBy("factory_id", "factory_name", "region", "city")
        .agg(
            countDistinct("device_id").alias("total_devices"),
            count("*").alias("total_readings"),
            spark_round(avg("temperature"), 2).alias("avg_temperature"),
            spark_round(max("temperature"), 2).alias("max_temperature"),
            spark_round(avg("air_pressure"), 2).alias("avg_air_pressure"),
            spark_round(avg("rotation_speed"), 2).alias("avg_rotation_speed")
        )
    )
```

**What's happening:**
- `@dp.materialized_view` defines a view that auto-updates
- We read from the silver table (already cleaned)
- Join with factory dimensions for context
- Calculate KPIs: device counts, avg/max temperature, etc.
- These metrics automatically update as new sensor data arrives

### SQL Version (Same Logic)

```sql
CREATE OR REFRESH MATERIALIZED VIEW factory_kpis_gold
COMMENT 'Factory-level KPIs for aircraft maintenance dashboards'
AS SELECT
  s.factory_id,
  f.factory_name,
  f.region,
  f.city,
  COUNT(DISTINCT s.device_id) as total_devices,
  COUNT(*) as total_readings,
  ROUND(AVG(s.temperature), 2) as avg_temperature,
  ROUND(MAX(s.temperature), 2) as max_temperature,
  ROUND(AVG(s.air_pressure), 2) as avg_air_pressure,
  ROUND(AVG(s.rotation_speed), 2) as avg_rotation_speed
FROM sensor_silver s
LEFT JOIN default.db_crash_course.dim_factories f ON s.factory_id = f.factory_id
GROUP BY s.factory_id, f.factory_name, f.region, f.city;
```

âœ… **Result:** Always-current factory KPIs ready for your dashboards, Genie spaces, and reports!

## Step 3: Deploy Your Pipeline

Now let's get this into production! Here's how to create and deploy your pipeline.

### Create the Pipeline in the UI

1. **Click** `New` â†’ `ETL Pipeline` in the Databricks workspace

2. **Configure:**
   - **Name:** `iot_sensor_pipeline`
   - **Target Catalog:** `default`
   - **Target Schema:** `db_crash_course`
   - **Language:** Choose Python or SQL

3. **Paste your code:**
   - Copy the complete Python or SQL code from the next cell
   - Paste into the pipeline editor
   - The editor will show you a visual graph of your pipeline

4. **Configure pipeline settings:**
   - **Mode:** 
     - `Triggered` - Runs on schedule or manual trigger (good for learning)
     - `Continuous` - Always running, processes data immediately (production)
   - **Cluster:** Accept defaults (auto-scaling recommended)

5. **Start the pipeline:**
   - Click `Start`
   - Monitor progress in the pipeline graph
   - View data quality metrics in real-time
   - Check expectation violations in the dashboard

### What You'll See

The pipeline graph shows:
- **Nodes:** Your streaming table and materialized view
- **Edges:** Data flow between them
- **Metrics:** Row counts, data quality, processing time
- **Status:** Running, completed, or errors

### Monitoring

After deployment, you get automatic:
- âœ… **Data quality dashboard** - Expectation pass/fail rates
- âœ… **Event log** - Detailed execution history
- âœ… **Lineage graph** - Visual data flow
- âœ… **Performance metrics** - Processing speed, cluster utilization

---

**Reference:** [Multi-File Editor](https://docs.databricks.com/aws/en/ldp/multi-file-editor)

## Complete Pipeline Code

Here's everything in one file you can copy/paste into the Lakeflow Pipeline Editor:

### Python Complete Pipeline

```python
"""
IoT Aircraft Sensor Pipeline
Complete declarative pipeline for production-ready sensor data processing
"""

from pyspark import pipelines as dp
from pyspark.sql.functions import (
    col, when, abs as abs_func, current_timestamp,
    avg, max, count, countDistinct, round as spark_round
)

# ========================================
# SILVER LAYER - Cleaned Sensor Data
# ========================================

@dp.table(
    name="sensor_silver",
    comment="Cleaned and validated aircraft sensor readings",
    expect={
        "valid_device_id": "device_id IS NOT NULL",
        "valid_timestamp": "timestamp IS NOT NULL",
        "valid_temperature_range": "temperature BETWEEN -50 AND 150"
    },
    expect_or_drop={
        "positive_air_pressure": "air_pressure > 0"
    }
)
def sensor_silver():
    """Ingest and clean sensor data with Auto Loader."""
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("cloudFiles.inferColumnTypes", "true")
        .option("header", "true")
        .load("/Volumes/default/db_crash_course/sensor_data/")
        .withColumn("air_pressure",
                   when(col("air_pressure") < 0, abs_func(col("air_pressure")))
                   .otherwise(col("air_pressure")))
        .withColumn("processed_at", current_timestamp())
    )

# ========================================
# GOLD LAYER - Factory KPIs
# ========================================

@dp.materialized_view(
    name="factory_kpis_gold",
    comment="Factory-level KPIs for aircraft maintenance dashboards"
)
def factory_kpis_gold():
    """Aggregate factory-level metrics."""
    sensors = spark.read.table("sensor_silver")
    factories = spark.read.table("default.db_crash_course.dim_factories")
    
    enriched = sensors.join(factories, "factory_id", "left")
    
    return (
        enriched
        .groupBy("factory_id", "factory_name", "region", "city")
        .agg(
            countDistinct("device_id").alias("total_devices"),
            count("*").alias("total_readings"),
            spark_round(avg("temperature"), 2).alias("avg_temperature"),
            spark_round(max("temperature"), 2).alias("max_temperature"),
            spark_round(avg("air_pressure"), 2).alias("avg_air_pressure"),
            spark_round(avg("rotation_speed"), 2).alias("avg_rotation_speed")
        )
    )

print("âœ… Pipeline defined! Deploy in the Lakeflow Pipelines Editor.")
```

### SQL Complete Pipeline

```sql
-- IoT Aircraft Sensor Pipeline (SQL Version)

-- SILVER LAYER: Cleaned sensor data
CREATE OR REFRESH STREAMING TABLE sensor_silver (
  CONSTRAINT valid_device_id EXPECT (device_id IS NOT NULL),
  CONSTRAINT valid_timestamp EXPECT (timestamp IS NOT NULL),
  CONSTRAINT valid_temperature_range EXPECT (temperature BETWEEN -50 AND 150),
  CONSTRAINT positive_air_pressure EXPECT OR DROP (air_pressure > 0)
)
COMMENT 'Cleaned and validated aircraft sensor readings'
AS SELECT
  device_id,
  trip_id,
  factory_id,
  model_id,
  timestamp,
  airflow_rate,
  rotation_speed,
  CASE 
    WHEN air_pressure < 0 THEN ABS(air_pressure)
    ELSE air_pressure
  END as air_pressure,
  temperature,
  delay,
  density,
  current_timestamp() as processed_at
FROM STREAM read_files(
  '/Volumes/default/db_crash_course/sensor_data/',
  format => 'csv',
  header => 'true'
);

-- GOLD LAYER: Factory KPIs
CREATE OR REFRESH MATERIALIZED VIEW factory_kpis_gold
COMMENT 'Factory-level KPIs for aircraft maintenance dashboards'
AS SELECT
  s.factory_id,
  f.factory_name,
  f.region,
  f.city,
  COUNT(DISTINCT s.device_id) as total_devices,
  COUNT(*) as total_readings,
  ROUND(AVG(s.temperature), 2) as avg_temperature,
  ROUND(MAX(s.temperature), 2) as max_temperature,
  ROUND(AVG(s.air_pressure), 2) as avg_air_pressure,
  ROUND(AVG(s.rotation_speed), 2) as avg_rotation_speed
FROM sensor_silver s
LEFT JOIN default.db_crash_course.dim_factories f ON s.factory_id = f.factory_id
GROUP BY s.factory_id, f.factory_name, f.region, f.city;
```

## Summary

ðŸŽ‰ **Congratulations!** You've built a production-ready data pipeline with:

âœ… **Streaming table** - Automatically ingests and cleans sensor data  
âœ… **Data quality** - Tracks and enforces expectations  
âœ… **Materialized view** - Always-current factory KPIs  
âœ… **Deployment** - Running in production with monitoring  

### What You Get For Free

The framework automatically provides:
- âœ… Incremental processing (only new data)
- âœ… Checkpoint management (exactly-once guarantees)
- âœ… Schema evolution (handles new columns gracefully)
- âœ… Data quality tracking (expectations dashboard)
- âœ… Lineage visualization (see data flow)
- âœ… Error recovery (automatic retries)
- âœ… Monitoring (performance and health metrics)

### Next Steps for Your Week

Now that you have clean, aggregated data:

1. **Dashboards** - Use `factory_kpis_gold` for visualizations
2. **Genie** - Connect to `sensor_silver` for natural language queries
3. **AutoML** - Use `sensor_silver` for predictive maintenance models
4. **Agents** - Build on clean data for AI systems

You're well on your way to meeting leadership's deadline! ðŸš€

## Try This Out: Extend Your Pipeline

Want to learn more? Here are some ideas to explore:

### 1. Add a Bronze Layer
Create a raw ingestion table before cleaning:

```python
@dp.table(name="sensor_bronze")
def sensor_bronze():
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .load("/Volumes/default/db_crash_course/sensor_data/")
    )

# Then update sensor_silver to read from sensor_bronze
@dp.table(name="sensor_silver")
def sensor_silver():
    return dp.read_stream("sensor_bronze").withColumn(...)
```

### 2. Add Device-Level Metrics
Create another materialized view for per-device KPIs:

```python
@dp.materialized_view(name="device_health_gold")
def device_health_gold():
    return (
        spark.read.table("sensor_silver")
        .groupBy("device_id", "factory_id")
        .agg(
            avg("temperature").alias("avg_temp"),
            count("*").alias("reading_count")
        )
    )
```

### 3. Add Inspection Data
Create a parallel streaming table for inspection records:

```python
@dp.table(name="inspection_silver")
def inspection_silver():
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .load("/Volumes/default/db_crash_course/inspection_data/")
    )
```

### 4. Explore Advanced Features

- **Flows** - Multiple sources writing to one table
- **CDC** - Handle change data capture
- **Watermarks** - Handle late-arriving data
- **Multi-file organization** - Separate bronze/silver/gold into different files

**Documentation:**
- [Flows](https://docs.databricks.com/aws/en/ldp/flows)
- [Streaming Tables](https://docs.databricks.com/aws/en/ldp/streaming-tables)
- [Materialized Views](https://docs.databricks.com/aws/en/ldp/materialized-views)
- [Multi-File Editor](https://docs.databricks.com/aws/en/ldp/multi-file-editor)