In [0]:
# 3. Querying and Transforming Your Data (ETL)

Once your data is in the Lakehouse, you need to query it and build transformation pipelines. Databricks offers two primary methods for orchestrating this work:

1. **Databricks Jobs:** An imperative approach, ideal for simple, scheduled tasks.
2. **Delta Live Tables:** A declarative approach, ideal for building robust, observable data pipelines.

## Writing Queries in the Databricks SQL Editor

The **SQL Editor** is your home for data exploration. You can write standard SQL to query any table, and the underlying Serverless SQL Warehouse provides best-in-class performance. You can also review the query profile to troubleshoot performance bottlenecks and see query history.

### To Try It Yourself:
1. In the left navigation bar, select **SQL Editor**.
2. Ensure a SQL Warehouse is selected in the top-right.
3. Run the query below to explore the clean, aggregated data produced by one of the demo pipelines.

In [None]:
-- This is the final 'gold' table from the bike pipeline demo, ready for BI.
SELECT * FROM main.dbdemos_pipeline_bike.bike_trips_gold;

## Orchestration Method 1: Databricks Jobs (The Imperative Approach)

A **Job** is the simplest way to run a notebook, script, or SQL query on a schedule. This is an *imperative* approach, where you define the steps to be executed in order. Think of it as a powerful "cron job" for Databricks, perfect for straightforward, routine tasks.

### To See How it Works:

You can schedule any notebook to run as a Job by clicking the **Schedule** button in the top-right corner of the notebook UI. This will take you to the Jobs UI where you can define the schedule and compute.

[![Video Thumbnail](https://img.youtube.com/vi/gHye-4w_d7A/0.jpg)](https://www.youtube.com/watch?v=gHye-4w_d7A "Databricks Workflows: Build, Run, and Manage ETL, ML, and Analytics")

📖 **Resource:** [Databricks Workflows Quickstart](https://docs.databricks.com/en/workflows/jobs/jobs-quickstart.html)

## Orchestration Method 2: Delta Live Tables (The Declarative Approach)

For building robust data pipelines, **Delta Live Tables (DLT)** is the modern, recommended approach. It's a *declarative* framework: instead of defining the *steps* of your pipeline, you simply define the *end state* of your tables using standard SQL or Python.

DLT automatically manages the underlying infrastructure, orchestration, data quality monitoring, and error handling.

Your setup script already deployed a DLT pipeline from the `pipeline-bike` demo!

### To Explore the DLT Pipeline:

1. **See the Definition:** In your workspace, navigate to the `pipeline-bike` demo folder and open the **`01-DLT-Pipeline-SQL`** notebook. This is the simple SQL code that defines the entire pipeline.
2. **See it Running:** Use the link generated by the setup script (or go to `Workflows > Delta Live Tables`) to see the live pipeline graph. You can monitor data flowing through the bronze, silver, and gold tables and see data quality scores.

[![Video Thumbnail](https://img.youtube.com/vi/1LAL-8y_q9Y/0.jpg)](https://www.youtube.com/watch?v=1LAL-8y_q9Y "What is Delta Live Tables?")

### 📖 Additional Resources:
* [Delta Live Tables Quickstart](https://docs.databricks.com/en/delta-live-tables/quickstart.html)
* [Tutorial: Run your first ETL pipeline with DLT](https://docs.databricks.com/en/delta-live-tables/tutorial-run-pipeline.html)

## Advanced SQL Techniques and Performance Optimization

### Query Optimization Tips

In [None]:
-- Example: Using Z-ordering for optimal query performance
OPTIMIZE main.default.your_table_name
ZORDER BY (frequently_filtered_column);

-- Example: Using partitioning for large tables
CREATE TABLE main.default.sales_data (
    transaction_id STRING,
    amount DECIMAL(10,2),
    customer_id STRING,
    sale_date DATE
)
USING DELTA
PARTITIONED BY (sale_date);

-- Example: Analyzing query performance
EXPLAIN EXTENDED 
SELECT customer_id, SUM(amount) as total_sales
FROM main.default.sales_data 
WHERE sale_date >= '2024-01-01'
GROUP BY customer_id;

### Data Quality and Testing with Delta Live Tables

In [None]:
-- Example: Delta Live Tables with data quality constraints
CREATE OR REFRESH STREAMING LIVE TABLE clean_customer_data (
  CONSTRAINT valid_email EXPECT (email RLIKE '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$') ON VIOLATION DROP ROW,
  CONSTRAINT valid_age EXPECT (age > 0 AND age < 150) ON VIOLATION QUARANTINE,
  CONSTRAINT not_null_id EXPECT (customer_id IS NOT NULL) ON VIOLATION FAIL UPDATE
)
AS SELECT 
  customer_id,
  email,
  age,
  current_timestamp() as processed_at
FROM STREAM(LIVE.raw_customer_data)

## Comprehensive Resource Library

### 📚 **Official Documentation**
* [SQL Analytics and Warehousing Complete Guide](https://docs.databricks.com/en/sql/index.html)
* [Delta Live Tables Documentation](https://docs.databricks.com/en/delta-live-tables/index.html)
* [Databricks Workflows (Jobs) Guide](https://docs.databricks.com/en/workflows/index.html)
* [SQL Reference Guide](https://docs.databricks.com/en/sql/language-manual/index.html)
* [Query Optimization Best Practices](https://docs.databricks.com/en/optimizations/index.html)
* [Data Quality Monitoring](https://docs.databricks.com/en/delta-live-tables/expectations.html)

### 🎥 **Video Learning Resources**
* [SQL Analytics Fundamentals Playlist](https://www.youtube.com/playlist?list=PLTPXxbhUt-YVstcW1CG5F0S3LXvvfRT1u)
* [Delta Live Tables Deep Dive](https://www.youtube.com/watch?v=1LAL-8y_q9Y)
* [Advanced SQL Techniques on Databricks](https://www.youtube.com/watch?v=LWtj-84Hi8E)
* [Performance Tuning for Analytics](https://www.youtube.com/watch?v=FfVuCpMhV6Q)
* [Data Quality with DLT](https://www.youtube.com/watch?v=vv8OPIhCGE8)

### 🛠️ **Hands-On Tutorials and Labs**
* [Delta Live Tables Quickstart](https://docs.databricks.com/en/delta-live-tables/quickstart.html)
* [SQL Analytics Workshop](https://github.com/databricks-academy/sql-analytics-with-databricks)
* [Advanced Analytics with Databricks Course](https://academy.databricks.com/path/data-analyst)
* [ETL with Delta Live Tables Tutorial](https://docs.databricks.com/en/delta-live-tables/tutorial-run-pipeline.html)
* [Building Production Pipelines](https://github.com/databricks-academy/advanced-data-engineering-with-databricks)

### 📖 **Advanced Reading and Best Practices**
* [Medallion Architecture Implementation](https://www.databricks.com/glossary/medallion-architecture)
* [SQL Performance Optimization Guide](https://www.databricks.com/blog/2021/08/11/high-performance-sql-analytics-with-databricks-sql.html)
* [Data Pipeline Testing Strategies](https://www.databricks.com/blog/2021/12/09/testing-data-pipelines-with-databricks.html)
* [Change Data Capture (CDC) Patterns](https://docs.databricks.com/en/delta-live-tables/cdc.html)
* [Streaming Analytics Best Practices](https://www.databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html)

### 🏗️ **Architecture and Design Patterns**
* [Modern Data Stack with Databricks](https://www.databricks.com/blog/2021/08/30/frequently-asked-questions-about-the-databricks-lakehouse-platform.html)
* [Real-time Analytics Architecture](https://www.databricks.com/solutions/accelerators/real-time-analytics)
* [Multi-layer Data Architecture](https://www.databricks.com/blog/2020/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html)
* [Scaling ETL Workloads](https://www.databricks.com/blog/2021/05/24/how-to-scale-machine-learning-pipelines.html)

### 🔧 **Tools and Integrations**
* [dbt Integration with Databricks](https://docs.databricks.com/en/partners/prep/dbt.html)
* [Apache Airflow Integration](https://docs.databricks.com/en/dev-tools/external-tools.html#apache-airflow)
* [GitHub Actions for CI/CD](https://docs.databricks.com/en/dev-tools/ci-cd/ci-cd-github.html)
* [Terraform for Infrastructure](https://registry.terraform.io/providers/databricks/databricks/latest/docs)

### 💡 **Community and Support**
* [Databricks Community Forums - SQL & Analytics](https://community.databricks.com/s/topic/0TO0w000000MqzEGAS/sql-analytics)
* [Delta Live Tables Community Discussions](https://community.databricks.com/s/topic/0TO5w00000096DHGAY/delta-live-tables)
* [Stack Overflow - Databricks SQL](https://stackoverflow.com/questions/tagged/databricks+sql)
* [LinkedIn Learning Databricks Courses](https://www.linkedin.com/learning/search?keywords=databricks)