# Databricks SQL & SQL Warehouses
## Data Analyst Persona

Up until now, we have largely focused on the **Data Engineering** persona (working with Spark Clusters, Jobs, DLT, and Notebooks). Today, we shift gears to the **Data Analyst** persona.

### Objectives
1.  Understand **Databricks SQL (DBSQL)** and how it enables Data Warehousing on the Lakehouse.
2.  Learn about **SQL Warehouses** (Serverless, Pro, Classic) and their features.
3.  Explore the **SQL Editor** for writing and managing queries.
4.  Deep dive into **Query Profiling**, Monitoring, and Debugging (Spark UI).
5.  Create **Visualizations** and **Dashboards** directly from queries.
6.  Connect **BI Tools** (PowerBI, Tableau) to Databricks.

---

## 1. What is Databricks SQL (DBSQL)?

Databricks SQL provides a dedicated interface for data analysts who want to run SQL queries without worrying about Spark configurations, cluster management, or Python code.

### The "Warehouse" on the "Lake"
Traditionally, companies moved data from a Data Lake to a Data Warehouse (like Snowflake or Redshift) for high-performance SQL analysis.
*   **Databricks Solution:** With the **Lakehouse** architecture (powered by Delta Lake), you can perform Data Warehousing operations directly on your data lake storage.
*   **Benefit:** No data movement, single source of truth, lower latency, and lower TCO (Total Cost of Ownership).

### SQL Warehouse Types
To run SQL queries in DBSQL, you need a compute resource called a **SQL Warehouse**.

| Feature | **Serverless** (Recommended) | **Pro** | **Classic** |
| :--- | :--- | :--- | :--- |
| **Startup Time** | Instant (Seconds) | Minutes | Minutes |
| **Compute Management** | Managed by Databricks (in their account) | In your Cloud Account | In your Cloud Account |
| **Performance** | **Intelligent Workload Management**, Predictive I/O | Photon Engine | Standard |
| **Cost** | Optimizes cost by scaling down instantly | Standard scaling | Slower scaling |

**Key Features:**
*   **Photon Engine:** A vectorized query engine written in C++ for extreme speed.
*   **Predictive I/O:** AI-driven optimization to scan data faster.
*   **Concurrency:** Warehouses automatically scale (add more clusters) based on the number of concurrent queries.

## 2. Creating a SQL Warehouse (Hands-on)

*Note: This is a GUI operation, but the steps are documented here for reference.*

1.  Navigate to **Compute** -> **SQL Warehouses** tab.
2.  Click **Create SQL Warehouse**.
3.  **Name:** E.g., `dev_warehouse`.
4.  **Cluster Size:** Uses "T-Shirt" sizing (2X-Small, Small, Medium, etc.).
    *   *Tip:* Start small (2X-Small) and scale up if query latency is high.
5.  **Scaling:** Set Min and Max clusters.
    *   *Example:* Min 1, Max 5. Databricks will spin up new clusters if too many users run queries simultaneously to prevent queuing.
6.  **Type:** Select **Serverless** for the best experience.
7.  **Unity Catalog:** Enabled by default.

### Connection Details (BI Tools)
Once created, go to the **Connection Details** tab. You will find:
*   **Server Hostname**
*   **HTTP Path**
*   **JDBC/ODBC Connection String**
*   *Direct Connectors:* Buttons to download connection files for PowerBI, Tableau, etc.

## 3. The SQL Editor

The SQL Editor is the IDE for Data Analysts. It allows you to:
*   Browse the **Catalog Explorer** (Catalogs, Schemas, Tables) on the left.
*   Write generic ANSI SQL queries.
*   Save, Share, and Schedule queries.

### Example Query Scenario
We will join our `orders` table with the `customers` (SCD Type 2) table to calculate total orders by market segment.

**Key Logic:**
*   Join `orders_silver` (Fact) with `customers_silver` (Dimension).
*   Filter for active customer records (`_END_AT IS NULL`) because it is an SCD Type 2 table.
*   Group by `market_segment`.

In [None]:
-- This creates the SQL query logic demonstrated in the video.
-- You can run this in a Notebook attached to a SQL Warehouse or in the SQL Editor.

SELECT
    c.c_mktsegment,
    COUNT(o.o_orderkey) AS total_orders
FROM
    dev.etl.orders_silver o
    LEFT JOIN dev.etl.customer_scd2_bronze c
        ON o.o_custkey = c.c_custkey
WHERE
    -- SCD Type 2 Logic: Only take the current active record for the customer
    c.__END_AT IS NULL
GROUP BY
    c.c_mktsegment;

## 4. Query Profiling & Debugging

One of the most powerful features of DBSQL is the **Query Profile**.

### How to access:
1.  Run a query in the SQL Editor.
2.  Click on the time duration at the bottom of the editor (e.g., "Time in Photon").
3.  Click **See Query Profile**.

### What to look for:
*   **Visual DAG:** A flowchart of how the query was executed.
*   **Time Spent:** See exactly which step took the most time (e.g., Scanning data vs. Aggregating vs. Shuffling).
*   **Rows Processed:** Hover over lines to see if a join caused a data explosion (or if partition pruning worked efficiently).
*   **Photon Usage:** Yellow boxes indicate the query is running on the high-performance Photon engine.

### Advanced Debugging (Spark UI)
For deep engineering analysis, you can click the **Spark UI** link within the Query Profile to see the underlying Spark DAG, stages, and tasks.

## 5. Visualizations & Scheduling

### Visualizations
You don't always need an external BI tool.
1.  In the SQL Editor results pane, click the **+ (Plus)** icon -> **Visualization**.
2.  **Type:** Select Bar, Line, Pie, etc.
3.  **Configuration:** Drag and drop columns (e.g., X-axis: `c_mktsegment`, Y-axis: `total_orders`).
4.  These visualizations can be added to **Databricks Dashboards**.

### Scheduling
You can schedule a SQL query to run periodically (e.g., to refresh a report or check data quality).
1.  Click **Schedule** at the top right of the editor.
2.  Set frequency (e.g., Every 12 hours).
3.  Choose the SQL Warehouse to run the schedule.

## 6. Notebooks on SQL Warehouses

A relatively new and powerful feature is the ability to run **Jupyter/Databricks Notebooks** directly on a **SQL Warehouse**.

*   **Why?** You get the narrative power of a notebook with the startup speed and cost-efficiency of a Serverless SQL Warehouse.
*   **Limitation:** You can primarily run SQL. Python support on SQL Warehouses is limited or non-existent depending on the specific warehouse configuration (designed for SQL workloads).
*   **How:** In the notebook dropdown for "Connect", select your **SQL Warehouse** instead of a standard All-Purpose Cluster.

In [None]:
# Note: If this notebook is attached to a SQL Warehouse,
# Only SQL cells (like Cell 5) will execute successfully.
# Python code cells might not run or have limited functionality.

print("To test the SQL Warehouse integration:")
print("1. Click the 'Connect' dropdown at the top right.")
print("2. Select 'SQL Warehouses'.")
print("3. Choose the warehouse created in step 2 (e.g., dev_warehouse).")
print("4. Run the SQL cell above.")