# Advanced Spark SQL Examples

This notebook demonstrates advanced SQL concepts in Spark with detailed explanations:
- Creating, using, dropping views
- Querying with and without CTEs
- Customers without orders (Join vs Subquery)
- CTAS (Create Table As Select)

## 1. Create, Use, Drop a View

**Concept Explanation:**
A *view* is a saved SQL query that acts like a virtual table. It improves code reuse and makes complex queries easier to handle.
- **Create View:** Defines the virtual table.
- **Use View:** Query it like a table.
- **Drop View:** Remove it when no longer needed.

In [None]:
# Create View
spark.sql("""
CREATE OR REPLACE TEMP VIEW complete_orders AS
SELECT * FROM orders WHERE order_status='COMPLETE'
""")

# Use the View
spark.sql("SELECT COUNT(*) AS complete_count FROM complete_orders").show()

# Drop the View
spark.catalog.dropTempView("complete_orders")

## 2. Query Without CTE

**Concept Explanation:**
A query can directly compute daily revenue by joining `orders` and `order_items`.
- This approach is simple but can be harder to read when queries get longer.
- Every calculation is done inline without a temporary reference.

In [None]:
spark.sql("""
SELECT to_date(o.order_date) AS order_date,
       ROUND(SUM(oi.order_item_subtotal),2) AS daily_revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE','CLOSED')
GROUP BY to_date(o.order_date)
ORDER BY order_date
""").show(10, truncate=False)

## 3. Query With CTE

**Concept Explanation:**
A **Common Table Expression (CTE)** is a temporary named result set defined with the `WITH` clause.
- Improves **readability** by splitting logic into smaller blocks.
- Can be reused multiple times in the main query.
- Exists only during execution of the query.

**Use Case Here:**
We first compute daily revenue in a CTE (`revenue_cte`) and then query it to format results.

In [None]:
spark.sql("""
WITH revenue_cte AS (
    SELECT to_date(o.order_date) AS order_date,
           SUM(oi.order_item_subtotal) AS revenue
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_item_order_id
    WHERE o.order_status IN ('COMPLETE','CLOSED')
    GROUP BY to_date(o.order_date)
)
SELECT order_date, ROUND(revenue,2) AS daily_revenue
FROM revenue_cte
ORDER BY order_date
""").show(10, truncate=False)

## 4. Customers Without Orders (Join)

**Concept Explanation:**
- A `LEFT JOIN` returns all rows from the left table (`customers`), along with matching rows from the right table (`orders`).
- Customers with no orders will have `NULL` values in the `orders` columns.
- Filtering `WHERE o.order_id IS NULL` gives customers with no orders.

In [None]:
spark.sql("""
SELECT c.customer_id, c.customer_fname, c.customer_lname
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.order_customer_id
WHERE o.order_id IS NULL
""").show(10, truncate=False)

## 5. Customers Without Orders (Subquery)

**Concept Explanation:**
- Instead of joins, we can use a **subquery**.
- Here we check if a customer_id is **NOT IN** the list of order_customer_id from the orders table.
- Simpler syntax but can be less efficient for very large datasets.

In [None]:
spark.sql("""
SELECT c.customer_id, c.customer_fname, c.customer_lname
FROM customers c
WHERE c.customer_id NOT IN (SELECT DISTINCT order_customer_id FROM orders)
""").show(10, truncate=False)

## 6. CTAS (Create Table As Select)

**Concept Explanation:**
CTAS allows creating a new table (or view) based on the result of a query.
- In Spark SQL, we can simulate CTAS with `CREATE OR REPLACE TEMP VIEW`.
- Useful for saving intermediate results for reuse.

**Use Case Here:**
We create a table of daily revenue and then query it.

In [None]:
spark.sql("DROP TABLE IF EXISTS daily_revenue_ctas")

spark.sql("""
CREATE OR REPLACE TEMP VIEW daily_revenue_ctas AS
SELECT to_date(o.order_date) AS order_date,
       ROUND(SUM(oi.order_item_subtotal),2) AS order_revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE','CLOSED')
GROUP BY to_date(o.order_date)
""")

spark.sql("SELECT * FROM daily_revenue_ctas ORDER BY order_date").show(10, truncate=False)