
# Module 8 Lab: Performance Tuning & Cost Optimization on Databricks
**Language:** PySpark (with SQL references)  
**Goal:** Measure performance improvements from Spark- and Delta-level optimizations.

> Run top-to-bottom on a Databricks cluster. Keep cluster size constant within a given A/B test for fair comparisons.



## 🔭 Big Picture: What You'll Do
1. **Ingest** CSV data → Delta tables with an intentionally suboptimal layout.
2. Record **baseline** query plans & runtimes.
3. Apply **Spark optimizations** (AQE, predicate pushdown, projection pruning, broadcast joins, caching).
4. Apply **Delta optimizations** (OPTIMIZE, Z-ORDER, Auto Optimize).
5. **Re-benchmark** and compare end-to-end speedups & cost considerations.



## Notebook 0 — Utilities

Below we define two reusable helpers:

- `bench(name)`: a **decorator** that times any function you run. It prints how long the wrapped code took and returns both the result and the timing value.  
  - Uses Python's `time.time()` to mark start/end and `functools.wraps` so your function's name/docstring stay intact.

- `print_plan(df)`: prints Spark's **query execution plan**. We try the JVM-native plan (`_jdf.queryExecution().toString()`), then fall back to `df.explain("formatted")`. We also try `df.explain("cost")` (if supported) to see cost-based estimates.


In [0]:

import time
from functools import wraps

def bench(name):
    """Decorator to time a function and print elapsed seconds."""
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            print(f"\n=== {name} ===")
            t0 = time.time()
            result = fn(*args, **kwargs)
            dt = time.time() - t0
            print(f"{name} took: {dt:.3f}s")
            return result, dt
        return wrapper
    return decorator

def print_plan(df):
    """Print Spark query plans for inspection."""
    try:
        print("\n-- EXPLAIN (formatted via JVM) --")
        print(df._jdf.queryExecution().toString())  # engine-native plan text
    except Exception as e:
        print("Formatted plan via _jdf not available:", e)
        df.explain("formatted")
    print("\n-- EXPLAIN cost --")
    try:
        df.explain("cost")
    except Exception as e:
        print("Cost-based EXPLAIN not supported on this runtime:", e)



## Step 1 — Environment Setup

We create a working **database** and choose a **DBFS path** to store Delta tables.

- `spark.sql(...)`: runs SQL statements from Python.
- Keep cluster/runtime consistent between baseline and optimized tests for fair comparisons.


In [0]:

# Choose working catalog/database and DBFS base path
catalog = "hive_metastore"            # change if using Unity Catalog ('main', etc.)
db = "perf_tuning_lab"
base_path = "/FileStore/tmp/perf_tuning_lab"    # DBFS path for this lab

spark.sql(f"CREATE DATABASE IF NOT EXISTS {db}")
spark.sql(f"USE {db}")

print("Using catalog:", catalog, "| database:", db, "| base_path:", base_path)



### 📥 Upload CSVs to DBFS

Upload these files to the following DBFS directory **before** running Step 1.2:
- `dbfs:/FileStore/tmp/perf_tuning_lab/input/sales.csv`
- `dbfs:/FileStore/tmp/perf_tuning_lab/input/product_catalog.csv`
- `dbfs:/FileStore/tmp/perf_tuning_lab/input/customer_details.csv`

You can do this via **Catalog → Browse DBFS** in the Databricks UI.



### 1.2 Read CSVs & Create a Scaled Dataset 

- `spark.read.csv(...).option("header","true")`: first row contains column names.
- `.option("inferSchema","true")`: Spark infers column types (convenient; explicit schema is faster for production).
- We **scale up** rows by repeatedly `union`-ing the same DataFrame and modifying a key column so it's unique. This helps stress the system to make optimizations visible.


In [0]:

from pyspark.sql import functions as F

SCALE_SALES = 10  # increase to stress the system

raw_sales = (spark.read
  .option("header","true").option("inferSchema","true")
  .csv(f"{base_path}/input/sales.csv"))

raw_products = (spark.read
  .option("header","true").option("inferSchema","true")
  .csv(f"{base_path}/input/product_catalog.csv"))

raw_customers = (spark.read
  .option("header","true").option("inferSchema","true")
  .csv(f"{base_path}/input/customer_details.csv"))

# Synthetic scale-up with union; we modify SalesOrderLineNumber to keep rows unique
scaled_sales = raw_sales
for i in range(SCALE_SALES-1):
    scaled_sales = scaled_sales.union(
        raw_sales.withColumn(
            "SalesOrderLineNumber",
            F.col("SalesOrderLineNumber") + F.lit(100000*(i+1))
        )
    )

print("Row counts →",
      "sales:", scaled_sales.count(),
      "| products:", raw_products.count(),
      "| customers:", raw_customers.count())



## Step 2 — Ingest to Delta

We normalize types and compute `SalesAmount`. Then we write to **Delta** without partitioning and with **high `repartition`** to **create many small files** intentionally (to demonstrate how `OPTIMIZE` and Z-ORDER help later).

**Key functions:**
- `withColumn`, `F.to_date`, `cast`: type cleanup.
- `repartition(200)`: increases output file count (we will compact later).
- `WRITE format('delta')`: saves as Delta files.
- `CREATE TABLE ... USING delta LOCATION ...`: registers an external Delta table.


In [0]:

sales_df = (scaled_sales
  .withColumn("OrderDate", F.to_date("OrderDate"))
  .withColumn("Quantity", F.col("Quantity").cast("int"))
  .withColumn("UnitPrice", F.col("UnitPrice").cast("double"))
  .withColumn("TaxAmount", F.col("TaxAmount").cast("double"))
  .withColumn("SalesAmount", F.col("Quantity") * F.col("UnitPrice") + F.col("TaxAmount"))
)

product_df = raw_products
customer_df = raw_customers

(sales_df
  .repartition(200)  # intentionally many files
  .write.mode("overwrite").format("delta")
  .save(f"{base_path}/tables/sales_delta"))
spark.sql(f"CREATE TABLE IF NOT EXISTS sales_delta USING delta LOCATION '{base_path}/tables/sales_delta'")

(product_df.write.mode("overwrite").format("delta")
  .save(f"{base_path}/tables/product_delta"))
spark.sql(f"CREATE TABLE IF NOT EXISTS product_delta USING delta LOCATION '{base_path}/tables/product_delta'")

(customer_df.write.mode("overwrite").format("delta")
  .save(f"{base_path}/tables/customer_delta"))
spark.sql(f"CREATE TABLE IF NOT EXISTS customer_delta USING delta LOCATION '{base_path}/tables/customer_delta'")

print("Delta tables ready.")



### ✅ Checkpoint A — Validate Layout & Files
- `DESCRIBE DETAIL` shows Delta metadata (`numFiles`, `sizeInBytes`, etc.).  
- `dbutils.fs.ls(path)` lists the actual files. Expect **many small files** now.


In [0]:

spark.sql("DESCRIBE DETAIL sales_delta").show(truncate=False)
display(dbutils.fs.ls(f"{base_path}/tables/sales_delta"))



## Step 3 — Baseline Queries

We **disable** Spark-side optimizations to get a clean baseline:
- `spark.sql.adaptive.enabled=false`: turn off Adaptive Query Execution (AQE).
- `spark.sql.autoBroadcastJoinThreshold=-1`: prevent auto-broadcast joins.

We then run three representative analytics:
1. **Q1**: Filter + aggregate (tests predicate pushdown & projection pruning potential)
2. **Q2**: Join with product dimension (tests broadcast vs shuffle)
3. **Q3**: Star join and aggregations (closer to real-world ETL/reporting)


In [0]:

spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
print("Baseline mode: AQE OFF, auto-broadcast OFF")



### Q1 — Baseline

- `spark.table("sales_delta")`: read a Delta table as a DataFrame.  
- `filter(...)`: time-range filter (this should push down to storage when possible).  
- `groupBy().agg(F.sum(...))`: compute total sales.  
- `orderBy(desc)`: sort to get top items.  
- We call `print_plan()` to verify plan (look for scans, filters, and shuffles).


In [0]:

@bench("Q1 baseline")
def run_q1():
    from pyspark.sql import functions as F
    df = (spark.table("sales_delta")
            .filter("OrderDate >= '2018-01-01' AND OrderDate < '2019-01-01'")
            .groupBy("Item")
            .agg(F.sum("SalesAmount").alias("TotalSales"))
            .orderBy(F.desc("TotalSales")))
    print_plan(df)
    return df.collect()

q1_result, q1_time_base = run_q1()



### Q2 — Baseline Join

- `join(prod, "Item")` without broadcast causes a **shuffle** join by default.  
- We aggregate by `Category` to see top categories.  
- Again, inspect the plan to confirm join type and number of shuffles.


In [0]:

@bench("Q2 baseline join")
def run_q2():
    from pyspark.sql import functions as F
    sales = spark.table("sales_delta")
    prod  = spark.table("product_delta")
    df = (sales.join(prod, "Item")
               .filter("OrderDate >= '2018-01-01' AND OrderDate < '2019-01-01'")
               .groupBy("Category")
               .agg(F.sum("SalesAmount").alias("TotalSales"))
               .orderBy(F.desc("TotalSales")))
    print_plan(df)
    return df.collect()

q2_result, q2_time_base = run_q2()



### Q3 — Baseline Star Join

- Fact `sales` joins Product and Customer dims.  
- Filters to `Country = 'United States'` and a given year.  
- Aggregates to state/category level.


In [0]:
@bench("Q3 baseline star")
def run_q3():
    sales = spark.table("sales_delta")
    prod  = spark.table("product_delta")
    cust  = spark.table("customer_delta")
    df = (
        sales.join(prod, "Item")
        .join(cust, "CustomerID")
        .filter(
            (F.col("Country") == "United States") &
            (F.col("OrderDate") >= "2019-01-01") &
            (F.col("OrderDate") < "2020-01-01")
        )
        .groupBy("State", "Category")
        .agg(
            F.sum("SalesAmount").alias("TotalSales"),
            F.countDistinct("SalesOrderNumber").alias("Orders")
        )
        .orderBy(F.desc("TotalSales"))
    )
    print_plan(df)
    return df.collect()

q3_result, q3_time_base = run_q3()


## Step 4 — Spark Optimizations

We now enable Spark features that often provide **2–5x** gains:

- **AQE** (`spark.sql.adaptive.enabled=true`): dynamically optimizes plans at runtime (e.g., coalesces small partitions, handles skewed joins).  
- **Predicate Pushdown & Projection Pruning**: filter early and select only needed columns to reduce I/O and shuffle widths.  
- **Broadcast Joins**: replicate small dimension to all executors to avoid a large shuffle.  
- **Caching**: store reused intermediate results to accelerate iterative queries.


In [0]:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
print("AQE + skew join handling + coalesce enabled")



### Q1 Optimized — Predicate Pushdown & Projection Pruning 

- `select(*cols_needed)`: **projection pruning** reduces scanned columns and shuffle width.  
- Early `filter(...)`: encourages **predicate pushdown** to the data source.  
- Same aggregation logic as baseline; plan should be slimmer.


In [0]:

@bench("Q1 optimized (pushdown + projection)")
def run_q1_opt():
    from pyspark.sql import functions as F
    cols_needed = ["Item","SalesAmount","OrderDate"]
    sales_pruned = (spark.table("sales_delta")
                      .select(*cols_needed)
                      .filter("OrderDate >= '2018-01-01' AND OrderDate < '2019-01-01'"))
    df = (sales_pruned.groupBy("Item")
           .agg(F.sum("SalesAmount").alias("TotalSales"))
           .orderBy(F.desc("TotalSales")))
    print_plan(df)
    return df.collect()

q1_opt_result, q1_time_opt = run_q1_opt()



### Q2 Optimized — Broadcast Join

- Set `spark.sql.autoBroadcastJoinThreshold` ~ 10MB, or explicitly use `broadcast(df)`.  
- Broadcast avoids shuffling the small dimension side, eliminating a big cost.


In [0]:

from pyspark.sql.functions import broadcast
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10*1024*1024)  # ~10MB

@bench("Q2 optimized (broadcast)")
def run_q2_opt():
    from pyspark.sql import functions as F
    sales = (spark.table("sales_delta")
               .select("Item","SalesAmount","OrderDate")
               .filter("OrderDate >= '2018-01-01' AND OrderDate < '2019-01-01'"))
    prod  = broadcast(spark.table("product_delta").select("Item","Category"))
    df = (sales.join(prod, "Item")
               .groupBy("Category")
               .agg(F.sum("SalesAmount").alias("TotalSales"))
               .orderBy(F.desc("TotalSales")))
    print_plan(df)
    return df.collect()

q2_opt_result, q2_time_opt = run_q2_opt()



### Q3 Optimized — Caching + AQE + Broadcast

- `cache()` marks the DataFrame to be kept in memory (or memory+disk).  
- We **materialize** the cache with a `count()` so future actions are fast.  
- Combine with broadcast and AQE for best results.


In [0]:

# Cache a hot subset for reuse
sales_2019 = (
    spark.table("sales_delta")
    .select("Item", "CustomerID", "OrderDate", "SalesAmount")
    .filter("OrderDate >= '2019-01-01' AND OrderDate < '2020-01-01'")
    .cache()
)
sales_2019.count()  # materialize cache

@bench("Q3 optimized (cache + AQE + broadcast)")
def run_q3_opt():
    prod = broadcast(
        spark.table("product_delta").select("Item", "Category")
    )
    cust = (
        spark.table("customer_delta")
        .select("CustomerID", "CustomerName", "EmailAddress", "Country", "State")
    )
    df = (
        sales_2019.join(prod, "Item")
        .join(cust, "CustomerID")
        .filter("Country = 'United States'")
        .groupBy("State", "Category")
        .agg(
            F.sum("SalesAmount").alias("TotalSales"),
            F.countDistinct("CustomerName").alias("DistinctCustomers")
        )
        .orderBy(F.desc("TotalSales"))
    )
    print_plan(df)
    return df.collect()

q3_opt_result, q3_time_opt = run_q3_opt()



## Step 5 — Delta Lake Optimizations

- **`OPTIMIZE`** compacts many small files into fewer, larger files (targeting ~256MB–1GB). This reduces metadata ops and improves scan throughput.  
- **`ZORDER BY (col1, col2, ...)`** physically clusters data so selective queries can **skip** more data. Choose columns that are frequently filtered/joined **together**.  
- **Auto Optimize** (`optimizeWrite` + `autoCompact`) keeps future writes healthy.


In [0]:

# Inspect before/after compaction
spark.sql("DESCRIBE DETAIL sales_delta").show(truncate=False)
spark.sql("OPTIMIZE sales_delta")
spark.sql("DESCRIBE DETAIL sales_delta").show(truncate=False)

# Z-ORDER on common filter/join keys
spark.sql("""
  OPTIMIZE sales_delta
  ZORDER BY (OrderDate, Item)
""")

# Enable Auto Optimize going forward
spark.sql("""
  ALTER TABLE sales_delta SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact'   = 'true'
  )
""")



### ✅ Checkpoint D — Storage-only Gains

We now **turn off Spark hints** again to isolate the storage layout impact from Delta operations. Expect speedups where filters hit Z-ordered columns, and fewer files reduce overhead.


In [0]:

spark.catalog.clearCache()
spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

_, q1_time_after_delta = run_q1()
_, q2_time_after_delta = run_q2()
_, q3_time_after_delta = run_q3()

print(f"Delta-only speedup Q1: {q1_time_base/q1_time_after_delta:.2f}x")
print(f"Delta-only speedup Q2: {q2_time_base/q2_time_after_delta:.2f}x")
print(f"Delta-only speedup Q3: {q3_time_base/q3_time_after_delta:.2f}x")



## Step 6 — Combined Optimizations

Finally, we **stack** Spark and Delta optimizations for the best overall performance per dollar.


In [0]:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10*1024*1024)

_, q1_time_final = run_q1_opt()
_, q2_time_final = run_q2_opt()
_, q3_time_final = run_q3_opt()

print("--- OVERALL SPEEDUPS (vs. original baseline) ---")
print(f"Q1 overall: {q1_time_base/q1_time_final:.2f}x")
print(f"Q2 overall: {q2_time_base/q2_time_final:.2f}x")
print(f"Q3 overall: {q3_time_base/q3_time_final:.2f}x")



## Step 7 — Cost Awareness & Cluster Tuning

- **Fixed vs Autoscaling**: Autoscaling reduces idle cost but may add ramp-up time.  
- **Spot Instances**: Much cheaper but can be preempted—great for non-SLA batch ETL.  
- **DBU Estimation**: Approximate cost = (DBU/hr × runtime hrs × DBU price) + infrastructure.



## Cleanup (Optional)
Uncomment if you want to reclaim space.


In [0]:

# spark.sql("VACUUM sales_delta RETAIN 168 HOURS")
# spark.sql("DROP TABLE IF EXISTS sales_delta")
# spark.sql("DROP TABLE IF EXISTS product_delta")
# spark.sql("DROP TABLE IF EXISTS customer_delta")
# dbutils.fs.rm(base_path, True)



## 📎 Quick Reference — Functions & Why

- **`bench(name)`**: Decorator timing harness to compare baseline vs optimized.  
- **`print_plan(df)` / `df.explain(...)`**: Verify filter pushdown, join types, shuffles.  
- **Read CSV**: `.option("header","true").option("inferSchema","true").csv(path)`  
- **Transform**: `withColumn`, `F.to_date`, `cast`, arithmetic on `Column`s  
- **Shuffle width**: `repartition(n)` controls output files & parallelism  
- **Delta**: `OPTIMIZE`, `ZORDER BY`, table `TBLPROPERTIES` for auto optimize  
- **Spark configs**: `spark.conf.set(...)` for AQE & joins  
- **Broadcast**: `from pyspark.sql.functions import broadcast; broadcast(df)`  
- **Cache**: `df.cache(); df.count()` materializes reuse
