# Performance Tuning: SQL & Table Optimization

**The Situation:** Leadership wants dashboards, predictive models, and AI agents ready by Friday. Your plane IoT data is growing fast, and queries that worked yesterday are timing out today.

**The Problem:** Slow queries = missed deadlines + angry leadership

**The Solution:** Get familiar with both SQL and table optimization techniques to get sub-second query times.

---

## What You'll Learn (30 minutes)

‚úÖ **SQL Optimization:** Predicate pushdown, join strategies, broadcast hints  
‚úÖ **Liquid Clustering:** Automatic data layout optimization  
‚úÖ **Materialized Views:** Pre-compute expensive aggregations  
‚úÖ **Deletion Vectors:** Fast updates and deletes  
‚úÖ **Query Profile:** Analyze query execution on SQL Warehouses  

---

## Prerequisites

- Completed Day 1 & 2
- `sensor_bronze`, `dim_factories`, `dim_devices` tables loaded
- SQL Warehouse or cluster running

---

**References:**
- [Delta Lake Performance](https://docs.databricks.com/en/delta/tune-file-layout.html)
- [Liquid Clustering](https://docs.databricks.com/en/delta/clustering.html)
- [Query Optimization](https://docs.databricks.com/en/optimizations/)


In [0]:
# Configuration - UPDATE THESE VALUES!
CATALOG = "your_catalog"    # Update: Change to your catalog name
SCHEMA = "your_username"    # Update: Use your username (without special characters)

# Example: If your email is john.doe@company.com, use:
# CATALOG = 'main' 
# SCHEMA = 'john_doe'

print(f"‚úÖ Using catalog: {CATALOG}")
print(f"‚úÖ Using schema: {SCHEMA}")


## Part 1: Understanding Performance Bottlenecks

### Common Performance Killers

| Problem | Impact | Solution |
|---------|--------|----------|
| üêå **Small Files** | Too many file opens | OPTIMIZE |
| üêå **Full Table Scans** | Read entire table | Liquid Clustering, predicates |
| üêå **Data Shuffle** | Network overhead | Broadcast joins |
| üêå **Wrong Join Type** | Memory spills | Join hints |
| üêå **Repeated Computation** | Wasted resources | Materialized views |
| üêå **Inefficient Predicates** | No pushdown | Proper filters |

### Performance Toolkit

**SQL Optimization:**
- Predicate pushdown (filter early)
- Join hints (BROADCAST, SHUFFLE_HASH)
- Proper WHERE clause design
- Query Profile analysis

**Table Optimization:**
- File compaction (OPTIMIZE)
- Liquid Clustering (automatic data layout optimization)
- Deletion Vectors (fast updates)

**Query Results:**
- Materialized Views
- Caching


## Part 2: Creating a "Bad" Table for Demonstration

Let's intentionally create a poorly optimized table with:
- Many small files (simulating streaming ingestion)
- Random data layout (no locality)
- No optimization

This represents what happens in real production systems without proper maintenance!


In [0]:
# Step 1: Create unoptimized table with random layout
spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized")

spark.sql(f"""
CREATE TABLE {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized
AS
SELECT 
    device_id,
    trip_id,
    factory_id,
    model_id,
    timestamp,
    airflow_rate,
    rotation_speed,
    air_pressure,
    temperature,
    delay,
    density
FROM {CATALOG}.{SCHEMA}.{USER}_sensor_bronze
ORDER BY RAND()  -- Random order = worst case for data locality!
LIMIT 200000  -- Use subset for demo
""")

print("‚úÖ Created unoptimized table with random layout")


In [0]:
# Step 2: Simulate many small files (like streaming writes)
# This is what happens with continuous ingestion without auto-compaction

for i in range(15):  # Create 15 small file batches
    spark.sql(f"""
    INSERT INTO {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized
    SELECT 
        device_id,
        trip_id,
        factory_id,
        model_id,
        timestamp,
        airflow_rate,
        rotation_speed,
        air_pressure,
        temperature,
        delay,
        density
    FROM {CATALOG}.{SCHEMA}.{USER}_sensor_bronze
    WHERE MOD(device_id, 15) = {i}
    LIMIT 800
    """)

print("‚úÖ Created many small files (simulating poor ingestion patterns)")


In [0]:
# Check table statistics - look at the file count!
display(spark.sql(f"""
DESCRIBE DETAIL {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized
""").select("numFiles", "sizeInBytes", "minReaderVersion", "minWriterVersion"))


### üîç What to Look For:

- **numFiles**: High number (hundreds of thousands or millions) = Performance problem!
- **sizeInBytes**: Total size, but spread across too many files

**Problem:** Every query must:
1. List all files
2. Open each file
3. Read metadata
4. Scan for relevant data

With many small files, overhead dominates actual work!


## Part 3: SQL Optimization - Predicate Pushdown

**Key Concept:** Push filters as close to the data as possible.

### ‚ùå Bad: Function on Column (Prevents Pushdown)


In [0]:
import time

# BAD: Using SUBSTRING on timestamp prevents predicate pushdown
start = time.time()

result_bad = spark.sql(f"""
SELECT 
    device_id,
    factory_id,
    AVG(temperature) as avg_temp
FROM {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized
WHERE SUBSTRING(CAST(timestamp AS STRING), 1, 10) >= DATE_SUB(CURRENT_DATE(), 7)
GROUP BY device_id, factory_id
""")

count_bad = result_bad.count()
time_bad = time.time() - start

print(f"‚ùå BAD Query Time: {time_bad:.2f} seconds")
print(f"   Results: {count_bad} rows")
print(f"   Problem: Function on column prevents statistics-based filtering!")


# GOOD: Direct filter on timestamp column enables predicate pushdown



In [0]:
start = time.time()

result_good = spark.sql(f"""
SELECT 
    device_id,
    factory_id,
    AVG(temperature) as avg_temp
FROM {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized
WHERE timestamp >= CURRENT_DATE() - INTERVAL 7 DAYS
GROUP BY device_id, factory_id
""")

count_good = result_good.count()
time_good = time.time() - start

print(f"‚úÖ GOOD Query Time: {time_good:.2f} seconds")
print(f"   Results: {count_good} rows")
print(f"   Speedup: {time_bad/time_good:.1f}x faster!")
print(f"   Reason: Databricks can use file statistics to skip irrelevant files")

### üí° Predicate Pushdown Best Practices

**DO:**
```sql
WHERE timestamp >= '2024-01-01'  -- Direct column comparison
WHERE device_id IN (1, 2, 3)     -- Direct value check
WHERE factory_id = 'A06'          -- Equality on column
```

**DON'T:**
```sql
WHERE DATE(timestamp) = '2024-01-01'      -- Function prevents pushdown
WHERE SUBSTRING(device_id, 1, 2) = '10'   -- Function on column
WHERE UPPER(factory_id) = 'A06'           -- Transformation blocks optimization
```


## Part 4: SQL Optimization - Join Strategies

### Understanding Join Types

| Join Type | Best For | Cost |
|-----------|----------|------|
| **Broadcast Join** | Small table (< 10MB) | Low - no shuffle |
| **Shuffle Hash Join** | Large tables | High - shuffle both |
| **Sort Merge Join** | Large sorted tables | Medium |

### ‚ùå Bad: Let Spark guess (might shuffle large tables)


# Without hint - Spark might choose inefficient join strategy


In [0]:
start = time.time()

result_no_hint = spark.sql(f"""
SELECT 
    s.device_id,
    f.factory_name,
    f.region,
    COUNT(*) as reading_count,
    AVG(s.temperature) as avg_temp
FROM {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized s
JOIN {CATALOG}.{SCHEMA}.{USER}_dim_factories f 
  ON s.factory_id = f.factory_id
GROUP BY s.device_id, f.factory_name, f.region
""")

count_no_hint = result_no_hint.count()
time_no_hint = time.time() - start

print(f"‚ö†Ô∏è  No Hint Query Time: {time_no_hint:.2f} seconds")
print(f"   Spark may shuffle both tables unnecessarily")


### ‚úÖ Good: Broadcast Small Dimension Table


In [0]:
# With BROADCAST hint - force efficient join strategy
start = time.time()

result_broadcast = spark.sql(f"""
SELECT 
    /*+ BROADCAST(f) */
    s.device_id,
    f.factory_name,
    f.region,
    COUNT(*) as reading_count,
    AVG(s.temperature) as avg_temp
FROM {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized s
JOIN {CATALOG}.{SCHEMA}.{USER}_dim_factories f 
  ON s.factory_id = f.factory_id
GROUP BY s.device_id, f.factory_name, f.region
""")

count_broadcast = result_broadcast.count()
time_broadcast = time.time() - start

print(f"‚úÖ Broadcast Join Time: {time_broadcast:.2f} seconds")
print(f"   Speedup: {time_no_hint/time_broadcast:.1f}x faster!")
print(f"   Only dimension table sent to executors - no shuffle of fact table!")


### üí° Join Optimization Rules

**BROADCAST when:**
- Dimension table < 10MB
- Reference data (factories, models, devices)
- Lookup tables

**Let Spark choose when:**
- Both tables are large
- Join cardinality is unknown
- Adaptive Query Execution is enabled (default)


## Part 5: Table Optimization - File Compaction

**Problem:** Many small files cause overhead

**Solution:** OPTIMIZE command compacts small files into larger ones

**Target:** 128MB - 1GB per file (default: 1GB)


In [0]:
# Check current file situation
detail_before = spark.sql(f"""
DESCRIBE DETAIL {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized
""").select("numFiles", "sizeInBytes").collect()[0]

files_before = detail_before['numFiles']
size_mb = detail_before['sizeInBytes'] / 1024 / 1024

print(f"üìä Before Optimization:")
print(f"   Files: {files_before}")
print(f"   Size: {size_mb:.2f} MB")
print(f"   Avg file size: {size_mb/files_before:.2f} MB")
print(f"\n   Status: {'üî¥ Too many small files!' if files_before > 10 else 'üü¢ OK'}")


In [0]:
# Run OPTIMIZE to compact files
start = time.time()

spark.sql(f"""
OPTIMIZE {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized
""")

optimize_time = time.time() - start

print(f"‚úÖ OPTIMIZE completed in {optimize_time:.2f} seconds")


In [0]:
# Check results after optimization
detail_after = spark.sql(f"""
DESCRIBE DETAIL {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized
""").select("numFiles").collect()[0]

files_after = detail_after['numFiles']

print(f"üìä After Optimization:")
print(f"   Files: {files_after}")
print(f"   Reduction: {files_before - files_after} files removed")
print(f"   Improvement: {files_before/files_after:.1f}x fewer files!")
print(f"\nüí° Queries now have much less file I/O overhead!")


## Part 6: Table Optimization - Liquid Clustering

**Liquid Clustering** is Delta Lake's automatic data layout optimization - the successor to Z-Ordering and partitioning.

### Why Liquid Clustering?

**Old approach (Z-Ordering):**
- Manual OPTIMIZE commands required
- Must choose columns upfront
- Re-optimize needed when access patterns change
- Separate from file compaction

**New approach (Liquid Clustering):**
- ‚úÖ Automatic optimization during writes
- ‚úÖ Adapts to changing access patterns
- ‚úÖ No manual maintenance required
- ‚úÖ Combines compaction + data layout

### How Data Skipping Works:

**Without Clustering:**
```
File 1: devices 1,5,10,15,20     <- Must read
File 2: devices 2,3,8,12,19      <- Must read  
File 3: devices 4,7,9,11,14      <- Must read
```
Query for device_id = 5 must read ALL files!

**With Liquid Clustering on device_id:**
```
File 1: devices 1,2,3,4,5        <- Read this (automatically organized!)
File 2: devices 7,8,9,10,11      <- SKIP
File 3: devices 12,14,15,19,20   <- SKIP
```
Query for device_id = 5 only reads File 1!

### Choosing Clustering Columns:

‚úÖ **Good candidates:**
- High cardinality (device_id, timestamp)
- Frequently in WHERE clauses
- Used in joins
- Common GROUP BY columns

‚ùå **Bad candidates:**
- Low cardinality (status: active/inactive)
- Rarely filtered

### Rule: 2-4 columns maximum, order matters (most selective first)


# Create a table WITH Liquid Clustering

# First, let's create a clustered version of the unoptimized table
spark.sql(f"""
CREATE OR REPLACE TABLE {CATALOG}.{SCHEMA}.sensor_clustered
CLUSTER BY (device_id, timestamp)
AS SELECT * FROM {CATALOG}.{SCHEMA}.sensor_bronze
""")

print("‚úÖ Created table with Liquid Clustering on (device_id, timestamp)")
print("   - Automatically organizes data as it's written")
print("   - No manual OPTIMIZE needed")


In [0]:
# Compare query performance: Unclustered vs Clustered

import time

# Benchmark query on UNCLUSTERED table
print("üîç Testing UNCLUSTERED table...")
start = time.time()

result_unclustered = spark.sql(f"""
SELECT 
    device_id,
    DATE(timestamp) as date,
    AVG(temperature) as avg_temp,
    AVG(rotation_speed) as avg_rotation,
    MAX(air_pressure) as max_pressure,
    COUNT(*) as reading_count
FROM {CATALOG}.{SCHEMA}.sensor_bronze
WHERE device_id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  AND timestamp >= CURRENT_DATE() - INTERVAL 7 DAYS
GROUP BY device_id, DATE(timestamp)
ORDER BY date DESC, device_id
""")

count_unclustered = result_unclustered.count()
time_unclustered = time.time() - start

print(f"‚è±Ô∏è  Query Time (Unclustered): {time_unclustered:.2f} seconds")
print(f"   Must scan many files to find relevant devices\n")

# Benchmark same query on CLUSTERED table
print("üîç Testing CLUSTERED table...")
start = time.time()

result_clustered = spark.sql(f"""
SELECT 
    device_id,
    DATE(timestamp) as date,
    AVG(temperature) as avg_temp,
    AVG(rotation_speed) as avg_rotation,
    MAX(air_pressure) as max_pressure,
    COUNT(*) as reading_count
FROM {CATALOG}.{SCHEMA}.sensor_clustered
WHERE device_id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  AND timestamp >= CURRENT_DATE() - INTERVAL 7 DAYS
GROUP BY device_id, DATE(timestamp)
ORDER BY date DESC, device_id
""")

count_clustered = result_clustered.count()
time_clustered = time.time() - start

print(f"‚è±Ô∏è  Query Time (Clustered): {time_clustered:.2f} seconds")
print(f"   Data skipping means fewer files to read")

# Calculate speedup
if time_clustered > 0:
    speedup = time_unclustered / time_clustered
    print(f"\nüöÄ Performance Improvement: {speedup:.1f}x faster with Liquid Clustering!")


In [0]:
# Common dashboard query: Hourly device metrics by factory
# Without materialized view - runs every time

start = time.time()

result_no_mv = spark.sql(f"""
SELECT 
    f.factory_id,
    f.factory_name,
    f.region,
    s.device_id,
    DATE_TRUNC('hour', s.timestamp) as hour,
    AVG(s.temperature) as avg_temp,
    AVG(s.rotation_speed) as avg_rotation,
    AVG(s.air_pressure) as avg_pressure,
    COUNT(*) as reading_count
FROM {CATALOG}.{SCHEMA}.sensor_bronze s
JOIN {CATALOG}.{SCHEMA}.dim_factories f ON s.factory_id = f.factory_id
WHERE s.timestamp >= current_date() - INTERVAL 7 DAYS
GROUP BY f.factory_id, f.factory_name, f.region, s.device_id, DATE_TRUNC('hour', s.timestamp)
ORDER BY hour DESC
LIMIT 100
""")

display(result_no_mv)

no_mv_time = time.time() - start
print(f"\n‚è±Ô∏è  Query time (no materialized view): {no_mv_time:.2f} seconds")


In [0]:
# Create materialized view for this common pattern
spark.sql(f"DROP MATERIALIZED VIEW IF EXISTS {CATALOG}.{SCHEMA}.mv_hourly_factory_metrics")

spark.sql(f"""
CREATE MATERIALIZED VIEW {CATALOG}.{SCHEMA}.mv_hourly_factory_metrics
AS
SELECT 
    f.factory_id,
    f.factory_name,
    f.region,
    s.device_id,
    DATE_TRUNC('hour', s.timestamp) as hour,
    AVG(s.temperature) as avg_temp,
    AVG(s.rotation_speed) as avg_rotation,
    AVG(s.air_pressure) as avg_pressure,
    COUNT(*) as reading_count
FROM {CATALOG}.{SCHEMA}.sensor_bronze s
JOIN {CATALOG}.{SCHEMA}.dim_factories f ON s.factory_id = f.factory_id
GROUP BY f.factory_id, f.factory_name, f.region, s.device_id, DATE_TRUNC('hour', s.timestamp)
""")

print("‚úÖ Created materialized view")
print("   This pre-computes the expensive join and aggregation")


In [0]:
# Now query is MUCH faster - reads pre-computed results
start = time.time()

result_with_mv = spark.sql(f"""
SELECT *
FROM {CATALOG}.{SCHEMA}.mv_hourly_factory_metrics
WHERE hour >= current_date() - INTERVAL 7 DAYS
ORDER BY hour DESC
LIMIT 100
""")

display(result_with_mv)

mv_time = time.time() - start
mv_speedup = no_mv_time / mv_time if mv_time > 0 else 0

print(f"\n‚è±Ô∏è  Query time (with materialized view): {mv_time:.2f} seconds")
print(f"üöÄ Speedup: {mv_speedup:.1f}x faster!")
print(f"\nüí° Dashboard loads instantly instead of making users wait!")


### üéØ Materialized View Benefits:

1. **Dashboard speed**: Instant load times
2. **Cost savings**: Compute once, query many times
3. **Automatic refresh**: Stays up to date
4. **Query rewriting**: Optimizer uses it automatically

**For your deadline:** This makes your real-time dashboard actually real-time!


## 7. Caching Strategies <a id="caching"></a>

**Caching** keeps frequently accessed data in memory for instant access.

### Types of Caching:

1. **DataFrame Cache**: Temporary, session-specific
2. **Delta Cache**: Disk-based, persists across queries
3. **Result Cache**: Caches query results

### When to Use Caching:

‚úÖ Dimension tables (small, frequently joined)  
‚úÖ Reference data  
‚úÖ Iterative ML training  
‚úÖ Dashboard data sources  

‚ùå Don't cache:
- Large fact tables (waste of memory)
- Rarely accessed data
- Data that changes frequently


In [0]:
# Cache frequently used dimension tables
# These are joined in almost every query!

spark.sql(f"CACHE TABLE {CATALOG}.{SCHEMA}.dim_factories")
spark.sql(f"CACHE TABLE {CATALOG}.{SCHEMA}.dim_models")
spark.sql(f"CACHE TABLE {CATALOG}.{SCHEMA}.dim_devices")

print("‚úÖ Cached dimension tables")
print("   Joins with these tables are now instant!")


In [0]:
# Test query with cached dimensions
start = time.time()

result_cached = spark.sql(f"""
SELECT 
    f.factory_name,
    f.region,
    m.model_name,
    m.model_family,
    d.device_id,
    COUNT(DISTINCT s.trip_id) as trip_count,
    AVG(s.temperature) as avg_temp
FROM {CATALOG}.{SCHEMA}.sensor_bronze s
JOIN {CATALOG}.{SCHEMA}.dim_devices d ON s.device_id = d.device_id
JOIN {CATALOG}.{SCHEMA}.dim_factories f ON d.factory_id = f.factory_id
JOIN {CATALOG}.{SCHEMA}.dim_models m ON d.model_id = m.model_id
WHERE s.timestamp >= current_date() - INTERVAL 1 DAYS
GROUP BY f.factory_name, f.region, m.model_name, m.model_family, d.device_id
""")

display(result_cached)

cached_time = time.time() - start
print(f"\n‚è±Ô∏è  Query time (with cached dimensions): {cached_time:.2f} seconds")
print("‚ú® Dimension joins are instant - no disk I/O needed!")


In [0]:
# Clear cache when done (frees memory)
spark.sql(f"UNCACHE TABLE IF EXISTS {CATALOG}.{SCHEMA}.dim_factories")
spark.sql(f"UNCACHE TABLE IF EXISTS {CATALOG}.{SCHEMA}.dim_models")
spark.sql(f"UNCACHE TABLE IF EXISTS {CATALOG}.{SCHEMA}.dim_devices")

print("‚úÖ Cleared caches")


### üí° Caching Best Practices:

1. **Cache small tables** that are joined frequently
2. **Monitor memory** - don't cache everything
3. **Clear caches** when not needed
4. **Use Delta cache** on read-heavy clusters
5. **Let Databricks auto-cache** query results


## 8. Using Query Profile on SQL Warehouses

**Query Profile** is your best friend for diagnosing slow queries on SQL Warehouses.

### What is Query Profile?

Query Profile shows you **exactly** what your query is doing:
- Which operations took the longest
- How much data was read
- Where shuffles happened
- Memory spills

### How to Access Query Profile:

1. Run a query on a **SQL Warehouse** (not a cluster)
2. After the query completes, click the **"Query Profile"** tab
3. Explore the visual execution plan

### What to Look For:

| Problem in Profile | Meaning | Solution |
|-------------------|---------|----------|
| üî¥ **Large Scan** | Reading too much data | Add Liquid Clustering, better filters |
| üî¥ **Shuffle** | Data moving between nodes | Use broadcast joins for small tables |
| üî¥ **Spill to Disk** | Out of memory | Increase warehouse size or optimize query |
| üî¥ **Many Tasks** | Too many small files | Run OPTIMIZE |

### Example Workflow:

```
1. Query is slow (10+ seconds) ‚ùå
2. Check Query Profile ‚Üí See "Large Scan"
3. Add Liquid Clustering to table
4. Re-run query ‚Üí 2 seconds ‚úÖ
```

**Learn more:** [Query Profile Documentation](https://docs.databricks.com/aws/en/sql/user/queries/query-profile)

**üí° Pro Tip:** Query Profile only works on SQL Warehouses, not compute clusters. If you're running notebooks on a cluster, switch to a SQL Warehouse to use this feature.

---

## 9. Performance Comparison Summary <a id="comparison"></a>

Let's summarize the performance improvements:

### Optimization Results:

| Technique | Typical Speedup | Setup Time | Maintenance |
|-----------|----------------|------------|--------------|
| **File Compaction** | 2-3x | 5 min | As needed |
| **Liquid Clustering** | 3-10x | 10 min | Automatic |
| **Materialized Views** | 5-20x | 15 min | Automatic |
| **Caching** | 10-100x | 2 min | Per session |

### Impact on Your Project:

**Before Optimization:**
- Dashboard: 10-15 seconds to load ‚ùå
- Model training queries: 5 minutes ‚ùå
- Ad-hoc analysis: 30+ seconds ‚ùå
- Leadership impatient: Yes ‚ùå

**After Optimization:**
- Dashboard: <1 second ‚úÖ
- Model training queries: 30 seconds ‚úÖ
- Ad-hoc analysis: 3-5 seconds ‚úÖ
- Leadership happy: Yes! ‚úÖ

### Optimization Strategy:

1. **Use Liquid Clustering** - For all production tables
2. **Add Materialized Views** - For repeated dashboard queries
3. **Cache dimension tables** - Small tables used everywhere
4. **Run OPTIMIZE** - When you have many small files
5. **Use Query Profile** - Identify bottlenecks in slow queries


In [0]:
# Quick performance audit of your tables
print("üìä Table Performance Audit\n")

# Get table details from sensor tables
tables_to_check = ['sensor_bronze', 'sensor_unoptimized', 'sensor_clustered']

for table in tables_to_check:
    try:
        detail = spark.sql(f"DESCRIBE DETAIL {CATALOG}.{SCHEMA}.{table}").collect()[0]
        num_files = detail['numFiles']
        size_mb = detail['sizeInBytes'] / 1024 / 1024
        
        if num_files > 1000:
            rec = 'üî¥ Too many files - run OPTIMIZE'
        elif num_files > 100:
            rec = 'üü° Consider OPTIMIZE'
        else:
            rec = 'üü¢ File count OK'
        
        print(f"{table}:")
        print(f"  Size: {size_mb:.2f} MB")
        print(f"  Files: {num_files}")
        print(f"  {rec}\n")
    except:
        print(f"{table}: Table not found or error\n")

print("üí° Use this audit to identify tables needing optimization")


## üéØ Key Takeaways

### Must-Do Optimizations:

1. **Use Liquid Clustering** - `CREATE TABLE ... CLUSTER BY (col1, col2)`
2. **Create materialized views** - For repeated dashboard queries
3. **Cache dimension tables** - Small, frequently joined tables
4. **Monitor file count** - Run OPTIMIZE when >100 files
5. **Use Query Profile** - Analyze slow queries on SQL Warehouses

### Performance Checklist:

- [ ] Created tables with Liquid Clustering on (device_id, timestamp)
- [ ] Created materialized views for dashboard queries
- [ ] Cached dimension tables
- [ ] Compacted files (numFiles < 100)
- [ ] Enabled Photon on SQL warehouse
- [ ] Used Query Profile to analyze slow queries

### For Your End-of-Week Demo:

‚úÖ **Dashboards**: Sub-second response times  
‚úÖ **ML models**: Fast training on optimized data  
‚úÖ **Genie queries**: Instant results on materialized views  
‚úÖ **Leadership**: Impressed with performance  

---

## üöÄ Try This Out

### Challenge 1: Optimize Your Most Expensive Query

1. Check SQL warehouse Query History
2. Find the slowest query from yesterday
3. Use Query Profile to identify bottlenecks
4. Apply Liquid Clustering or create materialized view
5. Measure the speedup

### Challenge 2: Create Clustered Tables

Convert your existing tables to use Liquid Clustering:

```sql
-- Sensor data - cluster by device and time
ALTER TABLE sensor_bronze CLUSTER BY (device_id, timestamp);

-- Inspection data - cluster by device and time  
ALTER TABLE inspection_bronze CLUSTER BY (device_id, timestamp);

-- Run OPTIMIZE to apply clustering
OPTIMIZE sensor_bronze;
OPTIMIZE inspection_bronze;
```

### Challenge 3: Use Query Profile

On a SQL Warehouse:
1. Run a complex query
2. Click the "Query Profile" tab
3. Identify the slowest operation
4. Look for:
   - Full table scans ‚Üí add clustering
   - Large shuffles ‚Üí add join hints
   - Spills to disk ‚Üí increase warehouse size

**Learn more:** [Query Profile on SQL Warehouses](https://docs.databricks.com/aws/en/sql/user/queries/query-profile)

### Challenge 4: Optimize the Inspection Pipeline

1. Add Liquid Clustering to `inspection_bronze` on (device_id, timestamp)
2. Create materialized view for defect rate by model
3. Compare query performance before/after

### Challenge 5: Experiment with Different Clustering Keys

1. Create test tables with different clustering strategies:
   - `CLUSTER BY (device_id, timestamp)`
   - `CLUSTER BY (factory_id, timestamp)`
   - `CLUSTER BY (device_id, factory_id)`
2. Run the same query on each
3. Measure which performs best for your use case

---

**Next Steps:**
- Apply these techniques to your production tables
- Set up monitoring to track query performance
- Schedule weekly OPTIMIZE jobs
- Educate team on performance best practices

**Remember:** Fast queries = happy leadership = successful project! üéâ


In [0]:
# Uncomment to clean up demo tables
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{SCHEMA}.{USER}_sensor_unoptimized")
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{SCHEMA}.{USER}_sensor_clustered")
# spark.sql(f"DROP MATERIALIZED VIEW IF EXISTS {CATALOG}.{SCHEMA}.{USER}_mv_hourly_metrics")
# print("‚úÖ Cleaned up demo tables")


### üîç Query Plan Analysis

Look for these indicators in the plan:
- **FileScan**: How many files are scanned?
- **Filter**: Pushed down to file scan (good) or after (bad)?
- **Exchange**: Data shuffle between nodes (expensive)
- **Data Skipping**: Are file statistics used?

**Key Issue:** Without optimization, Databricks must scan ALL files even though we only need 5 devices!
