# Databricks Delta Lake Professional Lab Exercise

## Overview
This hands-on lab will guide you through essential Delta Lake concepts including Delta Lake fundamentals, time travel, delta logs, data files, compaction, vacuum, optimize, and Z-ordering. You'll work with a realistic e-commerce dataset to understand how these features work in practice.

## Prerequisites
- Databricks workspace access
- Basic knowledge of SQL and Spark
- Running compute cluster

## Learning Objectives
By the end of this lab, you will be able to:
- Create and manage Delta Lake tables
- Understand Delta Lake architecture (data files and transaction logs)
- Use time travel to query historical data
- Perform table optimization using OPTIMIZE and Z-ORDER
- Clean up old files using VACUUM
- Monitor table performance and file structure

---

## Lab Setup

### Step 1: Environment Setup
Create a new notebook in your Databricks workspace and run the following setup commands:

```python
# Set up the database and location for this lab
spark.sql("CREATE DATABASE IF NOT EXISTS delta_lab")
spark.sql("USE delta_lab")

# Define the base path for our tables
base_path = "/tmp/delta_lab/"
dbutils.fs.rm(base_path, True)  # Clean up if exists
```

---

## Exercise 1: Creating Your First Delta Lake Table

### Task 1.1: Create Sample E-commerce Data
```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime, timedelta
import random

# Generate sample e-commerce data
def generate_ecommerce_data(num_records=10000):
    categories = ["Electronics", "Clothing", "Books", "Home & Garden", "Sports"]
    regions = ["North", "South", "East", "West", "Central"]
    
    data = []
    for i in range(num_records):
        record = {
            "order_id": f"ORD{i:06d}",
            "customer_id": f"CUST{random.randint(1, 1000):04d}",
            "product_category": random.choice(categories),
            "region": random.choice(regions),
            "order_amount": round(random.uniform(10, 1000), 2),
            "quantity": random.randint(1, 10),
            "order_date": (datetime.now() - timedelta(days=random.randint(0, 365))).strftime("%Y-%m-%d"),
            "status": random.choice(["Completed", "Pending", "Cancelled"])
        }
        data.append(record)
    
    return spark.createDataFrame(data)

# Generate initial dataset
df_orders = generate_ecommerce_data(10000)
df_orders.show(10)
```

### Task 1.2: Create Delta Lake Table
```python
# Write data as Delta Lake table
delta_table_path = f"{base_path}orders_delta"

df_orders.write \
    .format("delta") \
    .mode("overwrite") \
    .option("path", delta_table_path) \
    .saveAsTable("delta_lab.orders")

print(f"Created Delta table at: {delta_table_path}")
```

### Questions for Task 1:
1. What files were created in the Delta Lake table directory?
2. How does a Delta Lake table differ from a regular Parquet table?

---

## Exercise 2: Understanding Delta Lake Architecture

### Task 2.1: Explore Data Files and Delta Logs
```python
# List files in the Delta table directory
display(dbutils.fs.ls(delta_table_path))

# Look at the _delta_log directory
display(dbutils.fs.ls(f"{delta_table_path}/_delta_log/"))

# Read the first commit log
first_commit = spark.read.json(f"{delta_table_path}/_delta_log/00000000000000000000.json")
display(first_commit)
```

### Task 2.2: Analyze Table Structure
```python
# Describe the table
spark.sql("DESCRIBE EXTENDED delta_lab.orders").show(50, False)

# Show table history
spark.sql("DESCRIBE HISTORY delta_lab.orders").show(10, False)
```

### Questions for Task 2:
1. What information is stored in the Delta log files?
2. How many data files were created initially?
3. What does the commitInfo tell us about the transaction?

---

## Exercise 3: Time Travel and Versioning

### Task 3.1: Make Changes to Create Versions
```python
# Version 1: Add new orders (INSERT)
new_orders = generate_ecommerce_data(2000)
new_orders.write \
    .format("delta") \
    .mode("append") \
    .saveAsTable("delta_lab.orders")

print("Added 2000 new records - Version 1")

# Version 2: Update order status (UPDATE)
spark.sql("""
    UPDATE delta_lab.orders 
    SET status = 'Shipped' 
    WHERE status = 'Pending' AND region = 'North'
""")

print("Updated order status - Version 2")

# Version 3: Delete cancelled orders (DELETE)
spark.sql("""
    DELETE FROM delta_lab.orders 
    WHERE status = 'Cancelled'
""")

print("Deleted cancelled orders - Version 3")
```

### Task 3.2: Time Travel Queries
```python
# Show table history
history_df = spark.sql("DESCRIBE HISTORY delta_lab.orders")
display(history_df)

# Time travel by version
print("=== Current Version ===")
current_count = spark.sql("SELECT COUNT(*) as count FROM delta_lab.orders").collect()[0][0]
print(f"Current record count: {current_count}")

print("\n=== Version 0 (Original) ===")
v0_count = spark.sql("SELECT COUNT(*) as count FROM delta_lab.orders VERSION AS OF 0").collect()[0][0]
print(f"Version 0 record count: {v0_count}")

print("\n=== Version 1 (After Insert) ===")
v1_count = spark.sql("SELECT COUNT(*) as count FROM delta_lab.orders VERSION AS OF 1").collect()[0][0]
print(f"Version 1 record count: {v1_count}")

# Time travel by timestamp
print("\n=== Time Travel by Timestamp ===")
# Get timestamp from version 1
timestamp_v1 = history_df.filter(col("version") == 1).collect()[0]["timestamp"]
timestamp_count = spark.sql(f"SELECT COUNT(*) as count FROM delta_lab.orders TIMESTAMP AS OF '{timestamp_v1}'").collect()[0][0]
print(f"Record count at {timestamp_v1}: {timestamp_count}")
```

### Questions for Task 3:
1. How many versions of the table exist now?
2. What operations triggered new versions?
3. How do version-based and timestamp-based time travel differ?

---

## Exercise 4: File Management and Optimization

### Task 4.1: Analyze Current File Structure
```python
# Check table details to see file statistics
spark.sql("DESCRIBE DETAIL delta_lab.orders").show(1, False)

# Count data files
data_files = dbutils.fs.ls(delta_table_path)
parquet_files = [f for f in data_files if f.name.endswith('.parquet')]
print(f"Number of data files: {len(parquet_files)}")

# Show file sizes
for file in parquet_files[:10]:  # Show first 10 files
    print(f"File: {file.name}, Size: {file.size} bytes")
```

### Task 4.2: Create Small Files (Simulating Real-world Scenario)
```python
# Create many small inserts to simulate small file problem
for i in range(20):
    small_batch = generate_ecommerce_data(100)
    small_batch.write \
        .format("delta") \
        .mode("append") \
        .saveAsTable("delta_lab.orders")

print("Created many small files through multiple small inserts")

# Check file count again
data_files = dbutils.fs.ls(delta_table_path)
parquet_files = [f for f in data_files if f.name.endswith('.parquet')]
print(f"Number of data files after small inserts: {len(parquet_files)}")
```

### Questions for Task 4:
1. Why do we have so many small files?
2. What problems do small files cause?
3. How many transaction log files do we have now?

---

## Exercise 5: Table Optimization with OPTIMIZE

### Task 5.1: Optimize Without Z-Order
```python
# Check performance before optimization
import time

start_time = time.time()
result = spark.sql("SELECT region, COUNT(*) FROM delta_lab.orders GROUP BY region").collect()
query_time_before = time.time() - start_time
print(f"Query time before optimization: {query_time_before:.2f} seconds")

# Run OPTIMIZE
print("Running OPTIMIZE...")
spark.sql("OPTIMIZE delta_lab.orders").show(1, False)

# Check file count after optimization
data_files_after = dbutils.fs.ls(delta_table_path)
parquet_files_after = [f for f in data_files_after if f.name.endswith('.parquet')]
print(f"Number of data files after OPTIMIZE: {len(parquet_files_after)}")

# Check performance after optimization
start_time = time.time()
result = spark.sql("SELECT region, COUNT(*) FROM delta_lab.orders GROUP BY region").collect()
query_time_after = time.time() - start_time
print(f"Query time after optimization: {query_time_after:.2f} seconds")
```

### Task 5.2: Optimize with Z-Order
```python
# Add more data with specific patterns for Z-ORDER demonstration
spark.sql("""
    INSERT INTO delta_lab.orders
    SELECT 
        concat('ORD', cast(rand() * 1000000 as int)) as order_id,
        concat('CUST', cast(rand() * 1000 as int)) as customer_id,
        case when rand() < 0.3 then 'Electronics'
             when rand() < 0.6 then 'Clothing'
             else 'Books' end as product_category,
        case when rand() < 0.2 then 'North'
             when rand() < 0.4 then 'South' 
             when rand() < 0.6 then 'East'
             when rand() < 0.8 then 'West'
             else 'Central' end as region,
        rand() * 1000 as order_amount,
        cast(rand() * 10 as int) + 1 as quantity,
        date_sub(current_date(), cast(rand() * 365 as int)) as order_date,
        'Completed' as status
    FROM range(5000)
""")

# Optimize with Z-ORDER on frequently queried columns
print("Running OPTIMIZE with Z-ORDER...")
spark.sql("OPTIMIZE delta_lab.orders ZORDER BY (region, product_category)").show(1, False)

# Test query performance on Z-ordered columns
start_time = time.time()
result = spark.sql("""
    SELECT * FROM delta_lab.orders 
    WHERE region = 'North' AND product_category = 'Electronics'
    LIMIT 100
""").collect()
zorder_query_time = time.time() - start_time
print(f"Z-ordered query time: {zorder_query_time:.2f} seconds")
```

### Questions for Task 5:
1. How did OPTIMIZE affect the number of data files?
2. What is the difference between OPTIMIZE and OPTIMIZE ZORDER BY?
3. When should you use Z-ordering?

---

## Exercise 6: Vacuum Operations

### Task 6.1: Understanding Vacuum
```python
# Check current table history and versions
history = spark.sql("DESCRIBE HISTORY delta_lab.orders")
display(history.select("version", "timestamp", "operation", "operationParameters"))

# Try to query an old version
try:
    old_version_count = spark.sql("SELECT COUNT(*) FROM delta_lab.orders VERSION AS OF 0").collect()[0][0]
    print(f"Can still access version 0: {old_version_count} records")
except Exception as e:
    print(f"Error accessing version 0: {str(e)}")
```

### Task 6.2: Perform Vacuum
```python
# First, let's see what files exist
all_files_before = dbutils.fs.ls(delta_table_path)
print(f"Total files before vacuum: {len(all_files_before)}")

# Set retention period to 0 for demonstration (DON'T do this in production!)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# Perform VACUUM with 0 hours retention (removes all unused files)
print("Running VACUUM...")
spark.sql("VACUUM delta_lab.orders RETAIN 0 HOURS").show(1, False)

# Check files after vacuum
all_files_after = dbutils.fs.ls(delta_table_path)
print(f"Total files after vacuum: {len(all_files_after)}")

# Try to access old version again
try:
    old_version_count = spark.sql("SELECT COUNT(*) FROM delta_lab.orders VERSION AS OF 0").collect()[0][0]
    print(f"Can still access version 0: {old_version_count} records")
except Exception as e:
    print(f"Error accessing version 0 after vacuum: {str(e)}")
```

### Task 6.3: Vacuum with Different Retention Periods
```python
# Create some more versions for demonstration
spark.sql("INSERT INTO delta_lab.orders SELECT * FROM delta_lab.orders LIMIT 1000")
spark.sql("UPDATE delta_lab.orders SET status = 'Processing' WHERE status = 'Pending'")

# Show current history
current_history = spark.sql("DESCRIBE HISTORY delta_lab.orders")
display(current_history.select("version", "timestamp", "operation").orderBy(desc("version")))

# Vacuum with 168 hours (7 days) retention - typical production setting
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")
print("Running VACUUM with 7 days retention...")
spark.sql("VACUUM delta_lab.orders RETAIN 168 HOURS").show(1, False)
```

### Questions for Task 6:
1. What happens to old data files after VACUUM?
2. Why is there a default retention period?
3. What are the risks of setting retention period too low?

---

## Exercise 7: Advanced Monitoring and Analysis

### Task 7.1: Analyze Table Statistics
```python
# Get detailed table information
spark.sql("DESCRIBE DETAIL delta_lab.orders").show(1, False)

# Show column statistics
spark.sql("DESCRIBE EXTENDED delta_lab.orders").show(50, False)

# Analyze data distribution
spark.sql("""
    SELECT 
        region,
        product_category,
        COUNT(*) as record_count,
        AVG(order_amount) as avg_amount,
        MIN(order_date) as min_date,
        MAX(order_date) as max_date
    FROM delta_lab.orders
    GROUP BY region, product_category
    ORDER BY region, product_category
""").show(25, False)
```

### Task 7.2: Monitor File Statistics Over Time
```python
# Create a function to get file statistics
def get_file_stats(table_path):
    files = dbutils.fs.ls(table_path)
    data_files = [f for f in files if f.name.endswith('.parquet')]
    
    if data_files:
        total_size = sum(f.size for f in data_files)
        avg_size = total_size / len(data_files)
        min_size = min(f.size for f in data_files)
        max_size = max(f.size for f in data_files)
        
        return {
            'file_count': len(data_files),
            'total_size_mb': round(total_size / (1024*1024), 2),
            'avg_size_mb': round(avg_size / (1024*1024), 2),
            'min_size_mb': round(min_size / (1024*1024), 2),
            'max_size_mb': round(max_size / (1024*1024), 2)
        }
    return None

# Get current statistics
current_stats = get_file_stats(delta_table_path)
print("Current File Statistics:")
for key, value in current_stats.items():
    print(f"  {key}: {value}")
```

### Questions for Task 7:
1. How do you monitor Delta Lake table health?
2. What metrics indicate a table needs optimization?
3. How often should you run OPTIMIZE and VACUUM?

---

## Exercise 8: Best Practices Implementation

### Task 8.1: Implement Automated Optimization
```python
# Create a maintenance function
def maintain_delta_table(table_name, table_path, zorder_columns=None, vacuum_hours=168):
    """
    Perform maintenance on a Delta Lake table
    """
    print(f"Starting maintenance for {table_name}...")
    
    # Get initial statistics
    initial_stats = get_file_stats(table_path)
    print(f"Initial file count: {initial_stats['file_count']}")
    
    # Run OPTIMIZE
    if zorder_columns:
        optimize_sql = f"OPTIMIZE {table_name} ZORDER BY ({', '.join(zorder_columns)})"
    else:
        optimize_sql = f"OPTIMIZE {table_name}"
    
    print(f"Running: {optimize_sql}")
    spark.sql(optimize_sql).show(1, False)
    
    # Get post-optimize statistics
    post_optimize_stats = get_file_stats(table_path)
    print(f"Files after OPTIMIZE: {post_optimize_stats['file_count']}")
    
    # Run VACUUM
    vacuum_sql = f"VACUUM {table_name} RETAIN {vacuum_hours} HOURS"
    print(f"Running: {vacuum_sql}")
    spark.sql(vacuum_sql).show(1, False)
    
    # Get final statistics
    final_stats = get_file_stats(table_path)
    print(f"Final file count: {final_stats['file_count']}")
    
    return {
        'initial': initial_stats,
        'post_optimize': post_optimize_stats,
        'final': final_stats
    }

# Apply maintenance to our table
maintenance_results = maintain_delta_table(
    "delta_lab.orders", 
    delta_table_path, 
    zorder_columns=['region', 'product_category'],
    vacuum_hours=0  # Only for demo - use 168 in production
)
```

### Task 8.2: Create Monitoring Dashboard Query
```python
# Create a comprehensive table health query
table_health_query = """
WITH table_stats AS (
    SELECT 
        COUNT(*) as total_records,
        COUNT(DISTINCT region) as unique_regions,
        COUNT(DISTINCT product_category) as unique_categories,
        MIN(order_date) as min_date,
        MAX(order_date) as max_date,
        AVG(order_amount) as avg_order_value
    FROM delta_lab.orders
),
history_stats AS (
    SELECT 
        COUNT(*) as total_versions,
        MAX(version) as latest_version,
        MIN(timestamp) as first_commit,
        MAX(timestamp) as last_commit
    FROM (DESCRIBE HISTORY delta_lab.orders)
)
SELECT 
    t.*,
    h.*,
    datediff(current_date(), date(h.last_commit)) as days_since_last_update
FROM table_stats t
CROSS JOIN history_stats h
"""

print("Table Health Dashboard:")
spark.sql(table_health_query).show(1, False)
```

### Questions for Task 8:
1. What maintenance schedule would you recommend for a production table?
2. How do you decide which columns to use for Z-ordering?
3. What alerts would you set up for Delta Lake table health?

---

## Cleanup

```python
# Clean up the lab resources
spark.sql("DROP TABLE IF EXISTS delta_lab.orders")
spark.sql("DROP DATABASE IF EXISTS delta_lab CASCADE")
dbutils.fs.rm(base_path, True)
print("Lab cleanup completed")
```

---

## Summary

In this lab, you have learned:

1. **Delta Lake Basics**: Created and managed Delta Lake tables
2. **Time Travel**: Used version and timestamp-based queries to access historical data
3. **Delta Logs**: Understood the transaction log structure and metadata
4. **Data Files**: Analyzed parquet file organization and small file problems
5. **Compaction**: Used OPTIMIZE to consolidate small files
6. **Vacuum**: Cleaned up old files while understanding retention policies
7. **Z-Order**: Implemented data skipping optimization for better query performance
8. **Best Practices**: Created maintenance procedures and monitoring queries

## Next Steps

- Practice these concepts with your own datasets
- Implement automated maintenance procedures
- Explore advanced Delta Lake features like Change Data Feed and Liquid Clustering
- Study partition strategies for large tables
- Learn about Delta Lake security and access controls