# PySpark: Zero to Hero
## Module 29: Delta Lake, Time Travel, and Schema Evolution

Delta Lake is an open-source storage layer that brings reliability, performance, and lifecycle management to data lakes. It enables the **Lakehouse Architecture**, combining the best elements of Data Lakes and Data Warehouses.

### Key Features of Delta Lake:
1.  **ACID Transactions:** Ensures data integrity.
2.  **Time Travel:** Access previous versions of data.
3.  **Schema Evolution:** Allows schema changes without corrupting data.
4.  **DML Support:** Supports UPDATE, DELETE, and MERGE operations.
5.  **Scalable Metadata:** Handles petabytes of data efficiently.

### Agenda:
1.  **Setup:** Configuring Spark with Delta Lake support.
2.  **Data Writing:** Saving data as a Delta table.
3.  **DML Operations:** Updating and Deleting data (not possible in standard Parquet).
4.  **Time Travel:** Querying older versions of the data.
5.  **Vacuum:** Cleaning up old files to save space.
6.  **Schema Evolution:** Adding new columns dynamically using `mergeSchema`.

In [None]:
from pyspark.sql import SparkSession
from delta import *

# Configure Spark Session with Delta Lake packages
# Note: If running on Databricks, this is pre-configured.
# For local setups, ensure you have the correct delta-spark version.

builder = SparkSession.builder \
    .appName("Delta_Lake_Demo") \
    .master("local[*]") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") # Adjust version as needed

spark = configure_spark_with_delta_pip(builder).getOrCreate()

print("Spark Session Created with Delta Lake Support")

In [None]:
# Sample Sales Data
data = [
    (1, "Widget A", 100, "2023-01-01"),
    (2, "Widget B", 150, "2023-01-02"),
    (3, "Widget A", 100, "2023-01-03"),
    (4, "Widget C", 200, "2023-01-04")
]
columns = ["transaction_id", "product", "amount", "date"]

df_sales = spark.createDataFrame(data, columns)

# Write data as Delta Table
# We save it as a managed table in the Hive Metastore (or local warehouse)
df_sales.write.format("delta").mode("overwrite").saveAsTable("sales_delta")

print("Delta Table 'sales_delta' created.")

# View the data
spark.sql("SELECT * FROM sales_delta").show()

In [None]:
# Standard Parquet tables do not support UPDATE. Delta Lake does.
# Scenario: Update amount to 0 for transaction_id = 1

from delta.tables import *

deltaTable = DeltaTable.forName(spark, "sales_delta")

# Update using DeltaTable API
deltaTable.update(
    condition = "transaction_id = 1",
    set = { "amount": "0" }
)

print("Data Updated.")
spark.sql("SELECT * FROM sales_delta ORDER BY transaction_id").show()

In [None]:
# Delta Lake maintains a transaction log (JSON files in _delta_log folder).
# We can view the history of operations.

print("Table History:")
deltaTable.history().select("version", "timestamp", "operation", "operationParameters").show(truncate=False)

# Read Version 0 (Before the Update)
print("Data at Version 0 (Original):")
df_v0 = spark.read.format("delta").option("versionAsOf", 0).table("sales_delta")
df_v0.show()

# Read Version 1 (After the Update)
print("Data at Version 1 (Current):")
df_v1 = spark.read.format("delta").option("versionAsOf", 1).table("sales_delta")
df_v1.show()

In [None]:
# We can rollback the table to a previous state using Restore command (SQL syntax)
# Restoring to Version 0

spark.sql("RESTORE TABLE sales_delta TO VERSION AS OF 0")

print("Table Restored to Version 0.")
spark.sql("SELECT * FROM sales_delta ORDER BY transaction_id").show()

In [None]:
# Scenario: New data arrives with an extra column 'customer_id'.
# By default, append fails if schema doesn't match.
# We use 'mergeSchema' option to allow evolution.

new_data = [
    (5, "Widget D", 300, "2023-01-05", 101),
    (6, "Widget E", 120, "2023-01-06", 102)
]
# Note: Schema has 5 columns now
new_columns = ["transaction_id", "product", "amount", "date", "customer_id"]

df_new = spark.createDataFrame(new_data, new_columns)

# Append with Schema Evolution
df_new.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable("sales_delta")

print("New Data Appended with Schema Evolution.")
spark.sql("SELECT * FROM sales_delta").show()

# Notice that older records will have NULL for 'customer_id'

In [None]:
# Vacuum removes old data files that are no longer in the latest state of the table.
# CAUTION: After running VACUUM with retention 0, you lose the ability to Time Travel 
# to versions older than the retention period.

# By default, Spark prevents vacuuming files less than 168 hours (7 days) old.
# To force it for demo, we disable the retention check.
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# Vacuum files older than 0 hours (removes all history except current state)
deltaTable.vacuum(0)

print("Vacuum complete. History storage cleared.")

## Summary

1.  **Delta Lake** provides ACID properties on top of cloud object storage.
2.  **DML:** We successfully updated data in place.
3.  **Time Travel:** We queried older versions and restored the table state.
4.  **Schema Evolution:** We added a new column dynamically during an append operation.
5.  **Maintenance:** `VACUUM` helps reclaim storage space but limits time travel capabilities.

**Next Steps:**
In the next module, we will explore advanced optimization techniques for Delta Tables, including Z-Ordering and Optimize.