# Delta Lake Time Travel and Data Versioning

Delta Lake is an open-source storage layer that enhances Apache Spark with ACID transactions, enabling reliable data management. Its time travel feature allows you to query previous versions of data, making it ideal for auditing, error recovery, and experiment reproducibility. This notebook demonstrates time travel and data versioning through simple, intermediate, and advanced scenarios using a sample employee dataset.

## Setting Up the Environment

Ensure you have PySpark and Delta Lake installed. You can install Delta Lake using:

```
pip install delta-spark
```

The following code sets up a SparkSession for Delta Lake operations. Adjust configurations based on your environment (e.g., Databricks may not require explicit Delta Lake extensions).

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Delta Lake Time Travel") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

**Note**: If running in Databricks, Delta Lake is included by default. For local setups, ensure compatibility with Spark and Delta Lake versions (e.g., Delta Lake 4.0.0 with compatible Spark versions). See the [Delta Lake Quickstart](https://docs.delta.io/latest/quick-start.html) for details.

## Simple Scenario: Creating a Delta Table and Using Time Travel

In this section, we’ll create a Delta table for employee data, insert additional records, and use time travel to query previous versions.

### Step 1: Create Initial Data and Write to Delta Table

We define a schema for an employee table and insert three employee records.

In [None]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

# Define schema
schema = StructType([
    StructField("employee_id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("department", StringType(), False),
    StructField("salary", DoubleType(), False)
])

# Initial data
initial_data = [
    (1, "Alice", "Engineering", 100000.0),
    (2, "Bob", "Sales", 80000.0),
    (3, "Charlie", "Marketing", 90000.0)
]

# Create DataFrame
df = spark.createDataFrame(initial_data, schema)

# Write to Delta table
df.write.format("delta").save("./delta-employees")

### Step 2: View Current Data

Let’s display the current state of the Delta table.

In [None]:
# Read and display the Delta table
spark.read.format("delta").load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+
|employee_id|   name|  department|  salary|
+-----------+-------+------------+--------+
|          1|  Alice| Engineering|100000.0|
|          2|    Bob|       Sales| 80000.0|
|          3|Charlie|   Marketing| 90000.0|
+-----------+-------+------------+--------+
```

### Step 3: Insert Additional Data

We append two more employees to the table.

In [None]:
# New data
new_data = [
    (4, "David", "Engineering", 95000.0),
    (5, "Eve", "Sales", 85000.0)
]

# Create DataFrame
df_new = spark.createDataFrame(new_data, schema)

# Append to Delta table
df_new.write.format("delta").mode("append").save("./delta-employees")

### Step 4: View Updated Data

Display milestones: Display the table after appending new records.

In [None]:
# Read and display the updated Delta table
spark.read.format("delta").load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+
|employee_id|   name|  department|  salary|
+-----------+-------+------------+--------+
|          1|  Alice| Engineering|100000.0|
|          2|    Bob|       Sales| 80000.0|
|          3|Charlie|   Marketing| 90000.0|
|          4|  David| Engineering| 95000.0|
|          5|    Eve|       Sales| 85000.0|
+-----------+-------+------------+--------+
```

### Step 5: Use Time Travel to Query Previous Versions

Delta Lake’s transaction log tracks all changes, allowing us to query earlier versions. First, we check the table’s history.

In [None]:
# Describe history
spark.sql("DESCRIBE HISTORY './delta-employees'").show()

**Expected Output (example, timestamps will vary):**

| version | timestamp           | operation | ... |
|---------|---------------------|-----------|-----|
| 1       | 2025-07-14 23:59:00 | WRITE     | ... |
| 0       | 2025-07-14 23:58:00 | WRITE     | ... |

Now, query version 0 (initial state with three employees).

In [None]:
# Read version 0
spark.read.format("delta").option("versionAsOf", 0).load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+
|employee_id|   name|  department|  salary|
+-----------+-------+------------+--------+
|          1|  Alice| Engineering|100000.0|
|          2|    Bob|       Sales| 80000.0|
|          3|Charlie|   Marketing| 90000.0|
+-----------+-------+------------+--------+
```

In [None]:
# Read version 1
spark.read.format("delta").option("versionAsOf", 1).load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+
|employee_id|   name|  department|  salary|
+-----------+-------+------------+--------+
|          1|  Alice| Engineering|100000.0|
|          2|    Bob|       Sales| 80000.0|
|          3|Charlie|   Marketing| 90000.0|
|          4|  David| Engineering| 95000.0|
|          5|    Eve|       Sales| 85000.0|
+-----------+-------+------------+--------+
```

## Intermediate Scenario: Updates, Deletes, and Time Travel

This section demonstrates updating and deleting records, then using time travel to inspect the table’s state at different points.

### Step 1: Update Records

We update Alice’s salary to 110,000.

In [None]:
from delta.tables import DeltaTable

# Load Delta table
delta_table = DeltaTable.forPath(spark, "./delta-employees")

# Update Alice's salary
delta_table.update("employee_id = 1", {"salary": "110000.0"})

### Step 2: Delete Records

We delete Bob from the table.

In [None]:
# Delete Bob
delta_table.delete("employee_id = 2")

### Step 3: View Current Data

Display the table after updates and deletes.

In [None]:
# Read and display the current Delta table
spark.read.format("delta").load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+
|employee_id|   name|  department|  salary|
+-----------+-------+------------+--------+
|          1|  Alice| Engineering|110000.0|
|          3|Charlie|   Marketing| 90000.0|
|          4|  David| Engineering| 95000.0|
|          5|    Eve|       Sales| 85000.0|
+-----------+-------+------------+--------+
```

### Step 4: Use Time Travel to View Previous States

Check the table’s history to see all operations.

In [None]:
# Describe history
spark.sql("DESCRIBE HISTORY './delta-employees'").show()

**Expected Output (example, timestamps will vary):**

| version | timestamp           | operation | ... |
|---------|---------------------|-----------|-----|
| 3       | 2025-07-14 23:59:30 | DELETE    | ... |
| 2       | 2025-07-14 23:59:15 | UPDATE    | ... |
| 1       | 2025-07-14 23:59:00 | WRITE     | ... |
| 0       | 2025-07-14 23:58:00 | WRITE     | ... |

Query version 1 (before updates and deletes).

In [None]:
# Read version 1
spark.read.format("delta").option("versionAsOf", 1).load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+
|employee_id|   name|  department|  salary|
+-----------+-------+------------+--------+
|          1|  Alice| Engineering|100000.0|
|          2|    Bob|       Sales| 80000.0|
|          3|Charlie|   Marketing| 90000.0|
|          4|  David| Engineering| 95000.0|
|          5|    Eve|       Sales| 85000.0|
+-----------+-------+------------+--------+
```

In [None]:
# Read version 2
spark.read.format("delta").option("versionAsOf", 2).load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+
|employee_id|   name|  department|  salary|
+-----------+-------+------------+--------+
|          1|  Alice| Engineering|110000.0|
|          2|    Bob|       Sales| 80000.0|
|          3|Charlie|   Marketing| 90000.0|
|          4|  David| Engineering| 95000.0|
|          5|    Eve|       Sales| 85000.0|
+-----------+-------+------------+--------+
```

## Advanced Scenario: Merge Operations, Schema Evolution, and Restoring Versions

This section covers complex operations like merging data, evolving the schema, using time travel with timestamps, and restoring a previous version.

### Step 1: Perform a Merge Operation

We merge a dataset that updates Alice’s salary and adds a new employee, Frank.

In [None]:
# Data to merge
merge_data = [
    (1, "Alice", "Engineering", 120000.0),  # Update Alice's salary
    (6, "Frank", "HR", 70000.0)           # New employee
]

# Create DataFrame
df_merge = spark.createDataFrame(merge_data, schema)

# Merge into Delta table
delta_table.alias("target").merge(
    df_merge.alias("source"),
    "target.employee_id = source.employee_id"
).whenMatchedUpdate(set={
    "name": "source.name",
    "department": "source.department",
    "salary": "source.salary"
}).whenNotMatchedInsert(values={
    "employee_id": "source.employee_id",
    "name": "source.name",
    "department": "source.department",
    "salary": "source.salary"
}).execute()

### Step 2: View Current Data After Merge

In [None]:
# Read and display the Delta table
spark.read.format("delta").load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+
|employee_id|   name|  department|  salary|
+-----------+-------+------------+--------+
|          1|  Alice| Engineering|120000.0|
|          3|Charlie|   Marketing| 90000.0|
|          4|  David| Engineering| 95000.0|
|          5|    Eve|       Sales| 85000.0|
|          6|  Frank|          HR| 70000.0|
+-----------+-------+------------+--------+
```

### Step 3: Evolve Schema by Adding a New Column

We add a `hire_date` column by appending a new employee with this field, enabling schema evolution.

In [None]:
from pyspark.sql.types import DateType
import datetime

# New employee with hire_date
new_employee = [
    (7, "Grace", "Finance", 100000.0, datetime.date(2025, 1, 1))
]

# Updated schema with hire_date
schema_with_date = schema.add("hire_date", DateType())

# Create DataFrame
df_new_with_date = spark.createDataFrame(new_employee, schema_with_date)

# Write with schema evolution
df_new_with_date.write.format("delta").mode("append").option("mergeSchema", "true").save("./delta-employees")

### Step 4: View Current Data with New Column

In [None]:
# Read and display the Delta table
spark.read.format("delta").load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+----------+
|employee_id|   name|  department|  salary| hire_date|
+-----------+-------+------------+--------+----------+
|          1|  Alice| Engineering|120000.0|      null|
|          3|Charlie|   Marketing| 90000.0|      null|
|          4|  David| Engineering| 95000.0|      null|
|          5|    Eve|       Sales| 85000.0|      null|
|          6|  Frank|          HR| 70000.0|      null|
|          7|  Grace|     Finance|100000.0|2025-01-01|
+-----------+-------+------------+--------+----------+
```

### Step 5: Use Time Travel with Timestamps

We use timestamps to query a previous version of the table.

In [None]:
# Describe history
history = spark.sql("DESCRIBE HISTORY './delta-employees'")
history.select("version", "timestamp").show()

**Expected Output (example, timestamps will vary):**

| version | timestamp           |
|---------|---------------------|
| 4       | 2025-07-14 23:59:45 |
| 3       | 2025-07-14 23:59:30 |
| 2       | 2025-07-14 23:59:15 |
| 1       | 2025-07-14 23:59:00 |
| 0       | 2025-07-14 23:58:00 |

Select the timestamp for version 1 and query the table.

In [None]:
# Get timestamp of version 1
version1_timestamp = history.filter("version = 1").select("timestamp").collect()[0][0]

# Read using timestamp
spark.read.format("delta").option("timestampAsOf", version1_timestamp).load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+
|employee_id|   name|  department|  salary|
+-----------+-------+------------+--------+
|          1|  Alice| Engineering|100000.0|
|          2|    Bob|       Sales| 80000.0|
|          3|Charlie|   Marketing| 90000.0|
|          4|  David| Engineering| 95000.0|
|          5|    Eve|       Sales| 85000.0|
+-----------+-------+------------+--------+
```

### Step 6: Restore to a Previous Version

We restore the table to version 1, reverting all changes made after the initial append.

In [None]:
# Restore to version 1
delta_table.restoreToVersion(1)

### Step 7: View Data After Restore

In [None]:
# Read and display the Delta table
spark.read.format("delta").load("./delta-employees").show()

**Expected Output:**

```
+-----------+-------+------------+--------+
|employee_id|   name|  department|  salary|
+-----------+-------+------------+--------+
|          1|  Alice| Engineering|100000.0|
|          2|    Bob|       Sales| 80000.0|
|          3|Charlie|   Marketing| 90000.0|
|          4|  David| Engineering| 95000.0|
|          5|    Eve|       Sales| 85000.0|
+-----------+-------+------------+--------+
```

**Note**: Restoring creates a new version in the transaction log, which you can verify using `DESCRIBE HISTORY`.

## Conclusion

This notebook demonstrated Delta Lake’s time travel and data versioning capabilities. You learned how to:
- Create and modify Delta tables.
- Use time travel to query previous versions by version number or timestamp.
- Perform updates, deletes, merges, and schema evolution.
- Restore a table to a previous state.

These features make Delta Lake a robust solution for managing big data with reliability and flexibility. For more details, explore the [Delta Lake Documentation](https://docs.delta.io/latest/index.html) and [Databricks Time Travel Blog](https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html).