## Set Up Spark Session for Delta Lake on MinIO

This cell configures a SparkSession to work with **Delta Lake** using files stored on **MinIO** (S3-compatible storage).  
The setup includes:

- **Delta Lake support** via `delta-spark` package  
- **S3A configuration** to connect to MinIO with access keys  
- **Hadoop AWS** and **AWS SDK bundle** for S3 integration

The `.config(...)` options enable Delta SQL features, register the Delta catalog, and tell Spark how to connect to MinIO using the S3A protocol.



In [2]:
from pyspark.sql import SparkSession

packages = ",".join([
    "io.delta:delta-spark_2.12:3.2.0",       
    "org.apache.hadoop:hadoop-aws:3.3.4",
    "com.amazonaws:aws-java-sdk-bundle:1.12.530",
])

# Create SparkSession with Delta and MinIO support
spark = (
    SparkSession.builder
      .appName("Delta Lakehouse on MinIO")
      .config("spark.jars.packages", packages)
      # Enable Delta Lake SQL support
      .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
      # Configure S3A (MinIO) access
      .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000")
      .config("spark.hadoop.fs.s3a.path.style.access", "true") # Enable path-style access (important for MinIO)
      .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") # Disable SSL for local setup
      .config("spark.hadoop.fs.s3a.access.key", "minioadmin")
      .config("spark.hadoop.fs.s3a.secret.key", "minioadmin")
      .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
      .getOrCreate()
)
print("Spark:", spark.version)


Spark: 3.5.3


## Create Delta Table from Parquet Files

This step creates a new Delta Lake table `nyc.taxis_delta` in the `nyc` database.

- The table is created using the `delta` format.
- Data is loaded from two Parquet files stored in MinIO.
- The table is saved to the Delta-compatible path: `s3a://my-bucket/delta-lakehouse/nyc/taxis`.


In [3]:

spark.sql("CREATE DATABASE IF NOT EXISTS nyc")

spark.sql("""
CREATE TABLE IF NOT EXISTS nyc.taxis_delta
USING delta
LOCATION 's3a://my-bucket/delta-lakehouse/nyc/taxis'
AS
SELECT * FROM parquet.`s3a://my-bucket/raw-files/yellow_tripdata_2025-05.parquet`
UNION ALL
SELECT * FROM parquet.`s3a://my-bucket/raw-files/yellow_tripdata_2025-06.parquet`
""")


DataFrame[]

## Verify Row Count in Delta Table

This query checks how many rows were successfully loaded into the `nyc.taxis_delta` table.  
It confirms that data from both Parquet files was ingested correctly.

In [4]:
spark.sql("SELECT COUNT(*) FROM nyc.taxis_delta").show()


+--------+
|count(1)|
+--------+
| 8914805|
+--------+



## Inspect Delta Table Metadata and History

The first query (`DESCRIBE DETAIL`) shows metadata about the `nyc.taxis_delta` table, such as location, format, size, schema, and creation time.

The second query (`DESCRIBE HISTORY`) displays the table's commit history — including operations like `CREATE`, `WRITE`, or `MERGE`, along with timestamps and user information.



In [5]:
spark.sql(""" 
DESCRIBE DETAIL nyc.taxis_delta;
""").show()

spark.sql(""" 
DESCRIBE HISTORY nyc.taxis_delta;
""").show()

+------+--------------------+--------------------+-----------+--------------------+--------------------+-------------------+----------------+-----------------+--------+-----------+----------+----------------+----------------+--------------+
|format|                  id|                name|description|            location|           createdAt|       lastModified|partitionColumns|clusteringColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion| tableFeatures|
+------+--------------------+--------------------+-----------+--------------------+--------------------+-------------------+----------------+-----------------+--------+-----------+----------+----------------+----------------+--------------+
| delta|1154326b-475a-4ed...|spark_catalog.nyc...|       NULL|s3a://my-bucket/d...|2025-08-09 16:48:...|2025-08-09 16:48:21|              []|               []|      10|  183028768|        {}|               3|               7|[timestampNtz]|
+------+--------------------+-------

## Simulate New Batch Ingestion and Track Changes

A small random sample (~10%) from the June Parquet file is inserted into the `nyc.taxis_delta` table to simulate a new batch of data.

After the insert, `DESCRIBE HISTORY` is used again to verify that a new commit was recorded.  
This allows tracking how and when the table was updated — a core feature of Delta Lake's transaction log.


In [6]:

spark.sql("""INSERT INTO nyc.taxis_delta
SELECT * FROM parquet.`s3a://my-bucket/raw-files/yellow_tripdata_2025-06.parquet`
WHERE rand() < 0.10;
""")

spark.sql("""DESCRIBE HISTORY nyc.taxis_delta;
""").show()



+-------+-------------------+------+--------+--------------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+------------+--------------------+
|version|          timestamp|userId|userName|           operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|    operationMetrics|userMetadata|          engineInfo|
+-------+-------------------+------+--------+--------------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+------------+--------------------+
|      1|2025-08-09 16:49:45|  NULL|    NULL|               WRITE|{mode -> Append, ...|NULL|    NULL|     NULL|          0|  Serializable|         true|{numFiles -> 6, n...|        NULL|Apache-Spark/3.5....|
|      0|2025-08-09 16:48:21|  NULL|    NULL|CREATE TABLE AS S...|{partitionBy -> [...|NULL|    NULL|     NULL|       NULL|  Serializable|         true|{numFiles -> 11,

## Query Delta Table at Specific Versions

These queries read the `nyc.taxis_delta` table as it existed at earlier points in time using Delta Lake’s **time travel** feature:

- `VERSION AS OF 0` returns the original state of the table (after the initial load).
- `VERSION AS OF 1` includes the additional rows from the simulated batch insert.


In [7]:
spark.sql(""" 
SELECT COUNT(*) 
FROM delta.`s3a://my-bucket/delta-lakehouse/nyc/taxis` VERSION AS OF 0;""").show()

spark.sql(""" 
SELECT COUNT(*) 
FROM delta.`s3a://my-bucket/delta-lakehouse/nyc/taxis` VERSION AS OF 1;""").show()



+--------+
|count(1)|
+--------+
| 8914805|
+--------+

+--------+
|count(1)|
+--------+
| 9346535|
+--------+



## Restore Delta Table to an Earlier Version

This command rolls back the `nyc.taxis_delta` table to version 0 — effectively undoing the last insert.
The `DESCRIBE HISTORY` query then shows a new entry for the `RESTORE` operation, confirming that the rollback was recorded in the Delta transaction log.


In [8]:
spark.sql(""" 
RESTORE TABLE nyc.taxis_delta TO VERSION AS OF 0;""")

spark.sql(""" 
DESCRIBE HISTORY nyc.taxis_delta;""").show()

+-------+-------------------+------+--------+--------------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+------------+--------------------+
|version|          timestamp|userId|userName|           operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|    operationMetrics|userMetadata|          engineInfo|
+-------+-------------------+------+--------+--------------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+------------+--------------------+
|      2|2025-08-09 16:50:48|  NULL|    NULL|             RESTORE|{version -> 0, ti...|NULL|    NULL|     NULL|          1|  Serializable|        false|{numRestoredFiles...|        NULL|Apache-Spark/3.5....|
|      1|2025-08-09 16:49:45|  NULL|    NULL|               WRITE|{mode -> Append, ...|NULL|    NULL|     NULL|          0|  Serializable|         true|{numFiles -> 6, 

# Add and Populate a New Column in Delta Table

This step adds a new column `fare_bucket` to the `nyc.taxis_delta` table and fills it with values based on `total_amount`:

- `'low'` for fares under 10  
- `'mid'` for fares between 10 and 30  
- `'high'` for fares 30 and above

The final query groups the data by `fare_bucket` to show the distribution of fare ranges.


In [10]:
spark.sql("""ALTER TABLE nyc.taxis_delta ADD COLUMNS (fare_bucket STRING);""")

spark.sql("""UPDATE nyc.taxis_delta
SET fare_bucket = CASE
  WHEN total_amount < 10 THEN 'low'
  WHEN total_amount < 30 THEN 'mid'
  ELSE 'high'
END
WHERE total_amount IS NOT NULL;""")

spark.sql("""SELECT fare_bucket, COUNT(*) 
FROM nyc.taxis_delta 
GROUP BY fare_bucket;""").show()

+-----------+--------+
|fare_bucket|count(1)|
+-----------+--------+
|        low|  765654|
|        mid| 5743528|
|       high| 2405623|
+-----------+--------+



## Partition Evolution in Delta Lake

Delta Lake does not support partition evolution.  
Once a table’s partitioning is defined at creation, it cannot be changed without recreating the table and reloading the data.
