# Delta Lake Basics with PySpark – Upserts & Time Travel

**Base dataset**: `samples.nyctaxi.trips`  
**Target**: a small Delta table in your workspace (in DBFS)

In this notebook you will:
1. Create a small Delta table from NYC Taxi data
2. Perform INSERT and UPDATE operations
3. Use `MERGE INTO` for upserts
4. Explore Delta time travel (view older versions of data)


In [None]:
from pyspark.sql import functions as F
from delta.tables import DeltaTable

# Path in DBFS for demo Delta table (adjust as needed)
delta_path = "/tmp/demo_nyctaxi_delta"

# Load base data
nyc_taxi_df = spark.read.table("samples.nyctaxi.trips")

# For demo purposes, take a small subset
base_df = (
    nyc_taxi_df
    .select("vendor_id", "tpep_pickup_datetime", "tpep_dropoff_datetime",
            "passenger_count", "trip_distance", "fare_amount")
    .limit(10000)
)

display(base_df.limit(5))


## 1. Write Initial Delta Table

We will:
- Overwrite any existing data at `delta_path`
- Save as a Delta table


In [None]:
# Clean up any previous run
dbutils.fs.rm(delta_path, recurse=True)

(
    base_df
    .write
    .format("delta")
    .mode("overwrite")
    .save(delta_path)
)

# Read it back as a Delta table
delta_df = spark.read.format("delta").load(delta_path)
print("Initial row count:", delta_df.count())
display(delta_df.limit(5))


## 2. Convert Path to a `DeltaTable` Object

`DeltaTable` provides programmatic APIs for:
- MERGE
- UPDATE
- DELETE
- History


In [None]:
delta_table = DeltaTable.forPath(spark, delta_path)

# Show Delta table history
delta_table.history().show(truncate=False)


## 3. Simple UPDATE Operation

Example:
- Flag "long" trips where `trip_distance >= 10`


In [None]:
# Add a new boolean column via UPDATE
delta_table.update(
    condition="trip_distance >= 10",
    set={"is_long_trip": "true"}
)

# Read and confirm
updated_df = spark.read.format("delta").load(delta_path)
display(updated_df.limit(10))


## 4. MERGE INTO – Upsert Scenario

We'll simulate:
- A small set of updated rows for a given vendor
- New rows that do not yet exist

We use `MERGE` to:
- UPDATE matching records
- INSERT non-matching records


In [None]:
from pyspark.sql import Row

# Create a small DataFrame with updates & new rows
sample_existing = (
    updated_df
    .filter("vendor_id IS NOT NULL")
    .limit(3)
    .collect()
)

rows = []
for r in sample_existing:
    rows.append(Row(
        vendor_id=r["vendor_id"],
        tpep_pickup_datetime=r["tpep_pickup_datetime"],
        tpep_dropoff_datetime=r["tpep_dropoff_datetime"],
        passenger_count=r["passenger_count"],
        trip_distance=r["trip_distance"],
        fare_amount=r["fare_amount"] + 5.0,
        is_long_trip=r.get("is_long_trip", None)
    ))

# Add a brand-new row (non-matching key)
rows.append(Row(
    vendor_id=999,
    tpep_pickup_datetime=updated_df.select(F.min("tpep_pickup_datetime")).first()[0],
    tpep_dropoff_datetime=updated_df.select(F.max("tpep_dropoff_datetime")).first()[0],
    passenger_count=1,
    trip_distance=3.0,
    fare_amount=25.0,
    is_long_trip=False
))

updates_df = spark.createDataFrame(rows)

print("Updates + inserts:")
display(updates_df)


In [None]:
# Create a DeltaTable object again
delta_table = DeltaTable.forPath(spark, delta_path)

# Define a unique key for merge: (vendor_id, pickup, dropoff)
merge_condition = '''
  t.vendor_id = u.vendor_id AND
  t.tpep_pickup_datetime = u.tpep_pickup_datetime AND
  t.tpep_dropoff_datetime = u.tpep_dropoff_datetime
'''

(
    delta_table.alias("t")
    .merge(
        updates_df.alias("u"),
        merge_condition
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute()
)

print("After MERGE, row count:", spark.read.format("delta").load(delta_path).count())
delta_table.history().show(truncate=False)


## 5. Delta Time Travel

Delta keeps a transaction log with versions.

You can query older versions using:
- `versionAsOf`
- `timestampAsOf`


In [None]:
# Get full history
hist_df = delta_table.history()
display(hist_df)

versions = [row["version"] for row in hist_df.collect()]
print("Available versions:", versions)

# Read current (latest) version
current_df = spark.read.format("delta").load(delta_path)
print("Current version row count:", current_df.count())

# Read an older version (e.g., earliest)
older_version = min(versions)
old_df = (
    spark.read
         .format("delta")
         .option("versionAsOf", older_version)
         .load(delta_path)
)

print(f"Old version ({older_version}) row count:", old_df.count())
display(old_df.limit(5))
