# Apache Iceberg Tables in Lakehouse Lab

This notebook demonstrates Apache Iceberg table format features:
- Time travel queries
- Schema evolution
- ACID transactions
- Snapshot management

**Prerequisites:** This notebook requires the Iceberg configuration (`--iceberg` flag during installation).

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from datetime import datetime

# Get Iceberg JAR paths
iceberg_jars = [
    "/home/jovyan/work/iceberg-jars/iceberg-spark-runtime-3.5_2.12-1.5.0.jar",
    "/home/jovyan/work/iceberg-jars/iceberg-aws-1.5.0.jar"
]

# Check if JAR files exist
missing_jars = []
for jar in iceberg_jars:
    if not os.path.exists(jar):
        missing_jars.append(jar)

if missing_jars:
    print("❌ Missing Iceberg JAR files:")
    for jar in missing_jars:
        print(f"   - {jar}")
    print("\\nPlease ensure you ran installation with --iceberg flag and that init-compute.sh completed successfully.")
    raise FileNotFoundError("Required Iceberg JAR files not found")

# Configure Spark with Iceberg support and proper JARs
spark = SparkSession.builder \\
    .appName("Lakehouse Lab - Iceberg Demo") \\
    .config("spark.jars", ",".join(iceberg_jars)) \\
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \\
    .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") \\
    .config("spark.sql.catalog.spark_catalog.type", "hive") \\
    .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \\
    .config("spark.sql.catalog.iceberg.type", "hadoop") \\
    .config("spark.sql.catalog.iceberg.warehouse", "s3a://lakehouse/iceberg-warehouse/") \\
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \\
    .config("spark.hadoop.fs.s3a.access.key", os.environ.get('MINIO_ROOT_USER', 'minio')) \\
    .config("spark.hadoop.fs.s3a.secret.key", os.environ.get('MINIO_ROOT_PASSWORD', 'minio123')) \\
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \\
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \\
    .getOrCreate()

print("✅ Spark session with Iceberg support initialized!")
print(f"Spark version: {spark.version}")
print("🧊 Iceberg JARs loaded:")
for jar in iceberg_jars:
    print(f"   - {os.path.basename(jar)}")

## 1. Create an Iceberg Table

Let's create a sample Iceberg table with customer data:

In [None]:
# Create sample data
from pyspark.sql.functions import *

# Initial customer data
customers_data = [
    (1, "Alice Johnson", "alice@email.com", "2023-01-15", "Premium"),
    (2, "Bob Smith", "bob@email.com", "2023-02-20", "Standard"),
    (3, "Carol Davis", "carol@email.com", "2023-03-10", "Premium"),
    (4, "David Wilson", "david@email.com", "2023-04-05", "Standard")
]

schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("email", StringType(), False),
    StructField("signup_date", StringType(), False),
    StructField("tier", StringType(), False)
])

df = spark.createDataFrame(customers_data, schema)

# Create Iceberg table
df.writeTo("iceberg.customers").create()

print("✅ Created Iceberg table 'iceberg.customers'")
spark.sql("SELECT * FROM iceberg.customers").show()

## 2. Time Travel Queries

Iceberg allows you to query data as it existed at any point in time:

In [None]:
# Get current snapshot information
snapshots = spark.sql("SELECT * FROM iceberg.customers.snapshots")
print("📸 Table snapshots:")
snapshots.select("snapshot_id", "timestamp_ms", "operation").show()

# Store first snapshot ID for time travel
first_snapshot = snapshots.first()["snapshot_id"]
print(f"First snapshot ID: {first_snapshot}")

In [None]:
# Add more data to demonstrate time travel
new_customers = [
    (5, "Eve Brown", "eve@email.com", "2024-01-15", "Premium"),
    (6, "Frank Miller", "frank@email.com", "2024-02-20", "Standard")
]

new_df = spark.createDataFrame(new_customers, schema)
new_df.writeTo("iceberg.customers").append()

print("✅ Added new customers")
print("Current data:")
spark.sql("SELECT COUNT(*) as current_count FROM iceberg.customers").show()

In [None]:
# Query historical data using snapshot ID
print("🕰️ Time travel query - data at first snapshot:")
historical_query = f"SELECT COUNT(*) as historical_count FROM iceberg.customers VERSION AS OF {first_snapshot}"
spark.sql(historical_query).show()

print("Comparison:")
spark.sql("""
SELECT 
    'Current' as timepoint, COUNT(*) as record_count 
FROM iceberg.customers
UNION ALL
SELECT 
    'Historical' as timepoint, COUNT(*) as record_count 
FROM iceberg.customers VERSION AS OF """ + str(first_snapshot) + """
""").show()

## 3. Schema Evolution

Iceberg supports schema evolution without breaking existing queries:

In [None]:
# Add a new column to the table
spark.sql("ALTER TABLE iceberg.customers ADD COLUMN phone STRING")

print("✅ Added 'phone' column to table")
print("Updated schema:")
spark.sql("DESCRIBE iceberg.customers").show()

In [None]:
# Insert data with the new column
evolved_customers = [
    (7, "Grace Lee", "grace@email.com", "2024-03-15", "Premium", "+1-555-0123")
]

evolved_schema = schema.add(StructField("phone", StringType(), True))
evolved_df = spark.createDataFrame(evolved_customers, evolved_schema)
evolved_df.writeTo("iceberg.customers").append()

print("✅ Inserted data with new schema")
spark.sql("SELECT * FROM iceberg.customers WHERE phone IS NOT NULL").show()

## 4. Table Maintenance

Iceberg provides operations for managing table snapshots and performance:

In [None]:
# View table history
print("📋 Table history:")
spark.sql("SELECT * FROM iceberg.customers.history").show()

print("📊 Current snapshots:")
spark.sql("SELECT snapshot_id, timestamp_ms, operation, summary FROM iceberg.customers.snapshots").show(truncate=False)

In [None]:
# View table files
print("📁 Table files:")
files_df = spark.sql("SELECT file_path, file_format, record_count FROM iceberg.customers.files")
files_df.show(truncate=False)

## 5. Rollback Capability

Iceberg allows you to rollback to previous snapshots:

In [None]:
# Show current count
print("Before rollback:")
spark.sql("SELECT COUNT(*) as count FROM iceberg.customers").show()

# Rollback to first snapshot
rollback_sql = f"CALL iceberg.system.rollback_to_snapshot('iceberg.customers', {first_snapshot})"
spark.sql(rollback_sql)

print("✅ Rolled back to first snapshot")
print("After rollback:")
spark.sql("SELECT COUNT(*) as count FROM iceberg.customers").show()
spark.sql("SELECT * FROM iceberg.customers").show()

## 🎉 Summary

This notebook demonstrated key Apache Iceberg features:

✅ **ACID Transactions** - All operations are atomic and consistent

✅ **Time Travel** - Query data as it existed at any point in time

✅ **Schema Evolution** - Add columns without breaking existing queries

✅ **Snapshot Management** - View and manage table versions

✅ **Rollback Capability** - Easily revert to previous states

### Next Steps:
- Explore partition evolution with `ALTER TABLE ... REPLACE PARTITION FIELD`
- Set up branch and tag management for complex workflows
- Integrate with Spark streaming for real-time Iceberg updates
- Use Iceberg tables in your production analytics pipelines

In [None]:
# Cleanup (optional)
# spark.sql("DROP TABLE iceberg.customers")
# print("🧹 Cleaned up demo table")

# Stop Spark session
spark.stop()
print("✅ Spark session closed")