# 🔄 Apache Iceberg Schema Evolution Tutorial

Welcome to the comprehensive Schema Evolution tutorial! In this notebook, you'll learn:

1. **Schema Evolution Fundamentals**
2. **Adding Columns Safely**
3. **Data Type Promotions**
4. **Column Removal and Renaming**
5. **Complex Schema Changes**
6. **Real-World Evolution Scenarios**
7. **Compatibility Testing**

## 📋 Prerequisites

- Docker environment is running
- Completed Project 1 (Core Concepts)
- Understanding of basic Iceberg concepts

## 1. 🚀 Initialize Environment

First, let's set up Spark with Iceberg for schema evolution experiments.

In [1]:
import os
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import json

# Set Python path for Spark consistency
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/opt/conda/bin/python'

# Stop existing Spark session if any
try:
    spark.stop()
    print("🛑 Stopped existing Spark session")
except:
    print("ℹ️ No existing Spark session to stop")

# Create Spark session with Iceberg for Schema Evolution
spark = SparkSession.builder \
    .appName("IcebergSchemaEvolution") \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.3") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "file:///home/jovyan/work/warehouse") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()

print("✅ Spark with Iceberg initialized for Schema Evolution!")
print(f"Spark version: {spark.version}")
print(f"Warehouse location: {spark.conf.get('spark.sql.catalog.local.warehouse')}")

ℹ️ No existing Spark session to stop
✅ Spark with Iceberg initialized for Schema Evolution!
Spark version: 3.5.0
Warehouse location: file:///home/jovyan/work/warehouse


## 2. 🏗️ Schema Evolution Fundamentals

Let's start with understanding the basics of schema evolution.

In [2]:
# Create database for schema evolution experiments
spark.sql("CREATE DATABASE IF NOT EXISTS local.schema_lab")
print("✅ Created schema evolution lab database")

# Helper function to display schema information
def show_schema_info(table_name):
    print(f"\n📋 Schema for {table_name}:")
    spark.sql(f"DESCRIBE {table_name}").show(truncate=False)
    
def show_schema_history(table_name):
    print(f"\n📚 Schema history for {table_name}:")
    try:
        spark.sql(f"SELECT schema_id, schema FROM {table_name}.schemas").show(truncate=False)
    except Exception as e:
        print(f"Schema history not available: {e}")

def count_records(table_name):
    count = spark.sql(f"SELECT COUNT(*) as count FROM {table_name}").collect()[0]['count']
    print(f"📊 Total records in {table_name}: {count}")
    return count

✅ Created schema evolution lab database


## 3. 🎯 Experiment 1: Adding Columns

Let's create a user table and practice adding columns safely.

In [3]:
# Clean slate - drop table if exists
spark.sql("DROP TABLE IF EXISTS local.schema_lab.users")

# Create initial user table with basic schema
create_users_sql = """
CREATE TABLE local.schema_lab.users (
    user_id bigint,
    username string,
    email string,
    created_at timestamp
) USING ICEBERG
PARTITIONED BY (days(created_at))
"""

spark.sql(create_users_sql)
print("✅ Created initial users table")

# Show initial schema
show_schema_info("local.schema_lab.users")

✅ Created initial users table

📋 Schema for local.schema_lab.users:
+--------------+----------------+-------+
|col_name      |data_type       |comment|
+--------------+----------------+-------+
|user_id       |bigint          |NULL   |
|username      |string          |NULL   |
|email         |string          |NULL   |
|created_at    |timestamp       |NULL   |
|              |                |       |
|# Partitioning|                |       |
|Part 0        |days(created_at)|       |
+--------------+----------------+-------+



In [4]:
# Insert initial data
initial_users = [
    (1001, "alice_w", "alice@example.com", datetime(2024, 1, 15, 10, 0, 0)),
    (1002, "bob_smith", "bob@example.com", datetime(2024, 1, 15, 11, 0, 0)),
    (1003, "charlie_dev", "charlie@example.com", datetime(2024, 1, 16, 9, 0, 0)),
    (1004, "diana_data", "diana@example.com", datetime(2024, 1, 16, 14, 0, 0))
]

columns = ["user_id", "username", "email", "created_at"]
users_df = spark.createDataFrame(initial_users, columns)

print("📊 Initial user data:")
users_df.show()

# Insert into Iceberg table
users_df.writeTo("local.schema_lab.users").append()
print("✅ Inserted initial user data")

# Verify data
count_records("local.schema_lab.users")

📊 Initial user data:
+-------+-----------+-------------------+-------------------+
|user_id|   username|              email|         created_at|
+-------+-----------+-------------------+-------------------+
|   1001|    alice_w|  alice@example.com|2024-01-15 10:00:00|
|   1002|  bob_smith|    bob@example.com|2024-01-15 11:00:00|
|   1003|charlie_dev|charlie@example.com|2024-01-16 09:00:00|
|   1004| diana_data|  diana@example.com|2024-01-16 14:00:00|
+-------+-----------+-------------------+-------------------+

✅ Inserted initial user data
📊 Total records in local.schema_lab.users: 4


4

### Adding Single Columns

In [5]:
# Evolution 1: Add profile information columns
print("🔄 Evolution 1: Adding profile columns...")

# Add first_name column
spark.sql("ALTER TABLE local.schema_lab.users ADD COLUMN first_name string")
print("✅ Added first_name column")

# Add last_name column
spark.sql("ALTER TABLE local.schema_lab.users ADD COLUMN last_name string")
print("✅ Added last_name column")

# Add phone column
spark.sql("ALTER TABLE local.schema_lab.users ADD COLUMN phone string")
print("✅ Added phone column")

# Show updated schema
show_schema_info("local.schema_lab.users")

🔄 Evolution 1: Adding profile columns...
✅ Added first_name column
✅ Added last_name column
✅ Added phone column

📋 Schema for local.schema_lab.users:
+--------------+----------------+-------+
|col_name      |data_type       |comment|
+--------------+----------------+-------+
|user_id       |bigint          |NULL   |
|username      |string          |NULL   |
|email         |string          |NULL   |
|created_at    |timestamp       |NULL   |
|first_name    |string          |NULL   |
|last_name     |string          |NULL   |
|phone         |string          |NULL   |
|              |                |       |
|# Partitioning|                |       |
|Part 0        |days(created_at)|       |
+--------------+----------------+-------+



In [6]:
# Verify that old data is still accessible and new columns are null
print("📊 Data after adding columns (notice new columns are null):")
spark.sql("SELECT * FROM local.schema_lab.users ORDER BY user_id").show(truncate=False)

# Insert new data with the new columns
new_users_data = [
    (1005, "eve_analyst", "eve@example.com", datetime(2024, 1, 17, 10, 0, 0), "Eve", "Analyst", "+1-555-0105"),
    (1006, "frank_admin", "frank@example.com", datetime(2024, 1, 17, 11, 0, 0), "Frank", "Admin", "+1-555-0106")
]

new_columns = ["user_id", "username", "email", "created_at", "first_name", "last_name", "phone"]
new_users_df = spark.createDataFrame(new_users_data, new_columns)

print("\n📊 New user data with profile information:")
new_users_df.show()

# Insert new data
new_users_df.writeTo("local.schema_lab.users").append()
print("✅ Inserted new users with profile data")

# Show all data
print("\n📊 All users after schema evolution:")
spark.sql("SELECT * FROM local.schema_lab.users ORDER BY user_id").show(truncate=False)

count_records("local.schema_lab.users")

📊 Data after adding columns (notice new columns are null):
+-------+-----------+-------------------+-------------------+----------+---------+-----+
|user_id|username   |email              |created_at         |first_name|last_name|phone|
+-------+-----------+-------------------+-------------------+----------+---------+-----+
|1001   |alice_w    |alice@example.com  |2024-01-15 10:00:00|NULL      |NULL     |NULL |
|1002   |bob_smith  |bob@example.com    |2024-01-15 11:00:00|NULL      |NULL     |NULL |
|1003   |charlie_dev|charlie@example.com|2024-01-16 09:00:00|NULL      |NULL     |NULL |
|1004   |diana_data |diana@example.com  |2024-01-16 14:00:00|NULL      |NULL     |NULL |
+-------+-----------+-------------------+-------------------+----------+---------+-----+


📊 New user data with profile information:
+-------+-----------+-----------------+-------------------+----------+---------+-----------+
|user_id|   username|            email|         created_at|first_name|last_name|      phone|

6

### Adding Complex Data Types

In [7]:
# Evolution 2: Add complex data types
print("🔄 Evolution 2: Adding complex data types...")

# Add preferences as a map
spark.sql("ALTER TABLE local.schema_lab.users ADD COLUMN preferences map<string,string>")
print("✅ Added preferences map column")

# Add is_premium boolean with default
spark.sql("ALTER TABLE local.schema_lab.users ADD COLUMN is_premium boolean")
print("✅ Added is_premium boolean column")

# Add tags as array
spark.sql("ALTER TABLE local.schema_lab.users ADD COLUMN tags array<string>")
print("✅ Added tags array column")

# Show updated schema
show_schema_info("local.schema_lab.users")

🔄 Evolution 2: Adding complex data types...
✅ Added preferences map column
✅ Added is_premium boolean column
✅ Added tags array column

📋 Schema for local.schema_lab.users:
+--------------+------------------+-------+
|col_name      |data_type         |comment|
+--------------+------------------+-------+
|user_id       |bigint            |NULL   |
|username      |string            |NULL   |
|email         |string            |NULL   |
|created_at    |timestamp         |NULL   |
|first_name    |string            |NULL   |
|last_name     |string            |NULL   |
|phone         |string            |NULL   |
|preferences   |map<string,string>|NULL   |
|is_premium    |boolean           |NULL   |
|tags          |array<string>     |NULL   |
|              |                  |       |
|# Partitioning|                  |       |
|Part 0        |days(created_at)  |       |
+--------------+------------------+-------+



In [8]:
# Insert data with complex types using SQL
print("📊 Inserting users with complex data types...")

complex_users_sql = """
INSERT INTO local.schema_lab.users VALUES
    (1007, 'grace_premium', 'grace@example.com', TIMESTAMP '2024-01-18 09:00:00', 
     'Grace', 'Premium', '+1-555-0107', 
     map('theme', 'dark', 'language', 'en', 'notifications', 'enabled'), 
     true, 
     array('premium', 'early-adopter', 'power-user')),
    (1008, 'henry_basic', 'henry@example.com', TIMESTAMP '2024-01-18 10:00:00', 
     'Henry', 'Basic', '+1-555-0108', 
     map('theme', 'light', 'language', 'es'), 
     false, 
     array('new-user'))
"""

spark.sql(complex_users_sql)
print("✅ Inserted users with complex data types")

# Query complex data
print("\n📊 Users with complex data types:")
spark.sql("""
SELECT user_id, username, preferences, is_premium, tags 
FROM local.schema_lab.users 
WHERE preferences IS NOT NULL
ORDER BY user_id
""").show(truncate=False)

count_records("local.schema_lab.users")

📊 Inserting users with complex data types...
✅ Inserted users with complex data types

📊 Users with complex data types:
+-------+-------------+---------------------------------------------------------+----------+------------------------------------+
|user_id|username     |preferences                                              |is_premium|tags                                |
+-------+-------------+---------------------------------------------------------+----------+------------------------------------+
|1007   |grace_premium|{theme -> dark, language -> en, notifications -> enabled}|true      |[premium, early-adopter, power-user]|
|1008   |henry_basic  |{theme -> light, language -> es}                         |false     |[new-user]                          |
+-------+-------------+---------------------------------------------------------+----------+------------------------------------+

📊 Total records in local.schema_lab.users: 8


8

## 4. 📊 Experiment 2: Data Type Evolution

Let's practice safe data type promotions.

In [9]:
# Create a metrics table for type evolution experiments
spark.sql("DROP TABLE IF EXISTS local.schema_lab.metrics")

create_metrics_sql = """
CREATE TABLE local.schema_lab.metrics (
    metric_id int,
    metric_name string,
    value float,
    count int,
    recorded_at timestamp
) USING ICEBERG
PARTITIONED BY (days(recorded_at))
"""

spark.sql(create_metrics_sql)
print("✅ Created metrics table with small data types")

show_schema_info("local.schema_lab.metrics")

✅ Created metrics table with small data types

📋 Schema for local.schema_lab.metrics:
+--------------+-----------------+-------+
|col_name      |data_type        |comment|
+--------------+-----------------+-------+
|metric_id     |int              |NULL   |
|metric_name   |string           |NULL   |
|value         |float            |NULL   |
|count         |int              |NULL   |
|recorded_at   |timestamp        |NULL   |
|              |                 |       |
|# Partitioning|                 |       |
|Part 0        |days(recorded_at)|       |
+--------------+-----------------+-------+



In [10]:
# Insert initial metrics data
initial_metrics = [
    (1, "page_views", 1234.5, 5000, datetime(2024, 1, 15, 10, 0, 0)),
    (2, "click_rate", 0.045, 2250, datetime(2024, 1, 15, 11, 0, 0)),
    (3, "conversion_rate", 0.023, 115, datetime(2024, 1, 16, 9, 0, 0)),
    (4, "revenue", 15678.90, 45, datetime(2024, 1, 16, 14, 0, 0))
]

metrics_columns = ["metric_id", "metric_name", "value", "count", "recorded_at"]
metrics_df = spark.createDataFrame(initial_metrics, metrics_columns)

print("📊 Initial metrics data:")
metrics_df.show()

metrics_df.writeTo("local.schema_lab.metrics").append()
print("✅ Inserted initial metrics data")

count_records("local.schema_lab.metrics")

📊 Initial metrics data:
+---------+---------------+-------+-----+-------------------+
|metric_id|    metric_name|  value|count|        recorded_at|
+---------+---------------+-------+-----+-------------------+
|        1|     page_views| 1234.5| 5000|2024-01-15 10:00:00|
|        2|     click_rate|  0.045| 2250|2024-01-15 11:00:00|
|        3|conversion_rate|  0.023|  115|2024-01-16 09:00:00|
|        4|        revenue|15678.9|   45|2024-01-16 14:00:00|
+---------+---------------+-------+-----+-------------------+

✅ Inserted initial metrics data
📊 Total records in local.schema_lab.metrics: 4


4

### Safe Type Promotions

In [11]:
# Evolution: Promote data types to handle larger values
print("🔄 Performing safe type promotions...")

# int → bigint (safe promotion)
spark.sql("ALTER TABLE local.schema_lab.metrics ALTER COLUMN metric_id TYPE bigint")
print("✅ Promoted metric_id: int → bigint")

# float → double (safe promotion)
spark.sql("ALTER TABLE local.schema_lab.metrics ALTER COLUMN value TYPE double")
print("✅ Promoted value: float → double")

# int → bigint (safe promotion)
spark.sql("ALTER TABLE local.schema_lab.metrics ALTER COLUMN count TYPE bigint")
print("✅ Promoted count: int → bigint")

# Show updated schema
show_schema_info("local.schema_lab.metrics")

# Verify data is still accessible
print("\n📊 Data after type promotions:")
spark.sql("SELECT * FROM local.schema_lab.metrics ORDER BY metric_id").show()

🔄 Performing safe type promotions...
✅ Promoted metric_id: int → bigint
✅ Promoted value: float → double
✅ Promoted count: int → bigint

📋 Schema for local.schema_lab.metrics:
+--------------+-----------------+-------+
|col_name      |data_type        |comment|
+--------------+-----------------+-------+
|metric_id     |bigint           |NULL   |
|metric_name   |string           |NULL   |
|value         |double           |NULL   |
|count         |bigint           |NULL   |
|recorded_at   |timestamp        |NULL   |
|              |                 |       |
|# Partitioning|                 |       |
|Part 0        |days(recorded_at)|       |
+--------------+-----------------+-------+


📊 Data after type promotions:
+---------+---------------+--------------------+-----+-------------------+
|metric_id|    metric_name|               value|count|        recorded_at|
+---------+---------------+--------------------+-----+-------------------+
|        1|     page_views|              1234.5| 50

In [12]:
# Add high-precision decimal column
spark.sql("ALTER TABLE local.schema_lab.metrics ADD COLUMN revenue_precise decimal(15,2)")
print("✅ Added high-precision revenue column")

# Insert data that would overflow the original types
large_metrics_sql = """
INSERT INTO local.schema_lab.metrics VALUES
    (999999999, 'large_metric', 999999999.999999, 999999999999, 
     TIMESTAMP '2024-01-17 10:00:00', 1234567890.12),
    (1000000000, 'huge_metric', 1000000000.0, 1000000000000, 
     TIMESTAMP '2024-01-17 11:00:00', 9999999999.99)
"""

spark.sql(large_metrics_sql)
print("✅ Inserted large values that would have overflowed original types")

print("\n📊 All metrics including large values:")
spark.sql("SELECT * FROM local.schema_lab.metrics ORDER BY metric_id").show()

count_records("local.schema_lab.metrics")

✅ Added high-precision revenue column
✅ Inserted large values that would have overflowed original types

📊 All metrics including large values:
+----------+---------------+--------------------+-------------+-------------------+---------------+
| metric_id|    metric_name|               value|        count|        recorded_at|revenue_precise|
+----------+---------------+--------------------+-------------+-------------------+---------------+
|         1|     page_views|              1234.5|         5000|2024-01-15 10:00:00|           NULL|
|         2|     click_rate| 0.04500000178813934|         2250|2024-01-15 11:00:00|           NULL|
|         3|conversion_rate|0.023000000044703484|          115|2024-01-16 09:00:00|           NULL|
|         4|        revenue|     15678.900390625|           45|2024-01-16 14:00:00|           NULL|
| 999999999|   large_metric|  9.99999999999999E8| 999999999999|2024-01-17 10:00:00|  1234567890.12|
|1000000000|    huge_metric|               1.0E9|10000000

6

## 5. 🔄 Experiment 3: Column Removal and Renaming

Let's practice removing and renaming columns safely.

In [13]:
# Show current users schema before modifications
print("📋 Current users schema:")
show_schema_info("local.schema_lab.users")

print("\n📊 Sample data before modifications:")
spark.sql("SELECT user_id, username, first_name, last_name, phone FROM local.schema_lab.users LIMIT 3").show()

📋 Current users schema:

📋 Schema for local.schema_lab.users:
+--------------+------------------+-------+
|col_name      |data_type         |comment|
+--------------+------------------+-------+
|user_id       |bigint            |NULL   |
|username      |string            |NULL   |
|email         |string            |NULL   |
|created_at    |timestamp         |NULL   |
|first_name    |string            |NULL   |
|last_name     |string            |NULL   |
|phone         |string            |NULL   |
|preferences   |map<string,string>|NULL   |
|is_premium    |boolean           |NULL   |
|tags          |array<string>     |NULL   |
|              |                  |       |
|# Partitioning|                  |       |
|Part 0        |days(created_at)  |       |
+--------------+------------------+-------+


📊 Sample data before modifications:
+-------+-------------+----------+---------+-----------+
|user_id|     username|first_name|last_name|      phone|
+-------+-------------+----------+----

### Column Renaming

In [14]:
# Rename username to display_name for better semantics
print("🔄 Renaming username to display_name...")
spark.sql("ALTER TABLE local.schema_lab.users RENAME COLUMN username TO display_name")
print("✅ Renamed username → display_name")

# Show updated schema
show_schema_info("local.schema_lab.users")

# Verify data is accessible with new column name
print("\n📊 Data with renamed column:")
spark.sql("SELECT user_id, display_name, first_name, last_name FROM local.schema_lab.users LIMIT 3").show()

🔄 Renaming username to display_name...
✅ Renamed username → display_name

📋 Schema for local.schema_lab.users:
+--------------+------------------+-------+
|col_name      |data_type         |comment|
+--------------+------------------+-------+
|user_id       |bigint            |NULL   |
|display_name  |string            |NULL   |
|email         |string            |NULL   |
|created_at    |timestamp         |NULL   |
|first_name    |string            |NULL   |
|last_name     |string            |NULL   |
|phone         |string            |NULL   |
|preferences   |map<string,string>|NULL   |
|is_premium    |boolean           |NULL   |
|tags          |array<string>     |NULL   |
|              |                  |       |
|# Partitioning|                  |       |
|Part 0        |days(created_at)  |       |
+--------------+------------------+-------+


📊 Data with renamed column:
+-------+-------------+----------+---------+
|user_id| display_name|first_name|last_name|
+-------+------------

### Column Removal (Privacy Compliance)

In [15]:
# Remove phone column for privacy compliance (GDPR example)
print("🔄 Removing phone column for privacy compliance...")
spark.sql("ALTER TABLE local.schema_lab.users DROP COLUMN phone")
print("✅ Removed phone column")

# Show updated schema
show_schema_info("local.schema_lab.users")

# Verify data is still accessible (phone column should be gone)
print("\n📊 Data after removing phone column:")
spark.sql("SELECT * FROM local.schema_lab.users LIMIT 5").show(truncate=False)

# Try to query the dropped column (should fail)
try:
    spark.sql("SELECT phone FROM local.schema_lab.users LIMIT 1").show()
except Exception as e:
    print(f"\n✅ Expected error when querying dropped column: {str(e)[:100]}...")

🔄 Removing phone column for privacy compliance...
✅ Removed phone column

📋 Schema for local.schema_lab.users:
+--------------+------------------+-------+
|col_name      |data_type         |comment|
+--------------+------------------+-------+
|user_id       |bigint            |NULL   |
|display_name  |string            |NULL   |
|email         |string            |NULL   |
|created_at    |timestamp         |NULL   |
|first_name    |string            |NULL   |
|last_name     |string            |NULL   |
|preferences   |map<string,string>|NULL   |
|is_premium    |boolean           |NULL   |
|tags          |array<string>     |NULL   |
|              |                  |       |
|# Partitioning|                  |       |
|Part 0        |days(created_at)  |       |
+--------------+------------------+-------+


📊 Data after removing phone column:
+-------+-------------+-----------------+-------------------+----------+---------+---------------------------------------------------------+-------

## 6. 🏗️ Experiment 4: Complex Schema Evolution

Let's create a product catalog with complex nested structures.

In [16]:
# Create products table with initial simple schema
spark.sql("DROP TABLE IF EXISTS local.schema_lab.products")

create_products_sql = """
CREATE TABLE local.schema_lab.products (
    product_id string,
    name string,
    price decimal(10,2),
    category string,
    created_at timestamp
) USING ICEBERG
PARTITIONED BY (category)
"""

spark.sql(create_products_sql)
print("✅ Created products table with simple schema")

# Insert initial products
initial_products_sql = """
INSERT INTO local.schema_lab.products VALUES
    ('LAPTOP_001', 'Gaming Laptop', 1299.99, 'electronics', TIMESTAMP '2024-01-15 10:00:00'),
    ('BOOK_001', 'Python Programming', 49.99, 'books', TIMESTAMP '2024-01-15 11:00:00'),
    ('CHAIR_001', 'Office Chair', 299.99, 'furniture', TIMESTAMP '2024-01-16 09:00:00')
"""

spark.sql(initial_products_sql)
print("✅ Inserted initial products")

show_schema_info("local.schema_lab.products")
print("\n📊 Initial products:")
spark.sql("SELECT * FROM local.schema_lab.products").show()

✅ Created products table with simple schema
✅ Inserted initial products

📋 Schema for local.schema_lab.products:
+-----------------------+-------------+-------+
|col_name               |data_type    |comment|
+-----------------------+-------------+-------+
|product_id             |string       |NULL   |
|name                   |string       |NULL   |
|price                  |decimal(10,2)|NULL   |
|category               |string       |NULL   |
|created_at             |timestamp    |NULL   |
|# Partition Information|             |       |
|# col_name             |data_type    |comment|
|category               |string       |NULL   |
+-----------------------+-------------+-------+


📊 Initial products:
+----------+------------------+-------+-----------+-------------------+
|product_id|              name|  price|   category|         created_at|
+----------+------------------+-------+-----------+-------------------+
|  BOOK_001|Python Programming|  49.99|      books|2024-01-15 11:00:00|
|

### Adding Nested Structures

In [17]:
# Evolution 1: Add flexible attributes map
print("🔄 Adding attributes map for flexible properties...")
spark.sql("ALTER TABLE local.schema_lab.products ADD COLUMN attributes map<string,string>")
print("✅ Added attributes map")

# Evolution 2: Add product variants array
print("\n🔄 Adding product variants array...")
variants_sql = """
ALTER TABLE local.schema_lab.products ADD COLUMN variants array<struct<
    size: string,
    color: string,
    stock: int,
    price: decimal(10,2)
>>
"""
spark.sql(variants_sql)
print("✅ Added variants array with struct elements")

# Evolution 3: Add metadata struct
print("\n🔄 Adding metadata struct...")
metadata_sql = """
ALTER TABLE local.schema_lab.products ADD COLUMN metadata struct<
    brand: string,
    manufacturer: string,
    country: string,
    certifications: array<string>
>
"""
spark.sql(metadata_sql)
print("✅ Added metadata struct")

show_schema_info("local.schema_lab.products")

🔄 Adding attributes map for flexible properties...
✅ Added attributes map

🔄 Adding product variants array...
✅ Added variants array with struct elements

🔄 Adding metadata struct...
✅ Added metadata struct

📋 Schema for local.schema_lab.products:
+-----------------------+------------------------------------------------------------------------------------+-------+
|col_name               |data_type                                                                           |comment|
+-----------------------+------------------------------------------------------------------------------------+-------+
|product_id             |string                                                                              |NULL   |
|name                   |string                                                                              |NULL   |
|price                  |decimal(10,2)                                                                       |NULL   |
|category               |string       

In [18]:
# Insert products with complex nested data
complex_products_sql = """
INSERT INTO local.schema_lab.products VALUES
    ('LAPTOP_002', 'Professional Laptop', 1899.99, 'electronics', TIMESTAMP '2024-01-17 10:00:00',
     map('processor', 'Intel i7', 'ram', '32GB', 'storage', '1TB SSD', 'warranty', '3 years'),
     array(
         struct('15-inch', 'Silver', 10, 1899.99),
         struct('15-inch', 'Black', 8, 1899.99),
         struct('17-inch', 'Silver', 5, 2099.99)
     ),
     struct('TechBrand', 'TechManufacturer', 'USA', array('ISO-9001', 'Energy Star'))
    ),
    ('BOOK_002', 'Data Science Handbook', 79.99, 'books', TIMESTAMP '2024-01-17 11:00:00',
     map('pages', '650', 'language', 'English', 'edition', '3rd', 'format', 'Hardcover'),
     array(
         struct('Hardcover', 'Blue', 25, 79.99),
         struct('Paperback', 'Blue', 50, 59.99),
         struct('eBook', 'N/A', 9999, 39.99)
     ),
     struct('EduPublisher', 'EduPrint Co', 'UK', array('Educational Standard', 'Peer Reviewed'))
    )
"""

spark.sql(complex_products_sql)
print("✅ Inserted products with complex nested data")

count_records("local.schema_lab.products")

✅ Inserted products with complex nested data
📊 Total records in local.schema_lab.products: 5


5

In [19]:
# Query complex nested data
print("📊 Products with attributes:")
spark.sql("""
SELECT product_id, name, attributes['processor'] as processor, attributes['warranty'] as warranty
FROM local.schema_lab.products 
WHERE attributes IS NOT NULL
""").show()

print("\n📊 Product variants (flattened):")
spark.sql("""
SELECT product_id, name, explode(variants) as variant
FROM local.schema_lab.products 
WHERE variants IS NOT NULL
""").show()

print("\n📊 Product metadata:")
spark.sql("""
SELECT product_id, name, metadata.brand, metadata.country, metadata.certifications
FROM local.schema_lab.products 
WHERE metadata IS NOT NULL
""").show(truncate=False)

📊 Products with attributes:
+----------+--------------------+---------+--------+
|product_id|                name|processor|warranty|
+----------+--------------------+---------+--------+
|  BOOK_002|Data Science Hand...|     NULL|    NULL|
|LAPTOP_002| Professional Laptop| Intel i7| 3 years|
+----------+--------------------+---------+--------+


📊 Product variants (flattened):
+----------+--------------------+--------------------+
|product_id|                name|             variant|
+----------+--------------------+--------------------+
|  BOOK_002|Data Science Hand...|{Hardcover, Blue,...|
|  BOOK_002|Data Science Hand...|{Paperback, Blue,...|
|  BOOK_002|Data Science Hand...|{eBook, N/A, 9999...|
|LAPTOP_002| Professional Laptop|{15-inch, Silver,...|
|LAPTOP_002| Professional Laptop|{15-inch, Black, ...|
|LAPTOP_002| Professional Laptop|{17-inch, Silver,...|
+----------+--------------------+--------------------+


📊 Product metadata:
+----------+---------------------+------------+-

## 7. 🧪 Compatibility Testing

Let's test backward and forward compatibility of our schema changes.

In [20]:
# Get snapshots to test historical compatibility
print("📸 Available snapshots for users table:")
users_snapshots = spark.sql("SELECT snapshot_id, committed_at, operation FROM local.schema_lab.users.snapshots ORDER BY committed_at")
users_snapshots.show(truncate=False)

# Get the first snapshot (before schema evolution)
first_snapshot = users_snapshots.collect()[0]['snapshot_id']
print(f"\n🎯 Testing with first snapshot ID: {first_snapshot}")

📸 Available snapshots for users table:
+-------------------+-----------------------+---------+
|snapshot_id        |committed_at           |operation|
+-------------------+-----------------------+---------+
|4728977475871491922|2025-06-15 14:07:04.463|append   |
|5453818090795943003|2025-06-15 14:07:19.846|append   |
|499303692362199223 |2025-06-15 14:07:43.027|append   |
+-------------------+-----------------------+---------+


🎯 Testing with first snapshot ID: 4728977475871491922


In [21]:
# Test backward compatibility: read old data with new schema reader
print("⏪ Backward Compatibility Test: Reading old data with new schema")

try:
    old_data = spark.sql(f"SELECT * FROM local.schema_lab.users VERSION AS OF {first_snapshot}")
    print("✅ Successfully read old data with current reader")
    print("📊 Old data columns:", old_data.columns)
    print("📊 Old data sample:")
    old_data.show(2)
    
    # Count records in old snapshot
    old_count = old_data.count()
    print(f"📊 Records in old snapshot: {old_count}")
    
except Exception as e:
    print(f"❌ Backward compatibility issue: {e}")

⏪ Backward Compatibility Test: Reading old data with new schema
✅ Successfully read old data with current reader
📊 Old data columns: ['user_id', 'username', 'email', 'created_at']
📊 Old data sample:
+-------+---------+-----------------+-------------------+
|user_id| username|            email|         created_at|
+-------+---------+-----------------+-------------------+
|   1001|  alice_w|alice@example.com|2024-01-15 10:00:00|
|   1002|bob_smith|  bob@example.com|2024-01-15 11:00:00|
+-------+---------+-----------------+-------------------+
only showing top 2 rows

📊 Records in old snapshot: 4


In [22]:
# Test forward compatibility: compare data access patterns
print("⏩ Forward Compatibility Test: Compare old vs new data patterns")

# Query that would work with both old and new schema
compatible_query = """
SELECT user_id, email, created_at 
FROM local.schema_lab.users 
WHERE user_id < 1005
ORDER BY user_id
"""

print("\n📊 Compatible query (works with both old and new schema):")
spark.sql(compatible_query).show()

# Query that only works with new schema
new_schema_query = """
SELECT user_id, display_name, first_name, last_name, is_premium 
FROM local.schema_lab.users 
WHERE is_premium IS NOT NULL
ORDER BY user_id
"""

print("\n📊 New schema query (only works after evolution):")
spark.sql(new_schema_query).show()

⏩ Forward Compatibility Test: Compare old vs new data patterns

📊 Compatible query (works with both old and new schema):
+-------+-------------------+-------------------+
|user_id|              email|         created_at|
+-------+-------------------+-------------------+
|   1001|  alice@example.com|2024-01-15 10:00:00|
|   1002|    bob@example.com|2024-01-15 11:00:00|
|   1003|charlie@example.com|2024-01-16 09:00:00|
|   1004|  diana@example.com|2024-01-16 14:00:00|
+-------+-------------------+-------------------+


📊 New schema query (only works after evolution):
+-------+-------------+----------+---------+----------+
|user_id| display_name|first_name|last_name|is_premium|
+-------+-------------+----------+---------+----------+
|   1007|grace_premium|     Grace|  Premium|      true|
|   1008|  henry_basic|     Henry|    Basic|     false|
+-------+-------------+----------+---------+----------+



## 8. 📊 Performance Analysis

Let's analyze the performance impact of schema evolution.

In [23]:
# Analyze storage and performance impact
print("📈 Performance Analysis of Schema Evolution")

# Check table files and sizes
print("\n📁 Table files analysis:")
files_info = spark.sql("""
SELECT 
    file_path,
    file_format,
    record_count,
    file_size_in_bytes,
    ROUND(file_size_in_bytes / 1024.0, 2) as file_size_kb
FROM local.schema_lab.users.files
ORDER BY file_size_in_bytes DESC
""")
files_info.show(truncate=False)

# Summary statistics
summary_stats = spark.sql("""
SELECT 
    COUNT(*) as total_files,
    SUM(record_count) as total_records,
    SUM(file_size_in_bytes) as total_size_bytes,
    ROUND(AVG(file_size_in_bytes), 2) as avg_file_size_bytes,
    ROUND(SUM(file_size_in_bytes) / 1024.0, 2) as total_size_kb
FROM local.schema_lab.users.files
""")

print("\n📊 Storage summary:")
summary_stats.show()

📈 Performance Analysis of Schema Evolution

📁 Table files analysis:
+--------------------------------------------------------------------------------------------------------------------------------------------+-----------+------------+------------------+------------+
|file_path                                                                                                                                   |file_format|record_count|file_size_in_bytes|file_size_kb|
+--------------------------------------------------------------------------------------------------------------------------------------------+-----------+------------+------------------+------------+
|file:/home/jovyan/work/warehouse/schema_lab/users/data/created_at_day=2024-01-18/00000-66-5785fe72-62cc-40a8-8f24-e5f25e2448d7-00001.parquet|PARQUET    |2           |3405              |3.33        |
|file:/home/jovyan/work/warehouse/schema_lab/users/data/created_at_day=2024-01-17/00000-60-dad020b9-3266-4ab8-b193-36c5d9434310-0000

In [24]:
# Test query performance with schema projection
print("⚡ Query Performance Testing")

# Query with column projection (only select needed columns)
print("\n🎯 Testing column projection performance:")

# Query 1: Select only basic columns (should be fast)
basic_query = "SELECT user_id, display_name, email FROM local.schema_lab.users"
print("Query 1 - Basic columns:")
result1 = spark.sql(basic_query)
print(f"Result count: {result1.count()}")

# Query 2: Select all columns (might be slower)
all_columns_query = "SELECT * FROM local.schema_lab.users"
print("\nQuery 2 - All columns:")
result2 = spark.sql(all_columns_query)
print(f"Result count: {result2.count()}")

# Query 3: Complex column operations
complex_query = """
SELECT 
    user_id,
    display_name,
    COALESCE(first_name, 'Unknown') as first_name_clean,
    COALESCE(is_premium, false) as premium_status,
    size(COALESCE(tags, array())) as tag_count
FROM local.schema_lab.users
"""
print("\nQuery 3 - Complex operations with null handling:")
result3 = spark.sql(complex_query)
result3.show()

⚡ Query Performance Testing

🎯 Testing column projection performance:
Query 1 - Basic columns:
Result count: 8

Query 2 - All columns:
Result count: 8

Query 3 - Complex operations with null handling:
+-------+-------------+----------------+--------------+---------+
|user_id| display_name|first_name_clean|premium_status|tag_count|
+-------+-------------+----------------+--------------+---------+
|   1001|      alice_w|         Unknown|         false|        0|
|   1002|    bob_smith|         Unknown|         false|        0|
|   1003|  charlie_dev|         Unknown|         false|        0|
|   1004|   diana_data|         Unknown|         false|        0|
|   1005|  eve_analyst|             Eve|         false|        0|
|   1006|  frank_admin|           Frank|         false|        0|
|   1007|grace_premium|           Grace|          true|        3|
|   1008|  henry_basic|           Henry|         false|        1|
+-------+-------------+----------------+--------------+---------+



## 9. 🎯 Schema Evolution Best Practices

Let's demonstrate schema evolution best practices.

In [25]:
# Best Practice 1: Always add columns as nullable initially
print("✅ Best Practice 1: Add nullable columns")
print("   - New columns are automatically nullable")
print("   - Historical data gets null values")
print("   - No data rewriting required")

# Best Practice 2: Use safe type promotions
print("\n✅ Best Practice 2: Safe type promotions only")
safe_promotions = [
    "int → bigint ✅",
    "float → double ✅",
    "decimal(p,s) → decimal(p',s) where p' > p ✅",
    "string → string (no change needed) ✅"
]
for promotion in safe_promotions:
    print(f"   - {promotion}")

# Best Practice 3: Avoid these dangerous operations
print("\n❌ Avoid these operations:")
dangerous_ops = [
    "bigint → int (data loss possible)",
    "double → float (precision loss)",
    "Adding required (non-null) columns",
    "Changing column semantics without renaming"
]
for op in dangerous_ops:
    print(f"   - {op}")

✅ Best Practice 1: Add nullable columns
   - New columns are automatically nullable
   - Historical data gets null values
   - No data rewriting required

✅ Best Practice 2: Safe type promotions only
   - int → bigint ✅
   - float → double ✅
   - decimal(p,s) → decimal(p',s) where p' > p ✅
   - string → string (no change needed) ✅

❌ Avoid these operations:
   - bigint → int (data loss possible)
   - double → float (precision loss)
   - Adding required (non-null) columns
   - Changing column semantics without renaming


In [26]:
# Validation function for schema evolution
def validate_schema_evolution(table_name):
    """Comprehensive validation of schema evolution"""
    print(f"🔍 Validating schema evolution for {table_name}")
    
    try:
        # Test 1: Current data accessibility
        current_count = spark.sql(f"SELECT COUNT(*) as count FROM {table_name}").collect()[0]['count']
        print(f"✅ Current data accessible: {current_count} records")
        
        # Test 2: Schema history availability
        snapshots = spark.sql(f"SELECT COUNT(*) as count FROM {table_name}.snapshots").collect()[0]['count']
        print(f"✅ Schema history available: {snapshots} snapshots")
        
        # Test 3: Historical data accessibility
        first_snapshot = spark.sql(f"SELECT snapshot_id FROM {table_name}.snapshots ORDER BY committed_at LIMIT 1").collect()[0]['snapshot_id']
        historical_count = spark.sql(f"SELECT COUNT(*) as count FROM {table_name} VERSION AS OF {first_snapshot}").collect()[0]['count']
        print(f"✅ Historical data accessible: {historical_count} records in first snapshot")
        
        # Test 4: Metadata consistency
        files_count = spark.sql(f"SELECT COUNT(*) as count FROM {table_name}.files").collect()[0]['count']
        print(f"✅ Metadata consistency: {files_count} data files tracked")
        
        print(f"\n🎉 All validation tests passed for {table_name}!")
        return True
        
    except Exception as e:
        print(f"❌ Validation failed: {e}")
        return False

# Validate our evolved tables
validate_schema_evolution("local.schema_lab.users")
print("\n" + "="*50 + "\n")
validate_schema_evolution("local.schema_lab.products")

🔍 Validating schema evolution for local.schema_lab.users
✅ Current data accessible: 8 records
✅ Schema history available: 3 snapshots
✅ Historical data accessible: 4 records in first snapshot
✅ Metadata consistency: 4 data files tracked

🎉 All validation tests passed for local.schema_lab.users!


🔍 Validating schema evolution for local.schema_lab.products
✅ Current data accessible: 5 records
✅ Schema history available: 2 snapshots
✅ Historical data accessible: 3 records in first snapshot
✅ Metadata consistency: 5 data files tracked

🎉 All validation tests passed for local.schema_lab.products!


True

## 10. 🎉 Summary and Next Steps

Congratulations! You've completed the Schema Evolution tutorial.

In [27]:
# Final summary of what we accomplished
print("🎉 Schema Evolution Tutorial Complete!")
print("\n✅ What You've Mastered:")

accomplishments = [
    "Added columns without data rewriting",
    "Performed safe data type promotions",
    "Renamed columns with backward compatibility",
    "Removed columns for privacy compliance",
    "Created complex nested data structures",
    "Tested backward and forward compatibility",
    "Analyzed performance impact of schema changes",
    "Implemented schema evolution best practices"
]

for i, accomplishment in enumerate(accomplishments, 1):
    print(f"   {i}. {accomplishment}")

print("\n📊 Final Statistics:")
tables_created = ['users', 'metrics', 'products']
for table in tables_created:
    try:
        count = spark.sql(f"SELECT COUNT(*) as count FROM local.schema_lab.{table}").collect()[0]['count']
        snapshots = spark.sql(f"SELECT COUNT(*) as count FROM local.schema_lab.{table}.snapshots").collect()[0]['count']
        print(f"   📋 {table}: {count} records, {snapshots} snapshots")
    except:
        print(f"   📋 {table}: Not accessible")

print("\n🚀 Next Steps:")
next_steps = [
    "Explore Project 3: Time Travel Features",
    "Practice schema evolution in production scenarios",
    "Learn about schema registries and governance",
    "Study advanced Iceberg features"
]

for step in next_steps:
    print(f"   → {step}")

print("\n🎯 Key Takeaways:")
takeaways = [
    "Schema evolution in Iceberg is safe and efficient",
    "Historical data remains accessible after schema changes",
    "Complex data types enable flexible data modeling",
    "Proper validation ensures data integrity"
]

for takeaway in takeaways:
    print(f"   💡 {takeaway}")

🎉 Schema Evolution Tutorial Complete!

✅ What You've Mastered:
   1. Added columns without data rewriting
   2. Performed safe data type promotions
   3. Renamed columns with backward compatibility
   4. Removed columns for privacy compliance
   5. Created complex nested data structures
   6. Tested backward and forward compatibility
   7. Analyzed performance impact of schema changes
   8. Implemented schema evolution best practices

📊 Final Statistics:
   📋 users: 8 records, 3 snapshots
   📋 metrics: 6 records, 2 snapshots
   📋 products: 5 records, 2 snapshots

🚀 Next Steps:
   → Explore Project 3: Time Travel Features
   → Practice schema evolution in production scenarios
   → Learn about schema registries and governance
   → Study advanced Iceberg features

🎯 Key Takeaways:
   💡 Schema evolution in Iceberg is safe and efficient
   💡 Historical data remains accessible after schema changes
   💡 Complex data types enable flexible data modeling
   💡 Proper validation ensures data integ