# üßä Apache Iceberg Tutorial

Welcome to the comprehensive Apache Iceberg tutorial! In this notebook, you'll learn:

1. **Setting up Spark with Iceberg**
2. **Creating your first Iceberg table**
3. **Inserting and querying data**
4. **Understanding table metadata and structure**
5. **Time travel and snapshots**
6. **Schema evolution**

## üìã Prerequisites

- Docker environment is running
- Basic understanding of SQL
- Python and PySpark knowledge (helpful but not required)


## 1. üöÄ Initialize Spark with Iceberg

First, let's set up Spark with Iceberg extensions.


In [1]:
import os
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from datetime import datetime, timedelta
import pandas as pd

# Set Python path for Spark to ensure consistent Python version
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/opt/conda/bin/python'

# Stop existing Spark session if any
try:
    spark.stop()
    print("üõë Stopped existing Spark session")
except:
    print("‚ÑπÔ∏è No existing Spark session to stop")

# Create Spark session with Iceberg and correct warehouse path
spark = SparkSession.builder \
    .appName("IcebergTutorial") \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.3") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "file:///home/jovyan/work/warehouse") \
    .config("spark.pyspark.python", "/opt/conda/bin/python") \
    .config("spark.pyspark.driver.python", "/opt/conda/bin/python") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()

print("‚úÖ Spark with Iceberg initialized successfully!")
print(f"Spark version: {spark.version}")
print(f"Python path: {os.environ.get('PYSPARK_PYTHON', 'Not set')}")

# Verify the warehouse configuration
warehouse_path = spark.conf.get("spark.sql.catalog.local.warehouse")
print(f"Configured warehouse location: {warehouse_path}")


‚ÑπÔ∏è No existing Spark session to stop
‚úÖ Spark with Iceberg initialized successfully!
Spark version: 3.5.0
Python path: /opt/conda/bin/python
Configured warehouse location: file:///home/jovyan/work/warehouse


## 2. üèóÔ∏è Create Database and Your First Table

Let's create a database and a table to store user events data.


In [3]:
# Create database
spark.sql("CREATE DATABASE IF NOT EXISTS local.demo")
print("‚úÖ Database 'local.demo' created!")

# Show available databases
spark.sql("SHOW DATABASES").show()


‚úÖ Database 'local.demo' created!
+---------+
|namespace|
+---------+
|  default|
+---------+



In [4]:
# Drop the table if it exists to avoid path conflicts
try:
    spark.sql("DROP TABLE IF EXISTS local.demo.user_events")
    print("üóëÔ∏è Dropped existing table (if any)")
except:
    print("‚ÑπÔ∏è No existing table to drop")

# Create Iceberg table with explicit path
create_table_sql = """
CREATE TABLE local.demo.user_events (
    user_id bigint,
    event_type string,
    event_time timestamp,
    page_url string,
    user_agent string,
    session_id string
) USING ICEBERG
PARTITIONED BY (days(event_time))
"""

spark.sql(create_table_sql)
print("üéâ Iceberg table 'user_events' created successfully!")

# Verify the table was created successfully
try:
    # Check if we can query the table structure
    table_info = spark.sql("SHOW TABLES IN local.demo").collect()
    tables = [row.tableName for row in table_info]
    if 'user_events' in tables:
        print("‚úÖ Table 'user_events' found in catalog")
    else:
        print("‚ùå Table 'user_events' not found in catalog")

    # Try to get table properties (alternative way to check table details)
    warehouse_path = spark.conf.get("spark.sql.catalog.local.warehouse")
    print(f"üìç Warehouse location: {warehouse_path}")

except Exception as e:
    print(f"‚ö†Ô∏è Could not verify table details: {e}")

# Describe the table
spark.sql("DESCRIBE local.demo.user_events").show()


üóëÔ∏è Dropped existing table (if any)
üéâ Iceberg table 'user_events' created successfully!
‚úÖ Table 'user_events' found in catalog
üìç Warehouse location: file:///home/jovyan/work/warehouse
+--------------+----------------+-------+
|      col_name|       data_type|comment|
+--------------+----------------+-------+
|       user_id|          bigint|   NULL|
|    event_type|          string|   NULL|
|    event_time|       timestamp|   NULL|
|      page_url|          string|   NULL|
|    user_agent|          string|   NULL|
|    session_id|          string|   NULL|
|              |                |       |
|# Partitioning|                |       |
|        Part 0|days(event_time)|       |
+--------------+----------------+-------+



## 3. üìù Insert Sample Data

Let's insert some sample user events data.


In [5]:
# Create sample data
from datetime import datetime

sample_data = [
    (1001, "page_view", datetime(2024, 1, 15, 10, 30, 0), "/home", "Mozilla/5.0", "sess_001"),
    (1001, "click", datetime(2024, 1, 15, 10, 35, 0), "/products", "Mozilla/5.0", "sess_001"),
    (1001, "purchase", datetime(2024, 1, 15, 10, 45, 0), "/checkout", "Mozilla/5.0", "sess_001"),
    (1002, "page_view", datetime(2024, 1, 15, 11, 0, 0), "/home", "Chrome/98.0", "sess_002"),
    (1002, "search", datetime(2024, 1, 15, 11, 5, 0), "/search", "Chrome/98.0", "sess_002"),
    (1002, "click", datetime(2024, 1, 15, 11, 10, 0), "/products/123", "Chrome/98.0", "sess_002"),
    (1003, "page_view", datetime(2024, 1, 16, 9, 0, 0), "/home", "Safari/15.0", "sess_003"),
    (1003, "signup", datetime(2024, 1, 16, 9, 15, 0), "/signup", "Safari/15.0", "sess_003"),
]

# Create DataFrame
columns = ["user_id", "event_type", "event_time", "page_url", "user_agent", "session_id"]
df = spark.createDataFrame(sample_data, columns)

# Show the data we're about to insert
print("üìä Sample data to insert:")
df.show(truncate=False)

# Insert data into Iceberg table
df.writeTo("local.demo.user_events").append()
print("‚úÖ Data inserted successfully!")


üìä Sample data to insert:
+-------+----------+-------------------+-------------+-----------+----------+
|user_id|event_type|event_time         |page_url     |user_agent |session_id|
+-------+----------+-------------------+-------------+-----------+----------+
|1001   |page_view |2024-01-15 10:30:00|/home        |Mozilla/5.0|sess_001  |
|1001   |click     |2024-01-15 10:35:00|/products    |Mozilla/5.0|sess_001  |
|1001   |purchase  |2024-01-15 10:45:00|/checkout    |Mozilla/5.0|sess_001  |
|1002   |page_view |2024-01-15 11:00:00|/home        |Chrome/98.0|sess_002  |
|1002   |search    |2024-01-15 11:05:00|/search      |Chrome/98.0|sess_002  |
|1002   |click     |2024-01-15 11:10:00|/products/123|Chrome/98.0|sess_002  |
|1003   |page_view |2024-01-16 09:00:00|/home        |Safari/15.0|sess_003  |
|1003   |signup    |2024-01-16 09:15:00|/signup      |Safari/15.0|sess_003  |
+-------+----------+-------------------+-------------+-----------+----------+

‚úÖ Data inserted successfully!


## 4. üîç Query and Analyze Data

Now let's query our Iceberg table and perform some analysis.


In [10]:
# üîß Fix table history query
print("üìú Table history (fixed):")

try:
    # Use correct column names for this Iceberg version
    spark.sql("SELECT made_current_at, snapshot_id, parent_id, is_current_ancestor FROM local.demo.user_events.history").show()

    print("\n‚ú® What each column means:")
    print("‚Ä¢ made_current_at: When this snapshot became the current table state")
    print("‚Ä¢ snapshot_id: Unique identifier for each snapshot")
    print("‚Ä¢ parent_id: The previous snapshot this one builds upon")
    print("‚Ä¢ is_current_ancestor: Whether this snapshot is in the current lineage")

except Exception as e:
    print(f"History query error: {e}")
    print("üìã Let's see all available columns in history:")
    try:
        # Show all columns to understand the schema
        history_df = spark.sql("SELECT * FROM local.demo.user_events.history")
        print(f"Columns: {history_df.columns}")
        history_df.show()
    except Exception as e2:
        print(f"Could not access history table: {e2}")


üìú Table history (fixed):
+--------------------+-------------------+---------+-------------------+
|     made_current_at|        snapshot_id|parent_id|is_current_ancestor|
+--------------------+-------------------+---------+-------------------+
|2025-06-15 23:33:...|4374133211291202804|     NULL|               true|
+--------------------+-------------------+---------+-------------------+


‚ú® What each column means:
‚Ä¢ made_current_at: When this snapshot became the current table state
‚Ä¢ snapshot_id: Unique identifier for each snapshot
‚Ä¢ parent_id: The previous snapshot this one builds upon
‚Ä¢ is_current_ancestor: Whether this snapshot is in the current lineage


In [11]:
# üéâ SUCCESS! Let's explore what we've accomplished
print("üéâ Congratulations! Your Iceberg table is working perfectly!")
print("\nüìä Summary of what we created:")

# Count total records
total_records = spark.sql("SELECT COUNT(*) as total FROM local.demo.user_events").collect()[0]['total']
print(f"‚Ä¢ Total records: {total_records}")

# Show partition information
partitions = spark.sql("SELECT partition, COUNT(*) as records FROM local.demo.user_events.partitions GROUP BY partition ORDER BY partition").collect()
print(f"‚Ä¢ Partitions created: {len(partitions)}")
for p in partitions:
    print(f"  - {p['partition']}: {p['records']} records")

# Show unique users and events
user_count = spark.sql("SELECT COUNT(DISTINCT user_id) as users FROM local.demo.user_events").collect()[0]['users']
event_types = spark.sql("SELECT event_type, COUNT(*) as count FROM local.demo.user_events GROUP BY event_type ORDER BY count DESC").collect()

print(f"‚Ä¢ Unique users: {user_count}")
print("‚Ä¢ Event types:")
for et in event_types:
    print(f"  - {et['event_type']}: {et['count']} events")

print("\nüîç Key Iceberg Features Demonstrated:")
print("‚úÖ ACID transactions - All data writes are atomic")
print("‚úÖ Time partitioning - Data organized by partition")
print("‚úÖ Metadata tables - .snapshots, .files, .history for introspection")
print("‚úÖ Schema enforcement - Strong typing with bigint, string, timestamp")
print("‚úÖ Parquet storage - Efficient columnar format for analytics")

print(f"\nüìÅ Your data is stored at: {spark.conf.get('spark.sql.catalog.local.warehouse')}")
print("üöÄ Ready to explore time travel, schema evolution, and more!")


üéâ Congratulations! Your Iceberg table is working perfectly!

üìä Summary of what we created:
‚Ä¢ Total records: 8
‚Ä¢ Partitions created: 2
  - Row(event_time_day=datetime.date(2024, 1, 15)): 1 records
  - Row(event_time_day=datetime.date(2024, 1, 16)): 1 records
‚Ä¢ Unique users: 3
‚Ä¢ Event types:
  - page_view: 3 events
  - click: 2 events
  - signup: 1 events
  - purchase: 1 events
  - search: 1 events

üîç Key Iceberg Features Demonstrated:
‚úÖ ACID transactions - All data writes are atomic
‚úÖ Time partitioning - Data organized by partition
‚úÖ Metadata tables - .snapshots, .files, .history for introspection
‚úÖ Schema enforcement - Strong typing with bigint, string, timestamp
‚úÖ Parquet storage - Efficient columnar format for analytics

üìÅ Your data is stored at: file:///home/jovyan/work/warehouse
üöÄ Ready to explore time travel, schema evolution, and more!


In [12]:
# Basic query - show all data
print("üìã All user events:")
spark.sql("SELECT * FROM local.demo.user_events ORDER BY event_time").show(truncate=False)


üìã All user events:
+-------+----------+-------------------+-------------+-----------+----------+
|user_id|event_type|event_time         |page_url     |user_agent |session_id|
+-------+----------+-------------------+-------------+-----------+----------+
|1001   |page_view |2024-01-15 10:30:00|/home        |Mozilla/5.0|sess_001  |
|1001   |click     |2024-01-15 10:35:00|/products    |Mozilla/5.0|sess_001  |
|1001   |purchase  |2024-01-15 10:45:00|/checkout    |Mozilla/5.0|sess_001  |
|1002   |page_view |2024-01-15 11:00:00|/home        |Chrome/98.0|sess_002  |
|1002   |search    |2024-01-15 11:05:00|/search      |Chrome/98.0|sess_002  |
|1002   |click     |2024-01-15 11:10:00|/products/123|Chrome/98.0|sess_002  |
|1003   |page_view |2024-01-16 09:00:00|/home        |Safari/15.0|sess_003  |
|1003   |signup    |2024-01-16 09:15:00|/signup      |Safari/15.0|sess_003  |
+-------+----------+-------------------+-------------+-----------+----------+



## 5. üì∏ Explore Iceberg Metadata

One of Iceberg's key features is rich metadata. Let's explore it!


In [13]:
# View table snapshots - This is ONLY possible with Iceberg!
print("üì∏ Table snapshots:")
spark.sql("SELECT snapshot_id, committed_at, operation FROM local.demo.user_events.snapshots").show(truncate=False)

# View table files
print("\nüìÅ Table files:")
spark.sql("SELECT file_path, file_format, record_count FROM local.demo.user_events.files").show(truncate=False)

# View table history
print("\nüìú Table history:")
# spark.sql("SELECT made_current_at, snapshot_id FROM local.demo.user_events.history").show()
spark.sql("SELECT * FROM local.demo.user_events.history").show()



üì∏ Table snapshots:
+-------------------+-----------------------+---------+
|snapshot_id        |committed_at           |operation|
+-------------------+-----------------------+---------+
|4374133211291202804|2025-06-15 23:33:25.237|append   |
+-------------------+-----------------------+---------+


üìÅ Table files:
+--------------------------------------------------------------------------------------------------------------------------------------------+-----------+------------+
|file_path                                                                                                                                   |file_format|record_count|
+--------------------------------------------------------------------------------------------------------------------------------------------+-----------+------------+
|file:/home/jovyan/work/warehouse/demo/user_events/data/event_time_day=2024-01-15/00000-28-67900940-a662-4bbe-abe8-7253f253a735-00001.parquet|PARQUET    |6           |
|file:

In [14]:
# ‚úÖ CORRECTED: Table history query with proper columns
print("üìú Table history (corrected):")

try:
    spark.sql("SELECT made_current_at, snapshot_id, parent_id, is_current_ancestor FROM local.demo.user_events.history").show()

    print("\n‚ú® What each column means:")
    print("‚Ä¢ made_current_at: Timestamp when this snapshot became the current table state")
    print("‚Ä¢ snapshot_id: Unique identifier for each table snapshot/version")
    print("‚Ä¢ parent_id: The ID of the previous snapshot this one builds upon (NULL for first)")
    print("‚Ä¢ is_current_ancestor: Whether this snapshot is part of the current table lineage")

    print("\nüîç Understanding the output:")
    print("‚Ä¢ You should see one row showing your single data insertion")
    print("‚Ä¢ parent_id is NULL because this is the first snapshot")
    print("‚Ä¢ is_current_ancestor is true because this is your current table state")

except Exception as e:
    print(f"Could not query history: {e}")
    print("üìã Let's try showing all columns:")
    try:
        spark.sql("SELECT * FROM local.demo.user_events.history").show()
    except Exception as e2:
        print(f"History table not accessible: {e2}")


üìú Table history (corrected):
+--------------------+-------------------+---------+-------------------+
|     made_current_at|        snapshot_id|parent_id|is_current_ancestor|
+--------------------+-------------------+---------+-------------------+
|2025-06-15 23:33:...|4374133211291202804|     NULL|               true|
+--------------------+-------------------+---------+-------------------+


‚ú® What each column means:
‚Ä¢ made_current_at: Timestamp when this snapshot became the current table state
‚Ä¢ snapshot_id: Unique identifier for each table snapshot/version
‚Ä¢ parent_id: The ID of the previous snapshot this one builds upon (NULL for first)
‚Ä¢ is_current_ancestor: Whether this snapshot is part of the current table lineage

üîç Understanding the output:
‚Ä¢ You should see one row showing your single data insertion
‚Ä¢ parent_id is NULL because this is the first snapshot
‚Ä¢ is_current_ancestor is true because this is your current table state


## üéâ Congratulations!

You've successfully created and queried your first Iceberg table using Jupyter Notebook!

### ‚úÖ What You've Accomplished

1. **Set up Spark with Iceberg** in a Jupyter environment
2. **Created an Iceberg table** with partitioning
3. **Inserted sample data** using DataFrames
4. **Queried the data** with SQL
5. **Explored Iceberg metadata** - snapshots, files, and history

### üîç Key Differences from Regular Spark Tables

- **`USING ICEBERG`** - This makes it an Iceberg table, not a regular Spark table
- **Rich metadata** - `.snapshots`, `.files`, `.history` queries only work with Iceberg
- **ACID transactions** - All operations are atomic and consistent
- **Time travel** - You can query historical versions (try adding more data and exploring!)

### üöÄ Next Steps

1. **Add more data** and observe new snapshots
2. **Try schema evolution** - add new columns safely
3. **Experiment with time travel** queries
4. **Explore different partitioning strategies**

**Remember**: You're using the power of both Spark (compute engine) and Iceberg (table format) together!
