# Demo 1: Hadoop Catalog with Apache Iceberg

This lesson demonstrates using the **Hadoop Catalog** with Apache Iceberg backed by **Backblaze B2** (S3-compatible storage).

**Architecture:**
- Spark (Query Engine)
- Hadoop Catalog (Iceberg metadata in object storage)
- Backblaze B2 (S3-compatible storage)

**Benefits:**
- Production-ready for S3-compatible object storage
- Simple setup with no external catalog service
- ACID guarantees with Iceberg tables


## 1. Configure Spark with Hadoop Catalog and Backblaze B2


In [1]:
import os
import pyspark
from pyspark.sql import SparkSession

# Backblaze B2 (S3-compatible) configuration
B2_ACCESS_KEY_ID = os.getenv("B2_APPLICATION_KEY_ID")
B2_SECRET_KEY = os.getenv("B2_APPLICATION_KEY")
B2_ENDPOINT = os.getenv("B2_S3_ENDPOINT", "s3.us-east-005.backblazeb2.com")
B2_REGION = os.getenv("B2_REGION", "us-east-005")
B2_BUCKET = os.getenv("B2_BUCKET", "iceberg-test")
WAREHOUSE_PATH = f"s3a://{B2_BUCKET}/warehouse"

if not B2_ACCESS_KEY_ID or not B2_SECRET_KEY:
    raise RuntimeError(
        "Missing Backblaze B2 credentials. Set B2_APPLICATION_KEY_ID and B2_APPLICATION_KEY."
    )

conf = (
    pyspark.SparkConf()
        .setAppName("iceberg_hadoop_catalog")
        .set(
            "spark.jars.packages",
            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,"
            "org.apache.hadoop:hadoop-aws:3.3.4,"
            "com.amazonaws:aws-java-sdk-bundle:1.11.375"
        )
        .set(
            "spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
        )

        # Hadoop Catalog configuration
        .set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkCatalog")
        .set("spark.sql.catalog.spark_catalog.type", "hadoop")
        .set("spark.sql.catalog.spark_catalog.warehouse", WAREHOUSE_PATH)
        .set("spark.sql.catalog.spark_catalog.io-impl", "org.apache.iceberg.hadoop.HadoopFileIO")

        # Hadoop S3A configuration for Backblaze B2
        .set("spark.hadoop.fs.s3a.endpoint", B2_ENDPOINT)
        .set("spark.hadoop.fs.s3a.access.key", B2_ACCESS_KEY_ID)
        .set("spark.hadoop.fs.s3a.secret.key", B2_SECRET_KEY)
        .set("spark.hadoop.fs.s3a.path.style.access", "true")
        .set("spark.hadoop.fs.s3a.connection.ssl.enabled", "true")
        .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
        .set(
            "spark.hadoop.fs.s3a.aws.credentials.provider",
            "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
        )
)

spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark running with Iceberg Hadoop Catalog")
print(f"B2 endpoint: {B2_ENDPOINT}")
print(f"Warehouse: {WAREHOUSE_PATH}")


Spark running with Iceberg Hadoop Catalog
B2 endpoint: s3.us-east-005.backblazeb2.com
Warehouse: s3a://iceberg-test/warehouse


## 2. Create Default Namespace

Hadoop Catalog starts empty. Create the `default` namespace first.


In [2]:
# Create default namespace if it doesn't exist
spark.sql("CREATE NAMESPACE IF NOT EXISTS spark_catalog.default")
print("Created default namespace")


Created default namespace


## 3. Verify Catalog Connection


In [3]:
# Show available databases
spark.sql("SHOW DATABASES").show()

# Show existing tables
spark.sql("SHOW TABLES IN spark_catalog.default").show()


+---------+
|namespace|
+---------+
|  default|
+---------+

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
+---------+---------+-----------+



## 4. Load CSV Data


In [4]:
# Load CSV into temporary view
csv_df = spark.read.format("csv").option("header", "true").load("../datasets/df_open_2011.csv")
csv_df.createOrReplaceTempView("csv_open_2011")

print(f"Loaded {csv_df.count()} rows from CSV")
csv_df.show(5)


Loaded 13126 rows from CSV
+------------+---------------+---------+--------+------+------+-------------------+-------------------+--------+-------------------+-----------+-------------------+---+------+------+-----------+------------+--------+----+
|competitorId| competitorName|firstName|lastName|status|gender|countryOfOriginCode|countryOfOriginName|regionId|         regionName|affiliateId|      affiliateName|age|height|weight|overallRank|overallScore|genderId|year|
+------------+---------------+---------+--------+------+------+-------------------+-------------------+--------+-------------------+-----------+-------------------+---+------+------+-----------+------------+--------+----+
|       47661|     Dan Bailey|      Dan|  Bailey|   ACT|     M|               NULL|               NULL|       6|       Central East|          0|    CrossFit Legacy| 27|  NULL|  NULL|          1|          43|       1|2011|
|      124483| Joshua Bridges|   Joshua| Bridges|   ACT|     M|               NULL|  

## 5. Create Iceberg Table


In [5]:
# Create Iceberg table from CSV data
spark.sql("""
    CREATE TABLE IF NOT EXISTS spark_catalog.default.df_open_2011_hadoop
    USING iceberg
    AS SELECT * FROM csv_open_2011
""")

print("Created Iceberg table: spark_catalog.default.df_open_2011_hadoop")


Created Iceberg table: spark_catalog.default.df_open_2011_hadoop


## 6. Query Iceberg Table


In [6]:
# Query the table
spark.sql("SELECT * FROM spark_catalog.default.df_open_2011_hadoop LIMIT 10").show()

# Count records
spark.sql("SELECT COUNT(*) as total FROM spark_catalog.default.df_open_2011_hadoop").show()


+------------+--------------------+---------+--------------------+------+------+-------------------+-------------------+--------+-------------------+-----------+-------------------+---+------+------+-----------+------------+--------+----+
|competitorId|      competitorName|firstName|            lastName|status|gender|countryOfOriginCode|countryOfOriginName|regionId|         regionName|affiliateId|      affiliateName|age|height|weight|overallRank|overallScore|genderId|year|
+------------+--------------------+---------+--------------------+------+------+-------------------+-------------------+--------+-------------------+-----------+-------------------+---+------+------+-----------+------------+--------+----+
|       47661|          Dan Bailey|      Dan|              Bailey|   ACT|     M|               NULL|               NULL|       6|       Central East|          0|    CrossFit Legacy| 27|  NULL|  NULL|          1|          43|       1|2011|
|      124483|      Joshua Bridges|   Joshua

## 7. ACID Operations - Delete Records


In [7]:
# Delete records (ACID transaction)
spark.sql("""
    DELETE FROM spark_catalog.default.df_open_2011_hadoop
    WHERE gender = 'M'
""")

print("Deleted records")

# Verify deletion
spark.sql("SELECT COUNT(*) as total FROM spark_catalog.default.df_open_2011_hadoop").show()


Deleted records
+-----+
|total|
+-----+
| 4506|
+-----+



## 8. Time Travel and History


In [8]:
# Show table history
spark.sql("SELECT * FROM spark_catalog.default.df_open_2011_hadoop.history").show(truncate=False)

# Show snapshots
spark.sql(
    "SELECT snapshot_id, parent_id, operation FROM "
    "spark_catalog.default.df_open_2011_hadoop.snapshots"
).show(truncate=False)


+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2026-01-21 19:00:11.634|3693390167868883144|NULL               |true               |
|2026-01-21 19:00:22.316|8648452601893943335|3693390167868883144|true               |
+-----------------------+-------------------+-------------------+-------------------+

+-------------------+-------------------+---------+
|snapshot_id        |parent_id          |operation|
+-------------------+-------------------+---------+
|3693390167868883144|NULL               |append   |
|8648452601893943335|3693390167868883144|overwrite|
+-------------------+-------------------+---------+



In [None]:
# Use the two most recent snapshots for time travel
history_df = spark.sql("""
    SELECT snapshot_id
    FROM spark_catalog.default.df_open_2011_hadoop.snapshots
    ORDER BY committed_at DESC
    LIMIT 2
""")

rows = history_df.collect()
if len(rows) >= 2:
    current_id = rows[0][0]
    previous_id = rows[1][0]

    print(f"Current ID: {current_id}")
    print(f"Previous ID: {previous_id}")

    spark.sql(f"""
        SELECT COUNT(*) as count_previous
        FROM spark_catalog.default.df_open_2011_hadoop
        VERSION AS OF {previous_id}
    """).show()

    spark.sql(f"""
        SELECT COUNT(*) as count_current
        FROM spark_catalog.default.df_open_2011_hadoop
        VERSION AS OF {current_id}
    """).show()
else:
    print("Not enough history to find a previous version.")


## 9. Table Metadata


In [None]:
# Show table schema
spark.sql("DESCRIBE spark_catalog.default.df_open_2011_hadoop").show()
