# 🧊 Apache Iceberg + PySpark: Local Hadoop Catalog Setup

This notebook demonstrates how to configure a SparkSession to use **Apache Iceberg** with a **local Hadoop catalog**. This setup is ideal for local development and testing without requiring a Hive Metastore.

## 🔧 Spark Configuration Details

The following Spark configurations are used to enable Iceberg support:

| Config Key | Description |
|------------|-------------|
| `spark.sql.catalog.local` | Registers a catalog named `local` using Iceberg's `SparkCatalog` implementation. |
| `spark.sql.catalog.local.type` | Specifies the catalog type as `hadoop`, meaning metadata is stored in the local filesystem. |
| `spark.sql.catalog.local.warehouse` | Sets the warehouse directory path where Iceberg tables will be stored. This should be a valid local or distributed filesystem path. |

Once configured, you can use SQL commands to create, insert into, and query Iceberg tables using the `local` catalog.


# 🧊 PySpark + Apache Iceberg: Local Hadoop Catalog Setup

This notebook demonstrates how to configure a SparkSession with Apache Iceberg using a local Hadoop catalog. It also includes example SQL commands to create and query Iceberg tables.

In [1]:
from pyspark.sql import SparkSession

### Define the warehouse path (adjust as needed)

In [1]:
import os

# Ensure the warehouse path exists
warehouse_path = "/home/jovyan/iceberg/warehouse"
os.makedirs(warehouse_path, exist_ok=True)

## 🔧 SparkSession Configuration
We configure the SparkSession to use Iceberg's `SparkCatalog` with a local Hadoop catalog. Make sure to define a valid `warehouse_path` where Iceberg tables will be stored.

In [3]:

# Initialize SparkSession with Iceberg Hadoop catalog
spark = SparkSession.builder \
    .appName("IcebergLocalSetup") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", warehouse_path) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")
print("✅ SparkSession initialized with local Iceberg Hadoop catalog")

✅ SparkSession initialized with local Iceberg Hadoop catalog


### Optional: List available catalogs and configurations

In [4]:
for k, v in spark.sparkContext.getConf().getAll():
    if 'catalog' in k:
        print(f"{k} = {v}")

spark.sql.catalog.local.warehouse = /home/jovyan/iceberg/warehouse
spark.sql.catalog.local.type = hadoop
spark.sql.catalog.local = org.apache.iceberg.spark.SparkCatalog


## 📝 Notes
- Ensure the Iceberg and Hadoop dependencies are available in your Spark environment.
- The `warehouse_path` should be accessible and writable.
- This setup is ideal for local development and testing.