# BigLake Iceberg with Docker Setup

This notebook demonstrates how to set up and use Apache Iceberg with Google Cloud BigLake metastore using Docker.

## Overview
- **BigLake**: Google Cloud's unified data lake solution
- **Apache Iceberg**: Open table format for large analytic datasets
- **Docker**: Containerized environment for consistent setup

## Prerequisites
- Docker and Docker Compose installed
- Google Cloud Project with BigQuery enabled
- Service account key file with appropriate permissions
- GCS bucket for storing Iceberg data

## 1. Configuration Setup

First, let's configure all the necessary parameters for our BigLake Iceberg environment.

In [None]:
# Import required libraries
import os
import subprocess
import time
from pyspark.sql import SparkSession

# =============================================================================
# PROJECT CONFIGURATION
# Replace these values with your actual project details
# =============================================================================

# Google Cloud Project Settings
# TODO: Replace with your GCP project ID
PROJECT_ID = "my-project-id"
REGION = "us-central1"        # TODO: Replace with your preferred region

# Google Cloud Storage bucket for Iceberg table data
BUCKET_NAME = f"{PROJECT_ID}-docker-bucket"  # Will be created if doesn't exist

# =============================================================================
# DOCKER CONFIGURATION
# =============================================================================

CONTAINER_NAME = "biglake-iceberg-env"  # Name for our Docker container
JUPYTER_PORT = 8888                     # Port for Jupyter Lab access
SPARK_UI_PORT = 4040                    # Port for Spark Web UI

# =============================================================================
# AUTHENTICATION CONFIGURATION
# =============================================================================

# Service account key file (must be in current directory)
# TODO: Replace with your key file name
GCP_KEY_FILE = "my-project-id"
GCP_KEY_PATH = "/home/jovyan/gcp-key.json"  # Path inside Docker container

# =============================================================================
# BIGLAKE METASTORE CONFIGURATION
# =============================================================================

BIGLAKE_DATASET = "my_iceberg_metastore"  # BigQuery dataset for metastore
BIGLAKE_CATALOG = "iceberg_on_bq"         # Iceberg catalog name
BIGLAKE_CONNECTION = f"projects/{PROJECT_ID}/locations/{REGION}/connections/default-{REGION}"

# Display configuration summary
print("Configuration loaded successfully")
print(f"Project: {PROJECT_ID}")
print(f"Region: {REGION}")
print(f"BigLake Dataset: {BIGLAKE_DATASET}")
print(f"Iceberg Catalog: {BIGLAKE_CATALOG}")
print(f"GCS Bucket: {BUCKET_NAME}")

## 2. Prerequisites Check

Before setting up Docker, let's verify that all required components are available.

In [None]:
def check_prerequisites():
    """
    Verify that Docker and GCP service account key are available.

    Raises:
        Exception: If Docker is not installed or key file is missing
    """
    print("Checking prerequisites...")

    # Check if Docker is installed and accessible
    try:
        result = subprocess.run(["docker", "--version"],
                                check=True, capture_output=True, text=True)
        print(f"Docker is available: {result.stdout.strip()}")
    except (subprocess.CalledProcessError, FileNotFoundError):
        raise Exception("Docker is not installed or not in PATH")

    # Check if Docker Compose is available
    try:
        result = subprocess.run(["docker-compose", "--version"],
                                check=True, capture_output=True, text=True)
        print(f"Docker Compose is available: {result.stdout.strip()}")
    except (subprocess.CalledProcessError, FileNotFoundError):
        raise Exception("Docker Compose is not installed or not in PATH")

    # Check if GCP service account key file exists
    if not os.path.exists(GCP_KEY_FILE):
        raise Exception(
            f"GCP key file '{GCP_KEY_FILE}' not found.\n"
            f"Please place your service account key file in the current directory."
        )

    print(f"GCP service account key found: {GCP_KEY_FILE}")
    print("All prerequisites satisfied!")


# Run the prerequisites check
check_prerequisites()

## 3. Docker Environment Setup

Now we'll create the Docker configuration files and build our containerized environment.

### What we're creating:
- **Dockerfile**: Defines our container image with Spark, Iceberg, and BigLake dependencies
- **docker-compose.yml**: Orchestrates the container with proper port mappings and volumes
- **notebooks/**: Directory for Jupyter notebooks

In [None]:
print("Creating Docker configuration files...")

# =============================================================================
# CREATE DOCKERFILE
# =============================================================================

dockerfile_content = f"""
# Base image with Jupyter and PySpark pre-installed
FROM jupyter/pyspark-notebook:spark-3.5.0

# Switch to root user to install system packages
USER root

# Install required system utilities
RUN apt-get update && apt-get install -y wget curl && rm -rf /var/lib/apt/lists/*

# Download required JAR files for BigLake + Iceberg integration
# These JARs enable Spark to work with Iceberg tables and BigQuery metastore
RUN wget -P /usr/local/spark/jars/ https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.6.1/iceberg-spark-runtime-3.5_2.12-1.6.1.jar && \\
    wget -P /usr/local/spark/jars/ https://storage.googleapis.com/spark-lib/bigquery/iceberg-bigquery-catalog-1.6.1-1.0.1-beta.jar && \\
    wget -P /usr/local/spark/jars/ https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar

# Install Python packages for Google Cloud integration
RUN pip install --no-cache-dir google-cloud-storage google-cloud-bigquery

# Switch back to jovyan user (default Jupyter user)
USER jovyan

# Set Spark configuration for optimal performance
ENV SPARK_OPTS="--driver-memory 4g --executor-memory 4g"
"""

# Write Dockerfile to current directory
with open("Dockerfile", "w") as f:
    f.write(dockerfile_content)

print("Dockerfile created")

# =============================================================================
# CREATE DOCKER-COMPOSE.YML
# =============================================================================

compose_content = f"""
version: '3.8'
services:
  biglake-iceberg:
    build: .                              # Build from local Dockerfile
    container_name: {CONTAINER_NAME}      # Container name for easy reference
    ports:
      - "{JUPYTER_PORT}:{JUPYTER_PORT}"   # Jupyter Lab access
      - "{SPARK_UI_PORT}:{SPARK_UI_PORT}" # Spark Web UI access
    volumes:
      - ./notebooks:/home/jovyan/work     # Mount notebooks directory
      - ./{GCP_KEY_FILE}:/home/jovyan/gcp-key.json:ro  # Mount GCP key (read-only)
    environment:
      - JUPYTER_ENABLE_LAB=yes            # Enable Jupyter Lab interface
      - GOOGLE_APPLICATION_CREDENTIALS=/home/jovyan/gcp-key.json  # GCP auth
    working_dir: /home/jovyan/work        # Set working directory
"""

# Write docker-compose.yml to current directory
with open("docker-compose.yml", "w") as f:
    f.write(compose_content)

print("docker-compose.yml created")

# =============================================================================
# CREATE NOTEBOOKS DIRECTORY
# =============================================================================

# Create notebooks directory if it doesn't exist
os.makedirs("notebooks", exist_ok=True)
print("notebooks/ directory created")

print("Docker configuration completed!")

## 4. Build and Start Docker Container

Now we'll build the Docker image and start the container. This process will:
1. Download the base Jupyter/PySpark image
2. Install required JAR files for Iceberg and BigLake
3. Start the container with Jupyter Lab

In [None]:
# =============================================================================
# BUILD DOCKER IMAGE
# =============================================================================

print("Building Docker image... (this may take a few minutes)")
print("Downloading base image and dependencies...")

# Build the Docker image using docker-compose
result = subprocess.run(["docker-compose", "build"],
                        capture_output=True, text=True)

if result.returncode != 0:
    print(f"Build failed: {result.stderr}")
    raise Exception("Docker build failed")

print("Docker image built successfully")

# =============================================================================
# START CONTAINER
# =============================================================================

print("Starting container...")

# Start the container in detached mode
result = subprocess.run(["docker-compose", "up", "-d"],
                        capture_output=True, text=True)

if result.returncode != 0:
    print(f"Container start failed: {result.stderr}")
    raise Exception("Container start failed")

print("Container started successfully")

# =============================================================================
# WAIT FOR CONTAINER TO BE READY
# =============================================================================

print("Waiting for Jupyter Lab to initialize...")
time.sleep(10)  # Give container time to start up

# =============================================================================
# GET JUPYTER ACCESS URL
# =============================================================================

print("Retrieving Jupyter Lab access URL...")

# Get container logs to find Jupyter URL with token
result = subprocess.run(["docker", "logs", CONTAINER_NAME],
                        capture_output=True, text=True)
logs = result.stdout

# Extract Jupyter URL from logs
jupyter_url = None
for line in logs.split('\n'):
    if 'http://127.0.0.1:8888/lab?token=' in line:
        jupyter_url = line.strip()
        break

# Display access information
print("\n" + "="*60)
print("CONTAINER READY!")
print("="*60)

if jupyter_url:
    print(f"Jupyter Lab: {jupyter_url}")
else:
    print(f"Jupyter Lab: http://localhost:{JUPYTER_PORT}")
    print("If URL doesn't work, check container logs for the token")

print(f"Container: {CONTAINER_NAME}")
print("="*60)

print("\nNext steps:")
print("1. Open Jupyter Lab in your browser")
print("2. Create a new notebook in the 'work' directory")
print("3. Copy and run the Spark configuration code from the next cells")

## 5. Spark Session Configuration

**⚠️ IMPORTANT: Run this code inside the Jupyter container**

Copy the following code to a new notebook cell in Jupyter Lab (running inside the Docker container).

This configures Spark to work with BigLake metastore and Iceberg tables.

In [None]:
# =============================================================================
# COPY THIS CODE TO JUPYTER LAB (INSIDE DOCKER CONTAINER)
# =============================================================================

from pyspark.sql import SparkSession

# Configuration - Update these values to match your setup
# TODO: Your Google Cloud project ID
PROJECT_ID = "my-project-id"
REGION = "us-central1"                    # TODO: Your region
BUCKET_NAME = f"{PROJECT_ID}-docker-bucket"  # GCS bucket for Iceberg data
BIGLAKE_DATASET = "my_iceberg_metastore"  # BigQuery dataset for metastore
BIGLAKE_CATALOG = "iceberg_on_bq"         # Iceberg catalog name
GCP_KEY_PATH = "/home/jovyan/gcp-key.json"  # Service account key path

print("Configuring Spark session for BigLake Iceberg...")

# Create Spark session with BigLake and Iceberg configuration
spark = SparkSession.builder \
    .appName("BigLake_Iceberg_Demo") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config(f"spark.sql.catalog.{BIGLAKE_CATALOG}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{BIGLAKE_CATALOG}.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \
    .config(f"spark.sql.catalog.{BIGLAKE_CATALOG}.gcp_project", PROJECT_ID) \
    .config(f"spark.sql.catalog.{BIGLAKE_CATALOG}.location", REGION) \
    .config(f"spark.sql.catalog.{BIGLAKE_CATALOG}.warehouse", f"gs://{BUCKET_NAME}/{BIGLAKE_CATALOG}") \
    .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
    .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", GCP_KEY_PATH) \
    .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
    .config("spark.hadoop.fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") \
    .getOrCreate()

print("Spark Session with BigLake catalog created successfully!")
print(f"Spark UI: http://localhost:4040")
print(f"Catalog: {BIGLAKE_CATALOG}")
print(f"Dataset: {BIGLAKE_DATASET}")

# =============================================================================
# CREATE SAMPLE DATA FOR TESTING
# =============================================================================

print("\nCreating sample dataset...")

# Sample employee data for demonstration
sample_data = [
    ("Alice", 25, "Engineering", 75000),
    ("Bob", 30, "Marketing", 65000),
    ("Charlie", 35, "Engineering", 85000),
    ("Diana", 28, "Sales", 60000),
    ("Eve", 32, "Engineering", 90000)
]

# Create DataFrame with proper schema
df = spark.createDataFrame(
    sample_data, ["name", "age", "department", "salary"]
)

print("Sample DataFrame created:")
df.show()

print("Ready to work with BigLake Iceberg tables!")

## 6. BigLake Iceberg Table Operations

**⚠️ IMPORTANT: Also run this code inside the Jupyter container**

This section demonstrates how to:
1. Create Iceberg tables in BigLake metastore
2. Insert data into Iceberg tables
3. Query Iceberg tables
4. View table metadata

In [None]:
# =============================================================================
# COPY THIS CODE TO JUPYTER LAB (INSIDE DOCKER CONTAINER)
# =============================================================================

# Table configuration
TABLE_NAME = f"{BIGLAKE_CATALOG}.{BIGLAKE_DATASET}.employees"

print("Working with BigLake Iceberg tables...")
print(f"Table: {TABLE_NAME}")

# =============================================================================
# STEP 1: CREATE ICEBERG TABLE
# =============================================================================

print("\nCreating Iceberg table in BigLake metastore...")

try:
    # Drop table if it exists (for demo purposes)
    print("Dropping existing table (if any)...")
    spark.sql(f"DROP TABLE IF EXISTS {TABLE_NAME}")

    # Create new Iceberg table with schema
    print("Creating new Iceberg table...")
    spark.sql(f"""
    CREATE TABLE {TABLE_NAME} (
        name STRING,
        age INT,
        department STRING,
        salary BIGINT
    )
    USING ICEBERG
    TBLPROPERTIES (
        'bq_connection'='projects/{PROJECT_ID}/locations/{REGION}/connections/default-{REGION}'
    )
    """)

    print(f"Iceberg table created successfully: {TABLE_NAME}")

except Exception as e:
    print(f"Table creation failed: {e}")
    print("Make sure:")
    print("   - BigQuery dataset exists")
    print("   - BigQuery connection exists")
    print("   - Service account has proper permissions")

In [None]:
# =============================================================================
# STEP 2: INSERT SAMPLE DATA
# =============================================================================

print("\nInserting sample data into Iceberg table...")

try:
    # Insert DataFrame data into Iceberg table
    df.write \
        .format("iceberg") \
        .mode("append") \
        .saveAsTable(TABLE_NAME)

    print("Data inserted successfully")
    print(f"Inserted {df.count()} records")

except Exception as e:
    print(f"Data insertion failed: {e}")
    print("Check GCS bucket permissions and connectivity")

In [None]:
# =============================================================================
# STEP 3: QUERY THE ICEBERG TABLE
# =============================================================================

print("\nQuerying the Iceberg table...")

try:
    # Query 1: Show all records
    print("\n--- All Records ---")
    result = spark.sql(f"SELECT * FROM {TABLE_NAME}")
    result.show()

    # Query 2: Filter by department
    print("\n--- Engineering Department (Sorted by Salary) ---")
    spark.sql(f"""
        SELECT name, age, salary
        FROM {TABLE_NAME}
        WHERE department = 'Engineering'
        ORDER BY salary DESC
    """).show()

    # Query 3: Department summary with aggregations
    print("\n--- Department Summary ---")
    spark.sql(f"""
        SELECT department,
               COUNT(*) as employee_count,
               AVG(salary) as avg_salary,
               MAX(salary) as max_salary,
               MIN(salary) as min_salary
        FROM {TABLE_NAME}
        GROUP BY department
        ORDER BY avg_salary DESC
    """).show()

    print("All queries executed successfully")

except Exception as e:
    print(f"Query failed: {e}")

In [None]:
# =============================================================================
# STEP 4: SHOW TABLE METADATA
# =============================================================================

print("\nTable Metadata and Information:")

try:
    # Show table schema
    print("\n--- Table Schema ---")
    spark.sql(f"DESCRIBE {TABLE_NAME}").show()

    # Show table properties
    print("\n--- Table Properties ---")
    spark.sql(f"SHOW TBLPROPERTIES {TABLE_NAME}").show()

    # Show table location and format
    print("\n--- Table Details ---")
    spark.sql(f"DESCRIBE EXTENDED {TABLE_NAME}").show(truncate=False)

except Exception as e:
    print(f"Metadata query failed: {e}")

print("\nBigLake Iceberg demo completed successfully!")
print(f"Table: {TABLE_NAME}")
print(f"You can now use this table for your data operations")
print(f"The table is stored in GCS: gs://{BUCKET_NAME}/{BIGLAKE_CATALOG}")

In [None]:
# =============================================================================
# CLEANUP RESOURCES
# Uncomment the lines below to clean up when you're done
# =============================================================================

# Stop Spark session
# spark.stop()
# print("Spark session stopped")

## 7. Cleanup (Optional)

When you're done with the demo, you can clean up resources.

In [None]:
# =============================================================================
# CLEANUP RESOURCES
# Uncomment the lines below to clean up when you're done
# =============================================================================

# Stop and remove Docker containers
# import subprocess
# print("Stopping Docker containers...")
# subprocess.run(["docker-compose", "down"], capture_output=True)
# print("Docker containers stopped")

# To remove all data and volumes:
# subprocess.run(["docker-compose", "down", "-v"], capture_output=True)
# print("All data and volumes removed")

print("Cleanup commands are commented out for safety")
print("Uncomment the lines above to clean up resources when done")