# Dataproc with BigLake Metastore
A guide to setting up a single-node Dataproc cluster with a BigLake Metastore.

**Note:** Steps 1-3 are for provisioning the cluster from your local environment. Subsequent steps are run in the Dataproc Jupyter notebook.

## Step 1: Configure Cluster Settings
Set the configuration for your Dataproc cluster. **Replace placeholders** with your Google Cloud project details.

In [None]:
# CONFIG: Replace these values with your project details.
PROJECT_ID = "my-project-id"  # Your Google Cloud project ID
REGION = "us-central1"      # The region for the cluster
CLUSTER_NAME = "my-single-node-cluster"  # A name for your Dataproc cluster
BUCKET_NAME = f"{PROJECT_ID}-dataproc-bucket"  # A unique GCS bucket name

# Hive settings
HIVE_DB = "my_hive_db"
HIVE_TABLE = f"{HIVE_DB}.people_hive"

# Iceberg on Hive settings
ICEBERG_HIVE_CATALOG = "iceberg_on_hive"
ICEBERG_HIVE_DB = "my_iceberg_db"
ICEBERG_HIVE_TABLE = f"{ICEBERG_HIVE_CATALOG}.{ICEBERG_HIVE_DB}.people_iceberg"
ICEBERG_HIVE_FROM_BQ = f"{ICEBERG_HIVE_CATALOG}.{ICEBERG_HIVE_DB}.people_filtered_bq"
ICEBERG_HIVE_FROM_SPARK = f"{ICEBERG_HIVE_CATALOG}.{ICEBERG_HIVE_DB}.people_filtered_spark"

# Iceberg on BigLake Metastore settings
BQ_DATASET = "my_iceberg_metastore"
BQ_TABLE = f"{BQ_DATASET}.people_biglake"
ICEBERG_BQ_CATALOG = "iceberg_on_bq"
ICEBERG_BIGLAKE_TABLE = f"{ICEBERG_BQ_CATALOG}.{BQ_TABLE}"
ICEBERG_BIGLAKE_FROM_SPARK = f"{ICEBERG_BQ_CATALOG}.{BQ_DATASET}.people_filtered_spark"

## Step 2: Create and Upload Initialization Script
This script downloads Iceberg and BigQuery JARs and places them in Spark's classpath. It's uploaded to a GCS bucket for use during cluster creation.

In [None]:
# This script downloads required JARs for Iceberg on BigLake Metastore integration.
init_script_content = """#!/bin/bash
# install-jars.sh
set -e -x

# URLs for the JAR files
ICEBERG_RUNTIME_URL="https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.6.1/iceberg-spark-runtime-3.5_2.12-1.6.1.jar"
BQ_CATALOG_JAR_GCS="gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.6.1-1.0.1-beta.jar"

# Download JARs to Spark's classpath
wget -P /usr/lib/spark/jars/ "$ICEBERG_RUNTIME_URL"
gsutil cp "$BQ_CATALOG_JAR_GCS" /usr/lib/spark/jars/
"""
# Define local and GCS paths
local_script_path = "install-jars.sh"
gcs_script_path = f"gs://{BUCKET_NAME}/scripts/{local_script_path}"

# Write the script to a local file
with open(local_script_path, "w") as f:
    f.write(init_script_content)

print(f"Initialization script created at: {local_script_path}")

# Upload the script to GCS
!gsutil cp {local_script_path} {gcs_script_path}

print(f"Successfully uploaded script to: {gcs_script_path}")

## Step 3: Create Dataproc Cluster with Iceberg Catalog
This command creates a Dataproc cluster with the necessary Spark properties for Iceberg. Setting these at the cluster level ensures the Spark session is pre-configured.

In [None]:
# Define properties for the Dataproc cluster
properties_list = [
    "spark:spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    f"spark:spark.sql.catalog.{ICEBERG_HIVE_CATALOG}=org.apache.iceberg.spark.SparkCatalog",
    f"spark:spark.sql.catalog.{ICEBERG_HIVE_CATALOG}.type=hive",
    f"spark:spark.sql.catalog.{ICEBERG_HIVE_CATALOG}.warehouse=gs://{BUCKET_NAME}/iceberg_on_hive",
    f"spark:spark.sql.catalog.{ICEBERG_BQ_CATALOG}=org.apache.iceberg.spark.SparkCatalog",
    f"spark:spark.sql.catalog.{ICEBERG_BQ_CATALOG}.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog",
    f"spark:spark.sql.catalog.{ICEBERG_BQ_CATALOG}.gcp_project={PROJECT_ID}",
    f"spark:spark.sql.catalog.{ICEBERG_BQ_CATALOG}.location={REGION}",
    f"spark:spark.sql.catalog.{ICEBERG_BQ_CATALOG}.warehouse=gs://{BUCKET_NAME}/iceberg_on_bq"
]
PROPERTIES = ",".join(properties_list)

# Create the Dataproc cluster
!gcloud dataproc clusters create {CLUSTER_NAME} \
    --project {PROJECT_ID} \
    --region {REGION} \
    --single-node \
    --image-version 2.2-debian12 \
    --optional-components=JUPYTER \
    --enable-component-gateway \
    --bucket {BUCKET_NAME} \
    --initialization-actions={gcs_script_path} \
    --properties="{PROPERTIES}"

print(f"Cluster '{CLUSTER_NAME}' creation initiated.")

## Step 4: Access Jupyter and Get Spark Session
Once the cluster is running, access the Jupyter environment. The Spark session is pre-configured, so no extra setup is needed.

In [None]:
# Get the pre-configured Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

print("Spark session is active and ready to use.")

## Step 5: Create a Sample DataFrame and Write to HDFS
Create a sample DataFrame and write it to HDFS as a Parquet file.

In [None]:
# Create a sample DataFrame
df = spark.createDataFrame(
    [('Alice', 25), ('Bob', 30), ('Charlie', 35)], ['name', 'age'])

# Write the DataFrame to HDFS
df.write.mode('overwrite').parquet('/user/my_data/people')

# Verify the file was created in HDFS
!hdfs dfs -ls /user/my_data/people

## Step 6: Create a Hive Table
Create a Hive table from the data in HDFS.

In [None]:
# Create a Hive database
spark.sql(
    f"CREATE DATABASE IF NOT EXISTS {HIVE_DB} LOCATION 'hdfs:///user/hive_db'")

# Drop the table if it already exists
spark.sql(f"DROP TABLE IF EXISTS {HIVE_TABLE}")

# Create an external Hive table pointing to the HDFS data
spark.sql(f"""
    CREATE EXTERNAL TABLE IF NOT EXISTS {HIVE_TABLE} (
        name STRING,
        age BIGINT
    )
    STORED AS PARQUET
    LOCATION '/user/my_data/people'
""")

print("--- Standard Hive Table ---")
# Query the Hive table
spark.sql(f"SELECT * FROM {HIVE_TABLE}").show()

## Step 7: Create an Iceberg Table with Hive Metastore
Create an Iceberg table using the Hive metastore.

In [None]:
print("Verifying the Iceberg catalog...")
# Show databases in the Hive Iceberg catalog
spark.sql(f"SHOW DATABASES IN {ICEBERG_HIVE_CATALOG}").show()

# Create a database in the Hive Iceberg catalog
spark.sql(
    f"CREATE DATABASE IF NOT EXISTS {ICEBERG_HIVE_CATALOG}.{ICEBERG_HIVE_DB}")

# Write the DataFrame to an Iceberg table
spark.sql(f"SELECT * FROM {HIVE_TABLE}") \
    .write \
    .format("iceberg") \
    .mode("overwrite") \
    .saveAsTable(ICEBERG_HIVE_TABLE)

print("--- Iceberg Table on Hive Metastore ---")
# Query the Iceberg table
spark.sql(f"SELECT * FROM {ICEBERG_HIVE_TABLE}").show()

## Step 8: Create an Iceberg Table with BigLake Metastore
Create an Iceberg table using the BigLake metastore.

In [None]:
print("Verifying the BigQuery catalog...")
# Show databases in the BigQuery Iceberg catalog
spark.sql(f"SHOW DATABASES IN {ICEBERG_BQ_CATALOG}").show()

# Create the BigQuery dataset to act as the metastore
!bq mk --connection --location={REGION} --project_id={PROJECT_ID} --connection_type=CLOUD_RESOURCE default-{REGION}
!bq --location={REGION} mk --dataset {PROJECT_ID}:{BQ_DATASET}

# Drop the table if it already exists
spark.sql(f"DROP TABLE IF EXISTS {ICEBERG_BIGLAKE_TABLE}")

# Create the Iceberg table in the BigLake metastore
spark.sql(f"""
CREATE TABLE IF NOT EXISTS
  {ICEBERG_BIGLAKE_TABLE} ( name string,
    age int )
USING
  ICEBERG TBLPROPERTIES ('bq_connection'='projects/{PROJECT_ID}/locations/{REGION}/connections/default-{REGION}');
""")

# Save the DataFrame to the BigLake metastore
spark.sql(f"SELECT * FROM {ICEBERG_HIVE_TABLE}") \
    .write \
    .format("iceberg") \
    .mode("overwrite") \
    .save(ICEBERG_BIGLAKE_TABLE)

# Query the Iceberg table from Spark
# spark.sql(f"SELECT * FROM {ICEBERG_BIGLAKE_TABLE}").show()

sql_query = f"SELECT * FROM {BQ_TABLE}"
df = spark.read \
    .format("bigquery") \
    .option("viewsEnabled", "true") \
    .option("query", sql_query) \
    .option("materializationDataset", BQ_DATASET) \
    .load()
df.show()

## Step 9: Push Down Computation to BigQuery
Use the BigQuery Connector to execute a SQL query directly in BigQuery. Only the results are returned to Spark.

In [None]:
# This query is executed directly in BigQuery
bq_sql_query = f"""
SELECT
    name,
    age
FROM
    {BQ_TABLE}
WHERE
    age > 28
"""

print("--- Sending SQL query to BigQuery ---")
print(bq_sql_query)

# Use the 'bigquery' format to send the query to BigQuery
filtered_df = spark.read \
    .format("bigquery") \
    .option("viewsEnabled", "true") \
    .option("query", bq_sql_query) \
    .option("materializationDataset", BQ_DATASET) \
    .load()

print(f"--- Data returned from BigQuery ---")
filtered_df.show()

# Save the filtered results to the Hive metastore
filtered_df.write.format("iceberg").mode(
    "overwrite").saveAsTable(ICEBERG_HIVE_FROM_BQ)

print("--- Iceberg Table on Hive Metastore ---")
spark.sql(f"SELECT * FROM {ICEBERG_HIVE_FROM_BQ}").show()

## Step 10: Push Down Computation with Serverless Spark
Use Serverless Spark to execute a query. The results are written to the Hive metastore.

In [None]:
# Python script for the Spark job
pyspark_job_content = f"""
from pyspark.sql import SparkSession

# These values are injected by the gcloud command
ICEBERG_BIGLAKE_TABLE = "{ICEBERG_BIGLAKE_TABLE}"
OUTPUT_TABLE = f"{ICEBERG_BIGLAKE_FROM_SPARK}"

def main():
    spark = SparkSession.builder \\
        .appName("Dataproc Serverless Spark Filter") \\
        .getOrCreate()

    # This is a Spark SQL query, not a BigQuery query
    filter_query = f'''
    SELECT
        name,
        age
    FROM
        {{ICEBERG_BIGLAKE_TABLE}}
    WHERE
        age > 28
    '''

    print(f"--- Running Spark SQL query: {{filter_query}} ---")

    filtered_df = spark.sql(filter_query)

    print("--- Filtered data computed by Spark ---")
    filtered_df.show()

    # Save results to a new Iceberg table
    print(f"--- Saving results to: {{OUTPUT_TABLE}} ---")
    filtered_df.write \\
        .format("iceberg") \\
        .mode("overwrite") \\
        .saveAsTable(OUTPUT_TABLE)

    print("Job completed.")

if __name__ == "__main__":
    main()
"""

# Define local and GCS paths for the job script
local_job_path = "filter_job_spark_sql.py"
gcs_job_path = f"gs://{BUCKET_NAME}/scripts/{local_job_path}"

# Write and upload the script
with open(local_job_path, "w") as f:
    f.write(pyspark_job_content)

!gsutil cp {local_job_path} {gcs_job_path}

print(f"Successfully uploaded Spark job script to: {gcs_job_path}")

In [None]:
# Define properties for the serverless Dataproc job
properties_list = [
    "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    f"spark.sql.catalog.{ICEBERG_HIVE_CATALOG}=org.apache.iceberg.spark.SparkCatalog",
    f"spark.sql.catalog.{ICEBERG_HIVE_CATALOG}.type=hive",
    f"spark.sql.catalog.{ICEBERG_HIVE_CATALOG}.warehouse=gs://{BUCKET_NAME}/iceberg_on_hive",
    f"spark.sql.catalog.{ICEBERG_BQ_CATALOG}=org.apache.iceberg.spark.SparkCatalog",
    f"spark.sql.catalog.{ICEBERG_BQ_CATALOG}.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog",
    f"spark.sql.catalog.{ICEBERG_BQ_CATALOG}.gcp_project={PROJECT_ID}",
    f"spark.sql.catalog.{ICEBERG_BQ_CATALOG}.location={REGION}",
    f"spark.sql.catalog.{ICEBERG_BQ_CATALOG}.warehouse=gs://{BUCKET_NAME}/iceberg_on_bq"
]
PROPERTIES = ",".join(properties_list)

# Submit the PySpark job to a serverless Dataproc cluster
!gcloud dataproc batches submit pyspark {gcs_job_path} \
    --project={PROJECT_ID} \
    --region={REGION} \
    --batch="serverless-spark-engine-job" \
    --version="2.2" \
    --subnet="default" \
    --jars="gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.6.1-1.0.1-beta.jar,https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.6.1/iceberg-spark-runtime-3.5_2.12-1.6.1.jar" \
    --properties="{PROPERTIES}"

print("Serverless batch job submitted.")

In [None]:
# Query the table created by the serverless Spark job
print(f"\n--- Data returned from Spark query ---")
spark.sql(f"SELECT * FROM {ICEBERG_BIGLAKE_FROM_SPARK}").show()

# Create a new Iceberg table from the results
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {ICEBERG_HIVE_FROM_SPARK}
    USING iceberg
    AS
    SELECT * FROM {ICEBERG_BIGLAKE_FROM_SPARK}
""")

print("--- Iceberg Table on Hive Metastore ---")
spark.sql(f"SELECT * FROM {ICEBERG_HIVE_FROM_SPARK}").show()

## Step 11: Clean Up Resources
Delete the Dataproc cluster to avoid charges.

In [None]:
# Delete the Dataproc cluster
!gcloud dataproc clusters delete {CLUSTER_NAME} --region {REGION}