# Replicating a Single-Node Cloudera Environment with Dataproc (Manual Configuration)
This notebook provides a definitive, step-by-step guide to creating a single-node Hadoop environment on Google Cloud Dataproc. This version explicitly follows the manual configuration for the BigQuery Metastore, bypassing the Dataproc Iceberg component to demonstrate the underlying setup.

## Step 1: Configuration
First, we define the configuration for our Dataproc cluster. **Make sure to replace the placeholder values** with your specific Google Cloud project details.

In [None]:
# IMPORTANT: Fill in these values before running!
PROJECT_ID = "johanesa-playground-326616"  # e.g., my-gcp-project
REGION = "us-central1"      # e.g., us-central1
CLUSTER_NAME = "my-single-node-dataproc-cluster"
# A unique GCS bucket name. Using the project ID as a prefix is a good practice.
BUCKET_NAME = f"{PROJECT_ID}-dataproc-bucket"
# A BigQuery dataset to act as a persistent Iceberg metastore.
BQ_DATASET = "my_iceberg_metastore"
# Catalog Names
ICEBERG_HIVE_CATALOG = "iceberg_on_hive"
ICEBERG_BQ_CATALOG = "iceberg_on_bq"

In [None]:
# This multi-line string contains the exact shell script content.
init_script_content = """#!/bin/bash
# install-jars.sh

# This script downloads required JARs for Iceberg + BigQuery Catalog integration.
# It places them directly in Spark's classpath.

set -e -x

# Define variables for JARs
ICEBERG_RUNTIME_URL="https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.6.1/iceberg-spark-runtime-3.5_2.12-1.6.1.jar"
BQ_CATALOG_JAR_GCS="gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.6.1-1.0.1-beta.jar"

# Download the JARs directly into Spark's main jars directory
wget -P /usr/lib/spark/jars/ "$ICEBERG_RUNTIME_URL"
gsutil cp "$BQ_CATALOG_JAR_GCS" /usr/lib/spark/jars/
"""

In [None]:
# Define the local and GCS paths for the script
local_script_path = "install-jars.sh"
gcs_script_path = f"gs://{BUCKET_NAME}/scripts/{local_script_path}"

# Write the content to a local file
with open(local_script_path, "w") as f:
    f.write(init_script_content)

print(f"Initialization script created locally at: {local_script_path}")

# Upload the local script to GCS using a shell command
!gsutil cp {local_script_path} {gcs_script_path}

print(f"Successfully uploaded script to: {gcs_script_path}")

## Step 2a: Create a Fully and Manually Configured Dataproc Cluster
This is the most important step. We add a `--properties` flag to the cluster creation command to manually set all the required Spark configurations. This is the standard and correct way to configure the default Spark session for the entire cluster, which avoids all errors in the interactive notebook environment.

In [None]:
# Define ONLY the catalog properties.
properties_list = [
    f"spark:spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    f"spark:spark.sql.catalog.{ICEBERG_HIVE_CATALOG}=org.apache.iceberg.spark.SparkCatalog",
    f"spark:spark.sql.catalog.{ICEBERG_HIVE_CATALOG}.type=hive",
    f"spark:spark.sql.catalog.{ICEBERG_HIVE_CATALOG}.warehouse=gs://{BUCKET_NAME}/iceberg_on_hive",
    f"spark:spark.sql.catalog.{ICEBERG_BQ_CATALOG}=org.apache.iceberg.spark.SparkCatalog",
    f"spark:spark.sql.catalog.{ICEBERG_BQ_CATALOG}.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog",
    f"spark:spark.sql.catalog.{ICEBERG_BQ_CATALOG}.gcp_project={PROJECT_ID}",
    f"spark:spark.sql.catalog.{ICEBERG_BQ_CATALOG}.location={REGION}",
    f"spark:spark.sql.catalog.{ICEBERG_BQ_CATALOG}.warehouse=gs://{BUCKET_NAME}/iceberg_on_bq"
]
PROPERTIES = ",".join(properties_list)

# The final gcloud command to create the cluster
# It references the GCS path of the script we just uploaded
!gcloud dataproc clusters create {CLUSTER_NAME} \
    --project {PROJECT_ID} \
    --region {REGION} \
    --single-node \
    --image-version 2.2-debian12 \
    --optional-components=JUPYTER \
    --enable-component-gateway \
    --bucket {BUCKET_NAME} \
    --initialization-actions={gcs_script_path} \
    --properties="{PROPERTIES}"

print(f"Cluster '{CLUSTER_NAME}' creation process initiated.")

## Step 3: Accessing the Jupyter Notebook
Follow these steps to access the interactive Jupyter environment on the cluster.

### Step 4: Get the Pre-Configured Spark Session
Because all configurations were set at the cluster level, we do not need to stop or configure the session. We simply get the default session that Jupyter started, which now has all our settings.

In [None]:
# Let's run a command to prove the catalog is working.
from pyspark.sql import SparkSession

# Get the existing, pre-configured Spark session. DO NOT use spark.stop()
spark = SparkSession.builder.getOrCreate()

print("Default Spark session is active and configured correctly. Ready to use.")

In [None]:
# Create a sample DataFrame that we will reuse across multiple steps.
df = spark.createDataFrame(
    [('Alice', 25), ('Bob', 30), ('Charlie', 35)], ['name', 'age'])

# Write data to HDFS as plain Parquet files
df.write.mode('overwrite').parquet('/user/my_data/people')

!hdfs dfs -ls /user/my_data/people

### Step 6: Create a Hive Table
The Hive warehouse location was correctly configured at the cluster level, so this will succeed.

In [None]:
hive_db = "my_hive_db"
hive_table = f"{hive_db}.people_hive"

spark.sql(f"CREATE DATABASE IF NOT EXISTS {hive_db} LOCATION 'hdfs:///user/hive_db'")

# First, drop the old table with the wrong schema
spark.sql(f"DROP TABLE IF EXISTS {hive_table}")

# Re-create the table using BIGINT to match the data in the Parquet file
spark.sql(f"""
    CREATE EXTERNAL TABLE IF NOT EXISTS {hive_table} (
        name STRING,
        age BIGINT
    )
    STORED AS PARQUET
    LOCATION '/user/my_data/people'
""")

print("--- Standard Hive Table ---")
spark.sql(f"SELECT * FROM {hive_table}").show()

### Step 7: Create an Iceberg Table (using Hive Metastore)
We can still use the default `spark_catalog` for Hive-based Iceberg tables.

In [None]:
print("Verifying the pre-configured Iceberg catalog...")
spark.sql(f"SHOW DATABASES IN {ICEBERG_HIVE_CATALOG}").show()

iceberg_hive_db = "my_iceberg_db"
iceberg_hive_table = f"{ICEBERG_HIVE_CATALOG}.{iceberg_hive_db}.people_iceberg"

spark.sql(
    f"CREATE DATABASE IF NOT EXISTS {ICEBERG_HIVE_CATALOG}.{iceberg_hive_db}")
df.write.format("iceberg").mode("overwrite").saveAsTable(iceberg_hive_table)

print("\n--- Iceberg Table on Internal Hive Metastore ---")
spark.sql(f"SELECT * FROM {iceberg_hive_table}").show()

### Step 8: Create an Iceberg Table (using Manual BigLake Metastore)
Now we use our manually configured catalog, `biglake_manual`, which was set at the cluster level.

In [None]:
print("Verifying the pre-configured BQ catalog...")
spark.sql(f"SHOW DATABASES IN {ICEBERG_BQ_CATALOG}").show()

# Create the BigQuery dataset that will act as the metastore
!bq --location={REGION} mk --dataset {PROJECT_ID}:{BQ_DATASET}

# Define the BigLake table name using our manual catalog
biglake_table = f"{ICEBERG_BQ_CATALOG}.{BQ_DATASET}.people_biglake_manual"


# Save the DataFrame as an Iceberg table
df.write.mode("overwrite").format("iceberg").saveAsTable(biglake_table)

# Query the table from Spark
spark.sql(f"SELECT * FROM {biglake_table}").show()

## Step 9: Clean Up
Finally, delete the cluster to avoid incurring ongoing charges.

In [None]:
!gcloud dataproc clusters delete {CLUSTER_NAME} --region {REGION}