# Replicating a Single-Node Cloudera Environment with Dataproc (Manual Configuration)
This notebook provides a definitive, step-by-step guide to creating a single-node Hadoop environment on Google Cloud Dataproc. This version explicitly follows the manual configuration for the BigQuery Metastore, bypassing the Dataproc Iceberg component to demonstrate the underlying setup.

## Step 1: Configuration
First, we define the configuration for our Dataproc cluster. **Make sure to replace the placeholder values** with your specific Google Cloud project details.

In [None]:
# IMPORTANT: Fill in these values before running!
PROJECT_ID = "your-gcp-project-id"  # e.g., my-gcp-project
REGION = "your-gcp-region"      # e.g., us-central1
CLUSTER_NAME = "my-single-node-cluster"
# A unique GCS bucket name. Using the project ID as a prefix is a good practice.
BUCKET_NAME = f"{PROJECT_ID}-dataproc-bucket"
# A BigQuery dataset to act as a persistent Iceberg metastore.
BQ_DATASET = "my_iceberg_metastore"
# The name for our manually-configured catalog.
CATALOG_NAME = "biglake_manual"

## Step 2a: Create a Fully and Manually Configured Dataproc Cluster
This is the most important step. We add a `--properties` flag to the cluster creation command to manually set all the required Spark configurations. This is the standard and correct way to configure the default Spark session for the entire cluster, which avoids all errors in the interactive notebook environment.

In [None]:
# Check if the bucket exists and create it if it does not
!gcloud storage buckets describe gs://{BUCKET_NAME} || gcloud storage buckets create gs://{BUCKET_NAME} --location={REGION}

# Define the JARs and packages needed for the manual configuration
PACKAGES = "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1"
JARS = "https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.6.1/iceberg-spark-runtime-3.5_2.12-1.6.1.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.6.1-1.0.1-beta.jar"

# Define the properties for the Spark session in a readable, multi-line format
properties_list = [
    "spark:spark.sql.warehouse.dir=hdfs:///user/hive/warehouse",
    f"spark:spark.sql.catalog.{CATALOG_NAME}=org.apache.iceberg.spark.SparkCatalog",
    f"spark:spark.sql.catalog.{CATALOG_NAME}.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog",
    f"spark:spark.sql.catalog.{CATALOG_NAME}.gcp_project={PROJECT_ID}",
    f"spark:spark.sql.catalog.{CATALOG_NAME}.location={REGION}",
    f"spark:spark.sql.catalog.{CATALOG_NAME}.warehouse=gs://{BUCKET_NAME}/iceberg_warehouse_manual"
]
PROPERTIES = ",".join(properties_list)

# Create the Dataproc cluster with all configurations set at creation time
!gcloud dataproc clusters create {CLUSTER_NAME} \
    --region {REGION} \
    --single-node \
    --image-version 2.2-debian12 \
    --optional-components=JUPYTER \
    --enable-component-gateway \
    --bucket {BUCKET_NAME} \
    --packages={PACKAGES} \
    --jars={JARS} \
    --properties="{PROPERTIES}"

## Step 2b: Update an Existing Cluster (Optional)
If you already have a cluster running and want to apply or change the configuration without recreating it, you can use the `gcloud dataproc clusters update` command. Note that some properties may require a cluster restart to take effect.

In [None]:
# Define the properties for the Spark session in a readable, multi-line format
properties_list = [
    "spark:spark.sql.warehouse.dir=hdfs:///user/hive/warehouse",
    f"spark:spark.sql.catalog.{CATALOG_NAME}=org.apache.iceberg.spark.SparkCatalog",
    f"spark:spark.sql.catalog.{CATALOG_NAME}.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog",
    f"spark:spark.sql.catalog.{CATALOG_NAME}.gcp_project={PROJECT_ID}",
    f"spark:spark.sql.catalog.{CATALOG_NAME}.location={REGION}",
    f"spark:spark.sql.catalog.{CATALOG_NAME}.warehouse=gs://{BUCKET_NAME}/iceberg_warehouse_manual"
]
PROPERTIES = ",".join(properties_list)

# Update the cluster with the new properties
!gcloud dataproc clusters update {CLUSTER_NAME} \
    --region {REGION} \
    --update-properties="{PROPERTIES}"

## Step 3: Accessing the Jupyter Notebook
Follow these steps to access the interactive Jupyter environment on the cluster.

### Step 4: Get the Pre-Configured Spark Session
Because all configurations were set at the cluster level, we do not need to stop or configure the session. We simply get the default session that Jupyter started, which now has all our settings.

In [None]:
from pyspark.sql import SparkSession

# Get the existing, pre-configured Spark session. DO NOT use spark.stop()
spark = SparkSession.builder.getOrCreate()

# Create a sample DataFrame that we will reuse across multiple steps.
df = spark.createDataFrame(
    [('Alice', 25), ('Bob', 30), ('Charlie', 35)], ['name', 'age'])

print("Spark session is ready and sample DataFrame 'df' has been created.")

### Step 5: Write to HDFS
Using the DataFrame created in the previous step, we will now write it to a new directory in HDFS.

In [None]:
!hdfs dfs -mkdir -p /user/my_data
df.write.mode('overwrite').parquet('/user/my_data/people')
!hdfs dfs -ls /user/my_data/people

### Step 6: Create a Hive Table
The Hive warehouse location was correctly configured at the cluster level, so this will succeed.

In [None]:
spark.sql("CREATE DATABASE IF NOT EXISTS my_db")
spark.sql("USE my_db")
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS people (name STRING, age INT) STORED AS PARQUET LOCATION '/user/my_data/people'")
spark.sql("SELECT * FROM my_db.people").show()

### Step 7: Create an Iceberg Table (using Hive Metastore)
We can still use the default `spark_catalog` for Hive-based Iceberg tables.

In [None]:
iceberg_hive_table = "spark_catalog.my_db.people_iceberg"
df.write.mode("overwrite").format("iceberg").save(iceberg_hive_table)
spark.sql(f"SELECT * FROM {iceberg_hive_table}").show()

### Step 8: Create an Iceberg Table (using Manual BigLake Metastore)
Now we use our manually configured catalog, `biglake_manual`, which was set at the cluster level.

In [None]:
# Create the BigQuery dataset that will act as the metastore
!bq --location={REGION} mk --dataset {PROJECT_ID}:{BQ_DATASET}

# Define the BigLake table name using our manual catalog
biglake_table = f"{CATALOG_NAME}.{BQ_DATASET}.people_biglake_manual"

# Save the DataFrame as an Iceberg table
df.write.mode("overwrite").format("iceberg").save(biglake_table)

# Query the table from Spark
spark.sql(f"SELECT * FROM {biglake_table}").show()

## Step 9: Clean Up
Finally, delete the cluster to avoid incurring ongoing charges.

In [None]:
!gcloud dataproc clusters delete {CLUSTER_NAME} --region {REGION}