# Replicating a Single-Node Cloudera Environment with Dataproc
This notebook provides a definitive, step-by-step guide to creating a single-node Hadoop environment on Google Cloud Dataproc. It is designed to be run cell-by-cell, first locally to create the infrastructure, and then within the cluster's Jupyter environment to perform data operations correctly.

## Step 1: Configuration
First, we define the configuration for our Dataproc cluster. **Make sure to replace the placeholder values** with your specific Google Cloud project details.

In [None]:
# IMPORTANT: Fill in these values before running!
PROJECT_ID = "your-gcp-project-id"  # e.g., my-gcp-project
REGION = "your-gcp-region"      # e.g., us-central1
CLUSTER_NAME = "my-single-node-cluster"
# A unique GCS bucket name. Using the project ID as a prefix is a good practice.
BUCKET_NAME = f"{PROJECT_ID}-dataproc-bucket"
# A BigQuery dataset to act as a persistent Iceberg metastore.
BQ_DATASET = "my_iceberg_metastore"

## Step 2a: Create GCS Bucket and a Fully Configured Dataproc Cluster
This is the most important step. We add a `--properties` flag to the cluster creation command. This configures the default Spark session for the entire cluster, so we don't need to manage the session lifecycle within the notebook. This is the standard and correct way to set these configurations.

The properties we set are:
- `spark:spark.sql.warehouse.dir`: Sets the Hive warehouse to an HDFS path, fixing the primary Hive error.
- `spark:spark.sql.catalog.biglake`: Pre-configures the BigLake catalog for Iceberg.

In [None]:
# Check if the bucket exists and create it if it does not
!gcloud storage buckets describe gs://{BUCKET_NAME} || gcloud storage buckets create gs://{BUCKET_NAME} --location={REGION}

# Define the properties for the Spark session
PROPERTIES = f"spark:spark.sql.warehouse.dir=hdfs:///user/hive/warehouse,spark:spark.sql.catalog.biglake=org.apache.iceberg.gcp.bigquery.BigQueryCatalog,spark:spark.sql.catalog.biglake.project={PROJECT_ID},spark:spark.sql.catalog.biglake.location={REGION},spark:spark.sql.catalog.biglake.gcs_location=gs://{BUCKET_NAME}/iceberg_warehouse"

# Create the Dataproc cluster with the session properties
!gcloud dataproc clusters create {CLUSTER_NAME} \
    --region {REGION} \
    --single-node \
    --image-version 2.2-debian12 \
    --optional-components=JUPYTER,ICEBERG \
    --enable-component-gateway \
    --bucket {BUCKET_NAME} \
    --properties="{PROPERTIES}"

## Step 2b: Update an Existing Cluster (Optional)
If you already have a cluster running and want to apply or change the configuration without recreating it, you can use the `gcloud dataproc clusters update` command. Note that some properties may require a cluster restart to take effect.

In [None]:
# Define the properties for the Spark session
PROPERTIES = f"spark:spark.sql.warehouse.dir=hdfs:///user/hive/warehouse,spark:spark.sql.catalog.biglake=org.apache.iceberg.gcp.bigquery.BigQueryCatalog,spark:spark.sql.catalog.biglake.project={PROJECT_ID},spark:spark.sql.catalog.biglake.location={REGION},spark:spark.sql.catalog.biglake.gcs_location=gs://{BUCKET_NAME}/iceberg_warehouse"

# Update the cluster with the new properties
!gcloud dataproc clusters update {CLUSTER_NAME} \
    --region {REGION} \
    --update-properties="{PROPERTIES}"

## Step 3: Accessing the Jupyter Notebook
For interactive development, use the Jupyter environment running on the cluster.

**How to Access Jupyter:**
1. Navigate to the **Dataproc** section in the Google Cloud Console.
2. Click on your cluster's name (`my-single-node-cluster`).
3. Go to the **Web Interfaces** tab.
4. Click the **Jupyter** link. This will open the Jupyter environment in a new browser tab.

---
### **--- The following cells (Steps 4-8) are intended to be run inside the Dataproc Jupyter Environment ---**

### Step 4: Get Spark Session and Create DataFrame
We no longer need to stop or configure the session. We simply get the default session that was already configured for us when the cluster was created. This is the simplest and most reliable approach.

In [None]:
from pyspark.sql import SparkSession

# Get the existing, pre-configured Spark session
spark = SparkSession.builder.getOrCreate()

# Create a sample DataFrame that we will reuse across multiple steps.
df = spark.createDataFrame([('Alice', 25), ('Bob', 30), ('Charlie', 35)], ['name', 'age'])

print("Spark session is ready and sample DataFrame 'df' has been created.")

### Step 5: Write to HDFS
Using the DataFrame created in the previous step, we will now write it to a new directory in HDFS.

In [None]:
# Create a directory in HDFS
!hdfs dfs -mkdir -p /user/my_data

# Write the DataFrame to that directory as a Parquet file
df.write.mode('overwrite').parquet('/user/my_data/people')

# Verify the data was written by listing the contents of the directory
!hdfs dfs -ls /user/my_data/people

### Step 6: Create a Hive Table
Because the Spark session is now correctly configured, creating a Hive database and table will work without any errors.

In [None]:
# Create a database
spark.sql("CREATE DATABASE IF NOT EXISTS my_db")
spark.sql("USE my_db")

# Create an external Hive table pointing to the data in HDFS
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS people (name STRING, age INT) STORED AS PARQUET LOCATION '/user/my_data/people'")

# Query the table
spark.sql("SELECT * FROM my_db.people").show()

### Step 7: Create an Iceberg Table (using Hive Metastore)

In [None]:
# The `spark_catalog` is the default catalog and is already configured to use Hive.
iceberg_hive_table = "spark_catalog.my_db.people_iceberg"

# Save the DataFrame as an Iceberg table
df.write.mode("overwrite").format("iceberg").save(iceberg_hive_table)

# Query the table
spark.sql(f"SELECT * FROM {iceberg_hive_table}").show()

### Step 8: Create an Iceberg Table (using BigLake Metastore)
Because the `biglake` catalog was configured at the cluster level, we can use it directly without any further configuration.

In [None]:
# Create the BigQuery dataset that will act as the metastore
!bq --location={REGION} mk --dataset {PROJECT_ID}:{BQ_DATASET}

# Define the BigLake table name
biglake_table = f"biglake.{BQ_DATASET}.people_biglake"

# Save the DataFrame as an Iceberg table using the BigLake catalog
df.write.mode("overwrite").format("iceberg").save(biglake_table)

# Query the table from Spark
spark.sql(f"SELECT * FROM {biglake_table}").show()

## Step 9: Clean Up
To avoid incurring ongoing charges, delete the cluster after you are finished. Run the following command in your local terminal or a new notebook cell.

In [None]:
!gcloud dataproc clusters delete {CLUSTER_NAME} --region {REGION}