# Google Cloud Dataproc Cluster Setup for Multi-User Spark Training

This notebook contains **step-by-step commands** and explanations to provision a Dataproc cluster that works well for **many concurrent learners running small Spark jobs**. It also includes a brief note on how autoscaling can save costs.


## 📌 What is Google Cloud Dataproc?

Google Cloud Dataproc is a **fully managed Apache Spark and Hadoop service**. It provisions clusters quickly, configures components, and integrates with GCS, BigQuery, and other GCP services.

**Key features**:
- Managed Spark/Hadoop/Hive deployments
- Fast cluster startup
- Component Gateway (easy access to web UIs like Jupyter, YARN, Spark)
- Tight integrations with GCS & BigQuery
- Autoscaling support


## 📊 Cluster Architecture

```
                +-----------------------------+
                |         Master Node         |
                |  e2-standard-4 (4 vCPU)     |
                |  Jupyter + Spark Driver     |
                +--------------+--------------+
                               |
          ---------------------------------------------
          |            |            |            |
+---------+  +---------+  +---------+  +---------+
| Worker 1|  | Worker 2|  | Worker 3|  | Worker N|
| e2-std-2|  | e2-std-2|  | e2-std-2|  | e2-std-2|
| 1 exec  |  | 1 exec  |  | 1 exec  |  | 1 exec  |
+---------+  +---------+  +---------+  +---------+
     |            |            |            |
     -----------------------------------------
                    Google Cloud Storage
                (Staging & Temporary Buckets)
```


## 🛠 Steps

**All commands below are meant to be run in Google Cloud Shell** (or any machine with the `gcloud` and `gsutil` CLIs authenticated to your project).


### 1) Set Project Variables

In [None]:
%%bash
PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUM=$(gcloud projects describe "$PROJECT_ID" --format='value(projectNumber)')
SA="${PROJECT_NUM}-compute@developer.gserviceaccount.com"
echo $PROJECT_ID
echo $PROJECT_NUM
echo $SA


### 2) Create Staging and Temporary Buckets

In [None]:
%%bash
gsutil mb -l us-east1 gs://$PROJECT_ID-dp-staging/
gsutil mb -l us-east1 gs://$PROJECT_ID-dp-temp/


### 3) Grant IAM Permissions to the Service Account

In [None]:
%%bash
gsutil iam ch serviceAccount:$SA:roles/storage.objectAdmin gs://$PROJECT_ID-dp-staging
gsutil iam ch serviceAccount:$SA:roles/storage.objectAdmin gs://$PROJECT_ID-dp-temp


### 4) Create the Dataproc Cluster

In [None]:
%%bash
gcloud dataproc clusters create cluster-b533   --region=us-east1   --enable-component-gateway   --no-address   --image-version=2.2-debian12   --master-machine-type=e2-standard-4   --master-boot-disk-type=pd-balanced   --master-boot-disk-size=100   --num-workers=10   --worker-machine-type=e2-standard-2   --worker-boot-disk-type=pd-balanced   --worker-boot-disk-size=40   --bucket="$PROJECT_ID-dp-staging"   --temp-bucket="$PROJECT_ID-dp-temp"   --optional-components=JUPYTER   --scopes='https://www.googleapis.com/auth/cloud-platform'   --properties="spark:spark.dynamicAllocation.enabled=false,spark:spark.scheduler.mode=FIFO,spark:spark.executor.instances=1,spark:spark.executor.cores=1,spark:spark.executor.memory=2g,spark:spark.executor.memoryOverhead=512m,spark:spark.driver.memory=1g,spark:spark.driver.memoryOverhead=384m,yarn:yarn.scheduler.minimum-allocation-mb=384,yarn:yarn.scheduler.minimum-allocation-vcores=1,yarn:yarn.scheduler.maximum-allocation-mb=3072,yarn:yarn.scheduler.maximum-allocation-vcores=1"


### 5) Quick runtime validation in Jupyter

In [None]:
# Run inside Jupyter Notebook on the Dataproc cluster
spark.range(1, 1_000_000).count()


## 💡 How Autoscaling Can Save Costs

Autoscaling lets Dataproc **grow** the cluster when many jobs are queued and **shrink** it when idle. This is ideal for training where workloads are **spiky**.

Benefits:
- Pay only for capacity you use
- Faster job starts during spikes
- Optionally use **preemptible** workers for cheap burst capacity
