gcp-dataproc_serverless-running-notebooks

Objective

Orchestrator to run Notebooks on Dataproc Serverless via Cloud Composer

What is Dataproc Serverless?

Dataproc Serverless lets you run Spark batch workloads without requiring you to provision and manage your own cluster. You can specify workload parameters, and then submit the workload to the Dataproc Serverless service. The service will run the workload on a managed compute infrastructure, autoscaling resources as needed. Dataproc Serverless charges apply only to the time when the workload is executing.

With Dataproc Serverless, You can run Notebooks and streamline the end-to-end data science workflow without provisioning a cluster. Also, you can run Spark batch workloads without provisioning and managing the clusters and servers. This improves developer productivity and decreases infrastructure costs. A Fortune 10 retailer uses Dataproc Serverless for optimizing retail assortment space for 500M+ items.

For more details, check this document

File Directory Structure

dataproc_serverless

├── composer_input                   
│   ├── jobs/                       vehicle_analytics_executor.py
│   ├── DAGs/                       serverless_airflow.py
├── notebooks 
│   ├── datasets/                   electric_vehicle_population.csv
│   ├── jupyter/                    vehicle_analytics.ipynb
│   ├── jupyter/output/             vehicle_analytics_sample_output.ipynb
│   ├── jupyter/dependencies/       requirements.txt (optional)
│   ├── jupyter/refs/                reference.py (optional)

File Details

composer_input

vehicle_analytics_executor.py: runs a papermill execution of input notebook and writes the output file into the assgined location
serverless_airflow.py: orchestrates the workflow
- Dataproc Batch Creation :This operator runs your workloads on Dataproc Serverless
```
create_batch1 = DataprocCreateBatchOperator()
create_batch2 = DataprocCreateBatchOperator()...
```

notebooks

vehicle_analytics.ibynp : This file creates Spark Session for Electric Vehicle Population, loads csv file in GCS and perform fundamental Spark functions. Also, it installs Python Pakages listed in requirements.txt and runs sample Python reference file (optional)

spark = SparkSession \
    .builder \
    .appName("Spark Session for Electric Vehicle Population") \
    .getOrCreate()

path=f"gs://{gcs_bucket}/dataproc_serverless/notebooks/datasets/electric_vehicle_population.csv"
df = spark.read.csv(path, header=True)
df.show()

df.filter(df['Model Year'] > 2020).show(5)
sortedByElectricRange = df.orderBy('Electric Range').show(10)

jupyter/output : this is where papermill notebook execution outputs will be stored Sample Notebook Output
electric_vehicle_population.csv : this dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).
requirements.txt : Python packages to be installed on Notebooks
reference.py : Python reference file to be evoked on Notebooks

I) Orchestrating End to End Notebook Execution workflow in Cloud Composer

Make sure to modify gcs path for datasets in Notebook and make any additional changes.
Create a Cloud Composer Environment
Find DAGs folder from Composer Environment and add serverless_airflow.py (DAGs file) to it in order to trigger DAGs execution: DAG folder from Cloud Composer Console
Have all the files available in GCS bucket, except DAGs file which should go into your Composer DAGs folder
Create Variables from Airflow UI for
- gcp_project - Google Cloud Project to use for the Cloud Dataproc cluster.
- gce_zone - Google Compute Engine zone where Cloud Dataproc cluster should be created.
- gcs_bucket - Google Cloud Storage bucket where all the files are stored
- phs_region = Persistent History Server region ex) us-central1 (optional)
- phs - Persistent History Server name (optional)
Open Airflow UI to monitor DAG executions, Runs and logs

Closing Note

If you're adapting this example for your own use consider the following:

Setting an appropriate input path within your environment (gcs, DAGs folder, etc)
Setting more appropriate configurations (DAGs, Dataproc cluster, persistent history server, etc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

composer_input

composer_input

notebooks

notebooks

README.md

README.md

Repository files navigation

gcp-dataproc_serverless-running-notebooks

Objective

What is Dataproc Serverless?

File Directory Structure

File Details

composer_input

notebooks

I) Orchestrating End to End Notebook Execution workflow in Cloud Composer

Closing Note

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
composer_input		composer_input
notebooks		notebooks
README.md		README.md

kristin-kim/gcp-dataproc_serverless-running-notebooks

Folders and files

Latest commit

History

Repository files navigation

gcp-dataproc_serverless-running-notebooks

Objective

What is Dataproc Serverless?

File Directory Structure

File Details

composer_input

notebooks

I) Orchestrating End to End Notebook Execution workflow in Cloud Composer

Closing Note

About

Topics

Resources

Stars

Watchers

Forks

Languages