## 5. Data Ops - Deploy Spark pipeline using Dataproc Workflows

### Dataproc Workflows

Dataproc Workflows has 2 types of workflow templates.

1. Manged cluster - Create a new cluster and delete the cluster once the job has completed.
2. Cluster selector - Select a pre-existing Dataproc cluster to the run the jobs (does not delete the cluster).

This codelab will use option 1 to create a managed cluster workflow template.

### 5.1 Convert code above into 2 python files

1. Job to convert CSV to BQ Tables
2. Job to run predictions by reading bank marketing data from hive table

In [15]:
project_id = !gcloud config list --format 'value(core.project)' 2>/dev/null 
project_id = project_id[0]
project_id

'datalake-vol2'

In [20]:
%%writefile job_csv_to_bq_table.py
## Job 1
print('Job 1')
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer

spark = SparkSession.builder \
.appName('Spark - Data Eng Demo') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar') \
.getOrCreate()

project_id = 'datalake-vol2'
# create temp GCS bucket for writing spark df to bq table
gcs_bucket = project_id + '-data'
# create the name your BQ dataset
dataset_name = project_id + '-raw'
dataset_name = dataset_name.replace('-', '_')
table_name = dataset_name + ".transaction_data_aut"

sql_statement = f"""CREATE DATABASE IF NOT EXISTS {dataset_name};"""

spark.sql(sql_statement)

df_transaction_data_from_csv = spark \
  .read \
  .option ( "inferSchema" , "true" ) \
  .option ( "header" , "true" ) \
  .csv ("gs://datalake-vol2-data/dataset/transaction_data_train.csv" )

df_transaction_data_from_csv.write \
.format("bigquery") \
.option("table", table_name) \
.option("temporaryGcsBucket", gcs_bucket) \
.mode('overwrite') \
.save()

Overwriting job_csv_to_bq_table.py


In [None]:
spark.stop()

In [None]:
# %%writefile job_xgboost_predictions.py
# ## Job 2
# print('Job 2')
# from pyspark.sql import SparkSession
# from pyspark.ml import Pipeline, PipelineModel
# from pyspark.ml.feature import StringIndexer

# warehouse_location = 'gs://cloud-native-dl-modernisation-demo/hive-warehouse'
# service_endpoint = 'thrift://hive-cluster-m.us-central1-c:9083'

# spark = SparkSession.builder \
#   .appName('csv_to_hive') \
#   .config("hive.metastore.uris", service_endpoint)  \
#   .config("spark.sql.warehouse.dir", warehouse_location) \
#   .enableHiveSupport() \
#   .getOrCreate()

# # Load the data
# data = spark.sql("""
# SELECT * 
# FROM bank_demo_db.bank_marketing
# """)

# (train_data, test_data) = data.randomSplit([0.7, 0.3], seed=42)

# model_path = 'gs://cloud-native-dl-modernisation-demo/models/xgboost/pipeline_model/bank-marketing'

# loaded_pipeline_model = PipelineModel.load(model_path)
# predictions = loaded_pipeline_model.transform(test_data)

# # save predictions as json
# predictions.write.json('gs://cloud-native-dl-modernisation-demo/models/xgboost/predictions_json/'+spark.sparkContext.applicationId)

### 5.2. Grant service account permission to deploy workflow from notebooks

Dataproc's service accounts needs to be granted "Dataproc Editor" IAM role.

Go to https://console.cloud.google.com/iam-admin/iam

Look for the service account email under the name "Google Cloud Dataproc Service Agent". It will be in the format

```bash
service-{project-number}@dataproc-accounts.iam.gserviceaccount.com
```

Edit the roles and add the role "Dataproc Editor" and press save.

In [None]:
## Alternatively run all of the below from the Google Cloud Shell

### 5.3 Create Dataproc managed cluster workflow Template

In [21]:
%env WORKFLOW_ID=read-csv-to-bq-workflow 
%env REGION=europe-west3

env: WORKFLOW_ID=read-csv-to-bq-workflow
env: REGION=europe-west3


In [None]:
%%bash 
echo $REGION

In [None]:
%%bash
gcloud dataproc workflow-templates create $WORKFLOW_ID \
--region $REGION

### 5.4 Configure managed cluster for the workflow template

In [None]:
%%bash
export PROJECT_ID=datalake-vol2
export CLUSTER_NAME=spark-workflow-cluster
export BUCKET_NAME=${PROJECT_ID}-data
export REGION=europe-west3
export NUM_WORKERS=2

gcloud beta dataproc workflow-templates set-managed-cluster $WORKFLOW_ID \
    --cluster-name $CLUSTER_NAME \
    --region $REGION \
    --image-version=1.5-ubuntu18 \
    --master-machine-type n1-standard-4 \
    --num-workers $NUM_WORKERS \
    --worker-machine-type n1-standard-4\
    --scopes https://www.googleapis.com/auth/cloud-platform \
    --bucket $BUCKET_NAME \
    --enable-component-gateway \
    --properties spark:spark.jars=gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.18.0.jar,spark:spark=gs://spark-lib/bigquery/spark-bigquery-latest.jar
                
                

### 5.5 Upload PySpark job to GCS

In [22]:
%%bash
export PROJECT_ID=datalake-vol2
export BUCKET_NAME=${PROJECT_ID}-data
gsutil cp job_csv_to_bq_table.py \
 gs://${PROJECT_ID}-data/workflows/python-scripts/job_csv_to_bq_table.py

Copying file://job_csv_to_bq_table.py [Content-Type=text/x-python]...
/ [1 files][  1.0 KiB/  1.0 KiB]                                                
Operation completed over 1 objects/1.0 KiB.                                      


In [None]:
# %%bash
# export PROJECT_ID=cloud-native-dl-modernisation
# export BUCKET_NAME=${PROJECT_ID}-demo
# gsutil cp job_xgboost_predictions.py \
#  gs://${PROJECT_ID}-demo/workflows/spark-bank-marketing/job_xgboost_predictions.py

### 5.6 Add job to workflow template

In [23]:
%%bash
export WORKFLOW_ID=read-csv-to-bq-workflow
export REGION=europe-west3
export PROJECT_ID=datalake-vol2

gcloud dataproc workflow-templates add-job pyspark \
   gs://${PROJECT_ID}-data/workflows/python-scripts/job_csv_to_bq_table.py \
    --region $REGION \
    --step-id csv_to_bq \
    --workflow-template $WORKFLOW_ID

ERROR: (gcloud.dataproc.workflow-templates.add-job.pyspark) INVALID_ARGUMENT: Template contains a duplicate job with step name 'csv_to_bq'


CalledProcessError: Command 'b'export WORKFLOW_ID=read-csv-to-bq-workflow\nexport REGION=europe-west3\nexport PROJECT_ID=datalake-vol2\n\ngcloud dataproc workflow-templates add-job pyspark \\\n   gs://${PROJECT_ID}-data/workflows/python-scripts/job_csv_to_bq_table.py \\\n    --region $REGION \\\n    --step-id csv_to_bq \\\n    --workflow-template $WORKFLOW_ID\n'' returned non-zero exit status 1.

In [None]:
# %%bash
# export WORKFLOW_ID=read-csv-to-bq-workflow
# export REGION=europe-west3
# export PROJECT_ID=datalake-vol2

# gcloud dataproc workflow-templates add-job pyspark \
#   gs://${PROJECT_ID}-demo/workflows/spark-bank-marketing/job_xgboost_predictions.py \
#     --region $REGION \
#     --start-after=csv_to_hive \
#     --step-id xgboost_predictions \
#     --workflow-template $WORKFLOW_ID

### 5.7 Run workflow template

In [24]:
%%bash

gcloud dataproc workflow-templates instantiate $WORKFLOW_ID \
--region $REGION

Waiting on operation [projects/datalake-vol2/regions/europe-west3/operations/9b7b17a5-84ba-3c52-b1a3-6b470789a1df].
WorkflowTemplate [read-csv-to-bq-workflow] RUNNING
Creating cluster: Operation ID [projects/datalake-vol2/regions/europe-west3/operations/c5b200dd-d5f6-48b0-9021-0d782cf10b21].
Created cluster: spark-workflow-cluster-4ptd3p6o2gf3e.
Job ID csv_to_bq-4ptd3p6o2gf3e RUNNING
Job ID csv_to_bq-4ptd3p6o2gf3e FAILED
Job ID csv_to_bq-4ptd3p6o2gf3e error: Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/csv_to_bq-4ptd3p6o2gf3e?project=datalake-vol2&region=europe-west3
gcloud dataproc jobs wait 'csv_to_bq-4ptd3p6o2gf3e' --region 'europe-west3' --project 'datalake-vol2'
https://console.cloud.google.com/storage/browser/datalake-vol2-data/google-cloud-dataproc-metainfo/24acf9f0-7f5d-4dac-8e9d-94c401caa3f0/jobs/csv_to_bq-4ptd3p6o2gf3e/
gs://datalake-vol2-data/google-cloud-dataproc-metainfo/24acf9f

CalledProcessError: Command 'b'\ngcloud dataproc workflow-templates instantiate $WORKFLOW_ID \\\n--region $REGION\n'' returned non-zero exit status 1.

### 5.8 View Cluster, workflow and jobs tabs

Go to the Dataproc UI and view the cluster page. You should see the new cluster spinning up

Once the cluster is ready view the workflow and jobs tabs to check the progress of the jobs.

### 5.9 Check new predictions table was created

### 5.10 Schedule workflows

View the guide on how to schedule Dataproc workflows

https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-schedule-solutions