## 5. Data Ops - Deploy Spark pipeline using Dataproc Workflows

### Dataproc Workflows

Dataproc Workflows has 2 types of workflow templates.

1. Manged cluster - Create a new cluster and delete the cluster once the job has completed.
2. Cluster selector - Select a pre-existing Dataproc cluster to the run the jobs (does not delete the cluster).

This codelab will use option 1 to create a managed cluster workflow template.

### 5.1 Convert code above into 2 python files

1. Job to convert CSV to Hive Tables
2. Job to run predictions on Hive Tables

In [None]:
%%writefile job_csv_to_hive.py
## Job 1
print('Job 1')
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer

warehouse_location = 'gs://dataproc-datalake-demo/hive-warehouse'
service_endpoint = 'thrift://hive-cluster-m.us-central1-f:9083'

spark = SparkSession.builder \
  .appName('csv_to_hive') \
  .config("hive.metastore.uris", service_endpoint)  \
  .config("spark.sql.warehouse.dir", warehouse_location) \
  .enableHiveSupport() \
  .getOrCreate()

#To-Do add code from notebook job 1 

Writing job_csv_to_hive.py


In [None]:
%%writefile job_xgboost_predictions.py
## Job 2
print('Job 2')

# Load the data
data = spark.sql("""
SELECT * 
FROM bank_demo_db.bank_marketing
""")

model_path = 'gs://dataproc-datalake-examples/xgboost/pipeline_model/bank-marketing'

loaded_model = XGBoostClassificationModel().load(model_path)

loaded_pipeline_model = PipelineModel.load(model_path)

#To-Do add code from notebook job 2 

Writing job_xgboost_predictions.py


### 5.2. Grant service account permission to deploy workflow from notebooks

Dataproc's service accounts needs to be granted "Dataproc Editor" IAM role.

Go to https://console.cloud.google.com/iam-admin/iam

Look for the service account email under the name "Google Cloud Dataproc Service Agent". It will be in the format

```bash
service-{project-number}@dataproc-accounts.iam.gserviceaccount.com
```

Edit the roles and add the role "Dataproc Editor" and press save.

In [None]:
## Alternatively run all of the below from the Google Cloud Shell

### 5.3 Create Dataproc managed cluster workflow Template

In [None]:
%%bash
export WORKFLOW_ID=bank-marketing-workflow

In [None]:
%%bash
export WORKFLOW_ID=bank-marketing-workflow
echo $WORKFLOW_ID

bank-marketing-workflow


In [None]:
%%bash
export WORKFLOW_ID=bank-marketing-workflow
export REGION=us-central1

gcloud dataproc workflow-templates create $WORKFLOW_ID \
--region $REGION

### 5.4 Configure managed cluster for the workflow template

In [None]:
%%bash
export WORKFLOW_ID=bank-marketing-workflow

export PROJECT_ID=dataproc-datalake
export CLUSTER_NAME=spark-workflow-cluster
export BUCKET_NAME=${PROJECT_ID}-demo
export REGION=us-central1
export ZONE=us-central1-f

gcloud beta dataproc workflow-templates set-managed-cluster $WORKFLOW_ID \
    --cluster-name $CLUSTER_NAME \
    --region $REGION \
    --zone $ZONE \
    --image-version preview-ubuntu \
    --master-machine-type n1-standard-4 \
    --worker-machine-type n1-standard-4 \
    --optional-components=ANACONDA,JUPYTER \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/rapids/rapids.sh \
    --metadata rapids-runtime=SPARK \
    --bucket $BUCKET_NAME

### 5.5 Upload PySpark job to GCS

In [None]:
%%bash
export PROJECT_ID=dataproc-datalake
export BUCKET_NAME=${PROJECT_ID}-demo
gsutil cp job_csv_to_hive.py \
 gs://${PROJECT_ID}-demo/workflows/spark-bank-marketing/job_csv_to_hive.py

Copying file://job_csv_to_hive.py [Content-Type=text/x-python]...
/ [1 files][   24.0 B/   24.0 B]                                                
Operation completed over 1 objects/24.0 B.                                       


In [None]:
%%bash
export PROJECT_ID=dataproc-datalake
export BUCKET_NAME=${PROJECT_ID}-demo
gsutil cp job_xgboost_predictions.py \
 gs://${PROJECT_ID}-demo/workflows/spark-bank-marketing/job_xgboost_predictions.py

Copying file://job_xgboost_predictions.py [Content-Type=text/x-python]...
/ [1 files][   24.0 B/   24.0 B]                                                
Operation completed over 1 objects/24.0 B.                                       


### 5.6 Add job to workflow template

In [None]:
%%bash
export WORKFLOW_ID=bank-marketing-workflow
export REGION=us-central1
export PROJECT_ID=dataproc-datalake

gcloud dataproc workflow-templates add-job pyspark \
  gs://${PROJECT_ID}-demo/workflows/spark-bank-marketing/job_csv_to_hive.py \
    --region $REGION \
    --step-id csv_to_hive \
    --workflow-template $WORKFLOW_ID

createTime: '2020-09-08T16:56:47.304Z'
id: bank-marketing-workflow
jobs:
- pysparkJob:
    mainPythonFileUri: gs://dataproc-datalake-demo/workflows/spark-bank-marketing/job_csv_to_hive.py
  stepId: csv_to_hive
name: projects/dataproc-datalake/regions/us-central1/workflowTemplates/bank-marketing-workflow
placement:
  managedCluster:
    clusterName: spark-workflow-cluster
    config:
      configBucket: dataproc-datalake-demo
      gceClusterConfig:
        metadata:
          rapids-runtime: SPARK
        zoneUri: us-central1-f
      initializationActions:
      - executableFile: gs://goog-dataproc-initialization-actions-us-central1/rapids/rapids.sh
        executionTimeout: 600s
      masterConfig:
        machineTypeUri: n1-standard-4
      softwareConfig:
        imageVersion: preview-ubuntu
        optionalComponents:
        - ANACONDA
        - JUPYTER
      workerConfig:
        machineTypeUri: n1-standard-4
updateTime: '2020-09-08T17:09:28.406Z'
version: 5


In [None]:
%%bash
export WORKFLOW_ID=bank-marketing-workflow
export REGION=us-central1
export PROJECT_ID=dataproc-datalake

gcloud dataproc workflow-templates add-job pyspark \
  gs://${PROJECT_ID}-demo/workflows/spark-bank-marketing/job_xgboost_predictions.py \
    --region $REGION \
    --start-after=csv_to_hive \
    --step-id xgboost_predictions \
    --workflow-template $WORKFLOW_ID

createTime: '2020-09-08T16:56:47.304Z'
id: bank-marketing-workflow
jobs:
- pysparkJob:
    mainPythonFileUri: gs://dataproc-datalake-demo/workflows/spark-bank-marketing/job_csv_to_hive.py
  stepId: csv_to_hive
- prerequisiteStepIds:
  - csv_to_hive
  pysparkJob:
    mainPythonFileUri: gs://dataproc-datalake-demo/workflows/spark-bank-marketing/job_xgboost_predictions.py
  stepId: xgboost_predictions
name: projects/dataproc-datalake/regions/us-central1/workflowTemplates/bank-marketing-workflow
placement:
  managedCluster:
    clusterName: spark-workflow-cluster
    config:
      configBucket: dataproc-datalake-demo
      gceClusterConfig:
        metadata:
          rapids-runtime: SPARK
        zoneUri: us-central1-f
      initializationActions:
      - executableFile: gs://goog-dataproc-initialization-actions-us-central1/rapids/rapids.sh
        executionTimeout: 600s
      masterConfig:
        machineTypeUri: n1-standard-4
      softwareConfig:
        imageVersion: preview-ubuntu
  

### 5.7 Run workflow template

In [None]:
%%bash
export WORKFLOW_ID=bank-marketing-workflow
export REGION=us-central1

gcloud dataproc workflow-templates instantiate $WORKFLOW_ID \
--region $REGION

Waiting on operation [projects/dataproc-datalake/regions/us-central1/operations/4072cbbe-c966-3708-b54d-4298354ab78a].
WorkflowTemplate [bank-marketing-workflow] PENDING
WorkflowTemplate [bank-marketing-workflow] RUNNING
Creating cluster: Operation ID [projects/dataproc-datalake/regions/us-central1/operations/db00b13c-d677-48ae-9f78-f2e29c693d08].
Created cluster: spark-workflow-cluster-zobjl64tmvq4s.
Job ID csv_to_hive-zobjl64tmvq4s RUNNING
Job ID csv_to_hive-zobjl64tmvq4s COMPLETED
Job ID xgboost_predictions-zobjl64tmvq4s COMPLETED
Deleting cluster: Operation ID [projects/dataproc-datalake/regions/us-central1/operations/0478b08a-944b-4d8a-99ba-1a013662ea2a].
WorkflowTemplate [bank-marketing-workflow] DONE
Deleted cluster: spark-workflow-cluster-zobjl64tmvq4s.


### 5.8 View Cluster, workflow and jobs tabs

Go to the Dataproc UI and view the cluster page. You should see the new cluster spinning up

Once the cluster is ready view the workflow and jobs tabs to check the progress of the jobs.

### 5.9 Check new predictions table was created

### 5.10 Schedule workflows

View the guide on how to schedule Dataproc workflows

https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-schedule-solutions