## 4. Data Ops - Deploy Spark pipeline using Dataproc Workflows

### Dataproc Workflows

Dataproc Workflows has 2 types of workflow templates.

1. Manged cluster - Create a new cluster and delete the cluster once the job has completed.
2. Cluster selector - Select a pre-existing Dataproc cluster to the run the jobs (does not delete the cluster).

This module will use option 1 to create a managed cluster workflow template.

### 4.1 Write pyspark jobs for dataproc workflow

* Job 1: To convert CSV to BQ Tables
* Job 2: Run predictions on trained model and persist results

Resources:
* Learn more about dataproc workflows [[here]](https://cloud.google.com/dataproc/docs/concepts/workflows/overview)
* Learn more about adding pyspark jobs to a worfklow template [[here]](https://cloud.google.com/sdk/gcloud/reference/dataproc/workflow-templates/add-job/pyspark)

#### Setup BQ tables for persisiting results from pyspark jobs

In [None]:
# Let's create bq tables to persist the results of the jobs
project_id = !gcloud config list --format 'value(core.project)' 2>/dev/null 
project_id = project_id[0]
# Table definition for both jobs 
job1_table_name = project_id + '-raw.transaction_data_workflow'
job2_table_name = project_id + '-annotated.transaction_data_workflow'
job1_table_name = job1_table_name.replace('-', '_')
job2_table_name = job2_table_name.replace('-', '_')
# Schema definition for tables
schema_inline = "type:string,amount:float,oldbalanceOrg:float,newbalanceOrig:float,isFraud:integer,transactionID:string,prediction:integer"


In [None]:
!bq mk --table \
{job1_table_name} \
{schema_inline}

In [None]:
!bq mk --table \
{job2_table_name} \
{schema_inline}

#### Define Job 1 (Read csv and persist in BQ) 

**TODO**
* Provide project_id and train_data_path 

In [None]:
%%writefile job_csv_to_bq_table.py
## Job 1
print('Job 1')
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName('Automated Data Engineer Workflow') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar') \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0") \
.getOrCreate()

# variables
project_id = '<insert-code-here>'
gcs_bucket = project_id + '-data'
train_data_path = '<insert-code-here>'
table_name = project_id + '-raw.transaction_data_workflow'
table_name = table_name.replace('-', '_')

df_transaction_data_from_csv = spark \
  .read \
  .option ("inferSchema" , "true") \
  .option ("header" , "true") \
  .csv (train_data_path)

df_transaction_data_from_csv.write \
.format("bigquery") \
.option("table", table_name) \
.option("temporaryGcsBucket", gcs_bucket) \
.mode('overwrite') \
.save()

In [None]:
spark.stop()

#### Define Job 2 (Run predictions on trained model and persist results)

##### **TODO** (Challenge 1)
* Provide project_id and train_data_path variables 
* Provide code to: 
    * load test data into a pyspark dataframe 
    * load previously trained ML model 
    * run prediction on test data 
    * persist relevant fields into BQ table

In [None]:
%%writefile job_ml_predictions.py
print('Job 2')
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel

path_to_predict_csv = '<insert-code-here>'
project_id = '<insert-code-here>'
gcs_bucket = project_id + '-data'
model_path = 'gs://' + gcs_bucket + '/model/'
table_name = project_id + '-annotated.transaction_data_workflow'
table_name = table_name.replace('-', '_')

spark = SparkSession.builder \
.appName('Automated Data Scientist Workflow') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar') \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0") \
.getOrCreate()

# Load the test data into a spark dataframe 
df_test_data = <insert-code-here>
# Load the previously trained pipeline model 
loaded_pipeline_model = <insert-code-here>
# Run predictions on the test data
predictions = <insert-code-here>

# Save the following fields into BQ table: 'type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'isFraud', 'transactionID', 'prediction' 
# Use the bq table name defined previously {table_name}
<insert-code-here>

In [None]:
spark.stop()

### 4.2 Grant additional service account permission to deploy workflow from notebooks

Go to https://console.cloud.google.com/iam-admin/iam

Look for Compute Engine default service account. It will be in the format

```
{number}-compute@developer.gserviceaccount.com
```

Edit the roles and add the role "Storage Object Admin" and press save.

Note that these steps are taken to deploy the workflow directly from the notebooks. Alternatively you can execute the steps below via cloud shell. 

### 4.3 Create Dataproc managed cluster workflow Template

Learn more about creating workflow templates [[here]](https://cloud.google.com/dataproc/docs/concepts/workflows/using-workflows#creating_a_template)

**TODO**
* Provide values for variables below

In [None]:
%env WORKFLOW_ID=automate-bank-transaction-workflow
%env REGION=<insert-code-here>
%env PROJECT_ID=<insert-code-here>
%env BUCKET_NAME=<insert-code-here>
%env CLUSTER_NAME=spark-workflow-cluster
%env NUM_WORKERS=2

In [None]:
%%bash
gcloud dataproc workflow-templates create $WORKFLOW_ID \
--region $REGION

### 4.4 Configure managed cluster for the workflow template

In [None]:
%%bash
gcloud beta dataproc workflow-templates set-managed-cluster $WORKFLOW_ID \
    --cluster-name $CLUSTER_NAME \
    --region $REGION \
    --image-version=1.5-ubuntu18 \
    --master-machine-type n1-standard-2 \
    --master-boot-disk-size=128GB \
    --num-workers $NUM_WORKERS \
    --worker-machine-type n1-standard-2\
    --worker-boot-disk-size=128GB \
    --scopes https://www.googleapis.com/auth/cloud-platform \
    --bucket $BUCKET_NAME \
    --properties spark:spark.jars=gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.18.0.jar,spark:spark=gs://spark-lib/bigquery/spark-bigquery-latest.jar

### 4.5 Upload PySpark job to GCS

In [None]:
%%bash
gsutil cp job_csv_to_bq_table.py \
 gs://${PROJECT_ID}-data/workflows/python-scripts/job_csv_to_bq_table.py

In [None]:
%%bash
gsutil cp job_ml_predictions.py \
 gs://${PROJECT_ID}-data/workflows/python-scripts/job_ml_predictions.py

### 4.6 Add job to workflow template

In [None]:
%%bash
gcloud dataproc workflow-templates add-job pyspark \
   gs://${PROJECT_ID}-data/workflows/python-scripts/job_csv_to_bq_table.py \
    --region $REGION \
    --step-id csv_to_bq \
    --workflow-template $WORKFLOW_ID

In [None]:
%%bash
gcloud dataproc workflow-templates add-job pyspark \
  gs://${PROJECT_ID}-data/workflows/python-scripts/job_ml_predictions.py \
    --region $REGION \
    --start-after=csv_to_bq \
    --step-id predict \
    --workflow-template $WORKFLOW_ID

### 4.7 Run workflow template

In [None]:
%%bash
gcloud dataproc workflow-templates instantiate $WORKFLOW_ID \
--region $REGION

### View Cluster, workflow and jobs tabs

Go to the Dataproc UI and view the cluster page. You should see the new cluster spinning up

Once the cluster is ready view the workflow and jobs tabs to check the progress of the jobs.

### 4.8 Schedule the workflows
**TODO** (Optional: Challenge 2)
* Cloud Composer is a managed Apache Airflow service you can use to create, schedule, monitor, and manage workflows.
* View the guide on how to schedule Dataproc workflows in composer [[here]](https://cloud.google.com/dataproc/docs/tutorials/workflow-composer)