## Manage Dataproc Workflows using gcloud Commands
Let us see how we can manage Dataproc Workflows using gcloud commands.
* Step 1: Create Dataproc Workflow Template
* Step 2: Configure active Dataproc cluster (we can also configure new cluster)
* Step 3: Add Spark SQL or Pyspark Jobs to Dataproc Workflow Templates with Dependencies
* Step 4: Run and Validate the Dataproc Workflow Template

We can take care of all the steps using `gcloud` commands.

In [1]:
!gcloud config set dataproc/region us-central1

Updated property [dataproc/region].


Updates are available for some Cloud SDK components.  To install them,
please run:
  $ gcloud components update



To take a quick anonymous survey, run:
  $ gcloud survey



In [3]:
!gcloud dataproc workflow-templates list

ID                        JOBS  UPDATE_TIME                  VERSION
getting-started           4     2022-10-08T09:34:41.266501Z  1
wf-daily-product-revenue  4     2022-10-08T11:12:53.276748Z  6


Here is the command to delete Dataproc Workflow Template (multiline approach doesn't work on Windows)

```shell
gcloud dataproc workflow-templates \
    delete wf-daily-product-revenue-bq
```

In [4]:
!gcloud dataproc workflow-templates delete wf-daily-product-revenue-bq --quiet

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates.delete) NOT_FOUND: Not found: Workflow Template projects/tidy-fort-361710/regions/us-central1/workflowTemplates/wf-daily-product-revenue-bq


Here is the command to create Dataproc Workflow.

```shell
gcloud dataproc workflow-templates \
    create wf-daily-product-revenue-bq
```

In [5]:
!gcloud dataproc workflow-templates create wf-daily-product-revenue-bq

In [6]:
!gcloud dataproc workflow-templates list

ID                           JOBS  UPDATE_TIME                  VERSION
getting-started              4     2022-10-08T09:34:41.266501Z  1
wf-daily-product-revenue     4     2022-10-08T11:12:53.276748Z  6
wf-daily-product-revenue-bq  0     2022-10-14T22:13:30.153785Z  1


Here is the command to attach running or active Dataproc Cluster to the Dataproc Workflow. We need to specify the label for the cluster.

```shell
gcloud dataproc workflow-templates \
    set-cluster-selector \
    wf-daily-product-revenue-bq \
    --cluster-labels goog-dataproc-cluster-name=aidataprocdev
```

In [7]:
!gcloud dataproc workflow-templates set-cluster-selector wf-daily-product-revenue-bq --cluster-labels goog-dataproc-cluster-name=aidataprocdev

Here are the commands to add Spark SQL Jobs to the Dataproc Workflow.

```shell
gcloud dataproc workflow-templates add-job spark-sql \
    --step-id=job-cleanup \
    --file=gs://airetail/scripts/daily_product_revenue/cleanup.sql \
    --workflow-template=wf-daily-product-revenue-bq

# File Format Converter jobs with dependency on cleanup
gcloud dataproc workflow-templates add-job spark-sql \
    --step-id=job-convert-orders \
    --file=gs://airetail/scripts/daily_product_revenue/file_format_converter.sql \
    --params=bucket_name=gs://airetail,table_name=orders \
    --workflow-template=wf-daily-product-revenue-bq \
    --start-after=job-cleanup

gcloud dataproc workflow-templates add-job spark-sql \
    --step-id=job-convert-order-items \
    --file=gs://airetail/scripts/daily_product_revenue/file_format_converter.sql \
    --params=bucket_name=gs://airetail,table_name=order_items \
    --workflow-template=wf-daily-product-revenue-bq \
    --start-after=job-cleanup

# Last Job which depends on convert orders and order_items jobs
gcloud dataproc workflow-templates add-job spark-sql \
    --step-id=job-daily-product-revenue \
    --file=gs://airetail/scripts/daily_product_revenue/compute_daily_product_revenue.sql \
    --params=bucket_name=gs://airetail \
    --workflow-template=wf-daily-product-revenue-bq \
    --start-after=job-convert-orders,job-convert-order-items
```

In [8]:
!gcloud dataproc workflow-templates add-job spark-sql --step-id=job-cleanup --file=gs://airetail/scripts/daily_product_revenue/cleanup.sql --workflow-template=wf-daily-product-revenue-bq

createTime: '2022-10-14T22:13:30.153785Z'
id: wf-daily-product-revenue-bq
jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
name: projects/tidy-fort-361710/regions/us-central1/workflowTemplates/wf-daily-product-revenue-bq
placement:
  clusterSelector:
    clusterLabels:
      goog-dataproc-cluster-name: aidataprocdev
updateTime: '2022-10-14T22:13:44.975510Z'
version: 3


In [9]:

!gcloud dataproc workflow-templates add-job spark-sql --step-id=job-convert-orders --file=gs://airetail/scripts/daily_product_revenue/file_format_converter.sql --params=bucket_name=gs://airetail,table_name=orders --workflow-template=wf-daily-product-revenue-bq --start-after=job-cleanup

createTime: '2022-10-14T22:13:30.153785Z'
id: wf-daily-product-revenue-bq
jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: orders
  stepId: job-convert-orders
name: projects/tidy-fort-361710/regions/us-central1/workflowTemplates/wf-daily-product-revenue-bq
placement:
  clusterSelector:
    clusterLabels:
      goog-dataproc-cluster-name: aidataprocdev
updateTime: '2022-10-14T22:13:48.649949Z'
version: 4


In [10]:
!gcloud dataproc workflow-templates add-job spark-sql --step-id=job-convert-order-items --file=gs://airetail/scripts/daily_product_revenue/file_format_converter.sql --params=bucket_name=gs://airetail,table_name=order_items --workflow-template=wf-daily-product-revenue-bq --start-after=job-cleanup

createTime: '2022-10-14T22:13:30.153785Z'
id: wf-daily-product-revenue-bq
jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: orders
  stepId: job-convert-orders
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: order_items
  stepId: job-convert-order-items
name: projects/tidy-fort-361710/regions/us-central1/workflowTemplates/wf-daily-product-revenue-bq
placement:
  clusterSelector:
    clusterLabels:
      goog-dataproc-cluster-name: aidataprocdev
updateTime: '2022-10-14T22:13:50.858264Z'
version: 5


In [11]:
!gcloud dataproc workflow-templates add-job spark-sql --step-id=job-daily-product-revenue --file=gs://airetail/scripts/daily_product_revenue/compute_daily_product_revenue.sql --params=bucket_name=gs://airetail --workflow-template=wf-daily-product-revenue-bq --start-after=job-convert-orders,job-convert-order-items

createTime: '2022-10-14T22:13:30.153785Z'
id: wf-daily-product-revenue-bq
jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: orders
  stepId: job-convert-orders
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: order_items
  stepId: job-convert-order-items
- prerequisiteStepIds:
  - job-convert-orders
  - job-convert-order-items
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/compute_daily_product_revenue.sql
    scriptVariables:
      bucket_name: gs://airetail
  stepId: job-daily-product-revenue
name: projects/tidy-fort

In [12]:
!gcloud dataproc workflow-templates list

ID                           JOBS  UPDATE_TIME                  VERSION
getting-started              4     2022-10-08T09:34:41.266501Z  1
wf-daily-product-revenue     4     2022-10-08T11:12:53.276748Z  6
wf-daily-product-revenue-bq  4     2022-10-14T22:13:53.061433Z  6


In [13]:
!gcloud dataproc workflow-templates describe wf-daily-product-revenue-bq

createTime: '2022-10-14T22:13:30.153785Z'
id: wf-daily-product-revenue-bq
jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: orders
  stepId: job-convert-orders
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: order_items
  stepId: job-convert-order-items
- prerequisiteStepIds:
  - job-convert-orders
  - job-convert-order-items
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/compute_daily_product_revenue.sql
    scriptVariables:
      bucket_name: gs://airetail
  stepId: job-daily-product-revenue
name: projects/tidy-fort

Here is the command to instantiate or run Dataproc Workflow.

```shell
gcloud dataproc workflow-templates \
    instantiate wf-daily-product-revenue-bq
```

In [26]:
!gcloud dataproc clusters start aidataprocdev

Waiting on operation [projects/tidy-fort-361710/regions/us-central1/operations/df7d7067-161b-3e09-8f21-33fc96fdef6a].
Waiting for cluster 'aidataprocdev' to start....done.                          
done: true
metadata:
  '@type': type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata
  clusterName: aidataprocdev
  clusterUuid: cefeefcb-63d4-4603-a1dc-6bfb3c0fbe6d
  description: Start cluster
  operationType: START
  status:
    innerState: DONE
    state: DONE
    stateStartTime: '2022-10-15T01:30:05.793757Z'
  statusHistory:
  - state: PENDING
    stateStartTime: '2022-10-15T01:28:18.862247Z'
  - state: RUNNING
    stateStartTime: '2022-10-15T01:28:18.929951Z'
name: projects/tidy-fort-361710/regions/us-central1/operations/df7d7067-161b-3e09-8f21-33fc96fdef6a
response:
  '@type': type.googleapis.com/google.cloud.dataproc.v1.Cluster
  clusterName: aidataprocdev
  clusterUuid: cefeefcb-63d4-4603-a1dc-6bfb3c0fbe6d
  config:
    configBucket: dataproc-staging-us-central1-640

In [25]:
!gcloud dataproc clusters stop aidataprocdev

Waiting on operation [projects/tidy-fort-361710/regions/us-central1/operations/82ff03f5-2d29-3cc1-8f68-2eed70f8c329].
Waiting for cluster 'aidataprocdev' to stop....done.                           
done: true
metadata:
  '@type': type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata
  clusterName: aidataprocdev
  clusterUuid: cefeefcb-63d4-4603-a1dc-6bfb3c0fbe6d
  description: Stop cluster
  operationType: STOP
  status:
    innerState: DONE
    state: DONE
    stateStartTime: '2022-10-14T22:19:16.839464Z'
  statusHistory:
  - state: PENDING
    stateStartTime: '2022-10-14T22:18:55.721271Z'
  - state: RUNNING
    stateStartTime: '2022-10-14T22:18:55.758764Z'
name: projects/tidy-fort-361710/regions/us-central1/operations/82ff03f5-2d29-3cc1-8f68-2eed70f8c329
response:
  '@type': type.googleapis.com/google.cloud.dataproc.v1.Cluster
  clusterName: aidataprocdev
  clusterUuid: cefeefcb-63d4-4603-a1dc-6bfb3c0fbe6d
  config:
    configBucket: dataproc-staging-us-central1-64041

In [27]:
# This will take some time to run

!gcloud dataproc workflow-templates instantiate wf-daily-product-revenue-bq

Waiting on operation [projects/tidy-fort-361710/regions/us-central1/operations/88a59259-e767-36c9-bce2-7fe8e14c7e42].
WorkflowTemplate [wf-daily-product-revenue-bq] RUNNING
Job ID job-cleanup-pw3itggd2hmz6 RUNNING
Job ID job-cleanup-pw3itggd2hmz6 COMPLETED
Job ID job-convert-orders-pw3itggd2hmz6 RUNNING
Job ID job-convert-order-items-pw3itggd2hmz6 RUNNING
Job ID job-convert-orders-pw3itggd2hmz6 COMPLETED
Job ID job-convert-order-items-pw3itggd2hmz6 COMPLETED
Job ID job-daily-product-revenue-pw3itggd2hmz6 RUNNING
WorkflowTemplate [wf-daily-product-revenue-bq] DONE
Job ID job-daily-product-revenue-pw3itggd2hmz6 COMPLETED
