## Manage Dataproc Workflows using gcloud Commands
Let us see how we can manage Dataproc Workflows using gcloud commands.
* Step 1: Create Dataproc Workflow Template
* Step 2: Configure active Dataproc cluster (we can also configure new cluster)
* Step 3: Add Spark SQL or Pyspark Jobs to Dataproc Workflow Templates with Dependencies
* Step 4: Run and Validate the Dataproc Workflow Template

We can take care of all the steps using `gcloud` commands.

In [1]:
!gcloud config set dataproc/region us-central1

Updated property [dataproc/region].


In [2]:
!gcloud dataproc workflow-templates

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates) Command name argument expected.

[m[1mAvailable groups for gcloud dataproc workflow-templates:[m

      add-job                 Add Google Cloud Dataproc jobs to workflow
                              template.

[m[1mAvailable commands for gcloud dataproc workflow-templates:[m

      create                  Create a workflow template.
      delete                  Delete a workflow template.
      describe                Describe a workflow template.
      export                  Export a workflow template.
      get-iam-policy          Get IAM policy for a workflow template.
      import                  Import a workflow template.
      instantiate             Instantiate a workflow template.
      instantiate-from-file   Instantiate a workflow template from a file.
      list                    List workflow templates.
      remove-dag-timeout      Remove DAG timeout from a workflow template.
      remove-job              

In [3]:
!gcloud dataproc workflow-templates list

ID               JOBS  UPDATE_TIME                  VERSION
getting-started  4     2022-10-08T09:34:41.266501Z  1


Here is the command to delete Dataproc Workflow Template (multiline approach doesn't work on Windows)

```shell
gcloud dataproc workflow-templates \
    delete wf-daily-product-revenue
```

In [5]:
!gcloud dataproc workflow-templates delete wf-daily-product-revenue --quiet

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates.delete) NOT_FOUND: Not found: Workflow Template projects/tidy-fort-361710/regions/us-central1/workflowTemplates/wf-daily-product-revenue


Here is the command to create Dataproc Workflow.

```shell
gcloud dataproc workflow-templates \
    create wf-daily-product-revenue
```

In [6]:
!gcloud dataproc workflow-templates create

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates.create) argument (TEMPLATE : --region=REGION): Must be specified.
Usage: gcloud dataproc workflow-templates create (TEMPLATE : --region=REGION) [optional flags]
  optional flags may be  --dag-timeout | --help | --labels | --region

For detailed information on this command and its flags, run:
  gcloud dataproc workflow-templates create --help


In [7]:
!gcloud dataproc workflow-templates create wf-daily-product-revenue

In [8]:
!gcloud dataproc workflow-templates list

ID                        JOBS  UPDATE_TIME                  VERSION
getting-started           4     2022-10-08T09:34:41.266501Z  1
wf-daily-product-revenue  0     2022-10-08T10:02:06.406701Z  1


In [11]:
!gcloud dataproc workflow-templates 

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates) Command name argument expected.

[m[1mAvailable groups for gcloud dataproc workflow-templates:[m

      add-job                 Add Google Cloud Dataproc jobs to workflow
                              template.

[m[1mAvailable commands for gcloud dataproc workflow-templates:[m

      create                  Create a workflow template.
      delete                  Delete a workflow template.
      describe                Describe a workflow template.
      export                  Export a workflow template.
      get-iam-policy          Get IAM policy for a workflow template.
      import                  Import a workflow template.
      instantiate             Instantiate a workflow template.
      instantiate-from-file   Instantiate a workflow template from a file.
      list                    List workflow templates.
      remove-dag-timeout      Remove DAG timeout from a workflow template.
      remove-job              

In [10]:
!gcloud dataproc workflow-templates set-cluster-selector

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates.set-cluster-selector) argument (TEMPLATE : --region=REGION): Must be specified.
Usage: gcloud dataproc workflow-templates set-cluster-selector (TEMPLATE : --region=REGION) [optional flags]
  optional flags may be  --cluster-labels | --help | --region

For detailed information on this command and its flags, run:
  gcloud dataproc workflow-templates set-cluster-selector --help


Here is the command to attach running or active Dataproc Cluster to the Dataproc Workflow. We need to specify the label for the cluster.

```shell
gcloud dataproc workflow-templates \
    set-cluster-selector \
    wf-daily-product-revenue \
    --cluster-labels goog-dataproc-cluster-name=aidataprocdev
```

In [12]:
!gcloud dataproc workflow-templates set-cluster-selector wf-daily-product-revenue --cluster-labels goog-dataproc-cluster-name=aidataprocdev

In [16]:
!gcloud dataproc workflow-templates add-job

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates.add-job) Command name argument expected.

[m[1mAvailable commands for gcloud dataproc workflow-templates add-job:[m

      hadoop                  Add a hadoop job to the workflow template.
      hive                    Add a Hive job to the workflow template.
      pig                     Add a Pig job to the workflow template.
      presto                  Add a Presto job to the workflow template.
      pyspark                 Add a PySpark job to the workflow template.
      spark                   Add a Spark job to the workflow template.
      spark-r                 Add a SparkR job to the workflow template.
      spark-sql               Add a SparkSql job to the workflow template.

[mFor detailed information on this command and its flags, run:
  gcloud dataproc workflow-templates add-job --help


In [17]:
!gcloud dataproc workflow-templates add-job spark-sql

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates.add-job.spark-sql) Exactly one of (--execute | --file) must be specified.
Usage: gcloud dataproc workflow-templates add-job spark-sql --step-id=STEP_ID (--execute=QUERY, -e QUERY | --file=FILE, -f FILE) (--workflow-template=WORKFLOW_TEMPLATE : --region=REGION) [optional flags]
  optional flags may be  --driver-log-levels | --execute | --file | --help |
                         --jars | --labels | --params | --properties |
                         --region | --start-after

For detailed information on this command and its flags, run:
  gcloud dataproc workflow-templates add-job spark-sql --help


* The command `gcloud dataproc workflow-templates add-job` is similar to `gcloud dataproc jobs submit`. Here are the examples for submitting jobs using `gcloud dataproc jobs submit`.

```shell
# Without parameters
gcloud dataproc jobs submit \
    spark-sql --cluster=aidataprocdev \
    -f gs://airetail/scripts/daily_product_revenue/cleanup.sql

# With parameters
gcloud dataproc jobs submit \
    spark-sql --cluster=aidataprocdev \
    -f gs://airetail/scripts/daily_product_revenue/file_format_converter.sql \
    --params=bucket_name=gs://airetail,table_name=orders
```


Here are the commands to add Spark SQL Jobs to the Dataproc Workflow.

```shell
gcloud dataproc workflow-templates add-job spark-sql \
    --step-id=job-cleanup \
    --file=gs://airetail/scripts/daily_product_revenue/cleanup.sql \
    --workflow-template=wf-daily-product-revenue

# File Format Converter jobs with dependency on cleanup
gcloud dataproc workflow-templates add-job spark-sql \
    --step-id=job-convert-orders \
    --file=gs://airetail/scripts/daily_product_revenue/file_format_converter.sql \
    --params=bucket_name=gs://airetail,table_name=orders \
    --workflow-template=wf-daily-product-revenue \
    --start-after=job-cleanup

gcloud dataproc workflow-templates add-job spark-sql \
    --step-id=job-convert-order-items \
    --file=gs://airetail/scripts/daily_product_revenue/file_format_converter.sql \
    --params=bucket_name=gs://airetail,table_name=order_items \
    --workflow-template=wf-daily-product-revenue \
    --start-after=job-cleanup

# Last Job which depends on convert orders and order_items jobs
gcloud dataproc workflow-templates add-job spark-sql \
    --step-id=job-daily-product-revenue \
    --file=gs://airetail/scripts/daily_product_revenue/compute_daily_product_revenue.sql \
    --params=bucket_name=gs://airetail \
    --workflow-template=wf-daily-product-revenue \
    --start-after=job-convert-orders,job-convert-order-items
```

In [18]:
!gcloud dataproc workflow-templates add-job spark-sql --step-id=job-cleanup --file=gs://airetail/scripts/daily_product_revenue/cleanup.sql --workflow-template=wf-daily-product-revenue

createTime: '2022-10-08T10:02:06.406701Z'
id: wf-daily-product-revenue
jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
name: projects/tidy-fort-361710/regions/us-central1/workflowTemplates/wf-daily-product-revenue
placement:
  clusterSelector:
    clusterLabels:
      goog-dataproc-cluster-name: aidataprocdev
updateTime: '2022-10-08T11:06:19.407439Z'
version: 3


In [19]:

!gcloud dataproc workflow-templates add-job spark-sql --step-id=job-convert-orders --file=gs://airetail/scripts/daily_product_revenue/file_format_converter.sql --params=bucket_name=gs://airetail,table_name=orders --workflow-template=wf-daily-product-revenue --start-after=job-cleanup

createTime: '2022-10-08T10:02:06.406701Z'
id: wf-daily-product-revenue
jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: orders
  stepId: job-convert-orders
name: projects/tidy-fort-361710/regions/us-central1/workflowTemplates/wf-daily-product-revenue
placement:
  clusterSelector:
    clusterLabels:
      goog-dataproc-cluster-name: aidataprocdev
updateTime: '2022-10-08T11:08:00.198971Z'
version: 4


In [20]:
!gcloud dataproc workflow-templates add-job spark-sql --step-id=job-convert-order-items --file=gs://airetail/scripts/daily_product_revenue/file_format_converter.sql --params=bucket_name=gs://airetail,table_name=order_items --workflow-template=wf-daily-product-revenue --start-after=job-cleanup

createTime: '2022-10-08T10:02:06.406701Z'
id: wf-daily-product-revenue
jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: orders
  stepId: job-convert-orders
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: order_items
  stepId: job-convert-order-items
name: projects/tidy-fort-361710/regions/us-central1/workflowTemplates/wf-daily-product-revenue
placement:
  clusterSelector:
    clusterLabels:
      goog-dataproc-cluster-name: aidataprocdev
updateTime: '2022-10-08T11:10:09.720005Z'
version: 5


In [21]:
!gcloud dataproc workflow-templates add-job spark-sql --step-id=job-daily-product-revenue --file=gs://airetail/scripts/daily_product_revenue/compute_daily_product_revenue.sql --params=bucket_name=gs://airetail --workflow-template=wf-daily-product-revenue --start-after=job-convert-orders,job-convert-order-items

createTime: '2022-10-08T10:02:06.406701Z'
id: wf-daily-product-revenue
jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: orders
  stepId: job-convert-orders
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: order_items
  stepId: job-convert-order-items
- prerequisiteStepIds:
  - job-convert-orders
  - job-convert-order-items
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/compute_daily_product_revenue.sql
    scriptVariables:
      bucket_name: gs://airetail
  stepId: job-daily-product-revenue
name: projects/tidy-fort-36

In [22]:
!gcloud dataproc workflow-templates list

ID                        JOBS  UPDATE_TIME                  VERSION
getting-started           4     2022-10-08T09:34:41.266501Z  1
wf-daily-product-revenue  4     2022-10-08T11:12:53.276748Z  6


In [26]:
!gcloud dataproc workflow-templates describe wf-daily-product-revenue

createTime: '2022-10-08T10:02:06.406701Z'
id: wf-daily-product-revenue
jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: orders
  stepId: job-convert-orders
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: order_items
  stepId: job-convert-order-items
- prerequisiteStepIds:
  - job-convert-orders
  - job-convert-order-items
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/compute_daily_product_revenue.sql
    scriptVariables:
      bucket_name: gs://airetail
  stepId: job-daily-product-revenue
name: projects/tidy-fort-36

Here is the command to instantiate or run Dataproc Workflow.

```shell
gcloud dataproc workflow-templates \
    instantiate wf-daily-product-revenue
```

In [35]:
!gcloud dataproc workflow-templates

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates) Command name argument expected.

[m[1mAvailable groups for gcloud dataproc workflow-templates:[m

      add-job                 Add Google Cloud Dataproc jobs to workflow
                              template.

[m[1mAvailable commands for gcloud dataproc workflow-templates:[m

      create                  Create a workflow template.
      delete                  Delete a workflow template.
      describe                Describe a workflow template.
      export                  Export a workflow template.
      get-iam-policy          Get IAM policy for a workflow template.
      import                  Import a workflow template.
      instantiate             Instantiate a workflow template.
      instantiate-from-file   Instantiate a workflow template from a file.
      list                    List workflow templates.
      remove-dag-timeout      Remove DAG timeout from a workflow template.
      remove-job              

In [36]:
!gcloud dataproc workflow-templates instantiate

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates.instantiate) argument (TEMPLATE : --region=REGION): Must be specified.
Usage: gcloud dataproc workflow-templates instantiate (TEMPLATE : --region=REGION) [optional flags]
  optional flags may be  --async | --help | --parameters | --region

For detailed information on this command and its flags, run:
  gcloud dataproc workflow-templates instantiate --help


In [37]:
!gcloud dataproc workflow-templates instantiate-from-file

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates.instantiate-from-file) argument --file: Must be specified.
Usage: gcloud dataproc workflow-templates instantiate-from-file --file=FILE [optional flags]
  optional flags may be  --async | --help | --region

For detailed information on this command and its flags, run:
  gcloud dataproc workflow-templates instantiate-from-file --help


In [39]:
!gcloud dataproc workflow-templates instantiate-from-file --help

[m[1mNAME[m
    gcloud dataproc workflow-templates instantiate-from-file - instantiate a
        workflow template from a file

[m[1mSYNOPSIS[m
    [1mgcloud dataproc workflow-templates instantiate-from-file[m [1m--file[m=[4mFILE[m
        [[1m--async[m] [[1m--region[m=[4mREGION[m] [[4mGCLOUD_WIDE_FLAG ...[m]

[m[1mDESCRIPTION[m
    Instantiate a workflow template from a file.

[m[1mREQUIRED FLAGS[m
     [1m--file[m=[4mFILE[m
        The YAML file containing the workflow template to run

[m[1mOPTIONAL FLAGS[m
     [1m--async[m
        Return immediately, without waiting for the operation in progress to
        complete.

     [1m--region[m=[4mREGION[m
        Cloud Dataproc region to use. Each Cloud Dataproc region constitutes an
        independent resource namespace constrained to deploying instances into
        Compute Engine zones inside the region. Overrides the default
        [1mdataproc/region[m property value for this command invocatio

In [40]:
!gcloud dataproc workflow-templates export

[1;31mERROR:[0m (gcloud.dataproc.workflow-templates.export) argument (TEMPLATE : --region=REGION): Must be specified.
Usage: gcloud dataproc workflow-templates export (TEMPLATE : --region=REGION) [optional flags]
  optional flags may be  --destination | --help | --region | --version

For detailed information on this command and its flags, run:
  gcloud dataproc workflow-templates export --help


In [41]:
!gcloud dataproc workflow-templates export wf-daily-product-revenue

jobs:
- sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/cleanup.sql
  stepId: job-cleanup
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: orders
  stepId: job-convert-orders
- prerequisiteStepIds:
  - job-cleanup
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/file_format_converter.sql
    scriptVariables:
      bucket_name: gs://airetail
      table_name: order_items
  stepId: job-convert-order-items
- prerequisiteStepIds:
  - job-convert-orders
  - job-convert-order-items
  sparkSqlJob:
    queryFileUri: gs://airetail/scripts/daily_product_revenue/compute_daily_product_revenue.sql
    scriptVariables:
      bucket_name: gs://airetail
  stepId: job-daily-product-revenue
placement:
  clusterSelector:
    clusterLabels:
      goog-dataproc-cluster-name: aidataprocdev


In [42]:
# This will take some time to run

!gcloud dataproc workflow-templates instantiate wf-daily-product-revenue

Waiting on operation [projects/tidy-fort-361710/regions/us-central1/operations/a3890c84-6a31-3f71-af54-cb96c515f826].
WorkflowTemplate [wf-daily-product-revenue] RUNNING
Job ID job-cleanup-vfn5wkdz3ntna RUNNING
Job ID job-cleanup-vfn5wkdz3ntna COMPLETED
Job ID job-convert-orders-vfn5wkdz3ntna RUNNING
Job ID job-convert-order-items-vfn5wkdz3ntna RUNNING
Job ID job-convert-orders-vfn5wkdz3ntna COMPLETED
Job ID job-convert-order-items-vfn5wkdz3ntna COMPLETED
Job ID job-daily-product-revenue-vfn5wkdz3ntna RUNNING
WorkflowTemplate [wf-daily-product-revenue] DONE
Job ID job-daily-product-revenue-vfn5wkdz3ntna COMPLETED
