Run a CWL workflow
This example demonstrates running a multi-stage workflow on Google Cloud Platform
- The workflow is launched with a bash script, cwl_runner.sh, that calls the gcloud command-line tool that is included in the Google Cloud SDK
- The workflow is defined using the Common Workflow Language (CWL)
- The workflow stages are orchestrated using cwltool or rabix.
To run a CWL workflow,
- Create a disk
- Create a Compute Engine VM with that disk
- Run a startup script on the VM
The startup script, cwl_startup.sh, will run on the VM and:
- Mount and format the disk
- Download input files from Google Cloud Storage
- Install Docker
- Install cwltool
- Run the CWL workflow and wait until completion
- Copy output files to Google Cloud Storage
- Copy stdout and stderr logs to Google Cloud Storage
- Shutdown and delete the VM and disk
Note that the CWL runner does not use the Pipelines API. If you don't have enough quota, the script will fail; it won't be queued to run when quota is available.
- Download the required script files,
cwl_shutdown.sh, or, if you prefer, clone or fork this github repository.
- Enable the Genomics, Cloud Storage, and Compute Engine APIs on a new or existing Google Cloud Project using the Cloud Console
- Install and initialize the Google Cloud SDK.
- Follow the Cloud Storage instructions for Creating Storage Buckets to create a bucket to store workflow output and logging
Running a sample workflow in the cloud
This script should be able to support any CWL workflow supported by cwltool.
You can run the script with
--help to see all of the command-line options.
This particular workflow requires:
- a reference genome bundle
- a DNA reads file in BAM format
- several CWL tool definitions
All of the required files have already been copied into Google Cloud Storage (at gs://genomics-public-data/cwl-examples/gdc-dnaseq-cwl), so we can just reference them when we run the CWL workflow.
Here's an example command-line:
./cwl_runner.sh \ --workflow-file gs://genomics-public-data/cwl-examples/gdc-dnaseq-cwl/workflows/dnaseq/transform.cwl \ --settings-file gs://genomics-public-data/cwl-examples/gdc-dnaseq-cwl/input/gdc-dnaseq-input.json \ --input-recursive gs://genomics-public-data/cwl-examples/gdc-dnaseq-cwl \ --output gs://MY-BUCKET/MY-PATH \ --machine-type n1-standard-4
MY-BUCKET/MY-PATH to a path in a Cloud Storage bucket that you have write access to.
The workflow will start running. If all goes well, it should complete in a couple of hours.
Here's some more information about what's happening:
- The command will run the CWL workflow definition located at the
workflow-filepath in Cloud Storage, using the workflow settings in the
- All path parameters defined in the
settings-fileare relative to the location of the file.
- A reference genome is required as input; the reference genome files are identified by the
- This particular GDC workflow uses lots of relative paths to the definition files for the individual workflow steps. In order to preserve relative paths, the GDC directory is recursively copied from the path passed to
- Output files and logs will be written to the
- The whole workflow will run on a single VM instance of the specified
Monitoring your workflow
Once your job starts, it will have an
OPERATION-ID assigned, which you can use to check status and find the VM and disk in your cloud project.
To monitor your job, check the status to see if it's RUNNING, COMPLETED, or FAILED:
gsutil cat gs://MY-BUCKET/MY-PATH/status-OPERATION-ID.txt
While your job is running, you can see the VM in the Cloud Console and command-line. When the job completes, the VM will no longer be found unless
--keep-alive is set. Command-line:
gcloud compute instances describe cwl-vm-OPERATION-ID
Canceling a job
To cancel a running job, you can terminate the VM from the cloud console or command-line:
gcloud compute instances delete cwl-vm-OPERATION-ID
Debugging a job
To debug a failed run, look at the log files in your output directory.
gsutil cat gs://MY-BUCKET/MY-PATH/stderr-OPERATION-ID.txt | less gsutil cat gs://MY-BUCKET/MY-PATH/stdout-OPERATION-ID.txt | less
For additional debugging, you can rerun this script with --keep-alive and ssh into the VM. If you use --keep-alive, you will need to manually delete the VM to avoid charges.