Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
209 lines (131 sloc) 9.33 KB

Run Picard and GATK tools on Cloud-Resident Genomic Data

The properly rendered version of this document can be found at Read The Docs.

If you are reading this on github, you should instead click here.

Introduction

`Picard`_/`GATK`_ tools are command line utilities for genomic sequencing data processing that typically take BAM and other files as input and produce modified BAM files.

These tools are frequently chained together into pipelines to perform step-by-step processing of the sequencing data all the way from unaligned sequencer output to variant calls (e.g. see Broad best practices).


We are teaching these tools to take cloud based datasets as a possible input. The foundation for cloud data access is now in HTSJDK library and we have converted a number of Picard tools.

If your dataset is loaded into a cloud provider supporting `GA4GH`_ API (e.g. `Google Genomics`_) or you use one of the available datasets from :doc:`/use_cases/discover_public_data/index`, you will be able to run a Picard tool against it, reading data directly from the cloud.


New: see the video version of this tutorial.

Below is a step by step guide on how to build Picard tools with GA4GH support, set-up access to genomics data in the cloud and run the tools.

By the end of this tutorial you will be able run a Picard tool, giving it a URL identifying a genomic dataset in the cloud and see the output of processing the data directly from the cloud.

We also have detailed description of changes to HTSJDK and Picard to help you write your own cloud-aware client on top of HTSDK or help us convert more Picard tools.

Set up access to genomics data in Google Cloud

We will assume you are starting from a completely blank slate so please skip the steps that are redundant for you.

If you are already using Google Genomics API and have a project set up for this you can skip this section and go directly to Build Picard tools with GA4GH support.

Build Picard tools with GA4GH support

You will need Maven and Ant build tools installed on your machine.

  1. Fetch `Picard`_, `HTSJDK`_ and `gatk-tools-java`_ projects required for building Picard with GA4GH support.
$ git clone https://github.com/broadinstitute/picard.git
$ cd picard
$ git clone https://github.com/samtools/htsjdk
$ cd ..
$ git clone https://github.com/googlegenomics/gatk-tools-java
  1. Build gatk-tools-java and copy the resulting JAR into Picard library folder:
$ cd gatk-tools-java
$ mvn compile package
$ mkdir ../picard/lib/gatk-tools-java
$ cp target/gatk-tools-java*minimized.jar ../picard/lib/gatk-tools-java/
  1. Build Picard version with GA4GH support:
// Assuming you are still in gatk-tools-java directory
$ cd ../picard
$ ant -lib lib/ant package-commands-ga4gh

4. Make sure you put client_secrets.json file in the parent folder just above Picard. You should end up with the following directory structure:

your_client_folder \
  client_secrets.json
  gatk-tools-java \
  picard \
    htsjdk \

Run Picard tools with an input from the cloud

You can now run ViewSam tool that prints the contents of the supplied INPUT

$ cd picard
$ java \
   -jar dist/picard.jar ViewSam \
   INPUT=https://www.googleapis.com/genomics/v1beta2/readgroupsets/CK256frpGBD44IWHwLP22R4/ \
   GA4GH_CLIENT_SECRETS=../client_secrets.json

This command uses an older, slower REST based API. To run using GRPC API implementation (which is much faster) use the following command that utilizes ALPN jars that come with gatk-tools-java and enables GRPC support:

java \
 -Xbootclasspath/p:../gatk-tools-java/lib/alpn-boot-8.1.3.v20150130.jar \
 -Dga4gh.using_grpc=true \
 -jar dist/picard.jar ViewSam \
 INPUT=https://www.googleapis.com/genomics/v1beta2/readgroupsets/CK256frpGBD44IWHwLP22R4/ \
 GA4GH_CLIENT_SECRETS=../client_secrets.json

For Java 7 (as opposed to 8) use alpn-boot-7.1.3.v20150130.jar.

We use a test readset here from genomics-test-data project.

Specifying a genomics region to use from the readset

The INPUT urls are of the form https://<GA4GH provider>/readgroupsets/<readgroupset id>[/sequence][/start-end].

For example:

java -jar dist/picard.jar ViewSam \
INPUT=https://www.googleapis.com/genomics/v1beta2/readgroupsets\
/CMvnhpKTFhD3he72j4KZuyc/chr17/41196311-41207499 \
GA4GH_CLIENT_SECRETS=../client_secrets.json

Timing the reading speed from the cloud

You can run gatk-tools-java/src/main/scripts/example.sh with and without "grpc" command line parameter to see the difference in reading speed. The timing statistics are dumped to the terminal. We benchmarked x11 speed improvements with GRPC compared to REST, giving ~12,000 reads/second.

The tests were done on Platinum Genomes NA12877_S1.bam dataset, please see the detailed writeup of the test procedure and results if you want to repeat the test.

We therefore recommend running GRPC variants of command line.

Other Picard tools you can run

You can run MarkDuplicates or MarkDuplicatesWithMateCigar tools like this:

java \
  -Xbootclasspath/p:../gatk-tools-java/lib/alpn-boot-8.1.3.v20150130.jar \
  -Dga4gh.using_grpc=true \
  -jar dist/picard.jar MarkDuplicates \
  INPUT=https://www.googleapis.com/genomics/v1beta2/readgroupsets/CK256frpGBD44IWHwLP22R4/ \
  OUTPUT=output.bam \
  METRICS_FILE=output.metrics \
  GA4GH_CLIENT_SECRETS=../client_secrets.json

Figuring out a url for your dataset

In the examples above we have been using urls of the form https://www.googleapis.com/genomics/v1beta2/readgroupsets/XXX where XXX is the id of the readset.

How do you find an ID of the readset from the :doc:`/use_cases/discover_public_data/index` set or from your own project ?

We will do it step by step using the command line API client.

$ gcloud alpha genomics readgroupsets list 10473108253681171589 --limit 10
ID                      NAME     REFERENCE_SET_ID
CMvnhpKTFhDq9e2Yy9G-Bg  HG02573  EOSt9JOVhp3jkwE
CMvnhpKTFhCEmf_d_o_JCQ  HG03894  EOSt9JOVhp3jkwE
...
  • Note the NAME column - it will correspond to the file name. The ID column is the ID of the readgroupset we are looking for.

Now lets suppose we are not looking for one of the readgroupsets form the genomics public data but instead want to use one from our own project. In this case we need to figure out the dataset id for our files first, before we can use "readgroupsets list" command to list the individual readgroupsets.

  • Lets say we want to figure out which dataset ids are present under genomics test data project.
  • First we need to set the project id for subsequent commands to be our project using
$ gcloud config set project genomics-test-data
  • Now we can issue this command:
$ gcloud alpha genomics datasets list --limit 10
  • The output will list dataset(s) present in the project together with their ids and we can then use the "readgroupsets list" command to get the id of the readgroupset under one of the datasets.