Apache Spark jobs such as Principal Coordinate Analysis.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
project
src/main
.gitignore
.travis.yml
CONTRIBUTING.rst
LICENSE
README.md
assembly.sbt
build.sbt

README.md

spark-examples Build Status

The projects in this repository demonstrate working with genomic data accessible via the Google Genomics API using Apache Spark.

If you are ready to start coding, take a look at the information below. But if you are looking for a task-oriented list (e.g., How do I compute principal coordinate analysis with Google Genomics?), a better place to start is the Google Genomics Cookbook.

Getting Started

  1. git clone this repository.

  2. If you have not already done so, follow the Google Genomics getting started instructions to set up your environment including installing gcloud and running gcloud init.

  3. Download and install Apache Spark.

  4. Install SBT.

  5. This project now includes code for calling the Genomics API using gRPC. To use gRPC, you'll need a version of ALPN that matches your JRE version.

  6. See the ALPN documentation for a table of which ALPN jar to use for your JRE version.

  7. Then download the correct version from here.

Local Run

From the spark-examples directory run sbt run

Use the following flags to match your runtime configuration:

$ export SBT_OPTS='-Xbootclasspath/p:/YOUR/PATH/TO/alpn-boot-YOUR-VERSION.jar'
$ sbt "run --help"
  -o, --output-path  <arg>
  -s, --spark-master  <arg>      A spark master URL. Leave empty if using spark-submit.
  ...
      --help                     Show help message

For example:

$ sbt "run --spark-master local[4]"

A menu should appear asking you to pick the sample to run:

Multiple main classes detected, select one to run:

 [1] com.google.cloud.genomics.spark.examples.SearchVariantsExampleKlotho
 [2] com.google.cloud.genomics.spark.examples.SearchVariantsExampleBRCA1
 [3] com.google.cloud.genomics.spark.examples.SearchReadsExample1
 [4] com.google.cloud.genomics.spark.examples.SearchReadsExample2
 [5] com.google.cloud.genomics.spark.examples.SearchReadsExample3
 [6] com.google.cloud.genomics.spark.examples.SearchReadsExample4
 [7] com.google.cloud.genomics.spark.examples.VariantsPcaDriver
 
Enter number:

Troubleshooting:

If you are seeing java.lang.OutOfMemoryError: PermGen space errors, set the following SBT_OPTS flag:

export SBT_OPTS='-XX:MaxPermSize=256m'

Run on Google Compute Engine

(1) Build the assembly.

sbt assembly

(2) Deploy your Spark cluster using Google Cloud Dataproc.

gcloud beta dataproc clusters create example-cluster --scopes cloud-platform

(3) Copy the assembly jar to the master node.

gcloud compute copy-files \
  target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar  example-cluster-m:~/

(4) ssh to the master.

gcloud compute ssh example-cluster-m

(5) Run one of the examples.

spark-submit --class com.google.cloud.genomics.spark.examples.SearchReadsExample1 \
  googlegenomics-spark-examples-assembly-1.0.jar

Running PCA variant analysis on GCE

To run the variant PCA analysis on GCE make sure you have followed all the steps on the previous section and that you are able to run at least one of the examples.

Run the example PCA analysis for BRCA1 on the 1000 Genomes Project dataset.

spark-submit --class com.google.cloud.genomics.spark.examples.VariantsPcaDriver \
  googlegenomics-spark-examples-assembly-1.0.jar

The analysis will output the two principal components for each sample to the console. Here is an example of the last few lines.

...
NA20811		0.0286308791579312	-0.008456233951873527
NA20812		0.030970386921818943	-0.006755469223823698
NA20813		0.03080348019961635	-0.007475822860939408
NA20814		0.02865238920148145	-0.008084003476919057
NA20815		0.028798695736608034	-0.003755789964021788
NA20816		0.026104805529612096	-0.010430718823329282
NA20818		-0.033609576645005836	-0.026655905606186293
NA20819		0.032019557126552155	-0.00775750983842731
NA20826		0.03026607917284046	-0.009102704080927001
NA20828		-0.03412964005321165	-0.025991697661590686
NA21313		-0.03401702847363714	-0.024555217139987182

This pipeline is described in greater detail on How do I compute principal coordinate analysis with Google Genomics?

Debugging

For more information, see https://cloud.google.com/dataproc/faq