# VariantSpark-k BigChr22

This notebook is designed to run inside a kubernetes cluster.  This makes it possible to use kubernetes internal DNS to find the MASTER node.  It also means that the configuration of the pod this notebook container is running in can provide the security token needed to authenticate with the kubernetes API (spark submit already knows where to look for this token).

You will need to create an S3 bucket, and put the input files in there, then set the INPUT_BUCKET variable in the cell below.

Then run the cell and wait.  There is very little feedback while the job is running, but you can view the Kubernetes UI to see the pods as they work.

In [2]:
%%bash

set -e

MASTER=https://kubernetes.default.svc:443
INPUT_BUCKET=variant-spark-llc

function fatal_error () {
	echo "ERROR: $1" 1>&2
	exit 1
}

if [ -z ${MASTER+x} ];
    then
        echo "You must set the MASTER environment variable to a kubernetes API endpoint";
        echo "Example: https://ABC.sk1.us-west-2.eks.amazonaws.com:443"
        exit 1
fi

if [ -z ${INPUT_BUCKET+x} ];
    then
        echo "You must set the INPUT_BUCKET environment variable to a bucket containing input data";
        echo "Example: variant-spark-k-storage"
        exit 1
fi

[[ $(type -P "spark-submit") ]] || fatal_error  "\`spark-submit\` cannot be found. Please make sure it's on your PATH."

spark-submit \
    --class au.csiro.variantspark.cli.VariantSparkApp \
    --driver-class-path ./conf \
    --master k8s://${MASTER} \
    --deploy-mode cluster \
    --name VariantSparkBig22 \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.executor.instances=24 \
    --conf spark.kubernetes.container.image=jamesrcounts/variantspark:002 \
    --jars http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar,http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar,http://central.maven.org/maven2/joda-time/joda-time/2.9.9/joda-time-2.9.9.jar \
    local:///opt/spark/jars/variant-spark_2.11-0.2.0-SNAPSHOT-all.jar importance \
        -if s3a://${INPUT_BUCKET}/ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.bz2 \
        -ff s3a://${INPUT_BUCKET}/chr22-labels_release_v3.20101123.csv \
        -fc 22_16051249 \
        -v \
        -rn 500 \
        -rbs 16 \
        -ro "$@"


2018-07-21 00:47:54 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new state: 
	 pod name: variantsparkbig22-550523f125b13f9695ccc37bc1b956a3-driver
	 namespace: default
	 labels: spark-app-selector -> spark-1b1233a410b94d1f82938efbba619714, spark-role -> driver
	 pod uid: b400a50d-8c7f-11e8-b7c4-0a5bf35c81f8
	 creation time: 2018-07-21T00:47:54Z
	 service account name: spark
	 volumes: spark-init-properties, download-jars-volume, download-files-volume, spark-token-d9t52
	 node name: N/A
	 start time: N/A
	 container images: N/A
	 phase: Pending
	 status: []
2018-07-21 00:47:54 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new state: 
	 pod name: variantsparkbig22-550523f125b13f9695ccc37bc1b956a3-driver
	 namespace: default
	 labels: spark-app-selector -> spark-1b1233a410b94d1f82938efbba619714, spark-role -> driver
	 pod uid: b400a50d-8c7f-11e8-b7c4-0a5bf35c81f8
	 creation time: 2018-07-21T00:47:54Z
	 service account name: spark
	 volumes: spark-init-properties, download