Skip to content

hypnosapos/sparknetes

Repository files navigation

Sparknetes

Build status sparknetes layers sparknetes version spark layers

Spark on kubernetes. Based on official documentation of spark 2.4

Requirerements:

  • Make (gcc)
  • Docker (17+)
  • Kubernetes 1.8+

Spark docker images

To get a base docker image to use for launch spark on kubernetes type:

make sparknetes-build spark-image

NOTE: This process may take you several minutes (~20 mins, under the wood there is a maven packaging task running). Take a look at Makefile file to view default values and other variables.

This docker image is available at dockerhub/hypnosapos.

Kubernetes cluster

Examples will be tested on GKE service, here you have instructions to create a kubernetes cluster).

When we've got our kubernetes cluster ready (for instance with GKE_CLUSTER_NAME=spark variable exported) we have to prepare a minimal bootstrapping operation:

export GKE_CLUSTER_NAME=spark
make gke-spark-bootstrap

Launch basic examples

Spark on kubernetes

As the picture above shows you, spark-submit commands will be thrown from a pod of a kubernetes job.

First example is the well known SparkPi:

make spark-basic-example

Logs of jobs may be tracked on this way:

JOB_NAME=<job_name> make gke-job-logs

NOTE: is the name of the example with the suffix '-job' instead of '-example' (i.e. "spark-basic-job" instead of "spark-basic-example")

If it run successfully, spark-submit command should outline something like this:

2018-05-27 14:00:16 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new state:
	 pod name: spark-pi-63ba1a53bc663d728936c24c91fb339b-driver
	 namespace: default
	 labels: spark-app-selector -> spark-2a6817ac76a248ba8a9cef7f3b988d82, spark-role -> driver
	 pod uid: 4698a7b8-61b6-11e8-b653-42010a840124
	 creation time: 2018-05-27T14:00:13Z
	 service account name: spark
	 volumes: spark-token-92jw7
	 node name: gke-spark-default-pool-ba0e670d-w989
	 start time: 2018-05-27T14:00:13Z
	 container images: hypnosapos/spark:2.4
	 phase: Succeeded
2018-05-27 14:00:16 INFO  LoggingPodStatusWatcherImpl:54 - Container final statuses:
Container name: spark-kubernetes-driver
	 Container image: hypnosapos/spark:2.4
	 Container state: Terminated
	 Exit code: 0
2018-05-27 14:00:16 INFO  Client:54 - Application spark-pi finished.

Second example is a linear regression, let's launch the log watcher in line too:

JOB_NAME=spark-ml-job make spark-ml-example gke-job-logs

GCS example

GCS and Spark on kubernetes

This example uses a remote dependency for GCS connector and the GCP credentials to authenticate with internal metadata server. We've used a private jar and class (provide your values directly in Makefile file, quoted by marks < >), but essentially you only need update your code to use gs:// instead the typical hdfs:// scheme for data input/output.

JOB_NAME=spark-gcs-job make spark-gcs-example gke-job-logs

In order to view the driver UI through a public load balance service:

export SPARK_APP_NAME=spark-gcs
make gke-spark-expose-ui
make gke-spark-open-ui

Driver UI - Stages

Driver UI - Executors

Using spark-k8s operator

Few months ago google community published the k8s-spark-operator. Thus, it's time to check it out:

make gke-spark-operator-install
make gke-spark-operator-example

Cleaning

Remove all spark resources on kubernetes cluster:

make gke-spark-clean