Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zeppelin: Add Zeppelin image to Spark example #17025

Merged
merged 1 commit into from
Nov 14, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 2 additions & 1 deletion examples/examples_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -355,11 +355,12 @@ func TestExampleObjectSchemas(t *testing.T) {
"secret": &api.Secret{},
},
"../examples/spark": {
"spark-driver-controller": &api.ReplicationController{},
"spark-master-controller": &api.ReplicationController{},
"spark-master-service": &api.Service{},
"spark-webui": &api.Service{},
"spark-worker-controller": &api.ReplicationController{},
"zeppelin-controller": &api.ReplicationController{},
"zeppelin-service": &api.Service{},
},
"../examples/spark/spark-gluster": {
"spark-master-service": &api.Service{},
Expand Down
96 changes: 79 additions & 17 deletions examples/spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,16 +120,16 @@ Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5
15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE
```

After you know the master is running, you can use the (cluster
proxy)[../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy] to
After you know the master is running, you can use the [cluster
proxy](../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy) to
connect to the Spark WebUI:

```console
kubectl proxy --port=8001
```

At which point the UI will be available at
http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/
[http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/](http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/).

## Step Two: Start your Spark workers

Expand Down Expand Up @@ -172,32 +172,40 @@ you should now see the workers in the UI as well. *Note:* The UI will have links
to worker Web UIs. The worker UI links do not work (the links will attempt to
connect to cluster IPs, which Kubernetes won't proxy automatically).

## Step Three: Start your Spark driver to launch jobs on your Spark cluster
## Step Three: Start the Zeppelin UI to launch jobs on your Spark cluster

The Spark driver is used to launch jobs into Spark cluster. You can read more about it in
[Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html).
The Zeppelin UI pod can be used to launch jobs into the Spark cluster either via
a web notebook frontend or the traditional Spark command line. See
[Zeppelin](https://zeppelin.incubator.apache.org/) and
[Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html)
for more details.

```console
$ kubectl create -f examples/spark/spark-driver-controller.yaml
replicationcontrollers/spark-driver-controller
$ kubectl create -f examples/spark/zeppelin-controller.yaml
replicationcontrollers/zeppelin-controller
```

The Spark driver needs the Master service to be running.
Zeppelin needs the Master service to be running.

### Check to see if the driver is running
### Check to see if Zeppelin is running

```console
$ kubectl get pods -lcomponent=spark-driver
NAME READY STATUS RESTARTS AGE
spark-driver-controller-vwb9c 1/1 Running 0 1m
$ kubectl get pods -lcomponent=zeppelin
NAME READY STATUS RESTARTS AGE
zeppelin-controller-ja09s 1/1 Running 0 53s
```

## Step Four: Do something with the cluster

Use the kubectl exec to connect to Spark driver and run a pipeline.
Now you have two choices, depending on your predilections. You can do something
graphical with the Spark cluster, or you can stay in the CLI.

### Do something fast with pyspark!

Use the kubectl exec to connect to the Zeppelin driver and run a pipeline.

```console
$ kubectl exec spark-driver-controller-vwb9c -it pyspark
$ kubectl exec zeppelin-controller-ja09s -it pyspark
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Expand All @@ -217,6 +225,24 @@ SparkContext available as sc, HiveContext available as sqlContext.
Congratulations, you just counted all of the words in all of the plays of
Shakespeare.

### Do something graphical and shiny!

Take the Zeppelin pod from above and port-forward the WebUI port:

```console
$ kubectl port-forward zeppelin-controller-ja09s 8080:8080
```

This forwards `localhost` 8080 to container port 8080. You can then find
Zeppelin at (https://localhost:8080/)[https://localhost:8080/].

Create a "New Notebook". In there, type:

```
%pyspark
print sc.textFile("gs://dataflow-samples/shakespeare/*").map(lambda s: len(s.split())).sum()
```

## Result

You now have services and replication controllers for the Spark master, Spark
Expand All @@ -235,10 +261,46 @@ After it's setup:

```console
kubectl get pods # Make sure everything is running
kubectl proxy --port=8001 # Start an application proxy, if you want to see the Spark WebUI
kubectl get pods -lcomponent=spark-driver # Get the driver pod to interact with.
kubectl proxy --port=8001 # Start an application proxy, if you want to see the Spark Master WebUI
kubectl get pods -lcomponent=zeppelin # Get the driver pod to interact with.
```

At which point the Master UI will be available at
[http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/](http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/).

You can either interact with the Spark cluster the traditional `spark-shell` /
`spark-subsubmit` / `pyspark` commands by using `kubectl exec` against the
`zeppelin-controller` pod, or if you want to interact with Zeppelin:

```console
kubectl port-forward zeppelin-controller-abc123 8080:8080 &
```

Then visit [http://localhost:8080/](http://localhost:8080/).

## Known Issues With Spark

* This provides a Spark configuration that is restricted to the cluster network,
meaning the Spark master is only available as a cluster service. If you need
to submit jobs using external client other than Zeppelin or `spark-submit` on
the `zeppelin` pod, you will need to provide a way for your clients to get to
the
[`examples/spark/spark-master-service.yaml`](spark-master-service.yaml). See
[Services](../../docs/user-guide/services.md) for more information.

## Known Issues With Zeppelin

* The Zeppelin pod is large, so it may take a while to pull depending on your
network. The size of the Zeppelin pod is something we're working on, see issue #17231.

* Zeppelin may take some time (about a minute) on this pipeline the first time
you run it. It seems to take considerable time to load.

* On GKE, `kubectl port-forward` may not be stable over long periods of time. If
you see Zeppelin go into `Disconnected` state (there will be a red dot on the
top right as well), the `port-forward` probably failed and needs to be
restarted. See #12179.

<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/spark/README.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
26 changes: 21 additions & 5 deletions examples/spark/images/Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,20 @@
all: push
push: push-spark push-zeppelin
.PHONY: push push-spark push-zeppelin spark zeppelin

# To bump the Spark version, bump the version in base/Dockerfile, bump
# this tag and reset to v1. You should also double check the native
# Hadoop libs at that point (we grab the 2.6.1 libs, which are
# appropriate for 1.5.1-with-2.6).
# the version in zeppelin/Dockerfile, bump this tag and reset to
# v1. You should also double check the native Hadoop libs at that
# point (we grab the 2.6.1 libs, which are appropriate for
# 1.5.1-with-2.6). Note that you'll need to re-test Zeppelin (and it
# may not have caught up to newest Spark).
TAG = 1.5.1_v2

containers:
# To bump the Zeppelin version, bump the version in
# zeppelin/Dockerfile and bump this tag and reset to v1.
ZEPPELIN_TAG = v0.5.5_v1

spark:
docker build -t gcr.io/google_containers/spark-base base
docker tag gcr.io/google_containers/spark-base gcr.io/google_containers/spark-base:$(TAG)
docker build -t gcr.io/google_containers/spark-worker worker
Expand All @@ -16,7 +24,11 @@ containers:
docker build -t gcr.io/google_containers/spark-driver driver
docker tag gcr.io/google_containers/spark-driver gcr.io/google_containers/spark-driver:$(TAG)

push: containers
zeppelin:
docker build -t gcr.io/google_containers/zeppelin zeppelin
docker tag -f gcr.io/google_containers/zeppelin gcr.io/google_containers/zeppelin:$(ZEPPELIN_TAG)

push-spark: spark
gcloud docker push gcr.io/google_containers/spark-base
gcloud docker push gcr.io/google_containers/spark-base:$(TAG)
gcloud docker push gcr.io/google_containers/spark-worker
Expand All @@ -26,4 +38,8 @@ push: containers
gcloud docker push gcr.io/google_containers/spark-driver
gcloud docker push gcr.io/google_containers/spark-driver:$(TAG)

push-zeppelin: zeppelin
gcloud docker push gcr.io/google_containers/zeppelin
gcloud docker push gcr.io/google_containers/zeppelin:$(ZEPPELIN_TAG)

clean:
2 changes: 1 addition & 1 deletion examples/spark/images/base/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM java:latest
FROM java:openjdk-8-jdk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version does upstream recommend, as they typically establish a baseline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building Spark using Maven requires Maven 3.3.3 or newer and Java 7+

But I was actually pegging it because I didn't want the Zeppelin build process to go all over the process (in the mythical world where an openjdk-9-jdk gets released).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And yes, I realize we're not building Spark in this image, I'm just pegging the JDK for the runtime.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And I didn't even push a new image, because :latest is actually == :openjdk-8-jdk right now.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple scan of the zeppelin docs show java 1.7 - https://zeppelin.incubator.apache.org/docs/install/install.html

maybe that means 1.7 only or 1.7+, pinning on java 8 start makes sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems to be function on JDK8, but I could kick it back if you want. Sorry, I did notice that pre-req as well and presumed it meant 1.7+. Other parts of that page are slightly in need of editing, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd say stick w/ the 1.7+ assumption until something goes wrong


ENV hadoop_ver 2.6.1
ENV spark_ver 1.5.1
Expand Down
66 changes: 66 additions & 0 deletions examples/spark/images/zeppelin/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Copyright 2015 The Kubernetes Authors All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Based heavily on
# https://github.com/dylanmei/docker-zeppelin/blob/master/Dockerfile
# (which is similar to many others out there), but rebased onto maven
# image.
#
# This image is a composition of the official docker-maven
# Docker image from https://github.com/carlossg/docker-maven/ and
# spark-base.

FROM gcr.io/google_containers/spark-base:latest

ENV ZEPPELIN_TAG v0.5.5
ENV MAVEN_VERSION 3.3.3
ENV SPARK_MINOR 1.5
ENV SPARK_PATCH 1
ENV SPARK_VER ${SPARK_MINOR}.${SPARK_PATCH}
ENV HADOOP_MINOR 2.6
ENV HADOOP_PATCH 1
ENV HADOOP_VER ${HADOOP_MINOR}.${HADOOP_PATCH}

RUN curl -fsSL http://archive.apache.org/dist/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use curl here, though you install it further down on line 46. (though I changed the base image, so perhaps this isn't a problem for you on java, but then you should probably not try and install it again on line 46.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curl is installed on buildpack-deps, so this part is a noop. Fixed, though.

&& mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
&& ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven

# libfontconfig is a workaround for
# https://github.com/karma-runner/karma/issues/1270, which caused a
# build break similar to
# https://www.mail-archive.com/users@zeppelin.incubator.apache.org/msg01586.html

RUN apt-get update \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh man, do we really want to install mvn and devel tooling? Or poke upstream harder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the Dockerfiles I found for Zeppelin built it. This image takes a long time to build.

My plan was to use this PR as a foil to poke the upstream (especially on ZEPPELIN-10, the multi-port issue).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binaries for spark 1.4.0 are available, which would be preferable

@rnowling have version of zeppelin do you have running and does it work w/ spark 1.5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where? http://zeppelin-project.org/download.html

Current binary release
We will provide binary release pretty soon. Until then, please build your own. It's not that difficult.

For install and configure, check install.

The v0.5.5 tag (which I just noticed, and may peg to now - it happened while I was working.) has no binaries on GitHub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess zeppelin-project.org is out of sync w/ https://zeppelin.incubator.apache.org/download.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rnowling is spark 1.5 w/ zeppelin 0.5 something you can test out in O(1hr)? if so, please do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattf Zeppelin 0.5 doesn't support Spark 1.5 -- there is no Spark 1.5 profile under spark/pom.xml and (at least in my experience), Zeppelin is very specific about the version. The Zeppelin Dockerfile is using master which includes the changes in 0.5.5 to support Spark 1.5. (Maybe best to bind a specific version instead of the latest from master.) I'm building and testing Z 0.5.5 w/ Spark 1.5 locally now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have Zeppelin 0.5.5 working with Spark 1.5 -- Spark 1.5.1 is bundled as part of the spark-1.5 profile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this Dockerfile is based on master, which I was testing with as of a few days ago and is practically 0.5.5. As I mentioned in https://github.com/kubernetes/kubernetes/pull/17025/files#r44416358, I can shift it over to 0.5.5 now that there's a tag that makes any sense. Zeppelin needs a better release cadence, though - their previous release was very old.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFYI, if I'm looking at their create_release.sh correctly, they're also building for YARN. It's unclear if the YARN support will play well with Standalone, even when they do finally publish binaries.

&& apt-get install -y net-tools build-essential git wget unzip python python-setuptools python-dev python-numpy libfontconfig \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/apache/incubator-zeppelin.git --branch ${ZEPPELIN_TAG} /opt/zeppelin
RUN cd /opt/zeppelin && \
mvn clean package \
-Pspark-${SPARK_MINOR} -Dspark.version=${SPARK_VER} \
-Phadoop-${HADOOP_MINOR} -Dhadoop.version=${HADOOP_VER} \
-Ppyspark \
-DskipTests && \
rm -rf /root/.m2 && \
rm -rf /root/.npm && \
echo "Successfully built Zeppelin"

ADD zeppelin-log4j.properties /opt/zeppelin/conf/log4j.properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as other entry around config. I would almost like to hold off further PRs until that is in place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I may take a trip sideways while we figure out how to get a more polished story around a lot of things in this space. I wanted to get a vertical slice up first.

ADD zeppelin-env.sh /opt/zeppelin/conf/zeppelin-env.sh
ADD docker-zeppelin.sh /opt/zeppelin/bin/docker-zeppelin.sh
EXPOSE 8080
ENTRYPOINT ["/opt/zeppelin/bin/docker-zeppelin.sh"]
21 changes: 21 additions & 0 deletions examples/spark/images/zeppelin/docker-zeppelin.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash

# Copyright 2015 The Kubernetes Authors All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

export ZEPPELIN_HOME=/opt/zeppelin
export ZEPPELIN_CONF_DIR="${ZEPPELIN_HOME}/conf"

echo "=== Launching Zeppelin under Docker ==="
/opt/zeppelin/bin/zeppelin.sh "${ZEPPELIN_CONF_DIR}"
26 changes: 26 additions & 0 deletions examples/spark/images/zeppelin/zeppelin-env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash

# Copyright 2015 The Kubernetes Authors All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

export MASTER="spark://spark-master:7077"
export SPARK_HOME=/opt/spark
export ZEPPELIN_JAVA_OPTS="-Dspark.jars=/opt/spark/lib/gcs-connector-latest-hadoop2.jar"
# TODO(zmerlynn): Setting global CLASSPATH *should* be unnecessary,
# but ZEPPELIN_JAVA_OPTS isn't enough here. :(
export CLASSPATH="/opt/spark/lib/gcs-connector-latest-hadoop2.jar"
export ZEPPELIN_NOTEBOOK_DIR="${ZEPPELIN_HOME}/notebook"
export ZEPPELIN_MEM=-Xmx1024m
export ZEPPELIN_PORT=8080
export PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip"
6 changes: 6 additions & 0 deletions examples/spark/images/zeppelin/zeppelin-log4j.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Set everything to be logged to the console.
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%5p [%d] ({%t} %F[%M]:%L) - %m%n
Original file line number Diff line number Diff line change
@@ -1,19 +1,21 @@
kind: ReplicationController
apiVersion: v1
metadata:
name: spark-driver-controller
name: zeppelin-controller
spec:
replicas: 1
selector:
component: spark-driver
component: zeppelin
template:
metadata:
labels:
component: spark-driver
component: zeppelin
spec:
containers:
- name: spark-driver
image: gcr.io/google_containers/spark-driver:1.5.1_v2
- name: zeppelin
image: gcr.io/google_containers/zeppelin:v0.5.5_v1
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
10 changes: 10 additions & 0 deletions examples/spark/zeppelin-service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
kind: Service
apiVersion: v1
metadata:
name: zeppelin
spec:
ports:
- port: 8080
targetPort: 8080
selector:
component: zeppelin