Spark

Fork from the official Spark Docker image with clients for various data lakes.

This is meant to be used as history server or Spark base image for K8s (since you likely need a distributed cache).

For instance:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spark-history-server
  name: spark-history-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark-history-server
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: spark-history-server
    spec:
      containers:
      - name: spark-3-1-1
        image: pilillo/spark:20210331
        imagePullPolicy: IfNotPresent
        command:
        - /opt/spark/sbin/start-history-server.sh
        env:
        - name: SPARK_NO_DAEMONIZE
          value: "false"
        - name: SPARK_HISTORY_OPTS
          value: -Dspark.history.fs.logDirectory=s3a://mybucket/
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: ui
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 4
        ports:
        - containerPort: 18080
          name: ui
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: ui
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 4
        resources:
          requests:
            cpu: 100m
            memory: 512Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /opt/spark/conf
          name: spark-conf-volume
      restartPolicy: Always
      securityContext:
        runAsNonRoot: true
        runAsUser: 185
      volumes:
      - configMap:
          defaultMode: 420
          name: spark-conf
        name: spark-conf-volume

with the spark-conf-volume being a config map containing the default spark conf files, out of which the spark-defaults.conf was modified to point to the distributed cache.

apiVersion: v1
data:
  ...
  spark-defaults.conf: |
    spark.hadoop.fs.s3a.impl                    org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.endpoint                myendpoint
    spark.hadoop.fs.s3a.connection.ssl.enabled  { true | false }
    spark.hadoop.fs.s3a.access.key              myaccesskey
    spark.hadoop.fs.s3a.secret.key              mysecretkey
    spark.eventLog.enabled                      true
    spark.eventLog.dir                          s3a://mybucket/
    spark.history.fs.logDirectory               s3a://mybucket/
    spark.hadoop.fs.s3a.path.style.access       { true | false }
  ...
kind: ConfigMap
metadata:
  name: spark-conf

Also mind that, because of a bug, the bucket shall not be empty at first.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark

About

Releases 6

Packages

Languages

License

pilillo/spark

Folders and files

Latest commit

History

Repository files navigation

Spark

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages