Spark Connect Proxy

When using Spark Connect as part of Apache Spark it is possible to seamlessly connect to the Spark Cluster from PySpark directly without running co-located to the driver.

However, Apache Spark does not provide to multiplex multiple users across multiple Spark Clusters. This is where Spark Connect Proxy comes in. It is a simple proxy server that can be used to multiplex multiple users across multiple Spark Clusters.

It operates in spirit similar to other projects like Apache Livy or Apache Kyuubi.

Installation

To install Spark Connect Proxy simply checkout this repository and run

make

Alternatively, you can download the latest binaries from the release page. The binaries are pre-compiled for Linux, Apple, and Windows for both X86 and ARM architectures.

Now you can start the server by running

./cmd/spark-connect-proxy/spark-connect-proxy

Docker Setup

If you want to see how to setup the Spark Connect Proxy in a multi-backend scenario, please have a look at the example using Docker Compose. This setup includes:

Spark Connect Proxy
Two Spark instances with Spark Connect enabled
Automatic testing container

See the Docker Compose Setup for additional documentation.

Configuration

The proxy server can be configured using a YAML file. The following example shows how to configure the proxy server to connect to a pre-defined Spark cluster.

---
backend_provider:
  # This is an arbitrary name to identify the backend provider.
  name: manual spark
  # Configures a pre-defined backend type that provides a list of already
  # started Spark clusters.
  type: PREDEFINED
  spec:
    endpoints:
      # A list of endpoints that the proxy can connect to.
      - url: localhost:15002
# Log level to use by the proxy.
log_level: debug

Usage

Please check out the following video:

To try out the proxy server you can use the following example setup:

Start Spark with Spark Connect

env SPARK_NO_DAEMONIZE=1 ./sbin/start-connect-server.sh --conf spark.log.structuredLogging.enabled=false --packages org.apache.spark:spark-connect_2.12:3.5.4

Start Spark Connect Proxy

./cmd/spark-connect-proxy/spark-connect-proxy

Connect to the Proxy to Connect to Spark

import requests
from pyspark.sql import SparkSession

# Create a new session and extract the ID
res = requests.post("http://localhost:8081/control/sessions")
id = res.text

# Connect to Spark Connect on port 8080 which is the default
# port for the proxy, Spark Connect usually listens on 15002.
remote = f"sc://localhost:8080/;x-spark-connect-session-id={id}"

# Connect to Spark
spark = SparkSession.builder.remote(remote).getOrCreate()
spark.range(10).collect()

Extending the Proxy with Custom Backend Providers

TODO

Help Needed

It would be great to further extend this project and make it more useful. For example there are still a lot of different topics that would be great to cover:

Add support for more backend providers
Add support for authentication and authorization as GRPC middleware
And many others ...

Please reach out or create a pull request!

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
cmd/spark-connect-proxy		cmd/spark-connect-proxy
connect		connect
dev		dev
docker		docker
docs		docs
examples		examples
internal		internal
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.docker.md		README.docker.md
README.md		README.md
docker-compose.yaml		docker-compose.yaml
go.mod		go.mod
go.sum		go.sum
spark-connect-proxy.yaml		spark-connect-proxy.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark Connect Proxy

Installation

Docker Setup

Configuration

Usage

Start Spark with Spark Connect

Start Spark Connect Proxy

Connect to the Proxy to Connect to Spark

Extending the Proxy with Custom Backend Providers

Help Needed

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

grundprinzip/spark-connect-proxy

Folders and files

Latest commit

History

Repository files navigation

Spark Connect Proxy

Installation

Docker Setup

Configuration

Usage

Start Spark with Spark Connect

Start Spark Connect Proxy

Connect to the Proxy to Connect to Spark

Extending the Proxy with Custom Backend Providers

Help Needed

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages