When using Spark Connect as part of Apache Spark it is possible to seamlessly connect to the Spark Cluster from PySpark directly without running co-located to the driver.
However, Apache Spark does not provide to multiplex multiple users across multiple Spark Clusters. This is where Spark Connect Proxy comes in. It is a simple proxy server that can be used to multiplex multiple users across multiple Spark Clusters.
It operates in spirit similar to other projects like Apache Livy or Apache Kyuubi.
To install Spark Connect Proxy simply checkout this repository and run
makeAlternatively, you can download the latest binaries from the release page. The binaries are pre-compiled for Linux, Apple, and Windows for both X86 and ARM architectures.
Now you can start the server by running
./cmd/spark-connect-proxy/spark-connect-proxyIf you want to see how to setup the Spark Connect Proxy in a multi-backend scenario, please have a look at the example using Docker Compose. This setup includes:
- Spark Connect Proxy
- Two Spark instances with Spark Connect enabled
- Automatic testing container
See the Docker Compose Setup for additional documentation.
The proxy server can be configured using a YAML file. The following example shows how to configure the proxy server to connect to a pre-defined Spark cluster.
---
backend_provider:
# This is an arbitrary name to identify the backend provider.
name: manual spark
# Configures a pre-defined backend type that provides a list of already
# started Spark clusters.
type: PREDEFINED
spec:
endpoints:
# A list of endpoints that the proxy can connect to.
- url: localhost:15002
# Log level to use by the proxy.
log_level: debugPlease check out the following video:
To try out the proxy server you can use the following example setup:
env SPARK_NO_DAEMONIZE=1 ./sbin/start-connect-server.sh --conf spark.log.structuredLogging.enabled=false --packages org.apache.spark:spark-connect_2.12:3.5.4./cmd/spark-connect-proxy/spark-connect-proxyimport requests
from pyspark.sql import SparkSession
# Create a new session and extract the ID
res = requests.post("http://localhost:8081/control/sessions")
id = res.text
# Connect to Spark Connect on port 8080 which is the default
# port for the proxy, Spark Connect usually listens on 15002.
remote = f"sc://localhost:8080/;x-spark-connect-session-id={id}"
# Connect to Spark
spark = SparkSession.builder.remote(remote).getOrCreate()
spark.range(10).collect()TODO
It would be great to further extend this project and make it more useful. For example there are still a lot of different topics that would be great to cover:
- Add support for more backend providers
- Add support for authentication and authorization as GRPC middleware
- And many others ...
Please reach out or create a pull request!
