Skip to content

HyukjinKwon/spark-connect-ruby

spark-connect (Ruby)

CI Gem Version Docs License

A pure-Ruby client for Apache Spark Connect - the gRPC-based, decoupled client/server protocol for Apache Spark.

spark-connect lets you build and run Spark DataFrame queries from Ruby against a remote Spark cluster, with an API that closely mirrors PySpark. No JVM, no local Spark installation, no spark-submit - just a gRPC connection to a Spark Connect server.

require "spark-connect"

spark = SparkConnect::SparkSession.builder
                                  .remote("sc://localhost:15002")
                                  .get_or_create

F = SparkConnect::F

spark.range(1, 1_000)
     .select(F.col("id"), (F.col("id") % 3).alias("bucket"))
     .group_by("bucket")
     .agg(F.count("*").alias("n"), F.sum("id").alias("total"))
     .order_by("bucket")
     .show

spark.stop
+------+---+------+
|bucket|  n| total|
+------+---+------+
|     0|333|166833|
|     1|333|166167|
|     2|333|166500|
+------+---+------+

What it supports

spark-connect implements the Spark Connect DataFrame, SQL, Structured Streaming, and Declarative Pipelines API -- everything except user-defined functions (UDFs) and the foreach/foreachBatch streaming sinks, whose Spark Connect protobuf definitions are not yet finalized. (The separate, experimental MLlib-over-Connect surface is also out of scope.)

Results decode through Apache Arrow into ordered, name-addressable Rows. Method names are snake_case (idiomatic Ruby) with camelCase aliases for the common PySpark names (groupBy, withColumn, orderBy, createDataFrame, ...), so PySpark code ports almost verbatim.

Requirements

  • Ruby >= 3.1
  • Apache Arrow C++/GLib system libraries (required by the red-arrow dependency):
  • A reachable Spark Connect server. This client is generated against the Spark Connect 4.1 protocol and supports Apache Spark 3.5 and above.

See the installation guide for details.

Installation

gem install rubygems-requirements-system
gem install spark-connect

Or in a Gemfile:

plugin "rubygems-requirements-system"
gem "spark-connect"

Running a local Spark Connect server

# Download a Spark distribution (4.1.0 shown here; 3.5+ also works)
curl -fsSL https://archive.apache.org/dist/spark/spark-4.1.0/spark-4.1.0-bin-hadoop3.tgz | tar xz
cd spark-4.1.0-bin-hadoop3

# Start the Connect server (requires Java 17+).
# Spark 4.0.0+ bundles the Connect server, so no extra packages are needed.
./sbin/start-connect-server.sh

On Spark 3.5.x the Connect server is not bundled; pull it in with --packages "org.apache.spark:spark-connect_2.13:3.5.5" (use a Scala 2.13 distribution).

The server listens on sc://localhost:15002 by default.

Connecting

Connection strings follow the standard Spark Connect grammar:

# Plaintext, local
SparkConnect::SparkSession.builder.remote("sc://localhost:15002").get_or_create

# TLS + bearer token (token implies SSL)
SparkConnect::SparkSession.builder
  .remote("sc://spark.example.com:443/;token=#{ENV['SPARK_TOKEN']};user_id=alice")
  .get_or_create

Supported parameters: token, user_id, user_agent, use_ssl, session_id, and any x-* custom gRPC headers.

A quick tour

F = SparkConnect::F
T = SparkConnect::Types

# Build a DataFrame from local Ruby data
df = spark.create_data_frame([
  { "name" => "alice", "dept" => "eng", "salary" => 120 },
  { "name" => "bob",   "dept" => "eng", "salary" => 100 },
  { "name" => "carol", "dept" => "ops", "salary" => 110 },
])

# Transform and aggregate
df.where(F.col("salary") >= 105)
  .group_by("dept")
  .agg(F.avg("salary").alias("avg_salary"), F.count("*").alias("headcount"))
  .order_by(F.col("avg_salary").desc)
  .show

# Window functions
w = SparkConnect::Window.partition_by("dept").order_by(F.col("salary").desc)
df.with_column("rank", F.rank.over(w)).show

# Schemas
df.print_schema
df.schema.simple_string  #=> "struct<name:string,dept:string,salary:bigint>"

# SQL with parameters
spark.sql("SELECT * FROM VALUES (1), (2), (3) AS t(x) WHERE x > :min", { min: 1 }).show

Documentation

Full documentation, including guides for every part of the API, lives at https://hyukjinkwon.github.io/spark-connect-ruby/.

Runnable examples/ cover quickstart, transformations, aggregations, joins, window functions, SQL, reading/writing, local data, and NA/stat helpers.

Compatibility

The client is generated against the Spark Connect 4.1 protocol and supports Apache Spark 3.5 and above (the Spark Connect wire protocol is backward compatible across these releases).

Development

git clone https://github.com/HyukjinKwon/spark-connect-ruby
cd spark-connect-ruby
bundle install

bundle exec rake spec      # unit specs (no server required)
bundle exec rake rubocop   # lint
bundle exec rake yard      # API docs

# Integration specs against a live server
SPARK_REMOTE=sc://localhost:15002 bundle exec rspec spec/integration

# Regenerate the protobuf/gRPC stubs from the vendored .proto files
bin/generate-protos

See CONTRIBUTING.md.

About

A pure-Ruby client for Apache Spark Connect: a PySpark-style DataFrame API over gRPC.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors