spark-connect (Ruby)

A pure-Ruby client for Apache Spark Connect - the gRPC-based, decoupled client/server protocol for Apache Spark.

spark-connect lets you build and run Spark DataFrame queries from Ruby against a remote Spark cluster, with an API that closely mirrors PySpark. No JVM, no local Spark installation, no spark-submit - just a gRPC connection to a Spark Connect server.

require "spark-connect"

spark = SparkConnect::SparkSession.builder
                                  .remote("sc://localhost:15002")
                                  .get_or_create

F = SparkConnect::F

spark.range(1, 1_000)
     .select(F.col("id"), (F.col("id") % 3).alias("bucket"))
     .group_by("bucket")
     .agg(F.count("*").alias("n"), F.sum("id").alias("total"))
     .order_by("bucket")
     .show

spark.stop

+------+---+------+
|bucket|  n| total|
+------+---+------+
|     0|333|166833|
|     1|333|166167|
|     2|333|166500|
+------+---+------+

What it supports

spark-connect implements the Spark Connect DataFrame, SQL, Structured Streaming, and Declarative Pipelines API -- everything except user-defined functions (UDFs) and the foreach/foreachBatch streaming sinks, whose Spark Connect protobuf definitions are not yet finalized. (The separate, experimental MLlib-over-Connect surface is also out of scope.)

Results decode through Apache Arrow into ordered, name-addressable Rows. Method names are snake_case (idiomatic Ruby) with camelCase aliases for the common PySpark names (groupBy, withColumn, orderBy, createDataFrame, ...), so PySpark code ports almost verbatim.

Requirements

Ruby >= 3.1
Apache Arrow C++/GLib system libraries (required by the red-arrow dependency):
- macOS: brew install apache-arrow apache-arrow-glib
- Ubuntu/Debian: install libarrow-glib-dev from the Apache Arrow APT repository
A reachable Spark Connect server. This client is generated against the Spark Connect 4.1 protocol and supports Apache Spark 3.5 and above.

See the installation guide for details.

Installation

gem install rubygems-requirements-system
gem install spark-connect

Or in a Gemfile:

plugin "rubygems-requirements-system"
gem "spark-connect"

Running a local Spark Connect server

# Download a Spark distribution (4.1.0 shown here; 3.5+ also works)
curl -fsSL https://archive.apache.org/dist/spark/spark-4.1.0/spark-4.1.0-bin-hadoop3.tgz | tar xz
cd spark-4.1.0-bin-hadoop3

# Start the Connect server (requires Java 17+).
# Spark 4.0.0+ bundles the Connect server, so no extra packages are needed.
./sbin/start-connect-server.sh

On Spark 3.5.x the Connect server is not bundled; pull it in with --packages "org.apache.spark:spark-connect_2.13:3.5.5" (use a Scala 2.13 distribution).

The server listens on sc://localhost:15002 by default.

Connecting

Connection strings follow the standard Spark Connect grammar:

# Plaintext, local
SparkConnect::SparkSession.builder.remote("sc://localhost:15002").get_or_create

# TLS + bearer token (token implies SSL)
SparkConnect::SparkSession.builder
  .remote("sc://spark.example.com:443/;token=#{ENV['SPARK_TOKEN']};user_id=alice")
  .get_or_create

Supported parameters: token, user_id, user_agent, use_ssl, session_id, and any x-* custom gRPC headers.

A quick tour

F = SparkConnect::F
T = SparkConnect::Types

# Build a DataFrame from local Ruby data
df = spark.create_data_frame([
  { "name" => "alice", "dept" => "eng", "salary" => 120 },
  { "name" => "bob",   "dept" => "eng", "salary" => 100 },
  { "name" => "carol", "dept" => "ops", "salary" => 110 },
])

# Transform and aggregate
df.where(F.col("salary") >= 105)
  .group_by("dept")
  .agg(F.avg("salary").alias("avg_salary"), F.count("*").alias("headcount"))
  .order_by(F.col("avg_salary").desc)
  .show

# Window functions
w = SparkConnect::Window.partition_by("dept").order_by(F.col("salary").desc)
df.with_column("rank", F.rank.over(w)).show

# Schemas
df.print_schema
df.schema.simple_string  #=> "struct<name:string,dept:string,salary:bigint>"

# SQL with parameters
spark.sql("SELECT * FROM VALUES (1), (2), (3) AS t(x) WHERE x > :min", { min: 1 }).show

Documentation

Full documentation, including guides for every part of the API, lives at https://hyukjinkwon.github.io/spark-connect-ruby/.

Runnable examples/ cover quickstart, transformations, aggregations, joins, window functions, SQL, reading/writing, local data, and NA/stat helpers.

Compatibility

The client is generated against the Spark Connect 4.1 protocol and supports Apache Spark 3.5 and above (the Spark Connect wire protocol is backward compatible across these releases).

Development

git clone https://github.com/HyukjinKwon/spark-connect-ruby
cd spark-connect-ruby
bundle install

bundle exec rake spec      # unit specs (no server required)
bundle exec rake rubocop   # lint
bundle exec rake yard      # API docs

# Integration specs against a live server
SPARK_REMOTE=sc://localhost:15002 bundle exec rspec spec/integration

# Regenerate the protobuf/gRPC stubs from the vendored .proto files
bin/generate-protos

See CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
bin		bin
docs		docs
examples		examples
lib		lib
proto		proto
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.yardopts		.yardopts
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Gemfile		Gemfile
LICENSE		LICENSE
NOTICE		NOTICE
PROTO_VERSION		PROTO_VERSION
README.md		README.md
Rakefile		Rakefile
SECURITY.md		SECURITY.md
spark-connect.gemspec		spark-connect.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-connect (Ruby)

What it supports

Requirements

Installation

Running a local Spark Connect server

Connecting

A quick tour

Documentation

Compatibility

Development

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

spark-connect (Ruby)

What it supports

Requirements

Installation

Running a local Spark Connect server

Connecting

A quick tour

Documentation

Compatibility

Development

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages