Skip to content

Commit

Permalink
Added initial README, source, tests and build scripts.
Browse files Browse the repository at this point in the history
R Renjin Spark Executor (REX) Library genesis.
  • Loading branch information
onetapbeyond committed Jan 16, 2016
1 parent 5be7eaa commit e80d01b
Show file tree
Hide file tree
Showing 18 changed files with 974 additions and 0 deletions.
255 changes: 255 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
#R Renjin Spark Executor (REX) Library

REX is a Scala library offering access to the scientific computing
power of the R programming language to
[Apache Spark](http://spark.apache.org/) batch and streaming
applications on the JVM. This library is built on top of the
[renjin-r-executor](https://github.com/onetapbeyond/renjin-r-executor)
library, a lightweight solution for integrating R analytics executed on
the [Renjin R interpreter](http://www.renjin.org) into any application
running on the JVM.

> IMPORTANT:
> The Renjin R interpreter for statistical computing is currently a
> work-in-progress and is not yet 100% compatible with GNU R. To find which
> CRAN R packages are currently supported by Renjin you can browse or search
> the [Renjin package repository](http://packages.renjin.org/). As Renjin
> compatibility with GNU R continues to improve, REX is ready to deliver
> those improvements directly to Apache Spark batch and streaming
> applications on the JVM. If Renjin does not meet your needs today then
> I recommend checking out
> [ROSE](https://github.com/onetapbeyond/opencpu-spark-executor), an
> alternative library that today offers access to the full scientific
> computing power of the R programming language to Apache Spark applications.
### REX Motivation

> Where Apache SparkR lets data scientists use Spark from R, REX is
> designed to let Scala and Java developers use R from Spark.
The popular [Apache SparkR](https://github.com/apache/spark/tree/master/R)
package provides a lightweight front-end for data scientists to use
Apache Spark from R. This approach is ideally suited to
investigative analytics, such as ad-hoc and exploratory analysis at scale.

The REX library attempts to provide the same R analytics capabilities
available to Apache SparkR applications within traditional Spark applications
on the JVM. It does this by exposing new `analyze` operations that execute R
analytics on compatible RDDs. This new facility is designed primarily for
operational analytics and can be used alongside Spark core, SQL, Streaming,
MLib and GraphX.

If you need to query R machine-learning models, score R prediction models or
leverage any other aspect of the R library within your Spark applications on
the JVM then the REX library may be for you.

### REX Examples

A number of example applications are provided to demonstrate the use of the
REX library to deliver R analytics capabilities within any Spark solution.

- Hello, World! [ [Scala](examples/scala/hello-world) ][ [Java](examples/java/hello-world) ]


### REX SBT Dependency

```
libraryDependencies += "io.onetapbeyond" %% "renjin-spark-executor_2.10" % "1.0"
```

### REX Gradle Dependency

```
compile 'io.onetapbeyond:renjin-spark-executor_2.10:1.0'
```

### REX Spark Package Dependency

Include the REX package in your Spark application using spark-shell, or spark-submit.
For example:

```
$SPARK_HOME/bin/spark-shell --packages io.onetapbeyond:renjin-spark-executor_2.10:1.0
```

### REX Basic Usage

This library exposes new `analyze` transformations on Spark RDDs of type
`RDD[RenjinTask]`. The following sections demonstrate how to use these new
RDD operations to execute R analytics directly within Spark batch and
streaming applications on the JVM.

See the [documentation](https://github.com/onetapbeyond/renjin-r-executor)
on the underlying `renjin-r-executor` library for details on building
`RenjinTask` and handling `RenjinResult`.

### REX Spark Batch Usage

For this example we assume an input `dataRDD`, then transform it to generate
an RDD of type `RDD[RenjinTask]`. In this example each `RenjinTask` will
execute a block of task specific R code when the RDD is eventually evaluated.

```
import io.onetapbeyond.renjin.spark.executor.R._
import io.onetapbeyond.renjin.r.executor._
val rTaskRDD = dataRDD.map(data => {
Renjin.R()
.code(rCode)
.input(data.asInput())
.build()
})
```

The set of `RenjinTask` within `rTaskRDD` can be scheduled for
processing by calling the new `analyze` operation provided by REX
on the RDD:

```
val rResultRDD = rTaskRDD.analyze
```

When `rTaskRDD.analyze` is evaluated by Spark the resultant `rResultRDD`
is of type `RDD[RenjinResult]`. The result returned by the block of the
task specific R code for the original `RenjinTask` is available
within these `RenjinResult`. These values can be optionally cached, further
processed or persisted per the needs of your Spark application.

Note, the block of task specific R code can make use of any CRAN R
package function or script that is supported by the Renjin R interpreter.
To find which CRAN R packages are currently supported by Renjin you can
browse or search the [Renjin package repository](http://packages.renjin.org/).

### REX Spark Streaming Usage

For this example we assume an input stream `dataStream`, then transform
it to generate a new stream with underlying RDDs of type `RDD[RenjinTask]`.
In this example each `RenjinTask` will execute a block of task specific R
code when the stream is eventually evaluated.

```
import io.onetapbeyond.renjin.spark.executor.R._
import io.onetapbeyond.renjin.r.executor._
val rTaskStream = dataStream.transform(rdd => {
rdd.map(data => {
Renjin.R()
.code(rCode)
.input(data.asInput())
.build()
})
})
```

The set of `RenjinTask` within `rTaskStream` can be scheduled for processing
by calling the new `analyze` operation provided by REX on each RDD within
the stream:

```
val rResultStream = rTaskStream.transform(rdd => rdd.analyze)
```

When `rTaskStream.transform` is evaluated by Spark the resultant
`rResultStream` has underlying RDDs of type `RDD[RenjinResult]`. The result
returned by the block of task specific R code for the original
`RenjinTask` is available within these `RenjinResult`. These values can
be optionally cached, further processed or persisted per the needs of your
Spark application.

Note, the block of task specific R code can make use of any CRAN R
package function or script that is supported by the Renjin R interpreter.
To find which CRAN R packages are currently supported by Renjin you can
browse or search the [Renjin package repository](http://packages.renjin.org/).

### Traditional v REX Spark Application Deployment

To understand how REX delivers the scientific computing power of
the R programming language to Spark applications on the JVM the following
sections compare and constrast the deployment of traditional Scala, Java,
Python and SparkR applications with Spark applications powered by the
REX library.

The sole deployment requirement when working with the REX library is to
add the necessary `Java Archive (.jar)` dependencies to your Spark application
for REX, for the Renjin interpreter itself and for any
[Renjin-compatible CRAN R packages](http://packages.renjin.org/) that
your R code will use. This is discussed in
greater detail in `Application Deployment` section 3. that follows below.

####1. Traditional Scala | Java | Python Spark Application Deployment

![Traditional Deployment: Spark](https://onetapbeyond.github.io/resource/img/rex/trad-spark-deploy.jpg)

Without REX library support, neither data scientists nor application
developers have access to R's analytic capabilities within these types
of application deployments.

####2. Traditional SparkR Application Deployment

![Traditional Deployment: SparkR](https://onetapbeyond.github.io/resource/img/rex/trad-sparkr-deploy.jpg)

While data scientists can leverage the computing power of Spark within R
applications in these types of application deployments, these same R
capabilities are not available to Scala, Java or Python developers.

Note, when working with Apache SparkR, the R runtime environment must be
installed locally on each worker node on your cluster.


####3. Scala | Java + R (REX) Spark Application Deployment

![New Deployment: renjin-spark-executor](https://onetapbeyond.github.io/resource/img/rex/new-rex-deploy.jpg)

Both data scientists and application developers working in either Scala or
Java can leverage the power of R using the REX library within these
types of application deployments.

As REX, the Renjin R interpreter and all
[Renjin-compatible CRAN R packages](http://packages.renjin.org/) all
native JVM libraries these dependencies are made available as standard
`JAR` artifacts available for download or inclusion as managed
dependencies from a Maven repository.

For example, the basic Maven artifact dependency delcarations for a
REX-powered Spark batch application using the `sbt` build tool look
as follows:

```
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "version" % "provided",
"io.onetapbeyond" % "renjin-spark-executor" % "version",
"org.renjin" % "renjin-script-engine" % "version"
)
```

As a further example, the Maven artifact dependencies for a REX-powered
Spark streaming application that depends on the Renjin-compatible CRAN
R `survey` package using the `sbt` build tool look as follows:

```
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-streaming_2.10" % "version" % "provided",
"io.onetapbeyond" % "renjin-spark-executor" % "version",
"org.renjin" % "renjin-script-engine" % "version",
"org.renjin.cran" % "survey" % "version"
)
```

All Renjin artifacts are maintained within a Maven repository
managed by [BeDataDriven](http://www.bedatadriven.com), the creators
of the Renjin interpreter. To use these artifacts you must identify
the `BeDataDriven` Maven repository to your build tool. For example,
when using `sbt` the required `resolver` is as follows:

```
resolvers +=
"BeDataDriven" at "https://nexus.bedatadriven.com/content/groups/public"
```

Note, as REX is powered by Renjin there is *no need* to install the R
runtime environment locally on each worker node on your cluster. This
means REX works out-of-the box with all new or existing Spark clusters.

### License

See the [LICENSE](LICENSE) file for license rights and limitations (Apache License 2.0).
17 changes: 17 additions & 0 deletions build.sbt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
resolvers +=
"BeDataDriven" at "https://nexus.bedatadriven.com/content/groups/public"

lazy val root = (project in file(".")).
settings(
name := "renjin-spark-executor",
organization := "io.onetapbeyond",
version := "1.0",
scalaVersion := "2.10.6",
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.6.0" % "provided",
"org.renjin" % "renjin-script-engine" % "0.8.1890" % "provided",
"io.onetapbeyond" % "renjin-r-executor" % "1.2",
"org.scalatest" % "scalatest_2.10" % "2.2.4" % "test"
),
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
)
14 changes: 14 additions & 0 deletions examples/java/hello-world/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
.gradle
build/
gradle.properties

# Ignore Gradle GUI config
gradle-app.setting

# Avoid ignoring Gradle wrapper jar file (.jar files are usually ignored)
!gradle-wrapper.jar

*.class

# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
hs_err_pid*
36 changes: 36 additions & 0 deletions examples/java/hello-world/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
###Hello, World!

The canonical "Hello, World!" example application that demonstrates
the basic usage of REX to deliver R analytics capabilities within any
Spark solution.

####Source

Check out the example source code provided and you'll see that REX
integrates seamlessly within any traditional Spark application. The source
code also provides extensive comments that help guide you through
the integration.

####Build

Run the following [Gradle](http://gradle.org/) command within
the `hello-world` directory to build a `fatJar` for the example application
that can then be deployed directly to your Spark cluster.

``
gradlew clean shadowJar
``

The generated `fatJar` can be found in the `build/libs` directory.

####Launch

The simplest way to launch the example application is to use the
[spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html)
shell script provided as part of the Spark distribution.

The submit command you need should look something like this:

```
spark-submit --class io.onetapbeyond.opencpu.spark.executor.examples.HelloWorld --master local[*] /path/to/fat/jar/hello-world-[version]-all.jar
```
53 changes: 53 additions & 0 deletions examples/java/hello-world/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
buildscript {
repositories { jcenter() }
dependencies {
classpath 'com.github.jengelman.gradle.plugins:shadow:1.2.2'
}
}

apply plugin: 'java'
apply plugin: 'com.github.johnrengelman.shadow'

sourceCompatibility = 1.8
targetCompatibility = 1.8

description ="""
Hello, World!
"""

group = "io.onetapbeyond"
archivesBaseName = "hello-world"
version = "1.0"

repositories {
jcenter()
mavenCentral()
maven {
url "http://nexus.bedatadriven.com/content/groups/public"
}
}

dependencies {
compile 'org.apache.spark:spark-core_2.10:1.6.0'
compile 'io.onetapbeyond:renjin-spark-executor_2.10:1.0'
compile 'org.renjin:renjin-script-engine:0.8.1890'
}

jar {
manifest {
attributes("Implementation-Title": archivesBaseName,
"Implementation-Version": version)
}
}

shadowJar {
manifest {
attributes 'Main-Class': 'io.onetapbeyond.renjin.spark.executor.examples.HelloWorld'
}
}


configurations {
// Exclude transitive|provided dependencies from shadowJar
runtime.exclude module: 'spark-core_2.10'
}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#Wed Dec 09 09:36:38 PET 2015
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-2.9-bin.zip
Loading

0 comments on commit e80d01b

Please sign in to comment.