sparklyudf demonstrates how to build a sparklyr extension package that uses custom Scala code which is compiled, deployed to Apache Spark and registered as an UDF that can be used in SQL and dplyr.
First build this package, then build its Spark 2.0 jars by running:
spec <- sparklyr::spark_default_compilation_spec()
spec <- Filter(function(e) e$spark_version >= "2.0.0", spec)
sparklyr::compile_package_jars()
then build the R package as usual.
This package contains an Scala-based UDF defined as:
object Main {
def register_hello(spark: SparkSession) = {
spark.udf.register("hello", (name: String) => {
"Hello, " + name + "! - From Scala"
})
}
}
Connect and test this package as follows:
library(sparklyudf)
library(sparklyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sc <- spark_connect(master = "local")
## * Using Spark: 2.1.0
sparklyudf_register(sc)
## <jobj[13]>
## class org.apache.spark.sql.expressions.UserDefinedFunction
## UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
Now the Scala UDF hello()
is registered and can be used as follows:
data.frame(name = "Javier") %>%
copy_to(sc, .) %>%
mutate(hello = hello(name))
## # Source: lazy query [?? x 2]
## # Database: spark_connection
## name hello
## <chr> <chr>
## 1 Javier Hello, Javier! - From Scala
spark_disconnect_all()
## [1] 1