# SparkR

Accessing Spark from an R interface.
[SIGMOD paper](https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf)

In [5]:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))


Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
    rank, rbind, sample, startsWith, subset, summary, transform, union



Launching java with spark-submit command /etc/spark/bin/spark-submit   --driver-memory "2g" sparkr-shell /tmp/Rtmp1KfiOV/backend_port41f961d00307 


Java ref type org.apache.spark.sql.SparkSession id 1 

In [6]:
library(ggplot2)

In [None]:
df1 <- as.DataFrame(faithful)

In [None]:
system.time(printSchema(df1))

In [None]:
df <- read.df("small_data/stackexchange/responses.csv", "csv", header = "true", inferSchema = "true", na.strings = "NA")

In [None]:
printSchema(df)

In [None]:
head(df)

In [None]:
head(select(df, "self_identification"))

In [None]:
head(select(filter(df, df$collector != "Facebook"), df$self_identification))

In [None]:
?summary

In [None]:
summary(df)

In [None]:
collect(summary(df))

In [None]:
collect(select(summary(df), "summary", "salary_midpoint"))

### Aggregation functions

* avg
* min
* max
* sum
* countDistinct
* sumDistinct

In [None]:
collect(select(df, avg(df$salary_midpoint)))

### Ordering

In [None]:
head(arrange(select(df, "country", "age_range", "salary_range"), desc(df$salary_midpoint)))

### Filtering

In [None]:
sals <- select(df, "salary_midpoint", "age_range")

In [None]:
head(sals)

In [None]:
sals <- filter(sals, "salary_midpoint > 0 and age_range != 'NA'")

In [None]:
head(sals)

### Grouping

Combine groupby with aggregation or summary.

In [None]:
age_groups <- agg(
    groupBy(sals, "age_range"), 
    number = n(sals$salary_midpoint),
    avg_sal = avg(sals$salary_midpoint), 
    max_sal = max(sals$salary_midpoint),
    min_sal = min(sals$salary_midpoint)
)

In [None]:
age_df <- collect(age_groups)
age_df

In [None]:
sorted_age_df <- collect(arrange(age_groups, asc(age_groups$age_range)))
sorted_age_df

In [None]:
str(age_groups)

In [None]:
?factor

In [None]:
ages_vec <- sort(unique(collect(sals)$age_range))
ages_vec

In [None]:
sals$age_range <- factor(
    x=sals$age_range, 
    levels=ages_vec
)

In [None]:
str(sals)

In [None]:
str(sorted_age_df)

In [None]:
sorted_age_df$age_range <- factor(
    sorted_age_df$age_range,
    ages_vec
)
str(sorted_age_df)

In [None]:
resorted_age_df <- sorted_age_df[order(sorted_age_df$age_range),]
resorted_age_df
# of course, if we were using dplyr we could use the same "arrange" syntax...

In [None]:
plot <- ggplot(data = head(resorted_age_df, -1), aes(x=age_range, y=avg_sal, group=1))
plot + geom_line() + geom_point() + ylab("Average salary") +xlab("Age range")

## Selecting using R and SQL

In [None]:
head(select(sals, sals$salary_midpoint / 1000))

In [None]:
head(selectExpr(sals, "(salary_midpoint / 1000) as Salary_K"))

In [None]:
createOrReplaceTempView(df, "data")

In [None]:
highpaid <- sql("select occupation, star_wars_vs_star_trek from data where salary_midpoint > 200000 and star_wars_vs_star_trek != 'NA'")

In [None]:
head(highpaid)

In [None]:
head(subset(df, df$salary_midpoint > 200000, c("occupation", "age_range")))

Some other familiar operations to try:
* nrow, ncol
* rbind, cbind

In [None]:
c <- ggplot(data=collect(df), aes(x=factor(age_range)))

*Question:* Why do we need to collect?

In [None]:
c + geom_bar() + xlab("Age")

## Modeling

In [7]:
titanic <- as.data.frame(Titanic)

In [8]:
titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5])
nbDF <- titanicDF
nbTestDF <- titanicDF

In [13]:
head(titanicDF)

Class,Sex,Age,Survived
3rd,Male,Child,No
3rd,Female,Child,No
1st,Male,Adult,No
2nd,Male,Adult,No
3rd,Male,Adult,No
Crew,Male,Adult,No


In [14]:
nbModel <- spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age)

In [15]:
summary(nbModel)

Yes,No
0.5769231,0.4230769

Unnamed: 0,Class_3rd,Class_1st,Class_2nd,Sex_Male,Age_Adult
Yes,0.3125,0.3125,0.3125,0.5,0.5625
No,0.4166667,0.25,0.25,0.5,0.75


In [16]:
nbPredictions <- predict(nbModel, nbTestDF)
showDF(nbPredictions)

+-----+------+-----+--------+-----+--------------------+--------------------+----------+
|Class|   Sex|  Age|Survived|label|       rawPrediction|         probability|prediction|
+-----+------+-----+--------+-----+--------------------+--------------------+----------+
|  3rd|  Male|Child|      No|  1.0|[-3.9824097993521...|[0.60062402496099...|       Yes|
|  3rd|Female|Child|      No|  1.0|[-3.9824097993521...|[0.60062402496099...|       Yes|
|  1st|  Male|Adult|      No|  1.0|[-3.7310953710712...|[0.58003280993672...|       Yes|
|  2nd|  Male|Adult|      No|  1.0|[-3.7310953710712...|[0.58003280993672...|       Yes|
|  3rd|  Male|Adult|      No|  1.0|[-3.7310953710712...|[0.39192399049881...|        No|
| Crew|  Male|Adult|      No|  1.0|[-2.9426380107070...|[0.50318824507901...|       Yes|
|  1st|Female|Adult|      No|  1.0|[-3.7310953710712...|[0.58003280993672...|       Yes|
|  2nd|Female|Adult|      No|  1.0|[-3.7310953710712...|[0.58003280993672...|       Yes|
|  3rd|Female|Adult| 

More examples can be found on [GitHub](https://github.com/apache/spark/blob/master/examples/src/main/r/ml.R).

*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*