# **SparkR**: The Apache Spark R API

## 1. Introduction

This notebook shows how to connect Jupyter notebooks to a Spark cluster to process data using Spark R API.

It works on this [Docker Cluster](https://github.com/datainsightat/bigdata_development_environment.git)

## 2. The Spark Cluster

### 2.1. Connection

To connect to the Spark cluster, create a SparkSession object with the following params:

+ **appName:** application name displayed at the [Spark Master Web UI](http://localhost:8080/);
+ **master:** Spark Master URL, same used by Spark Workers;
+ **spark.executor.memory:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config.

In [1]:
library(SparkR);

sparkR.session(appName="sparkr-notebook", master="spark://spark:7077", sparkConfig=list(spark.executor.memory="512m"))


Attaching package: ‘SparkR’


The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window


The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
    rank, rbind, sample, startsWith, subset, summary, transform, union


Spark package found in SPARK_HOME: /opt/spark



Launching java with spark-submit command /opt/spark/bin/spark-submit   sparkr-shell /tmp/RtmpTC4hKS/backend_port38339d9c94a 


Java ref type org.apache.spark.sql.SparkSession id 1 

More confs for SparkSession object in standalone mode can be added using the **sparkConfig** param. Checkout the API docs [here](https://spark.apache.org/docs/latest/api/R/sparkR.session.html).

## 3. The Data

### 3.1. Introduction

We will be using Spark R API to read, process and write data. Checkout the API docs [here](https://spark.apache.org/docs/latest/api/R/index.html).

### 3.2. Read

Let's read some UK's macroeconomic data ([source](https://www.kaggle.com/bank-of-england/a-millennium-of-macroeconomic-data)) from the cluster's simulated **Hadoop distributed file system (HDFS)** into a Spark dataframe.

In [4]:
#data <- read.df("data/uk-macroeconomic-data.csv", source="csv", header="true")
data <- read.df("hdfs://hive:54310/examples/bank_prospects.csv", source="csv", header="true")

In [9]:
write.df(data,"hdfs://hive:54310/examples/bank_prospects_test4.csv")

ERROR: Error in saveDF.df(data, "hdfs://hive:54310/examples/bank_prospects_test4.csv"): could not find function "saveDF.df"


Let's then display some dataframe metadata, such as the number of rows and cols and its schema (cols name and type).

In [5]:
count(data)

In [4]:
length(columns(data))

In [5]:
printSchema(data)

root
 |-- Age: string (nullable = true)
 |-- Salary: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Purchased: string (nullable = true)


### 3.3. Process

In this example, we will get UK's population and unemployment rate thoughtout the years. Let's start by selecting the relevant columns.

In [6]:
unemployment <- select(data, "Description", "Population (GB+NI)", "Unemployment rate")

In [7]:
head(unemployment, n=10)

Unnamed: 0_level_0,Description,Population (GB+NI),Unemployment rate
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,Units,000s,%
2,1209,,
3,1210,,
4,1211,,
5,1212,,
6,1213,,
7,1214,,
8,1215,,
9,1216,,
10,1217,,


We successfully selected the desired columns but two problems were found:
+ The first line contains no data but the unit of measurement of each column;
+ There are many years with missing population and unemployment data.

Let's then remove the first line.

In [8]:
cols_description <- filter(unemployment, unemployment$Description == "Units")

In [9]:
head(cols_description)

Unnamed: 0_level_0,Description,Population (GB+NI),Unemployment rate
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,Units,000s,%


In [10]:
unemployment <- join(unemployment, cols_description, joinExpr = unemployment$Description == cols_description$Description, joinType="left_anti")

In [11]:
head(unemployment, n=10)

Unnamed: 0_level_0,Description,Population (GB+NI),Unemployment rate
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,1209,,
2,1210,,
3,1211,,
4,1212,,
5,1213,,
6,1214,,
7,1215,,
8,1216,,
9,1217,,
10,1218,,


Nice! Now, let's drop the dataframe rows with missing data and refactor its columns names.

In [12]:
unemployment <- dropna(unemployment)

In [13]:
unemployment <- withColumnRenamed(unemployment, "Description", "year")
unemployment <- withColumnRenamed(unemployment, "Population (GB+NI)", "population")
unemployment <- withColumnRenamed(unemployment, "Unemployment rate", "unemployment_rate")

In [14]:
head(unemployment, n=10)

Unnamed: 0_level_0,year,population,unemployment_rate
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,1855,23241,3.73
2,1856,23466,3.52
3,1857,23689,3.95
4,1858,23914,5.23
5,1859,24138,3.27
6,1860,24360,2.94
7,1861,24585,3.72
8,1862,24862,4.68
9,1863,25142,4.15
10,1864,25425,2.99


### 3.4. Write

Lastly, we persist the unemployment data into the cluster's simulated **HDFS**.

In [15]:
unemployment <- repartition(unemployment, numPartitions=1)
write.df(unemployment, path="data/uk-macroeconomic-unemployment-data.csv", source="csv", sep=",", header="true", mode="overwrite")