# Introduction to SparkR Exercise Notebook

Hello and welcome to your Hands-on Lab notebook. In this notebook, you will find code examples and exercises for you to practice SparkR and understand its functionalities.

<div class="alert alert-block alert-info" style="margin-top: 20px">If you did not complete the Jupyter Notebooks tutorials, it is highly recommendable you do so before utilizing this notebook. You can find them at your **welcome page**.</div>

## Table of Contents

[Hands-on Lab 1: Getting Started with SparkR](#hl1)
  - [Initializing SQL Context](#sql)
  - [Importing Data to SparkR](#import)
  - [Regarding data frames](#tables)
  - [Registering data frames as tempTables](#temp)
  - [Useful functions: head and printSchema](#use1)

[Hands-on Lab 2: Data Manipulation in SparkR](#hl2)
  - [Selecting Columns](#selcol)
  - [Filtering by Condition](#conditions)
  - [Grouping by Average, Sum and Count](#groupby)
  - [Operating on Columns](#oper)
  - [Utilizing SQL Queries in SparkR](#queries)
  
[Hands-on Lab 3: Linear Models in SparkR](#hl3)
  - [Creating a Gaussian Regression Model](#gauss)
  - [Creating a Binomial Regression Model](#binomial)
  


# Hands-on Lab 1: Getting Started with SparkR <a id="hl1"></a>

In this Hands-on Lab, we will be going over the basic syntax and functionalities of SparkR. To begin using SparkR, you need first to properly install and load the libraries. However (thankfully), Jupyter Notebooks has SparkR installed and configured already, since it uses the IRKernel. The only thing we need to do beforehand is initializing the **SQL Context for SparkR**.

<div class="alert alert-block alert-info" style="margin-top: 20px">**NOTE:** If you are not running code through Data Scientist Workbench, such as your own instance of Jupyter Notebooks, you will probably need to initialize the **R environment**. To do so, execute the following line of code. Note that if you are running from the SparkR shell, you do not need to execute this.</div>

In [1]:
#Initialize the R enRvironment
#Only needed if you run this outside Data Scientist Workbench or a SparkR shell
sc <- sparkR.init()

Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or restart R to create a new Spark Context


### Initializing the SQL Context 
sqlContext enables SparkR to read and manipulate structured data. As such, it is very important to start it up whenever you are utilizing SparkR. You can do this by execute the code in the cell below.

In [2]:
#Executing this creates a SQL context using SparkR context
#Make the name of the variable something you can remember, as you'll need the SQL context for most functions
sqlContext <- sparkRSQL.init(sc)

### Importing Data to SparkR
Now that we have initialized the SQL Context, we need data to load up into SparkR. For the purposes of this notebook, examples will be provided utilizing the `mtcars` dataset provided by R and the exercises will utilize `iris`, another local R dataset.

If you want to use one of your datasets, feel free to drag it to Jupyter Notebooks' data tab. You can get a link to it by selecting it in the Recent Data tab and then clicking `Insert Path`. Use this path in whatever function is adequate for the format of your file.

### Regarding Data Frames
Dataframes are SparkR's prime data structure. Data Frames are used for storing, manipulating, and organizing data. There are a few ways to create a Data Frame in SparkR. You can utilize the `createDataFrame` function, if there's already a local **R** data frame in place, you can use `read.df` if your file is of a format natively readable by SparkR (such as a correctly structured JSON file or Parquet), or you can take a look in the <a href="http://spark-packages.org/">Spark Packages</a> and see if there is any packages made for reading your file.

To read and create data from our `mtcars` dataset, we use the `createDataFrame` function, like so:

In [3]:
#Create a data frame called "cars" using R's native dataset "mtcars"
cars <- createDataFrame(sqlContext, mtcars)

Now, you do the same for the `iris` dataset:

In [4]:
#Create a data frame called "flowers" using R's native dataset "iris"
#type your code here
#Create a data frame called "cars" using R's native dataset "mtcars"
flowers <- createDataFrame(sqlContext, iris)

In FUN(X[[i]], ...): Use Petal_Width instead of Petal.Width  as column name

<div class="alert alert-block alert-info" style="margin-top: 20px">You might receive some warning messages regarding the `iris` dataset. This is due to the column names not complying to the naming guidelines - for the purposes of this notebook, you can ignore them.</div>

### Registering Data Frames as tempTables
One of SparkR's more unique features is the capability to perform SQL queries on Data Frames. To do so, you generate a temporary SQL table (the so called `tempTable`) in Spark. We will go over performing SQL queries in the next Lab.

For now, to register a temporary SQL table, we use the following function:

In [5]:
#Create a temporary SQL table called "cars" using our SparkR data frame "cars"
registerTempTable(cars,"cars")

Now, do the same for the `flowers` data frame:

In [6]:
#Create a temporary SQL table called "flowers" using our SparkR data frame "flowers"
#type your code here

registerTempTable(flowers,"flowers")

### Useful functions: head and printSchema
Now that you have your structured data ready for SparkR, you can take a look over your data with some handy functions. The datasets you use might be very large, and as such, printing the entire data frame might be a little too messy. In this case, you can use the `head` function to take a look at only the first six rows, like so:

In [7]:
#Look at the first six rows of our "cars" SparkR data frame
#You need the SparkR:: prefix due to R already having a head function
SparkR::head(cars)

   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can also take a look at the typing scheme of your data frame using the `printSchema` function:

In [8]:
#Look at the schema for our SparkR data frame "cars"
printSchema(cars)

root
 |-- mpg: double (nullable = true)
 |-- cyl: double (nullable = true)
 |-- disp: double (nullable = true)
 |-- hp: double (nullable = true)
 |-- drat: double (nullable = true)
 |-- wt: double (nullable = true)
 |-- qsec: double (nullable = true)
 |-- vs: double (nullable = true)
 |-- am: double (nullable = true)
 |-- gear: double (nullable = true)
 |-- carb: double (nullable = true)


Now, try doing the same for your `flowers` data frame!

In [9]:
#Look at the first six rows of our "flowers" SparkR data frame
#type your code here
SparkR::head(flowers)

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

In [10]:
#Look at the schema for our SparkR data frame "flowers"
#type your code here
printSchema(flowers)

root
 |-- Sepal_Length: double (nullable = true)
 |-- Sepal_Width: double (nullable = true)
 |-- Petal_Length: double (nullable = true)
 |-- Petal_Width: double (nullable = true)
 |-- Species: string (nullable = true)


<div class="alert alert-block alert-info" style="margin-top: 20px">That wraps up our Hands-on Lab 1. Remember that if you want any more information on SparkR, you can visit their documentation page, at https://spark.apache.org/docs/latest/sparkr.html</div>

# Hands-on Lab 2: Data Manipulation in SparkR <a id="hl2"></a>
In this Hands-on Lab, we will go over methods to **select, filter, and manipulate Data Frames**. The manipulation of Data Frames is the cornerstone of SparkR, and as such, it's a good thing to practice whenever you can.

<div class="alert alert-block alert-info" style="margin-top: 20px">This Hands-on Lab assumes you have already created the SQL Context and loaded the `mtcars` and `iris` data frames from R. If for some reason you have not done so yet, execute the code block below.</div>

In [11]:
#Initiate our SQL Context
sqlContext <- sparkRSQL.init(sc)
#Create a Data Frame called "cars" from the mtcars dataset
cars <- createDataFrame(sqlContext,mtcars)
#Create a Data Frame called "flowers" from the iris dataset
flowers <- createDataFrame(sqlContext,iris)

In FUN(X[[i]], ...): Use Petal_Width instead of Petal.Width  as column name

### Selecting Columns
All of our data frames are separated in **rows and columns**, much like a data table. Most of the time, we would want to retrieve values from a specific column. To do this, we use the `select` function, like so:

In [12]:
#Select from the "cars" data frame the "mpg" column
#select(cars,cars$mpg)
#Select the first six rows of the "mpg" column from the "cars" data frame
#Remember that you have to add the SparkR:: prefix to head since R already has an incompatible head function
SparkR::head(select(cars,cars$mpg))

   mpg
1 21.0
2 21.0
3 22.8
4 21.4
5 18.7
6 18.1

Now, do the same for the `flowers` data frame!

In [16]:
#Select the first six rows of the "Petal_Length" column from the "flowers" data frame
#type your code here
SparkR::head(select(flowers,flowers$Petal_Length))

  Petal_Length
1          1.4
2          1.4
3          1.3
4          1.5
5          1.4
6          1.7

### Filtering by Conditions
You can also **filter rows by imposing conditions** upon given columns. This something critical to know how to do, for you may want to subset your data frame given certain condition. For this, you use the `filter` function.

In [None]:
#Select the first six rows of "cars" that have a value under 20 in the "mpg" column
#We have to use the SparkR:: prefix since R already has a conflicting filter function
SparkR::head(SparkR::filter(cars, cars$mpg < 20))

Now, try doing the same for the `flowers` data frame.

In [17]:
#Select the first six rows of "flowers" that have a value under 1.4 in the "Petal_Length" column
#type your code here
SparkR::head(SparkR::filter(flowers, flowers$Petal_Length < 1.4))

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          4.7         3.2          1.3         0.2  setosa
2          4.3         3.0          1.1         0.1  setosa
3          5.8         4.0          1.2         0.2  setosa
4          5.4         3.9          1.3         0.4  setosa
5          4.6         3.6          1.0         0.2  setosa
6          5.0         3.2          1.2         0.2  setosa

### Grouping by Average, Sum and Count
Another useful function is **grouping your data frame rows by their average, sum, and count**. This enables you to create a histogram or generate other relevant information with great ease. This is done with the `summarize` and `groupby` functions.

In [18]:
#Select the first six elements of the grouping of "cars" by its "mpg" column, plus the count of the occurrances of that given
#"mpg" value in the dataset
SparkR::head(summarize(groupBy(cars, cars$mpg), count = n(cars$mpg)))
#Select the first six elements of the grouping of "cars" by its "mpg" column, plus the sum of all occurrances of that given
#"mpg" value in the dataset
SparkR::head(summarize(groupBy(cars, cars$mpg), sum = sum(cars$mpg)))
#Select the first six elements of the grouping of "cars" by its "mpg" column, plus the average of all "hp" column values
#in rows which have that given "mpg" value
SparkR::head(summarize(groupBy(cars, cars$mpg), average = avg(cars$hp)))

   mpg count
1 21.0     2
2 33.9     1
3 19.2     2
4 32.4     1
5 15.0     1
6 21.4     2

   mpg  sum
1 21.0 42.0
2 33.9 33.9
3 19.2 38.4
4 32.4 32.4
5 15.0 15.0
6 21.4 42.8

   mpg average
1 21.0   110.0
2 33.9    65.0
3 19.2   149.0
4 32.4    66.0
5 15.0   335.0
6 21.4   109.5

Additionally, you can also sort the data using the `arrange` function, like so:

In [19]:
#Make a variable called "group" which is the grouping of "cars" by its "mpg" column, plus the average of all "hp" column values 
#in rows which have that given "mpg" value
group <- summarize(groupBy(cars, cars$mpg), average = avg(cars$hp))
#Take the first six elements of "group" which are ordered in decreasing "average" column value order
SparkR::head(arrange(group, desc(group$average)))

   mpg average
1 15.0     335
2 15.8     264
3 13.3     245
4 14.3     245
5 14.7     230
6 10.4     210

Now try it on the `flowers` data frame!

In [20]:
#Make a variable called "petals" which is the grouping of "flowers" by its "Petal_Length" column, plus the count of its occurrances
#type your code here
#Take first six elements of "petals" which are ordered in decreasing "count" column order
#type your code here

### Operating on Columns
Now that you know how to select columns, you can now **perform operations on them**. Virtually any function in SparkR can be applied to a column. To do so, you use the `$` operator.

In [21]:
#In the "cars" data frame, change the "mpg" (miles per gallon) column to miles per liter and then change it back
#1 gallon is 3.78541178 liters
cars$mpg <- cars$mpg/3.78541178
SparkR::head(cars)
#Change it back
cars$mpg <- cars$mpg*3.78541178
SparkR::head(cars)

       mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 5.547613   6  160 110 3.90 2.620 16.46  0  1    4    4
2 5.547613   6  160 110 3.90 2.875 17.02  0  1    4    4
3 6.023123   4  108  93 3.85 2.320 18.61  1  1    4    1
4 5.653282   6  258 110 3.08 3.215 19.44  1  0    3    1
5 4.940017   8  360 175 3.15 3.440 17.02  0  0    3    2
6 4.781514   6  225 105 2.76 3.460 20.22  1  0    3    1

   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Now try the same on the `flowers` data frame.

In [22]:
#In the "flowers" data frame, change the "Petal_Length" column from centimeters to millimeters and then change it back
#1 centimeter is 10 millimeters
#type your code here

#Change it back
#type your code here

### Utilizing SQL queries in SparkR
You can also **utilize SQL queries in SparkR**, thanks to the SQL Context created. Before utilizing SQL queries, you need to register your data frames as `tempTables`. Let's do this right now:

In [23]:
#Create a temporary SQL table called "cars" using our SparkR data frame "cars"
registerTempTable(cars,"cars")
#Create a temporary SQL table called "flowers" using our SparkR data frame "flowers"
registerTempTable(flowers,"flowers")

Now that we have our temporary tables, we can perform queries using the `sql` command.

In [24]:
#Select the first six rows from the "cars" data frame where the value of the "cyl" column is greater than 6
SparkR::head(sql(sqlContext, "SELECT * FROM cars WHERE cyl > 6"))

   mpg cyl  disp  hp drat   wt  qsec vs am gear carb
1 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2
2 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
3 16.4   8 275.8 180 3.07 4.07 17.40  0  0    3    3
4 17.3   8 275.8 180 3.07 3.73 17.60  0  0    3    3
5 15.2   8 275.8 180 3.07 3.78 18.00  0  0    3    3
6 10.4   8 472.0 205 2.93 5.25 17.98  0  0    3    4

Now, try performing a query on the `flowers` data frame.

In [25]:
#Select the first six rows from the "flowers" data frame where the value of the "Petal_Length" column is greater than 1
#type your code here

<div class="alert alert-block alert-info" style="margin-top: 20px">That wraps up our Hands-on Lab 2. Remember that if you want any more information on SparkR, you can visit their documentation page at https://spark.apache.org/docs/latest/sparkr.html </div>

# Hands-on Lab 3: Linear Models in SparkR <a id="hl3"></a>
In this hands-on lab, we will go over Generalized Linear Models in SparkR. Generalized Linear Models in SparkR are constructed similar to their R counterparts, but still are different in their core.

GLMs, as they are called in SparkR, are based on the **`MLlib`** Spark library for Machine Learning. SparkR makes its use easier as the functions to interact with these models are largely based on the pre-existing **R** functions.

<div class="alert alert-block alert-info" style="margin-top: 20px">This Hands-on Lab assumes you have already created the SQL Context and loaded the `mtcars` and `iris` data frames from R. If for some reason you have not done so yet, execute the code block below.</div>

In [26]:
#Initiate our SQL Context
sqlContext <- sparkRSQL.init(sc)
#Create a Data Frame called "cars" from the mtcars dataset
cars <- createDataFrame(sqlContext,mtcars)
#Create a Data Frame called "flowers" from the iris dataset
flowers <- createDataFrame(sqlContext,iris)

In FUN(X[[i]], ...): Use Petal_Width instead of Petal.Width  as column name

### Creating a Gaussian Regression Model
To create a Gaussian Regression Model, we utilize the general `glm` function, **passing the value `gaussian` to the `family` parameter**, indicating that it is a Gaussian model.

`glm` understands most of **R**'s formula operators, such as **~, +, -, . and :**. We can use them to easily create the model, like so:

In [27]:
#Create a GLM of the Gaussian family of models, using the formula that has "mpg" as the response variable and
#"hp" and "cyl" as the predictors.
model <- SparkR::glm(mpg ~ hp + cyl, data = cars, family = "gaussian")

We can check the data for this model in a easy-to-read manner using the `summary` function:

In [28]:
#Retrieve the data from our model
SparkR::summary(model)

$devianceResiduals
 Min       Max     
 -4.494752 7.293354

$coefficients
            Estimate   Std. Error t value   Pr(>|t|)    
(Intercept) 36.90833   2.190799   16.84698  2.220446e-16
hp          -0.0191217 0.01500073 -1.274718 0.2125285   
cyl         -2.264694  0.5758892  -3.932516 0.0004803752


Now that we know how to create this model, we can **use it for predicting data points** using the `predict` function:

In [29]:
#Create predictions based on the model created
predictions <- SparkR::predict(model, newData = cars)
SparkR::head(select(predictions, "mpg", "prediction"))

   mpg prediction
1 21.0   21.21678
2 21.0   21.21678
3 22.8   26.07124
4 21.4   21.21678
5 18.7   15.44448
6 18.1   21.31239

Now that you know how to create a Gaussian model, try it using the `flowers` data set:

In [30]:
#Create a Gaussian GLM, using the formula that has "Sepal_Length" as the response variable and "Sepal_Width" and "Species"
#as the predictor
#type your code here

#Retrieve the data from our model
#type your code here

#Create predictions based on the model created
#type your code here

### Creating a Binomial Regression Model
Creating a Binomial Regression Model is simple - you just pass the value `binomial` to the `family` parameter of the `glm` function, like this:

In [31]:
#Create a Binomial GLM, using the formula that has "am" as the response variable and "hp", "mpg" and "wt" as the predictors
model <- SparkR::glm(am ~ hp + mpg + wt, data = cars, family = "binomial")

As seen before, we can retrieve data from our model using the `summary` function:

In [32]:
#Retrieve data from our model
SparkR::summary(model)

$coefficients
                Estimate
(Intercept) -15.72136618
hp            0.08389344
mpg           1.22930192
wt           -6.95492385


And then, of course, **predict data points using our binomial regression model**:

In [33]:
#Create predictions based on the model created
predictions <- SparkR::predict(model, newData = cars)
SparkR::head(select(predictions, "am", "prediction"))

  am prediction
1  1          1
2  1          0
3  1          1
4  0          0
5  0          0
6  0          0

Now that you know how to build a binomial regression model, you can try a different model on the `cars` dataset.

In [34]:
#Create a Binomial GLM, using the formula that has "vs" as the response variable and "drat" ,"disp" and "gear" as
#the predictors
#type your code here

#Retrieve data from our model
#type your code here

#Create predictions based on the model created
#type your code here

<div class="alert alert-block alert-info" style="margin-top: 20px">That wraps up our Hands-on Lab 3. Remember that if you want any more information on SparkR, you can visit their documentation page, at https://spark.apache.org/docs/latest/sparkr.html</div>