# Lab 3 Instructions and Exercises (Big Data in Statistics)

## Introduction

These notes contain instructions and questions for the labs portion of the Big Data in Statistics module. Within this document, command-line steps are presented as follows:

In [None]:
hadoop fs -put data /user/mark/repository/data

All commands will be in a separate grey "cell" (as above).
<br><br>
Exercises will be listed as a bulleted item and italicized. For example:

<ul><li><i>Create a new directory in your HDFS home directory called sample. Upload data.csv into the sample directory on HDFS.</i></li></ul>

To follow real-world development practices, you will be using configuration control software git, and internet based repositories on <a href="http://github.com">github.com</a>. Instructions will be provided on how to use these tools during the exercises.

## Objectives

In this lab you will be expected to achieve the following:
<ol>
<li>Create a Spark Resilient Distributed Dataset (RDD)
<li>Transform an RDD
<li>Perform an action on an RDD
</ol>

## Exercises

### Exercise 1

You will now connect to the Spark REPL, using the following command:

In [None]:
spark-shell

After a short delay, you will be presented with a new prompt that looks like follows (NB: it is safe to ignore the error messages when Spark loads):

In [None]:
scala>

To avoid ambiguity, all Spark commands listed within this document will follow this <i>scala</i> prompt.<br>
As part of the Spark startup process, a Spark context (sc) is created, which is the main interface to the requested Spark environment. To view the available functions, type the following and then immediately press the Tab key:

In [None]:
scala> sc.

You will now create an RDD by loading the Heathrow airport temperature data:

In [None]:
scala> val heathrowFile = sc.textFile("hdfs:///user/USERNAME/temperatureData/heathrowdata.txt")

Note that <b>USERNAME</b> will need to be replaced with your compute cluster username, <i>userN</i>.
<br><br>
As displayed in the console, the value heathrowFile is of type org.apache.spark.rdd.RDD[String], that is, the function textFile has created an RDD of type String (as one would perhaps expect when reading a file of text). Due to Scala’s lazy evaluation model, the file has not been read; the file has not even been checked to see if it exists! This only happens when an action is performed. You will now bring back 5 lines of data to the console and print each line, to force Spark to undertake an <i>action</i>.

In [None]:
scala> val fiveLines = heathrowFile.take(5)

In [None]:
scala> fiveLines.foreach(println)

The first line creates a value fiveLines of the RDD by taking the first 5 elements (lines in the case of a textFile). fiveLines is of type Array[String] as the memory is now being held and managed in the local console, rather than via Spark’s RDD. The second line calls the function println on each element of fiveLines, which subsequently displays the data.

<ul>
<li><i>Create a value called wickairportFile, that is an RDD[String], based on the wickairportdata.txt file in HDFS. Display the first five lines of this data file.</i>
</ul>

### Exercise 2

You will now create a function that is able to determine whether a line of data contains the header information:

In [None]:
scala> def isHeader(line: String): Boolean = {
line.contains("yyyy") || line.split(" ").size != 5
}

A function is created by using the keyword def, followed by the function name, input arguments, and output type. The function body is specified in between the braces. This function simply checks to see whether the string yyyy is present in the line of data that is passed into the function. The string yyyy is only present in the header line.
<br><br>
You will now create a Case class corresponding to the structure of the data:

In [None]:
scala> case class TemperatureRecord(year: Long, month: Int, tmax: Float, tmin: Float, rain: Float)

This class will act as a data structure to allow us to parse and manipulate the data in a convenient manner. You can create an RDD of TemperatureRecord objects by creating a parse function as follows:

In [None]:
scala> def parse(line: String) = {
val pieces = line.split('\t')
val year = pieces(0).toLong
val mm = pieces(1).toInt
val tmax = pieces(2).toFloat
val tmin = pieces(3).toFloat
val rain = pieces(4).toFloat
TemperatureRecord(year, mm, tmax, tmin, rain)
}

and by creating the following transformation:

In [None]:
scala> val heathrowData = heathrowFile.filter(x => !isHeader(x)).map(parse)

As before, due to Spark’s lazy evaluation model, this command will not be executed until we perform an action on the data. The value heathrowData is of type org.apache.spark.rdd.RDD[TemperatureRecord].

<ul>
<li><i>Perform the same transformation on the value wickairportFile, to create a value wickairportData, which is of type org.apache.spark.rdd.RDD[TemperatureRecord]. NB: You will need to create an additional filter transformation to filter the lines of data in the wickairportFile that contain missing values (i.e. the records that contain “---“ as default data files). [It is recommended that you test this set of transformations using the take command.]</i>
</ul>

### Exercise 3

It is often the case that you need to sort the data in accordance with one of the fields in the data. You will now sort the data by month rather than year:

In [None]:
scala > val heathrowMonth = heathrowData.sortBy(_.month)

<ul>
<li><i>Sort the wickairportData by maximum temperature, creating a value called wickairportTemp. <b>NB: As often occurs in data analysis, you will need to clean the data first, by filtering any erroneous data points that contain "---" as the temperature values.</b><i>
</li>

### Exercise 4

You will now convert all temperature values from degrees Celsius into degrees Fahrenheit:

In [None]:
scala> val heathrowMonthFah = heathrowMonth.map( x => {
val tmax = x.tmax * 9/5 + 32
val tmin = x.tmin * 9/5 + 32
TemperatureRecord(x.year, x.month, tmax, tmin, x.rain)
})

<ul>
<li><i>Create an RDD[Float] containing the monthly rainfall in centimeters for Wick airport (the rainfall data is represented in mm in the data), sorted by tmax.</i>
</ul>