#Lab 2: To work on Azure RDD, Dataframes and SQL

### Learning Objective

* To Create Spark RDD
* To create Spark DataFrame
* To create Spark SQL

## Spark RDD (Resilient Distributed Dataset) 

* RDD is an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. 

* An RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.

### Creating RDD in Pyspark

#####  There are three ways to create an RDD in Spark.

* Parallelizing already existing collection in driver program.
* Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file system).
* Creating RDD from already existing RDDs.

In [None]:
MyRdd1 = sc.parallelize([("maths",92),("english",75),("SCiences",85),("Social",90)])

In [None]:
MyRdd1

In [None]:
MyRdd1.collect()

### Creating an rdd by reading a file 

In [None]:
MyRdd2 = sc.textFile('/temp_data.txt')

In [None]:
MyRdd2.collect()

In [None]:
MyRdd3 = MyRdd2.map(lambda s: s.split('\t'))

In [None]:
type(MyRdd3)

In [None]:
MyRdd3.take(4)

### RDDs support two types of operations:
* Transformations are operations (such as map, filter, join, union, and so on) that are performed on an RDD and which yield a new RDD containing the result.

* Transformations in Spark are “lazy”, meaning that they do not compute their results right away. 
* They just “remember” the operation to be performed and the dataset (e.g., file) to which the operation is to be    performed. 
* The transformations are only actually computed when an action is called and the result is returned to the driver program. 
* This design enables Spark to run more efficiently. For example, if a big file was transformed in various ways and passed to first action, Spark would only process and return the result for the first line, rather than do the work for the entire file.

In [None]:
intRdd = sc.parallelize([10,20,30,40,50])
mapRdd = intRdd.map(lambda x : x**2)

In [None]:
mapRdd.collect()

##  Spark DataFrame 

#### A DataFrame is the most common Structured API and simply represents a table of data with rows and columns. 
<br> The list that defines the columns and the types within those columns is called the schema. 
<br> One can think of a DataFrame as a spreadsheet with named columns.
<br> A spreadsheet sits on one computer in one specific location, whereas a Spark DataFrame can span thousands of computers.
<br> The reason for putting the data on more than one computer should be intuitive: 
<br>     either the data is too large to fit on one machine or 
<br>     it would simply take too long to perform that computation on one machine.

#### Create a dataframe with one column containing 100 rows with values from 0 to 99.
This range of numbers represents a distributed collection. 
<br> When run on a cluster, each part of this range of numbers exists on a different executor. 
<br> This is a Spark DataFrame.

In [None]:
myRange = spark.range(100).toDF('number')

In [None]:
myRange.rdd.getNumPartitions()

In [None]:
myRange.show(3)

In [None]:
type(myRange)

In [None]:
myDF = spark.createDataFrame([[1, 'Alice', 30],
                              [2, 'Bob', 28],
                              [3, 'Cathy', 31], 
                              [4, 'Dave', 56]], ['Id', 'Name', 'Age'])

myDF.show()

In [None]:
myDF.dtypes

## DataFrame Transformations & Actions

### Transformations
In Spark, the core data structures are immutable, meaning they cannot be changed after they’re created.
<br> To “change” a DataFrame, you need to instruct Spark how you would like to modify it to do what you want.
<br> These instructions are called transformations.
<br> Transformations are the core of how you express your business logic using Spark.
<br> Transformations are simply ways of specifying different series of data manipulation.

In [None]:
myRange.show()

In [None]:
divisBy2 = myRange.where("id % 2 = 0")
divisBy2

Notice that these return no output. <br>This is because we specified only an abstract transformation, and Spark will not act on transformations until we call an action.

### Actions
Transformations allow us to build up our logical transformation plan. 
<br> To trigger the computation, we run an action.
<br> An action instructs Spark to compute a result from a series of transformations. 
<br> The simplest action is count, which gives us the total number of records in the DataFrame:

In [None]:
divisBy2.count()

In [None]:
divisBy2.show(3)

In [None]:
testRDD = sc.textFile("/test.csv")
print("Total Records with header: ", testRDD.count())

### Spark SQL
With Spark SQL, you can register any DataFrame as a table or view (a temporary table) and query it using pure SQL. 
<br>There is no performance difference between writing SQL queries or writing DataFrame code, <br>they both “compile” to the same underlying plan that we specify in DataFrame code.

In [None]:
## Create view/table
trainDF.createOrReplaceTempView("trainDFTable")

In [None]:
## Verify Dataframe
trainDF.show(2)

In [None]:
## Verify Table
spark.sql("SELECT * FROM trainDFTable LIMIT 2").show()