# Unit 2: Basic Spark Concepts

## Contents

2.1 Spark Components

2.2 RDD

2.3 Partitioning

2.4 Transformations vs Actions

2.5 DAG

2.6 Available APIs

## Spark Components

A Spark application consists of a driver program that runs the main code of the program, distributing the operations to the rest of executors assigned by YARN to the application.

![Spark Components](http://bigdata.cesga.es/tutorials/img/cluster-overview.png)

Diagram taken from the [Spark Cluster Mode Overview](https://spark.apache.org/docs/2.4.0/cluster-overview.html).

For further information check our [Spark Tutorial](http://bigdata.cesga.es/tutorials/spark.html#/) and the [Spark Cluster Mode Overview](https://spark.apache.org/docs/2.4.0/cluster-overview.html).

## RDD
A Resilient Distributed Dataset (RDD) is an abstraction that represents a collection of elements **distributed** across the nodes of the cluster.

A RDD provides a series of methods that allow to operate with its underlying data in parallel in a very transparent way:


In [1]:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6])

In [2]:
rdd.count()

6

RDDs are **resilient** because they can automatically recover in case some of the nodes fails.

## Partitioning

The elements in a RDD are splitted between the nodes of the cluster, dividing the collection in partitions. Each partition is then processed by a given executor.

![Partitioning](https://docs.google.com/drawings/d/1GAasfY7P7uaMXhvGHuZ1nOqPqv6TrE7-N96RqUn1NqE/pub?w=960&h=540)

In [3]:
rdd1 = sc.parallelize([1, 2, 3, 4, 5, 6])

In [4]:
rdd1.toDebugString()

'(2) ParallelCollectionRDD[2] at parallelize at PythonRDD.scala:195 []'

In [5]:
rdd1.glom().collect()

[[1, 2, 3], [4, 5, 6]]

In [11]:
rdd2 = sc.parallelize([1, 2, 3, 4, 5, 6], 3)

In [12]:
rdd2.glom().collect()

[[1, 2], [3, 4], [5, 6]]

In [8]:
print rdd2.toDebugString()

(3) ParallelCollectionRDD[4] at parallelize at PythonRDD.scala:195 []


In general, each task of an application runs against a different partition of the RDD.

When using **large files** in HDFS (with many blocks) the partitions can be considered equivalent to the HDFS blocks of the given file.

For **small files** (smaller than 128MB) by default spark will create two partitions, so initally only two tasks can be executed in parallel, independently of how many resources YARN has allocated to the application.

In [10]:
rdd3 = sc.textFile('datasets/meteogalicia.txt')

In [10]:
print rdd3.toDebugString()

(2) datasets/meteogalicia.txt MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:0 []
 |  datasets/meteogalicia.txt HadoopRDD[6] at textFile at NativeMethodAccessorImpl.java:0 []


In [9]:
rdd4 = sc.textFile('datasets/meteogalicia.txt', 4)

In [12]:
print rdd4.toDebugString()

(4) datasets/meteogalicia.txt MapPartitionsRDD[9] at textFile at NativeMethodAccessorImpl.java:0 []
 |  datasets/meteogalicia.txt HadoopRDD[8] at textFile at NativeMethodAccessorImpl.java:0 []


## Transformations vs Actions

### Transformations
Create a new RDD from an existing one.

All transformations in Spark are **lazy**, in the sense that they do not actually do anything until an action is executed.

Examples:
* map
* filter

### Actions
Return the result to the driver program.

Examples:
* reduce
* collect

## DAG

Each job is represented by a graph (specifically a [directed acyclic graph (DAG)](https://en.wikipedia.org/wiki/Directed_acyclic_graph)):

![DAG](http://2.bp.blogspot.com/-5sDP78mSdlw/Ur3szYz1HpI/AAAAAAAABCo/Aak2Xn7TmnI/s1600/p2.png)

## Available APIs 

Currently there are different options to use Spark in Python:

* Low-Level API: Using **RDDs and PairRDDs**: the original API, low level, great flexibility

* Structured API: Using **Spark SQL and DataFrames**: newer, higher level, better performance

In the case of Java and Scala there is also the option of using **DataSets**: a generalization of DataFrames that allows to use typed data instead of generic Row objects.

## Useful Reference Resources

* [Spark RDD Programming Guide](https://spark.apache.org/docs/2.4.0/rdd-programming-guide.html)
* [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/2.4.0/sql-programming-guide.html)
* [Spark Python API](https://spark.apache.org/docs/2.4.0/api/python/index.html)

## Questionaire
Complete the [Unit 2 questionaire](https://forms.gle/knWfo8MK1A7UiiyJ7).