# 4. RDD Programming (Legacy)

Spark application consists of a driver program that runs and executes _parallel operations_ on a cluster.

* Provides a _resilient distributed database (RDD) which is a collection of elements patitioned across nodes on a cluster that can be operated on in parallel
* Persist an RDD in memory to be reused
* Automatically recove from node failures

_Shared variables_

Used when running a function parallel, Spark will copy each vairable used in the function to each task.

1. __Broadcast variables__ -- used to cache a value in memory on all nodes
2. __Accumulators__ -- variables that are only "added" to

# Linking with Spark

* Spark applications can work with C libraries (e.g NumPy)
* Need the Spark distribution `bin/spark-submit` script
* To access HDFS data, need to use a build of PySpark linked to HDFS
* Import the Spark classes

## Initializing Spark

Must create a `SparkContext` object, which tells Spark how to access the Cluster. First need to build a `SparkConf` object that contains information about the application.

* `SparkConf.setAppName(appName).setMaster(<type>)` -- set configuration
	* `<type>` --  "local" if you want to run local testing, "master" for production

In [1]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("appName").setMaster('master')
sc = SparkContext(conf=conf)

RuntimeError: Java gateway process exited before sending its port number

---

# Resilient Distributed Datasets (RDDs)

A fault-tolerant collection of elements that can be operated on in parallel. 2 ways of creating RDDs:

1. _Parallelizing_ and existing collection in driver program
2. _Referencing_ an external storage system

## Parallelized Collections

Created by calling `.parallelize()` on an existing collection. Elements of collection are copied to form a distributed dataset.

One important parameter is the __number of partitions__ to cut the dataset into. Spark will run one task for each partition. 2-4 partitions for each CPU in cluster. Number of partitions is usually automatic, but can be specified manually

In [None]:
data = [1, 2, 3, 4, 5]
dist_data = sc.parallelize(data)
dist_data.reduce(lambda a, b: a + b)

## External Datasets

Create distributed datasets from any storage source supported by Hadoop. (e.g local file system, HDFS, Cassandra, HBase, ...)

Notes on reading files:

* If using a path on local filesystem, they must also be accessible at the same path on worker nodes. Copy files to workers or use network-mounted shared file system
* All file-based input methods, support running on directories and compressed files, and wildcards
* Second argument of functions controls the number of partitions on the file

Alternative File input methods

* Text Files
* Pickle Files
* Sequence Files
* Hadoop Input/Output Formats

In [None]:
# Loading a textfile 
dist_file = sc.textFile("data.txt")

# Saving and Loading a SequenceFile
rdd = sc.parallelize(range(1, 4)).map(lambda x: ('a' * x))
rdd.saveAsSequenceFile("path/to/file") # Saving
sorted(sc.sequenceFile("path/to/file").collect()) # Loading

## RDD Operations

Supports 2 types of operations

1. _Transformations_ -- create a new dataset from an existing one
	* All transformations are lazy, only computed when an action requires a result to be returned
2. _Actions_ -- returns value to driver program after running a computation on dataset

Each RDD may be recomputed each time an action is runned. Able to persist an RDD in memory using `persist` method

In [None]:
lines = sc.textFile('data.txt')
line_lengths = lines.map(lambda s: len(s))
total_length = line_lengths.reduce(lambda a, b: a + b) # computed here

line_lengths.persist() # keep this as a persited RDD

## Passing Functions to Spark

3 recommended ways:

1. Lambda expressions for simple functions
2. local `def` inside the function calling into Spark for longer code
3. Top-level functions in a module

In [None]:
# Longer function
if __name__ == "__main__":
  def myFunc(s):
    words = s.split("")
    return len(words)
  
  sc = SparkContext(...)
  sc.textFile("file.txt").map(myFunc)

## Understanding Closures

Understanding the scope and life cycle of variables and methods when executing code across a cluster.

Consider the naive RDD element sum below, which may behave differently depending on whether execution is happening within the same JVM. A common example of this is when running Spark in local mode (--master = local[n]) versus deploying a Spark application to a cluster (e.g. via spark-submit to YARN):

In [None]:
counter = 0
rdd = sc.parallelize(data)

# Don't do this
def increment_counter(x):
  global counter
  counter += x
rdd.foreach(increment_counter)

The behavior of the above code is undefined, and may not work as intended. To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.

The variables within the closure sent to each executor are now copies and thus, when counter is referenced within the foreach function, it’s no longer the counter on the driver node. There is still a counter in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure. Thus, the final value of counter will still be zero since all operations on counter were referencing the value within the serialized closure.

In local mode, in some circumstances, the foreach function will actually execute within the same JVM as the driver and will reference the same original counter, and may actually update it.

To ensure well-defined behavior in these sorts of scenarios one should use an Accumulator. Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. The Accumulators section of this guide discusses these in more detail.

__In general closures - constructs like loops or locally defined methods should not be used to mutate a global state__

Some code that work in local mode may not work for distributed (master) mode

## Working with key-value pairs

A few special operations are available for RDDs of key-value paris. More commonly "shuffle" operations, like grouping or aggregating elements by key

In [None]:
lines = sc.textFile("data.txt")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)

## Transformations & Actions

Refer to [here](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) for list of common transforamtion and actions

Note: There are async versions for some actions, e.g `foreachAsync` which will not block the completion of the action

## Shuffle Operations

Certain operations trigger a shuffle - Spark's mechanism for re-distributing data so that it's grouped according to partitions. This happens when the data has been altered and each partition needs to be updated in order for the function to output the right data. This involves copying data across executors and machines, meaning the mechanism is costly.

Refer to [here](https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations) for a detailed explanation of the shuffle mechanism

## RDD Persistence

Each node stores any partitions of the operation it computes and reuses them in subsequent operations on the dataset. Usually good for iterative algorithms.

User `persist()` or `cache()` method, it will be kept in memory on the nodes. Each persisted RDD can be stored in a different _storage level_, done by passing a `StorageLevel` object to the methods.

Full list of the storage levels can be found [ here ](https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence)

__Choosing Storage Levels__

* If RDDs fit with default storage (MEMORY_ONLY) leave it as it is. Most efficient method
* If not try using MEMORY_ONLY_SER and a fast serialization library to make it more space-efficient, but equally fast
* Don't spill into disk unless the computed dataset is expensive or utilizes large amounts of data
* Use replicated storage levels for fast fault recovery, not speed

__Removing Data__

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the `RDD.unpersist()` method. Note that this method does not block by default. To block until resources are freed, specify blocking=true when calling this method.

---

# Shared Variables

Generally, when Spark operations are runned on remote cluster nodes, the data they act upon are separate copies of all the variables used in the function. These copied variables and the changes made on the function are not propogated back to driver program. 

However, Spark provides 2 limited types of _shared variables_ that provide this progpogation back

## Broadcast Variables

Keep a read-only variable cached on each machine rather than a shopped copy of it along with the tasks. Spark will automatically broadcast the common data needed by each task at each stage, meaning specifically creating broadcast variables is only useful when tasks across multiple stages needs the same data, or for greater control over the serialization of certain variables

In [None]:
broadcast_var = sc.broadcast([1, 2, 3])
broadcast_var.value

call `.unpersist()` to stop release resources of broadcasted variables in that instant. Use `destroy()` to permanently release all resources

## Accumulators

Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types.

As a user, you can create named or unnamed accumulators. Tasks running on a cluster can then add to it using the add method or the += operator. However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.

In [None]:
# Using built-in accumulator
accum = sc.accumulator(0)
sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))

accum.value

# Create own type of accumulator for different data types
from pyspark.accumulators import AccumulatorParam

class VectorAccumulatorParam(AccumulatorParam):
  def zero(self, initialValue):
    return Vector.zeros(initialValue.size)
  
  def addInPlace(self, v1, v2):
    v1 += v2
    return v1 

vec_accum = sc.accumulator(Vector(...), VectorAccumulatorParam())