*Resilient Distributed Dataset (RDD)*

* Is a collection of element partioned across the nodes of the cluster that can be operated on in parallel;
* RDD are created by starting with a file in the Hadoop file system, or an existing Scala collection;
* It can also recover automatically from node failures.

In [10]:
from pyspark import SparkContext, SparkConf

In [20]:
#Creating a SparkContext object, which tells Spark how to access a cluster
#To build a SparkContext you first need a SparkConf that contains information about
#our application
conf = SparkConf().setAppName("RDD").setMaster("local")
sc = SparkContext(conf = conf)

*To create a RDD we can do a parallelizing of an existing collection in our driver program, or referencing a dataset in an external storage system, like the Hadoop environment*

#### Parallelize Collections

In [12]:
data = [1, 2, 3, 4, 5, 6]
distData = sc.parallelize(data) 
#This takes our collection and copies it to form a distributed dataset that can be
#operated in parallel
#You can also set the number of partitions the data is cut into, sc.parallelize(data, int)

### External Datasets

In [13]:
distFile = sc.textFile("/Users/leoareias/Documents/Data_Engineering/RDD_Programming/test.txt")
#Here we use the sc.textFile to create an RDD where each element of the RDD represents a line in the text file
#sc meaning SparkContext

In [14]:
distFile.map(lambda s: len(s)).reduce(lambda a, b: a + b)
#The map transformation applies the provided function to each element of the RDD
#The reduce action aggregates the elements of the RDD using the specified binary operator
#Combining both the code calculates the length of each line and then sums these lengths 
#to get the total number of characters in the text file

27

*Saving and Loading SequenceFiles*

In [15]:
rdd = sc.parallelize(range(1,5)).map(lambda x: (x, "a" * x))
#Here we create a parallelized collection (1, 2, 3, 4), and then aplly a map function
#to it multipling the number we are on by the letter a.

In [17]:
rdd.saveAsSequenceFile("/Users/leoareias/Documents/Data_Engineering/RDD_Programming/output_squence_file")
#Saving as a SequenceFile

In [18]:
sorted(sc.sequenceFile("/Users/leoareias/Documents/Data_Engineering/RDD_Programming/output_squence_file").collect())
#Collecting the result we got

[(1, 'a'), (2, 'aa'), (3, 'aaa'), (4, 'aaaa')]

### RDD Operations

*There are 2 types of operations:*
1. Transformations, which create a new dataset from an existing one. Example: Map

2. Actions, which return a value to the driver program after running a computation on the dataset. Exemple: Reduce

* To persist a function in memory to latter be used we can use the .persist() method

### Passing Functions to Spark

1. Lambda Expressions, for simple functions that can be writtes as an expression;

2. Local defs inside the function calling into spark;

3. Top-level function in a module;

In [19]:
def myFunc(s):
    words = s.split(" ")
    return len(words)
    
rdd = sc.textFile("/Users/leoareias/Documents/Data_Engineering/RDD_Programming/test.txt").map(myFunc)
rdd.collect()

[7, 3]