![Spark Image](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Apache_Spark_logo.svg/1200px-Apache_Spark_logo.svg.png)

# Low-Level Unstructured APIs

In this notebook we will discuss the oldest fundamental concept in spark called *RDDs (Resilient distributed
datasets)*.<br> 
To truly understand how Spark works, `you must understand the essence of RDDs`. They provide an extremely solid foundation that other abstractions are built upon. Starting with Spark 2.0, Spark users will have fewer needs for directly interacting with RDD, but having a strong mental model of how RDD works is essential. `In a nutshell, Spark revolves around the concept of RDDs`.

## Introduction to RDDs

An RDD in Spark is simply an immutable distributed collection of objects. Each is split into multiple partitions, which may be computed on different nodes of the cluster.<br>
RDDs are `immutable`, `fault-tolerant`, `parallel data structures` that let users explicitly persist intermediate results `in memory`, control their partitioning to optimize data placement, and `manipulate` them using a rich set of `operators`.

## Immutable

RDDs are designed to be immutable, which means you `can’t` specifically `modify a particular row` in the dataset represented by that RDD. You can call one of the available RDD operations to manipulate the rows in the RDD into the way you want, but that operation will `return a new RDD`. The `basic RDD will stay unchanged`, and the new RDD
will contain the data in the way that you want. *Spark leverages Immutability to efficiently provide the fault tolerance capability.* 

## Fault Tolerant

The ability to process multiple datasets in parallel usually requires a cluster of machines to host and execute the computational logic. If one or more machices dies due to unexpected circumstances then whats happens to the data in those machines?.  Spark automatically takes care of handling the failure on behalf of its users by rebuilding the failed portion using the lineage information.

## Parallel Data Structures

Suppose you have huge amount of data and you need process each and every row of the datset. One solution will be to iterate over each row and process it one by one. But that would be very slow. So instead we will divide the huge chuck of Data in smaller chunks of Data. Each chunk contains a collection of rows, and all the chunks are being processed in parallel. This is where the phrase parallel data structures comes from.

## In-Memory Computing

The idea of speeding up the computation of large datasets that reside on disks in a parallelized manner using a cluster of machines was introduced by a MapReduce paper from Google. RDD pushes the speed boundary by introducing a novel idea, which is the ability to do distributed in-memory computation.

## RDD Operations

RDDs provide a rich set of commonly needed data processing operations. They include the ability to perform data transformation, filtering, grouping, joining, aggregation, sorting, and counting.<br>
Each row in a dataset is represented as a Java object, and the structure of this Java object is opaque to Spark. The user of RDD has complete control over how to manipulate this Java object. This flexibility comes with a lot of responsibilities, meaning some of the commonly needed operations such as the computing average will have to be handcrafted. Higher-level abstractions such as the Spark SQL component will provide this functionality out of the box.<br>

***The RDD operations are classified into two types: `transformations` and `actions`***

| Type | Evaluation | Returned Value |
|--|--|--|
| Transformation | Lazy | Another RDD |
| Action | Eager | Some result or write result to disk |

Transformation operations are lazily evaluated, meaning Spark will delay the evaluations of the invoked operations until an action is taken. In other words, the transformation operations merely record the specified transformation logic and will apply them at a later point. On the other hand, invoking an action operation will trigger the evaluation of all the transformations that preceded it, and it will either return some result to the driver or write data to a storage system, such as HDFS or the local file system.

## Initialising Spark

The programming language Python is used for the implementation in this course - for this we use 'pyspark. (PySpark documentation https://spark.apache.org/docs/latest/api/python/)
PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.

In [None]:
# ipmort libraries from pyspark
from pyspark import SparkConf, SparkContext

# set values for Spark configuration
conf = SparkConf().setMaster("local").setAppName("Tutorial")

# get (if already running) or create a Spark Context
sc = SparkContext.getOrCreate(conf=conf)

In [2]:
# check (try) if Spark context variable (sc) exists and print information about the Spark context
try:
    sc
except NameError:
    print("Spark context does not context exist. Please create Spark context first (run cell above).")
else:
    configurations = sc.getConf().getAll()
    for item in configurations: print(item)

('spark.master', 'local')
('spark.app.name', 'Tutorial')
('spark.app.startTime', '1642915846266')
('spark.rdd.compress', 'True')
('spark.app.id', 'local-1642915848952')
('spark.driver.host', '192.168.178.62')
('spark.serializer.objectStreamReset', '100')
('spark.submit.pyFiles', '')
('spark.executor.id', 'driver')
('spark.submit.deployMode', 'client')
('spark.ui.showConsoleProgress', 'true')
('spark.driver.port', '56956')


In [3]:
# print link to Spark UI, Version, Master and AppName
sc

## Creating RDDs

**There are two ways to create RDDs:**

**`The first way to create an RDD is to parallelize an python object, meaning converting it to a distributed dataset that can be operated in parallel.`**

In [4]:
# create a list of strings
stringList = ["Spark is awesome","Spark is cool"]
# covert list of strings into a Spark RDD
stringRDD = sc.parallelize(stringList)

In [5]:
# output RDD information
stringRDD

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

*One thing to notice is that you are not able to see the output, because of Spark's Lazy evaluation utill you call an action on that RDD.*

In [6]:
# retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes)
stringRDD.collect()

                                                                                

['Spark is awesome', 'Spark is cool']

*.collect() is an `action` as it name suggests it collects all the rows from each of the partitions in an RDD and brings them over to the driver program.*

**`The second way to create an RDD is to read a dataset from a storage system, which can be a local computer file system, HDFS, Cassandra, Amazon S3, and so on.`**

In [7]:
# read text file inro RDD
ratings = sc.textFile("data/ml-1m/ratings.dat")

In [8]:
# retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) and output first 5 rows
ratings.collect()[:5]

                                                                                

['1::1193::5::978300760',
 '1::661::3::978302109',
 '1::914::3::978301968',
 '1::3408::4::978300275',
 '1::2355::5::978824291']

In this particular example we had 1M rows calling .collect() of it didn't take lot of time but If your RDD contains 100 billion rows, then it is not a good idea to invoke the collect action because the driver program most likely doesn’t have sufficient memory to hold all those rows. As a result, the driver will most likely run into an out-of-memory error and your Spark application or shell will die. This action is typically used once the RDD is filtered down to a smaller size that can fit the memory size of the driver program. 

In [9]:
# take the first 5 elements of the RDD
ratings.take(5)

                                                                                

['1::1193::5::978300760',
 '1::661::3::978302109',
 '1::914::3::978301968',
 '1::3408::4::978300275',
 '1::2355::5::978824291']

## Transformations

Transformations are operations on RDDs that return a new RDD. Transformed RDDs are computed lazily, only when you
use them in an action.

Following Table describes commonly used transformations.

<table>
<tbody><tr><th style="width:25%">Transformation</th><th>Meaning</th></tr>
<tr>
  <td> <b>map</b>(<i>func</i>) </td>
  <td> Return a new distributed dataset formed by passing each element of the source through a function <i>func</i>. </td>
</tr>
<tr>
  <td> <b>filter</b>(<i>func</i>) </td>
  <td> Return a new dataset formed by selecting those elements of the source on which <i>func</i> returns true. </td>
</tr>
<tr>
  <td> <b>flatMap</b>(<i>func</i>) </td>
  <td> Similar to map, but each input item can be mapped to 0 or more output items (so <i>func</i> should return a Seq rather than a single item). </td>
</tr>
<tr>
  <td> <b>mapPartitions</b>(<i>func</i>) <a name="MapPartLink"></a> </td>
  <td> Similar to map, but runs separately on each partition (block) of the RDD, so <i>func</i> must be of type
    Iterator&lt;T&gt; =&gt; Iterator&lt;U&gt; when running on an RDD of type T. </td>
</tr>
<tr>
  <td> <b>mapPartitionsWithIndex</b>(<i>func</i>) </td>
  <td> Similar to mapPartitions, but also provides <i>func</i> with an integer value representing the index of
  the partition, so <i>func</i> must be of type (Int, Iterator&lt;T&gt;) =&gt; Iterator&lt;U&gt; when running on an RDD of type T.
  </td>
</tr>
<tr>
  <td> <b>sample</b>(<i>withReplacement</i>, <i>fraction</i>, <i>seed</i>) </td>
  <td> Sample a fraction <i>fraction</i> of the data, with or without replacement, using a given random number generator seed. </td>
</tr>
<tr>
  <td> <b>union</b>(<i>otherDataset</i>) </td>
  <td> Return a new dataset that contains the union of the elements in the source dataset and the argument. </td>
</tr>
<tr>
  <td> <b>intersection</b>(<i>otherDataset</i>) </td>
  <td> Return a new RDD that contains the intersection of elements in the source dataset and the argument. </td>
</tr>
<tr>
  <td> <b>distinct</b>([<i>numPartitions</i>])) </td>
  <td> Return a new dataset that contains the distinct elements of the source dataset.</td>
</tr>
<tr>
  <td> <b>groupByKey</b>([<i>numPartitions</i>]) <a name="GroupByLink"></a> </td>
  <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable&lt;V&gt;) pairs. <br>
    <b>Note:</b> If you are grouping in order to perform an aggregation (such as a sum or
      average) over each key, using <code>reduceByKey</code> or <code>aggregateByKey</code> will yield much better
      performance.
    <br>
    <b>Note:</b> By default, the level of parallelism in the output depends on the number of partitions of the parent RDD.
      You can pass an optional <code>numPartitions</code> argument to set a different number of tasks.
  </td>
</tr>
<tr>
  <td> <b>reduceByKey</b>(<i>func</i>, [<i>numPartitions</i>]) <a name="ReduceByLink"></a> </td>
  <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function <i>func</i>, which must be of type (V,V) =&gt; V. Like in <code>groupByKey</code>, the number of reduce tasks is configurable through an optional second argument. </td>
</tr>
<tr>
  <td> <b>aggregateByKey</b>(<i>zeroValue</i>)(<i>seqOp</i>, <i>combOp</i>, [<i>numPartitions</i>]) <a name="AggregateByLink"></a> </td>
  <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in <code>groupByKey</code>, the number of reduce tasks is configurable through an optional second argument. </td>
</tr>
<tr>
  <td> <b>sortByKey</b>([<i>ascending</i>], [<i>numPartitions</i>]) <a name="SortByLink"></a> </td>
  <td> When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean <code>ascending</code> argument.</td>
</tr>
<tr>
  <td> <b>join</b>(<i>otherDataset</i>, [<i>numPartitions</i>]) <a name="JoinLink"></a> </td>
  <td> When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
    Outer joins are supported through <code>leftOuterJoin</code>, <code>rightOuterJoin</code>, and <code>fullOuterJoin</code>.
  </td>
</tr>
<tr>
  <td> <b>cogroup</b>(<i>otherDataset</i>, [<i>numPartitions</i>]) <a name="CogroupLink"></a> </td>
  <td> When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable&lt;V&gt;, Iterable&lt;W&gt;)) tuples. This operation is also called <code>groupWith</code>. </td>
</tr>
<tr>
  <td> <b>cartesian</b>(<i>otherDataset</i>) </td>
  <td> When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). </td>
</tr>
<tr>
  <td> <b>pipe</b>(<i>command</i>, <i>[envVars]</i>) </td>
  <td> Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the
    process's stdin and lines output to its stdout are returned as an RDD of strings. </td>
</tr>
<tr>
  <td> <b>coalesce</b>(<i>numPartitions</i>) <a name="CoalesceLink"></a> </td>
  <td> Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently
    after filtering down a large dataset. </td>
</tr>
<tr>
  <td> <b>repartition</b>(<i>numPartitions</i>) </td>
  <td> Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them.
    This always shuffles all data over the network. <a name="RepartitionLink"></a></td>
</tr>
<tr>
  <td> <b>repartitionAndSortWithinPartitions</b>(<i>partitioner</i>) <a name="Repartition2Link"></a></td>
  <td> Repartition the RDD according to the given partitioner and, within each resulting partition,
  sort records by their keys. This is more efficient than calling <code>repartition</code> and then sorting within
  each partition because it can push the sorting down into the shuffle machinery. </td>
</tr>
</tbody></table>

## Transformation Examples

### Map transformation

*Return a new RDD by applying a function to each element of this RDD*

In [10]:
# use the already created RDD and convert all letter to uppercase 
# using the map transformation and the upper function
stringRDD_uppercase= stringRDD.map(lambda x: x.upper())
# retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes)
stringRDD_uppercase.collect()

['SPARK IS AWESOME', 'SPARK IS COOL']

In [11]:
# implement the function 'alternate_char_upper' which converts every other letter to upper case
def alternate_char_upper(text):
    new_text= []
    for i, character in enumerate(text):
        if i % 2 == 0:
            new_text.append(character.upper())
        else:
            new_text.append(character)
    return ''.join(new_text)

# use the already created RDD and use the map transformation with the 'alternate_char_upper' function
stringRDD_alternate_uppercase= stringRDD.map(alternate_char_upper)
# retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes)
stringRDD_alternate_uppercase.collect()  

['SpArK Is aWeSoMe', 'SpArK Is cOoL']

### Flat Map Transfermation

*Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results*

In [12]:
# use the already created RDD and split the string at every ' ' character
# using the flatMap transformation and the split function
flatMap_Split= stringRDD.flatMap(lambda x: x.split(" "))
# retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes)
flatMap_Split.collect()

['Spark', 'is', 'awesome', 'Spark', 'is', 'cool']

### Difference Between Map and FlatMap 

In [13]:
print("Split using Map transformation:")
# use the already created RDD (stringRDD) and split the string at every ' ' character
# using the map transformation and the split function
map_Split= stringRDD.map(lambda x: x.split(" "))
# retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes)
map_Split.collect()

Split using Map transformation:


[['Spark', 'is', 'awesome'], ['Spark', 'is', 'cool']]

In [14]:
print("Split using FlatMap transformation:")
# the FlatMap tranformation on the RDD (stringRDD) is already defined (see two code cells above)
# so we just have to retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes)
flatMap_Split.collect()

Split using FlatMap transformation:


['Spark', 'is', 'awesome', 'Spark', 'is', 'cool']

Since the source RDD contains two strings, the map transformation returns two separate objects (each with separate strings). The flatMap transformation returns only one object with all separated strings from both input objects (strings).

### Filter Transformation

*Return a new RDD containing only the elements that satisfy a predicate*

In [15]:
# filter all objects from RDD containing 'awesome'
awesomeLineRDD = stringRDD.filter(lambda x: "awesome" in x)
awesomeLineRDD.collect()

['Spark is awesome']

In [16]:
# filter all objects from RDD containing 'spark'
sparkLineRDD = stringRDD.filter(lambda x: "spark" in x.lower())
sparkLineRDD.collect()

['Spark is awesome', 'Spark is cool']

### Union Transformation

*Return a new RDD containing all items from two original RDDs. Duplicates are not culled.*

In [17]:
# create two new RDDs
rdd1 = sc.parallelize([1,2,3,4,5])
rdd2 = sc.parallelize([1,6,7,8])
# create a third RDD with 'union' transformation on RDD1 and RDD2
rdd3 = rdd1.union(rdd2)
rdd3.collect()

[1, 2, 3, 4, 5, 1, 6, 7, 8]

### Intersection Transformation

*Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.*

In [18]:
# create two new RDDs
rdd1 = sc.parallelize(["One", "Two", "Three"])
rdd2 = sc.parallelize(["two","One","threed","One"])
# create a third RDD with 'intersection' transformation on RDD1 and RDD2
rdd3 = rdd1.intersection(rdd2)
rdd3.collect()



['One']

### Substract Trsnformation

*Return each value in `self` that is not contained in `other`.* </br>
(return a new DataFrame containing rows in this DataFrame but not in another DataFrame.)</br>
This is equivalent to EXCEPT DISTINCT in SQL.

In [19]:
# create a new RDD 'words', use transformation flatMap and map ... all in one line
words = sc.parallelize(["The amazing thing about spark \
                is that it is very simple to learn"]).flatMap(lambda x: x.split(" ")).map(lambda c: c.lower())

# create a new TDD 'stopWords'
stopWords = sc.parallelize(["the", "it", "is", "to", "that", ''])

# use substract transformation on words RDD. 
realWords = words.subtract(stopWords)
realWords.collect()

['very', 'simple', 'amazing', 'thing', 'about', 'spark', 'learn']

### Distinct Transformation

*Return a new RDD containing distinct items from the original RDD (omitting all duplicates)*

In [20]:
# create new RDD 'duplicateValueRDD'
duplicateValueRDD = sc.parallelize(["one", 1,"two", 2, "three", "one", "two", 1, 2])
# use distinct transformation on RDD and collect action - in one line
duplicateValueRDD.distinct().collect()

['one', 1, 'two', 2, 'three']

###  Sample Transformation

*Return a new RDD containing a statistical sample of the original RDD*

In [21]:
# create a new RDD 'numbers'. The second parameter of the parallelize transformation is optional integer value
# and defines the number of partitions the data would be parallelized to.
numbers = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)

# The transformation 'sample' returns a sampled subset of the numbers RDD.
# The first parameter (here True) defines 'withReplacement'. The same element can be produced more than 
# once as the result of sample.
# The second parameter (here 0.3) defines the fraction of rows to generate. Note that it doesn’t guarantee 
# to provide the exact number of the fraction of records.
numbers.sample(True, 0.3).collect()

[5]

### GroupBy Transformation

*Group the data in the original RDD. Create pairs where the key is the output of a user function, and the value is all items for which the function yields this key.*

In [22]:
# create a new RDD 'x'
x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])
# groupBy all elements by the first letter of each element (which will be used as keys)
y = x.groupBy(lambda w: w[0])
# 'loop' through all element of the 'y' RDD and print the objects
print([(k, list(v)) for (k, v) in y.collect()])

[('J', ['John', 'James']), ('F', ['Fred']), ('A', ['Anna'])]


## GroupByKey Transformation

*Group the values for each key in the original RDD. Create a new pair where the original key corresponds to this collected group of values.*

In [23]:
# create new 'x' RDD with key,value pairs
x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])
# create RDD 'y' using the groupBy transformation on the keys of RDD 'x'
y = x.groupByKey()
# print objects of RDD 'x'
print(x.collect())
# 'loop' through all element of the 'y' RDD and print the objects
print(list((j[0], list(j[1])) for j in y.collect()))

[('B', 5), ('B', 4), ('A', 3), ('A', 2), ('A', 1)]
[('B', [5, 4]), ('A', [3, 2, 1])]


## MapPartitions Transformation

*Return a new RDD by applying a function to each partition of this RDD*

In [24]:
# create new RDD with two partitions
x = sc.parallelize([1,2,3], 2)
# define the function 'f' - it is an iterable. The function sums all values of a partition
# and returns the sum and the number 42 as an object
def f(iterator): yield sum(iterator); yield 42
# use the transformation 'mapPartitions' with the function 'f'
y = x.mapPartitions(f)
# glom() flattens elements on the same partition
print(x.glom().collect())
print(y.glom().collect())

[[1], [2, 3]]
[[1, 42], [5, 42]]


### MapPartitionWithIndex Transformation

*Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.*

In [25]:
# create new RDD with two partitions
x = sc.parallelize([1,2,3], 2)
# define the function 'f' - it is an iterable. The function sums all values of a partition
# and returns the index of the origina partition and the sum as an object
def f(partitionIndex, iterator): yield (partitionIndex, sum(iterator))
# use the transformation 'mapPartitionsWithIndex' with the function 'f'
y = x.mapPartitionsWithIndex(f)
# glom() flattens elements on the same partition
print(x.glom().collect())
print(y.glom().collect())

[[1], [2, 3]]
[[(0, 1)], [(1, 5)]]


### Join Transformation

*Return a new RDD containing all pairs of elements having the same key in the original RDDs*

`union(otherRDD, numPartitions=None)`


In [26]:
# create RDD 'x'
x = sc.parallelize([("a", 1), ("b", 2)])
# create RDD 'y'
y = sc.parallelize([("a", 3), ("a", 4), ("b", 5)])
# create RDD 'x' as a join result on keys from RDD 'x' and 'y'
z = x.join(y)
print(z.collect())

[('b', (2, 5)), ('a', (1, 3)), ('a', (1, 4))]


### Coalesce Transformation

*Return a new RDD which is reduced to a smaller number of partitions*

`coalesce(numPartitions, shuffle=False)`

In [27]:
# create a RDD with three partition
x = sc.parallelize([1, 2, 3, 4, 5], 3)
# reduce the number of partitions to two by using the coalesce transformation
y = x.coalesce(2)
print(x.glom().collect())
print(y.glom().collect())

[[1], [2, 3], [4, 5]]
[[1], [2, 3, 4, 5]]


### KeyBy Transformation

*Create a Pair RDD, forming one pair for each item in the original RDD. The pair’s key is calculated from the value via a user-supplied function.*


In [28]:
# create a new RDD 'x'
x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])
# use the first letter of each element as the key for the element
y = x.keyBy(lambda w: w[0])
print(y.collect())

[('J', 'John'), ('F', 'Fred'), ('A', 'Anna'), ('J', 'James')]


### PartitionBy Transformation

*Return a new RDD with the specified number of partitions, placing original items into the partition returned by a user supplied function*

`partitionBy(numPartitions, partitioner=portable_hash)`


In [29]:
# create a RDD with three partition
x = sc.parallelize([('J','James'),('F','Fred'),('A','Anna'),('J','John')], 3)
# creta a new RDD 'y' with only two partitions and place each item in partition 0
# if the first letter of the item is < 'H'. The item will be placed in partition 1 otherwise. 
y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1)
# glom() flattens elements on the same partition
print(x.glom().collect())
print(y.glom().collect())

[[('J', 'James')], [('F', 'Fred')], [('A', 'Anna'), ('J', 'John')]]
[[('F', 'Fred'), ('A', 'Anna')], [('J', 'James'), ('J', 'John')]]


### Zip Transformation

*Return a new RDD containing pairs whose key is the item in the original RDD, and whose
value is that item’s corresponding element (same partition, same index) in a second RDD*

`zip(otherRDD)`

In [30]:
# create RDD 'x'
x = sc.parallelize([1, 2, 3])
# create RDD 'y' using the transformation map on RDD 'x'
y = x.map(lambda n:n*n)
# create RDD 'z' using the transformation zip on RDDs 'x' and 'y'
z = x.zip(y)
print(x.collect())
print(y.collect())
print(z.collect())

[1, 2, 3]
[1, 4, 9]
[(1, 1), (2, 4), (3, 9)]


## Actions

<table class="table">
<tbody><tr><th>Action</th><th>Meaning</th></tr>
<tr>
  <td> <b>reduce</b>(<i>func</i>) </td>
  <td> Aggregate the elements of the dataset using a function <i>func</i> (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. </td>
</tr>
<tr>
  <td> <b>collect</b>() </td>
  <td> Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. </td>
</tr>
<tr>
  <td> <b>count</b>() </td>
  <td> Return the number of elements in the dataset. </td>
</tr>
<tr>
  <td> <b>first</b>() </td>
  <td> Return the first element of the dataset (similar to take(1)). </td>
</tr>
<tr>
  <td> <b>take</b>(<i>n</i>) </td>
  <td> Return an array with the first <i>n</i> elements of the dataset. </td>
</tr>
<tr>
  <td> <b>takeSample</b>(<i>withReplacement</i>, <i>num</i>, [<i>seed</i>]) </td>
  <td> Return an array with a random sample of <i>num</i> elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.</td>
</tr>
<tr>
  <td> <b>takeOrdered</b>(<i>n</i>, <i>[ordering]</i>) </td>
  <td> Return the first <i>n</i> elements of the RDD using either their natural order or a custom comparator. </td>
</tr>
<tr>
  <td> <b>saveAsTextFile</b>(<i>path</i>) </td>
  <td> Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. </td>
</tr>
<tr>
  <td> <b>saveAsSequenceFile</b>(<i>path</i>) <br> (Java and Scala) </td>
  <td> Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also
   available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). </td>
</tr>
<tr>
  <td> <b>saveAsObjectFile</b>(<i>path</i>) <br> (Java and Scala) </td>
  <td> Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using
    <code>SparkContext.objectFile()</code>. </td>
</tr>
<tr>
  <td> <b>countByKey</b>() <a name="CountByLink"></a> </td>
  <td> Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. </td>
</tr>
<tr>
  <td> <b>foreach</b>(<i>func</i>) </td>
  <td> Run a function <i>func</i> on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
  <br><b>Note</b>: modifying variables other than Accumulators outside of the <code>foreach()</code> may result in undefined behavior. See Understanding closures for more details.</td>
</tr>
</tbody></table>

### GetNumpartitions Action

*Return the number of partitions in RDD*

In [31]:
# create RDD 'x' with two partitions
x = sc.parallelize([1,2,3], 2)
# get the number of partitions of RDD 'x' - the return value is from type integer
y = x.getNumPartitions()
# glom() flattens elements on the same partition
print(x.glom().collect())
print(y)

[[1], [2, 3]]
2


### Collect Action

*Return all items in the RDD to the driver in a single list*

In [32]:
# create RDD 'x' with two partitions
x = sc.parallelize([1,2,3], 2)
# create list 'y' (no RDD - y is from type 'list')
y = x.collect()
print(x.glom().collect())
print(y)

[[1], [2, 3]]
[1, 2, 3]


### Count Action

*Return the number of elements in this RDD.*

In [33]:
# create new RDD 'numberRDD' with two partitions
numberRDD = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
# the action count returns the number of element in the dataset - independent of the number of partitions 
numberRDD.count()

10

### First Action

*Return the first element in this RDD.*

In [34]:
# create new RDD 'numberRDD' with two partitions
numberRDD = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
# return the first element - the order of the elements within the RDD is not effected by the partitioning
numberRDD.first()

1

### Take Action

*Take the first num elements of the RDD.*

In [35]:
# create new RDD 'numberRDD' with two partitions
numberRDD = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
# return the first FOUR element - the order of the elements within the RDD is not effected by the partitioning
numberRDD.take(4)

[1, 2, 3, 4]

### Reduce Action

*Aggregate all the elements of the RDD by applying a user function pairwise to elements and partial results, and returns a result to the driver*

In [36]:
# create new RDD 'x'
x = sc.parallelize([1,2,3,4])
# apply function pairwise (a,b) to elements and return the sum - return type is integer
y = x.reduce(lambda a,b: a+b)
print(x.collect())
print(y)

[1, 2, 3, 4]
10


### Aggregate Action

Since RDD’s are partitioned, the aggregate takes full advantage of it by first aggregating elements in each partition and then aggregating results of all partition to get the final result.

Aggregate all the elements of the RDD by:
- applying a user function to combine elements with user-supplied objects,
- then combining those user-defined results via a second user function,
- and finally returning a result to the driver.

In [37]:
# The seqOp operator is used to accumulate the results of each partition and stores the running 
# accumulated result to data.
seqOp = lambda data, item: (data[0] + [item], data[1] + item)
# The combOp is used to combine the results of all partitions
combOp = lambda d1, d2: (d1[0] + d2[0], d1[1] + d2[1])
# create new RDD 'x'
x = sc.parallelize([1,2,3,4])
# aggregate all elements of the RDD
y = x.aggregate(([], 0), seqOp, combOp)
print(y)

([1, 2, 3, 4], 10)


### Max Action

*Return the maximum item in the RDD*

In [38]:
# create new RDD 'x'
x = sc.parallelize([2,4,1])
# return the maximum value from the dataset
y = x.max()
print(x.collect())
print(y)

[2, 4, 1]
4


## Stop The Spark Session

In [39]:
# stop the underlying SparkContext.
sc.stop()

---