# Lazy RDDs

### Introduction

So we saw in the last lesson that Spark achieves fault tolerance by keeping a recording of the transformations needed to recreate our data.

<img src="./filter_map.jpg" width="40%">

These steps are an example of a directed acyclic graph, because we go from one step to another to arrive at the resulting RDD.  It turns out that recording the steps needed to transform the data is useful not just for fault tolerance but because it allows Spark to determine efficient ways to perform the prescribed steps.  We'll learn more about how spark achieves this efficiency in this lesson.

### Looking Under the Hood

Let's again connect to our spark cluster.

In [1]:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("films").setMaster("local[2]")
sc = SparkContext.getOrCreate(conf=conf)

And then, let's again create an RDD from our movie records.

In [2]:
movies = ['dark knight', 'dunkirk', 'pulp fiction', 'avatar']

In [5]:
movies_rdd = sc.parallelize(movies)
movies_rdd

ParallelCollectionRDD[2] at readRDDFromFile at PythonRDD.scala:274

And then let's capitalize the movies, and select the movies that begin with `d`.

In [7]:
movies_rdd.map(lambda movie: movie.title()).filter(lambda movie: movie[0] == 'd' or movie[0] == 'D').collect()

['Dark Knight', 'Dunkirk']

Now, as we wrote the code above, Spark should first capitalize the words of each element, and then select the movies that begin with the letter `d`.

<img src="./filter_map.jpg" width="40%">

But, imagine if we reversed the steps, so that we first selected the records that begin with `d`, and then capitalize all of the remaining elements.  This second way we are achieving the same results, but need to perform less work, as we only map through the selected records.  This is what Spark does.

To do this, Spark allows us to chain methods, and determine how most efficiently to perform the method calls, only taking action when it needs to.  Let's see this.

### A little experiment

If we run the code below, notice that nothing is returned.

In [12]:
movies_rdd.map(lambda movie: movie.title())

PythonRDD[8] at RDD at PythonRDD.scala:53

And even if we chain the map and the filter methods, still nothing is returned.

In [9]:
movies_rdd.map(lambda movie: movie.title()).filter(lambda movie: movie[0] == 'd' or movie[0] == 'D')

PythonRDD[5] at RDD at PythonRDD.scala:53

It's only when we add a collect function on the end, will some data be returned.

In [28]:
movies_rdd.filter(lambda movie: movie[0] == 'd' or movie[0] == 'D').map(lambda movie: movie.title()).collect()

['Dark Knight', 'Dunkirk']

Nothing was returned when we ran the `map` and `collect` functions, because when we only executed those functions, Spark did not actually act on the data.  It only recorded the steps it needed to perform.  Then in the third line we finally did act on the data.  We told Spark that we want to both transform, and filter the data, and then return all of the results.  

Spark then logged the transformations we requested, reordered the transformations to perform them in the most efficient way, and returned the results.

> <img src="./filter_workflow.jpg" width="60%">

> So above we can see that we start with the original RDD, and as we call `map` and `collect`, Spark simply logs that we'll need to perform the above steps.  It's only when we call `collect` that Spark then can determine the best way to perform the steps and then executes the steps accordingly.

### Transformations and Actions

So above we can see that the functions `map` and `filter` do not actually perform any work on our data.  Instead steps are only kicked off when we call the `collect` method.  

In Spark, the methods that kick off tasks and return results are called **actions** (eg. collect).  The methods like `map` and `transform` that do not are called `transformations`.  

* Transformations

So we already saw that transformations include `map` and `filter`, let's see a few more.

> sample

Sample allows us to take a random sample from our dataset.  Notice that it does not return any data.

In [24]:
movies_rdd.sample(fraction = .2, withReplacement = True)

PythonRDD[16] at RDD at PythonRDD.scala:53

> Distinct

In [25]:
movies_rdd.distinct()

PythonRDD[21] at RDD at PythonRDD.scala:53

So one way to think about transformations is that they have to look comb through our data, either to filter, or unique the data.

* Actions

Actions are a bit more about the end result.  So far we've learned about `collect`, which returns all of the results of a series of transformations.  

> Take

If we want to limit our results to a subset, we can use the function `take`.

In [27]:
movies_rdd.distinct().take(2)

['dark knight', 'dunkirk']

The function `take` just limits the results to a specified number.  So here, `take` just returns the first two results.  It's similar to `LIMIT` in SQL.

> Count

In [29]:
movies_rdd.distinct().count()

4

Count simply counts the results.

So we can see that, our actions have a bit of finality to them.  To get a better sense of the transformation and action functions, it's worth looking at the [documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations).

### Resources

[Berkley White Paper](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf)

[Pyspark RDD Methods blog](https://www.nbshare.io/notebook/403283317/How-To-Analyze-Data-Using-Pyspark-RDD/)