# Resilient RDDs

### Review

We saw the two of the main components of Spark.  

1. Spark primarily saves data to memory, and
2. Spark stores a dataset distributed across executors, that query and operate on that data in parallel.  

> <img src="./cluster_executor.jpg" width='100%'>

* how does Spark does recover when data when a node goes down.  

The *resilient* component of our Resilient Distributed Datasets. 

### Fault Tolerance?



> we still want our spark cluster to be *fault tolerant*.  
    
* Fault tolerant: Even if one of our nodes goes down, we do not want the data on that node to be lost.

### The normal approach: copy the data

> Normally, distributed databases achieve this by copying partitions of the data to multiple nodes.  This way if one node goes down, there is still a backup.

<img src="./copied_data.jpg" width="60%">

> In the diagram below we can see that the `D` movies are copied to two different nodes.

Problems with copying:

1. This copying takes up a significant amount of space, and 
2. It requires copying data over a the cluster's network, and oftentimes there may be narrow bandwidth to do so

> The diagram below shows the copying process from one node to the other.  With a lot of data, and narrow bandwidth, this can be a slow process.

> <img src="./network_slow.jpg" width="60%">

### Spark's Approach

Instead of copying the data over, from one node to another, Spark instead keeps track of all of the steps to recreate our dataset in the driver node.

* if the node goes down, it can simply reapply those steps.

### The Consequence: 

1. Only Coarse Transformations

Because keeping track of every tiny change that happens to a dataset takes some work, Spark limits the kinds of transformations we can apply.  

* When we apply changes, we must apply these changes to the *entire* RDD.  For example, above, we capitalized every record with the `map` function.

In [1]:
movies = ['dunkirk', 'minari']

In [None]:
movies.filter(lambda movie: movie == 'dunkirk').map(lambda movie: movie.capitalize())

2. Only Read Only

* when we apply a transformation to a dataset, we are actually creating a new RDD.  Again, we can see that in the DAG.  


So we are never updating an RDD.  

> Our RDDs are **read only**, and whenever we filter or map through a dataset, we are creating a new RDD in the process.  

In [21]:
rdd.map(lambda movie: movie.title()).collect()

['Dark Knight', 'Dunkirk', 'Pulp Fiction', 'Avatar']

Or, with our RDDs, we can also go through every record, and only select those that begin with the letter `d`.

In [31]:
rdd.filter(lambda movie: movie[0] == 'd').collect()

['dark knight', 'dunkirk']

> But this is still considered an operation on the entire dataset because we search through every record.

whether we use `map` or `filter`, each step applies to *every* record.  These types of transformations are called **coarse grained transformations** - and these are the only kinds of transformations that Spark allows.

In [11]:
rdd.filter(lambda movie: movie[0] == 'd').map(lambda movie: movie.title()).collect()

['Dark Knight', 'Dunkirk']

<img src="./filter_map.jpg" width="40%">

### Recap

* Spark's approach for fault tolerance
    * Keep record of transformations needed to recreate data
        * Only coarse grained transformations (filter, map)
        * And RDDs are read only

### Resources

[Spark Debugging Minibook](https://cs.famaf.unc.edu.ar/~damian/tmp/bib/Mini%20eBook%20-%20Apache%20Spark%20Monitoring%20and%20Debugging.pdf)

[Presenting RDDs](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf)

[RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)

* [RDDs Simplified](https://vishnuviswanath.com/spark_rdd)

* [Databricks RDDs](https://databricks.com/glossary/what-is-rdd)

[Databricks best practices](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/index.html)