# Efficient RDDs

### Looking Under the Hood

In [1]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("films").setMaster("local[2]")
sc = SparkContext.getOrCreate(conf=conf)

In [2]:
movies = ['dark knight', 'dunkirk', 'pulp fiction', 'avatar']

In [3]:
movies_rdd = sc.parallelize(movies)
movies_rdd

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

And then let's capitalize the movies, and select the movies that begin with `d`.

In [4]:
movies_rdd.map(lambda movie: movie.title()).collect()

['Dark Knight', 'Dunkirk', 'Pulp Fiction', 'Avatar']

In [5]:
sc

> <img src="./parallel.png" width="100%">

Now what if we do the following.

In [11]:
movies_rdd.map(lambda movie: movie.title()).filter(lambda movie: movie == "Dunkirk").take(1)

['Dunkirk']

> <img src="./individual_task.png" width="80%">

### A little experiment

In [12]:
movies_rdd.map(lambda movie: movie.title())

PythonRDD[8] at RDD at PythonRDD.scala:53

And even if we chain the map and the filter methods, still nothing is returned.

In [8]:
movies_rdd.map(lambda movie: movie.title()).filter(lambda movie: movie[0] == 'd')

PythonRDD[3] at RDD at PythonRDD.scala:53

In [9]:
movies_rdd.filter(lambda movie: movie[0] == 'd').map(lambda movie: movie.title()).collect()

['Dark Knight', 'Dunkirk']

### Transformations and Actions

* **actions** (eg. collect).  

* **transformations**: map, filter

1. Transformations

* sample

The `sample` method allows us to take a random sample from our dataset.  

In [14]:
movies_rdd.sample(fraction = .5, withReplacement = True).collect()

['dunkirk', 'dunkirk', 'pulp fiction']

* distinct

In [15]:
movies_rdd.distinct()

PythonRDD[15] at RDD at PythonRDD.scala:53

> Distinct finds the unique results.  Notice that it also does not return data.

2. Actions

Actions are a bit more about the end result.  So far we've learned about `collect`, which returns *all* of the results of a series of transformations.  

* Take

We've also seen `take`, which limits our results to a subset.

In [16]:
movies_rdd.distinct().take(2)

['dark knight', 'dunkirk']

> So `take` is similar to the `LIMIT` function in SQL. Notice that here our records are returned.

* Count

In [18]:
movies_rdd.distinct().collect()

['dark knight', 'dunkirk', 'pulp fiction', 'avatar']

Count simply counts the results.

### Summary

So we can see that, our actions have a bit of finality to them.  To get a better sense of the transformation and action functions, it's worth looking at the [documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations).

### Resources

[Berkley White Paper](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf)

[Pyspark RDD Methods blog](https://www.nbshare.io/notebook/403283317/How-To-Analyze-Data-Using-Pyspark-RDD/)


[Databricks Debugging Spark Streaming](https://docs.databricks.com/spark/latest/rdd-streaming/debugging-streaming-applications.html)