# Resilient distributed datasets
Resilient distributed datasets, or *RDD*s, are the fundamental blocks in PySpark. 

## What is RDD?  
RDD is essentially a collection of unordered objects,
- or a mathematical *set*,
- or a Python list of objects,
- or similar to a JSON.

|![rdd-idea](../img/rdd-idea.png)|![math-set](../img/math-set.png)|![json-ex](../img/json-ex.png)|
| |Fig. Collection of objects| |

In [None]:
# !pip install pyspark

In [None]:
# set up an RDD
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('rddExample').getOrCreate()

# list promoted to RDD


## Why RDD if we have dataframes?
- If the data at hand are more *freeformed*, using an RDD allows for storage of various types of objects.
- Compared to dataframe, which will *attempt* (and fail) to find a common denominator to fit the data above.

## Main ingredients of RDD manipulation
We cover three main building blocks for using RDDs, inheriting the concept of a MapReduce scheme.

Each of the following building blocks (*functions*) takes a functional input:
- `map()`
- `filter()`
- `reduce()`

### `map` through an example
`map()` applies the given function to each element of the RDD.

In [None]:
# simple mapping function


In [None]:
# mapping to the RDD

**Quick note:** 
- Why did the line throw an error?
- When was the error thrown?

**A potential fix:**

In [None]:
from py4j.protocol import Py4JJavaError

def safer_add_one(value):
    pass # ...

**Lesson here**:
- PySpark does not warn you about the content of the RDD.
- As the developer, we are responsible for how to deal with the data given to an RDD.

### `filter` through an example
`filter()` takes a function that returns `True`/`False` based on any conditions.

In [None]:
# writing a filtering example


### `reduce` through an example
`reduce()` summarizes the RDD by sequentially applying the given function.
- similar to `groupby()` in a dataframe.

In [None]:
# create a new RDD

# example using reduce


**Warnings about `reduce()`**
- What functions are reasonable for `reduce()`?
- *commutative operation*
- *associative operation*

**Additional Note:**  

A dataframe is actually an RDD, e.g.,

In [None]:
# Dataframe creation
