# Resilient distributed datasets
Resilient distributed datasets, or *RDD*s, are the fundamental blocks in PySpark. 
The associated Colab notebook can be found [here](https://github.com/mosesyhc/de300-wn2024-notes/blob/main/examples/ex-rdd.ipynb).

## What is RDD?  
RDD is essentially a collection of unordered objects,
- or a mathematical *set*,
- or a Python list of objects,
- or similar to a JSON.

|<img src="../img/rdd-idea.png" width="100%"/>|<img src="../img/math-set.png" width="60%"/>|<img src="../img/json-ex.png"/>|
|-:|:-:|:-|
| |Fig. Collection of objects| |

In [2]:
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.getOrCreate()
 
collection = [1, "two", 3.0, ("four", 4), {"five": 5}]  # generic list
 
sc = spark.sparkContext
 
collection_rdd = sc.parallelize(collection)  # list promoted to RDD

print(collection_rdd)

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289


In [3]:
# collection_rdd.collect()

[1, 'two', 3.0, ('four', 4), {'five': 5}]

## Why RDD if we have dataframes?
- If the data at hand are more *freeformed*, using an RDD allows for storage of various types of objects.
- Compared to dataframe, which will *attempt* (and fail) to find a common denominator to fit the data above.

## Main ingredients of RDD manipulation
We cover three main building blocks for using RDDs, inheriting the concept of a MapReduce scheme.

Each of the following building blocks (*functions*) takes a functional input:
- `map()`
- `filter()`
- `reduce()`

### `map` through an example
`map()` applies the given function to each element of the RDD.

In [5]:
from py4j.protocol import Py4JJavaError

def add_one(value):
    return value + 1

collection_rdd_p1 = collection_rdd.map(add_one)

In [None]:
try: 
    print(collection_rdd_p1.collect())
except Py4JJavaError as e:
    pass # print(e)

# You'll get one of the following:
# TypeError: can only concatenate str (not "int") to str
# TypeError: unsupported operand type(s) for +: 'dict' and 'int'
# TypeError: can only concatenate tuple (not "int") to tuple

```{figure} ../img/rdd-map.png
---
width: 80%
name: rdd-map
---
Applying `add_one()` to each element of RDD through `map()` (Fig 8.2, {cite:p}`rioux2022data`).
```

**Quick note:** 
- Why did the line throw an error?
- When was the error thrown?

**A potential fix:**

In [None]:
def safer_add_one(value):
    try:
        return value + 1
    except TypeError:
        return value

# collection_rdd_p1_again = collection_rdd.map(safer_add_one)

**Lesson here**:
- PySpark does not warn you about the content of the RDD.
- As the developer, we are responsible for how to deal with the data given to an RDD.

### `filter` through an example
`filter()` takes a function that returns `True`/`False` based on any conditions.

In [None]:
collection_rdd_filter = collection_rdd.filter(
    lambda elem: isinstance(elem, (float, int))
)

In [None]:
# print(collection_rdd_filter.collect())

```{figure} ../img/rdd-filter.png
---
width: 80%
name: rdd-filter
---
Applying `filter()` to the RDD (Fig 8.3, {cite:p}`rioux2022data`).
```

**A word about `lambda` function**

```{figure} ../img/rdd-lambda.png
---
width: 80%
name: rdd-lambda
---
The use of `lambda` function {cite:p}`rioux2022data`.
```

### `reduce` through an example
`reduce()` summarizes the RDD by sequentially applying the given function.
- similar to `groupby()` in a dataframe.

In [None]:
from operator import add

collection_rdd2 = sc.parallelize([4, 7, 9.2, 5.6, -20])

In [None]:
# collection_rdd2.reduce(add)

In [None]:
# collection_rdd2.reduce(
#     lambda a, b: a + b
# )

```{figure} ../img/rdd-reduce.png
---
width: 80%
name: rdd-reduce
---
Applying `add` through `reduce()` to the RDD (Fig 8.4, {cite:p}`rioux2022data`).
```

**Warnings about `reduce()`**
- What functions are reasonable for `reduce()`?
- *commutative operation*
- *associative operation*

**Additional Note:**  

A dataframe is actually an RDD, e.g.,

In [None]:
# df = spark.createDataFrame([[1], [2], [3]], schema=["column"])
 
# print(df.rdd)
 
# print(df.rdd.collect())