# Resilient distributed datasets
Resilient distributed datasets, or *RDD*s, are the fundamental blocks in PySpark. 
The associated Colab notebook can be found [here](https://github.com/mosesyhc/de300-wn2024-notes/blob/main/examples/ex-rdd.ipynb).

## What is RDD?  
RDD is essentially a collection of unordered objects,
- or a mathematical *set*,
- or a Python list of objects,
- or similar to a JSON.

|<img src="../img/rdd-idea.png" width="100%"/>|<img src="../img/math-set.png" width="60%"/>|<img src="../img/json-ex.png"/>|
|-:|:-:|:-|
| |Fig. Collection of objects| |

In [3]:
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.getOrCreate()
 
collection = [1, "two", 3.0, ("four", 4), {"five": 5}]  # generic list
 
sc = spark.sparkContext
 
collection_rdd = sc.parallelize(collection)  # list promoted to RDD

print(collection_rdd)

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289


In [2]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.5.tar.gz (317.2 MB)
     ---------------------------------------- 0.0/317.2 MB ? eta -:--:--
     --------------------------------------- 2.1/317.2 MB 11.8 MB/s eta 0:00:27
      -------------------------------------- 4.7/317.2 MB 11.9 MB/s eta 0:00:27
      -------------------------------------- 7.3/317.2 MB 11.9 MB/s eta 0:00:27
     - ------------------------------------- 9.7/317.2 MB 11.8 MB/s eta 0:00:26
     - ------------------------------------ 12.3/317.2 MB 11.9 MB/s eta 0:00:26
     - ------------------------------------ 14.9/317.2 MB 11.9 MB/s eta 0:00:26
     -- ----------------------------------- 17.3/317.2 MB 11.9 MB/s eta 0:00:26
     -- ----------------------------------- 19.9/317.2 MB 11.9 MB/s eta 0:00:26
     -- ----------------------------------- 22.5/317.2 MB 11.9 MB/s eta 0:00:25
     -- ----------------------------------- 24.9/317.2 MB 11.9 MB/s eta 0:00:25
     --- ---------------------------------- 27.5/317.2 MB 11.9

  DEPRECATION: Building 'pyspark' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'pyspark'. Discussion can be found at https://github.com/pypa/pip/issues/6334


In [4]:
collection_rdd.collect()

[1, 'two', 3.0, ('four', 4), {'five': 5}]

## Why RDD if we have dataframes?
- If the data at hand are more *freeformed*, using an RDD allows for storage of various types of objects.
- Compared to dataframe, which will *attempt* (and fail) to find a common denominator to fit the data above.

## Main ingredients of RDD manipulation
We cover three main building blocks for using RDDs, inheriting the concept of a MapReduce scheme.

Each of the following building blocks (*functions*) takes a functional input:
- `map()`
- `filter()`
- `reduce()`

### `map` through an example
`map()` applies the given function to each element of the RDD.

In [5]:
from py4j.protocol import Py4JJavaError

def add_one(value):
    return value + 1

collection_rdd_p1 = collection_rdd.map(add_one)

In [None]:
try: 
    print(collection_rdd_p1.collect())
except Py4JJavaError as e:
    print(e)

# You'll get one of the following:
# TypeError: can only concatenate str (not "int") to str
# TypeError: unsupported operand type(s) for +: 'dict' and 'int'
# TypeError: can only concatenate tuple (not "int") to tuple

|![rdd-map](../img/rdd-map.png)|
|:---:|
|Applying `add_one()` to each element of RDD through `map()` (Fig 8.2 in Rioux, 2022).|

**Quick note:** 
- Why did the line throw an error?
- When was the error thrown?

**A potential fix:**

In [None]:
def safer_add_one(value):
    try:
        return value + 1
    except TypeError:
        return value

# collection_rdd_p1_again = collection_rdd.map(safer_add_one)

**Lesson here**:
- PySpark does not warn you about the content of the RDD.
- As the developer, we are responsible for how to deal with the data given to an RDD.

### `filter` through an example
`filter()` takes a function that returns `True`/`False` based on any conditions.

In [None]:
collection_rdd_filter = collection_rdd.filter(
    lambda elem: isinstance(elem, (float, int))
)

In [None]:
# print(collection_rdd_filter.collect())

|![rdd-filter](../img/rdd-filter.png)|
|:---:|
|Applying `filter()` to the RDD (Fig 8.3 in Rioux, 2022).|

**A word about `lambda` function**

|![rdd-lambda](../img/rdd-lambda.png)|
|:---:|
|The use of `lambda` function (Rioux, 2022).|

### `reduce` through an example
`reduce()` summarizes the RDD by sequentially applying the given function.
- similar to `groupby()` in a dataframe.

In [None]:
from operator import add

collection_rdd2 = sc.parallelize([4, 7, 9.2, 5.6, -20])

In [None]:
# collection_rdd2.reduce(add)

In [None]:
# collection_rdd2.reduce(
#     lambda a, b: a + b
# )

![rdd-reduce](../img/rdd-reduce.png)|
|:---:|
|Applying `add` through `reduce()` to the RDD (Fig 8.4 in Rioux, 2022).|


**Warnings about `reduce()`**
- What functions are reasonable for `reduce()`?
- *commutative operation*
- *associative operation*

**Additional Note:**  

A dataframe is actually an RDD, e.g.,

In [None]:
# df = spark.createDataFrame([[1], [2], [3]], schema=["column"])
 
# print(df.rdd)
 
# print(df.rdd.collect())