# Extending PySpark with Python: RDD & UDFs

Instead of using methods provided by `pyspark.sql`, we build our own set of transformations in pure python, using PySpark as a convenient distributing engine. We start with *resilient distributed dataset* (or **RDD**). RDD is like data frame but distributes unordered objects rather than records and columns. RDD is as a bag of elements with no order or relationship to one another. Each element is independent of the other.

In [1]:
# !set PYSPARK_PYTHON=%cd%\venv\Scripts\python.exe

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

collection = [1, "two", 3.0, ("four", 4), {"five": 5}]

sc = spark.sparkContext

collection_rdd = sc.parallelize(collection)

print(collection_rdd)


ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274


If we were trying to store an integer, a string, a floating point number, a tuple, and a dictionary in a single column, the data frame would have (and fail) to find a common denominator to fit those different types of data.

### Manipulating data the RDD way: `map()`, `filter()`, and `reduce()`

`map()`, `filter()`, and `reduce()` all take a function (that we will call `f`) as their only parameter and return a copy of the RDD with the desired modifications. We call functions that take other functions as parameters *higher-order functions*. 


#### Apply one functiono to every object: MAP

We start with the most basic and common operation: applying a Python function to
every element of the RDD. For this, PySpark provides `map()`. This directly echoes the functionality of the `map()` function in Python.

In [2]:
from py4j.protocol import Py4JJavaError

def add_one(value):
    return value + 1

collection_rdd = collection_rdd.map(add_one)

try:
    print(collection_rdd.collect())
except Py4JJavaError:
    pass

# Stack trace galore! The important bit, you'll get one of the following:
# TypeError: can only concatenate str (not "int") to str
# TypeError: unsupported operand type(s) for +: 'dict' and 'int'
# TypeError: can only concatenate tuple (not "int") to tuple

![Failure to add_one](./images/rdd_failure_to_add.png)

In [3]:
# improved safer_add_one() function below 
# which returns the original element if the 
# function runs into a type error.

collection_rdd = sc.parallelize(collection)

def safer_add_one(value):
    try:
        return value + 1
    except TypeError:
        return value

collection_rdd = collection_rdd.map(safer_add_one)

print(collection_rdd.collect())

[2, 'two', 4.0, ('four', 4), {'five': 5}]


#### Only keep what you want: FILTER

`filter()` is used to keep only the element that satisfies a predicate. The RDD version of `filter()` is a little different than the data frame version: it takes a function `f`, which applies to each object (or element) and keeps only those that return a truthful value.

The `isinstance()` function returns True if the first argument’s type is present in the second argument; in our case, it’ll test if each element is either a `float` or an `int`.

In [4]:
collection_rdd = collection_rdd.filter(
    lambda elem: isinstance(elem, (float, int))
)

print(collection_rdd.collect())

[2, 4.0]


Just like `map()`, the function passed as a parameter to `filter()` is applied to every element in the RDD. This time, though, instead of returning the result in a new RDD, we keep the original value if the result of the function is truthy. If the result is falsy, we drop the element.

#### Two elements come in, one comes out: REDUCE

This is an important operation of RDD, which enables the summarization of data (similar to `groupby()`/`agg()`) using the data frame. `reduce()`, as its name implies, is used to reduce elements in an RDD.
By *reducing*, meaning we are taking two elements and applying a function that will return only one element. PySpark will apply the function to the first two elements, then apply it again to the result and the third element, and so on, until there are no elements left. 

![Reduce RDD](./images/rdd_reduce.png)

In [5]:
from operator import add

collection_rdd = sc.parallelize([4, 7, 9, 1, 3])

print(collection_rdd.reduce(add))

24


> **Note** `reduce()` in a distributed world Because of PySpark’s distributed nature, the data of an RDD can be distributed across multiple partitions. The `reduce()` function will be applied independently on each partition, and then each intermediate value will be sent to the master node for the final reduction. Because of this, you need to provide a commutative and associative function to `reduce()`.  \
\
A *commutative* function is a function where the order in which the arguments are
applied is not important. For example, `add()` is commutative, since `a + b = b + a`. Oh the flip side, `subtract()` is not: `a - b != b - a`.  \
\
An *associative* function is a function where how the values are grouped is not important. `add()` is associative, since `(a + b) + c = a + (b + c)`. `subtract()` is not: `(a - b) - c != a - (b - c)`.  \
\
`add()`, `multiply()`, `min()`, and `max()` are both associative and commutative

#### Using python to extend PySpark via UDFs