# Extending PySpark with Python: RDD & UDFs

Instead of using methods provided by `pyspark.sql`, we build our own set of transformations in pure python, using PySpark as a convenient distributing engine. We start with *resilient distributed dataset* (or **RDD**). RDD is like data frame but distributes unordered objects rather than records and columns. RDD is as a bag of elements with no order or relationship to one another. Each element is independent of the other.

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

collection = [1, "two", 3.0, ("four", 4), {"five": 5}]

sc = spark.sparkContext

collection_rdd = sc.parallelize(collection)

print(collection_rdd)


ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274


If we were trying to store an integer, a string, a floating point number, a tuple, and a dictionary in a single column, the data frame would have (and fail) to find a common denominator to fit those different types of data.

### Manipulating data the RDD way: `map()`, `filter()`, and `reduce()`

`map()`, `filter()`, and `reduce()` all take a function (that we will call `f`) as their only parameter and return a copy of the RDD with the desired modifications. We call functions that take other functions as parameters *higher-order functions*. 


#### Apply one functiono to every object: MAP

We start with the most basic and common operation: applying a Python function to
every element of the RDD. For this, PySpark provides `map()`. This directly echoes the functionality of the `map()` function in Python.

In [5]:
from py4j.protocol import Py4JJavaError

def add_one(value):
    return value + 1

collection_rdd = collection_rdd.map(add_one)

try:
    print(collection_rdd.collect())
except Py4JJavaError:
    pass

# Stack trace galore! The important bit, you'll get one of the following:
# TypeError: can only concatenate str (not "int") to str
# TypeError: unsupported operand type(s) for +: 'dict' and 'int'
# TypeError: can only concatenate tuple (not "int") to tuple

![Failure to add_one](./images/rdd_failure_to_add.png)

In [6]:
# improved safer_add_one() function below 
# which returns the original element if the 
# function runs into a type error.

collection_rdd = sc.parallelize(collection)

def safer_add_one(value):
    try:
        return value + 1
    except TypeError:
        return value

collection_rdd = collection_rdd.map(safer_add_one)

print(collection_rdd.collect())

[2, 'two', 4.0, ('four', 4), {'five': 5}]


#### Only keep what you want: FILTER

`filter()` is used to keep only the element that satisfies a predicate. The RDD version of `filter()` is a little different than the data frame version: it takes a function `f`, which applies to each object (or element) and keeps only those that return a truthful value.

The `isinstance()` function returns True if the first argument’s type is present in the second argument; in our case, it’ll test if each element is either a `float` or an `int`.

In [7]:
collection_rdd = collection_rdd.filter(
    lambda elem: isinstance(elem, (float, int))
)

print(collection_rdd.collect())

[2, 4.0]


Just like `map()`, the function passed as a parameter to `filter()` is applied to every element in the RDD. This time, though, instead of returning the result in a new RDD, we keep the original value if the result of the function is truthy. If the result is falsy, we drop the element.

#### Two elements come in, one comes out: REDUCE

This is an important operation of RDD, which enables the summarization of data (similar to `groupby()`/`agg()`) using the data frame. `reduce()`, as its name implies, is used to reduce elements in an RDD.
By *reducing*, meaning we are taking two elements and applying a function that will return only one element. PySpark will apply the function to the first two elements, then apply it again to the result and the third element, and so on, until there are no elements left. 

![Reduce RDD](./images/rdd_reduce.png)

In [8]:
from operator import add

collection_rdd = sc.parallelize([4, 7, 9, 1, 3])

print(collection_rdd.reduce(add))

24


> **Note** `reduce()` in a distributed world Because of PySpark’s distributed nature, the data of an RDD can be distributed across multiple partitions. The `reduce()` function will be applied independently on each partition, and then each intermediate value will be sent to the master node for the final reduction. Because of this, you need to provide a commutative and associative function to `reduce()`.  \
\
A *commutative* function is a function where the order in which the arguments are
applied is not important. For example, `add()` is commutative, since `a + b = b + a`. Oh the flip side, `subtract()` is not: `a - b != b - a`.  \
\
An *associative* function is a function where how the values are grouped is not important. `add()` is associative, since `(a + b) + c = a + (b + c)`. `subtract()` is not: `(a - b) - c != a - (b - c)`.  \
\
`add()`, `multiply()`, `min()`, and `max()` are both associative and commutative

#### Using python to extend PySpark via UDFs

Unlike the RDD, the data frame has a structure enforced by columns. To address this
constraint, PySpark provides the possibility of creating UDFs via the `pyspark.sql.functions.udf()` function. What comes in is a regular Python function, and what goes out is a function promoted to work on PySpark columns.

In [10]:
import pyspark.sql.functions as F
import pyspark.sql.types as T

fractions = [[x,y] for x in range(100) for y in range(1,100)]

frac_df = spark.createDataFrame(fractions, ["numerator","denominator"])

frac_df = frac_df.select(
    F.array(F.col("numerator"),F.col("denominator")).alias("fraction")
)

frac_df.show(5, False)

+--------+
|fraction|
+--------+
|[0, 1]  |
|[0, 2]  |
|[0, 3]  |
|[0, 4]  |
|[0, 5]  |
+--------+
only showing top 5 rows



#### Using typed Python functions

This section covers creating a Python function that will work seamlessly with a PySpark data frame. While Python and Spark usually work seamlessly together, creating and using UDFs requires a few precautions. 

we will have a function to reduce a fraction and one to transform a fraction into
a floating-point number. The blueprint when creating a function destined to become a Python UDF is as follows:
1. Create and document the function.
2. Make sure the input and output types are compatible.
3. Test the function.

In [12]:
from fractions import Fraction
from typing import Tuple, Optional

Frac = Tuple[int,int]

def py_reduce_fraction(frac: Frac) -> Optional[Frac]:
    """Reduce a fracction represented as a 2-tuple of integers"""
    num, denom = frac
    if denom:
        answer = Fraction(num,denom)
        return answer.numerator, answer.denominator
    return None

assert py_reduce_fraction((3,6)) == (1,2)
assert py_reduce_fraction((1,0)) is None

def py_fraction_to_float(frac: Frac) -> Optional[float]:
    """Transforms a fraction represented as a 2-tuple of integers into a float."""
    num, denom = frac
    if denom:
        return num / denom
    return None

assert py_fraction_to_float((2, 8)) == 0.25
assert py_fraction_to_float((10, 0)) is None

Python is a dynamic language; this means that the type of an object is known at runtime. When working with PySpark’s data frame, where each column has one and only one type, we need to make sure that our UDF will return consistent types. We can use type hints to ensure this.

#### From Python functions to UDFs using `udf()`

Once you have your Python function created, PySpark provides a simple mechanism
to promote to a UDF. This section covers the `udf()` function and how to use it directly to create a UDF, as well as using the decorator to simplify the creation of a UDF.

PySpark provides a `udf()` function in the `pyspark.sql.functions` module to promote Python functions to their UDF equivalents. The function takes two parameters:
- The function you want to promote
- The return type of the generated UDF

Below table shows type equivalences between Python and PySpark. If you provide a return type, it must be compatible with the return value of your UDF.

| Type Constructor | String representation | Python equivalent                 | 
|------------------|-----------------------|-----------------------------------|
| NullType()       | null                  | None                              |
| StringType()     | string                | Python's regular strings          |
| BinaryType()     | binary                | bytearray                         |
| BooleanType()    | boolean               | bool                              |
| DataType()       | date                  | datetime.date(from `datetime` lib)|
| TimestampType()  | timestamp             | datetime.datetime                 |
| DecimalType(p,s) | decimal               | decimal.Decimal (from the decimal library) |
| DoubleType()     | double                | float
| FloatType()      | float                 | float*
| ByteType()       | byte or tinyint       | int*
| IntegerType()    | int                   | int*
| LongType()       | long or bigint        | int*
| ShortType()      | short or smallint     | int*
| ArrayType(T)     | N/A                   | list, tuple, or Numpy array (from the numpy library) |
| MapType(K, V)    | N/A                   | dict
| StructType([…])  | N/A                   | list or tuple

We promote the `py_reduce_fraction()` function to a UDF via the `udf()` function. Just like we did with the Python equivalent, we provide a return type to the UDF
(this time, an Array of Long, since Array is the companion type of the tuple and Long is the one for Python integers). Once the UDF is created, we can apply it just like any other PySpark function on columns. we chose to create a new column to showcase the before and after; in the sample shown, the fraction appears properly reduced.

In [13]:
SparkFrac = T.ArrayType(T.LongType())

reduce_fraction = F.udf(py_reduce_fraction, SparkFrac)

frac_df = frac_df.withColumn(
    "reduced_fraction", reduce_fraction(F.col("fraction"))
)

frac_df.show(5, False)

+--------+----------------+
|fraction|reduced_fraction|
+--------+----------------+
|[0, 1]  |[0, 1]          |
|[0, 2]  |[0, 1]          |
|[0, 3]  |[0, 1]          |
|[0, 4]  |[0, 1]          |
|[0, 5]  |[0, 1]          |
+--------+----------------+
only showing top 5 rows



You also have the option of creating your Python function and promoting it as a UDF
using the udf function as a decorator. 

In [15]:
@F.udf(T.DoubleType())
def fraction_to_float(frac: Frac) -> Optional[float]:
    """Transforms a fraction represented as a 2-tuple of integers into a float."""
    num, denom = frac
    if denom:
        return num / denom
    return None

frac_df = frac_df.withColumn(
    "fraction_float", fraction_to_float(F.col("reduced_fraction"))
)

frac_df.select("reduced_fraction", "fraction_float").distinct().show(5, False)

assert fraction_to_float.func((1, 2)) == 0.5

+----------------+--------------------+
|reduced_fraction|fraction_float      |
+----------------+--------------------+
|[3, 50]         |0.06                |
|[3, 67]         |0.04477611940298507 |
|[3, 76]         |0.039473684210526314|
|[2, 85]         |0.023529411764705882|
|[4, 15]         |0.26666666666666666 |
+----------------+--------------------+
only showing top 5 rows



***
<p style="text-align:left;">
    <a href="./7_Python_SQL.ipynb">Previous Chapter</a>
    <span style="float:right;">
        <a href="./9_Pandas_UDF.ipynb">Next Chapter</a>
    </span>
</p>
