# Spark User Defined Functions

## Overview

Spark provides a powerful computational engine and we have seen some core aspects of it thus far. In this
section, we will look into how to create user defined functions and integrate these in Spark.

## User defined functions in Spark

UDFs are custom per-row transformations
in native Python that run in parallel on your data. The obvious question is, why not only use UDFs?
After all, they are also more flexible. There is a hierarchy of tools you should look to use for speed
reasons. Speed is a significant consideration and should not be ignored. Ideally, you should get the most
bang for your buck using Python DataFrame APIs and their native functions/methods. DataFrames
go through many optimizations, so they are ideally suited for semi-structured and structured data.
The methods and functions Spark provides are also heavily optimized and designed for the most
common data processing tasks. Suppose you find a case where you just can’t do what is required with
the native functions and methods and you are forced to write UDFs. UDFs are slower because Spark
can’t optimize them. They take your native language code and serialize it into the JVM, then pass it
across the cluster. I have come across many cases where objects can’t be serialized, which can be a
pain to deal with. Also, this whole process is much slower for non-JVM languages. I always like to
caution against UDFs unless no other choice is possible.

The code snippet below shows you how to define a simple UDF in Spark.
It accepts a string column and returns a modified version of that string

```python

from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType, StringType


def add_string(string):
    return string + "_" + "this_is_added"


if __name__ == '__main__':

    if len(sys.argv) != 2:
        print("Usage: filename <file>", file=sys.stderr)

    # get a spark session
    spark = SparkSession.builder.appName(APP_NAME).getOrCreate()

    # read the filename from the commandline
    rows = [Row(180.0, 85.0, 35, "M"),
            Row(175.5, 75.5, 25, "M"),
            Row(165.3, 55.3, 19, "F")]

    # create the DataFrame
    df = spark.createDataFrame(rows, ["Height", "Weight", "Age", "Sex"])
    df.show()

    # use expr function
    df.select(expr("Height * 5")).show()

    #...or use column
    df.select("Height", col("Height") * 5, "Weight", col("Weight") * 2, "Age", "Sex").show()

    # before using the UDF we need to register it
    addStringUDF = udf(lambda i: add_string(i),StringType())

    spark.stop()
    
```

If we want to use our UDF in native SQL statements, then we also need to register it as shown below:

```python
spark.udf.register("addStringUDF", addString,StringType())

```

## Summary

## References

1. Jules S. Damji, Brooke Wenig, Tathagata Das, Deny Lee, _Learning Spark. Lighting-fasts data analytics_, 2nd Edition, O'Reilly.