## PySpark UDF (User Defined Function)

In PySpark, you can create a function and wrap it with PySpark SQL `udf()` or register it as UDF and use it on DataFrame and SQl respectively.
UDF’s are the most expensive operations hence use them only you have no choice and when essential.

In [0]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()

#### Load libraries

In [0]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField, ArrayType, MapType, DoubleType
from pyspark.sql.functions import lit, col, expr, when, sum, avg, max, min, mean, count, udf

#### Create Spark session

In [0]:
spark = SparkSession.builder.appName('PySpark UDF (User Defined Function)').getOrCreate()

#### Create a DataFrame

In [0]:
columns = ['seqno','name']

data = [
  ('1', 'john jones'),
  ('2', 'tracey smith'),
  ('3', 'amy sanders')
]

df = spark.createDataFrame(data=data,schema=columns)

df.show()

#### Create a Python Function

In [0]:
def convertCase(str):
  resStr = ''
  arr = str.split(' ')
  for x in arr:
     resStr = f'{resStr}{x[0:1].upper()}{x[1:len(x)]} '
  return resStr.strip()

#### Convert a Python function to PySpark UDF

Now convert this function `convertCase()` to UDF by passing the function to PySpark SQL `udf()`.  
This function is available at `org.apache.spark.sql.functions.udf` package. Make sure you import this package before using it.

In [0]:
# Converting function to UDF
convertUDF = udf(lambda z: convertCase(z),StringType())

In [0]:
# Converting function to UDF StringType() is by default hence not required
convertUDF = udf(lambda z: convertCase(z))

#### Using UDF with DataFrame | select()

In [0]:
df.select(
  col('seqno'),
  convertUDF(col('name')).alias('name')
).show()

#### Using UDF with DataFrame | withColumn()

In [0]:
df.withColumn('fixed_name', convertUDF(col('name'))).show()

#### Registering PySpark UDF & use it on SQL

In [0]:
spark.udf.register('convertUDF', convertCase, StringType())

df.createOrReplaceTempView('seq_tbl')

spark.sql('select seqno, convertUDF(name) as name from seq_tbl').show()

#### Creating UDF using annotation

In [0]:
@udf(returnType = StringType()) 
def upperCase(str):
  return str.upper()

df.withColumn('CAPS name', upperCase(col('name'))).show()

#### Special Handling

PySpark/Spark does not guarantee the order of evaluation of subexpressions meaning expressions are not guarantee to evaluated left-to-right or in any other fixed order.  
PySpark reorders the execution for query optimization and planning hence, AND, OR, WHERE and HAVING expression will have side effects.  
When you are designing and using UDF, you have to be very careful especially with null handling as these results runtime exceptions.

##### Execution order

In [0]:
# No guarantee Name is not null will execute first
# If convertUDF(Name) like '%John%' execute first then 
# you will get runtime error

spark.sql("""
select seqno, 
       convertUDF(name) as name 
from seq_tbl 
where name is not null and convertUDF(name) like '%John%'
""").show()  

##### Handling null check

In [0]:
columns = ['seqno','name']

data = [
  ('1', 'john jones'),
  ('2', 'tracey smith'),
  ('3', 'amy sanders'),
  ('4', None)
]

df2 = spark.createDataFrame(data=data,schema=columns)

df2.show()

In [0]:
df2.createOrReplaceTempView('seq_tbl2')
spark.sql("select convertUDF(name) from seq_tbl2").show()

It is always best practice to check for `null` inside a UDF function rather than checking for `null` outside.  
In any case, if you can’t do a null check in UDF at lease use `IF` or `CASE WHEN` to check for `null` and call UDF conditionally.

In [0]:
spark.udf.register('_nullsafeUDF', lambda str: convertCase(str) if not str is None else '' , StringType())

In [0]:
spark.sql('select _nullsafeUDF(name) from seq_tbl2').show()

In [0]:
spark.sql("""select Seqno, _nullsafeUDF(name) as name from seq_tbl2 where name is not null and _nullsafeUDF(name) like '%John%'""").show()

#### The end of the notebook