<a href="https://colab.research.google.com/github/harenlin/PySpark-Learning/blob/main/UDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('UDF').getOrCreate()
cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
[K     |████████████████████████████████| 212.4MB 72kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 16.2MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=e9bb12b45922f8c01647143b93bcc458bbafc572d9e7022c017743f07a0afd46
  Stored in directory: /root/.cache/pip/wheels/40/1b/2c/30f43be2627857ab80062bef1527c0128f7b4070b6b2d02139
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2
You 

In [21]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *

columns = ["no", "Name"]
data = [("1", "haren lin"), ("2", "jimmy lin"), ("3", "watson wang")]
df = spark.createDataFrame(data=data, schema=columns)

# define function
def convertCase(str):
    resStr = ""
    arr = str.split(" ")
    for x in arr:
       resStr = resStr + x[0:1].upper() + x[1:] + " "
    return resStr

In [24]:
# convert function to udf - method 1
convertToUDF = F.udf(lambda z: convertCase(z), StringType()) 
df.select(F.col("no"), F.col("Name"), convertToUDF(F.col("Name")).alias("Converted_Name")).show()

+---+-----------+--------------+
| no|       Name|Converted_Name|
+---+-----------+--------------+
|  1|  haren lin|    Haren Lin |
|  2|  jimmy lin|    Jimmy Lin |
|  3|watson wang|  Watson Wang |
+---+-----------+--------------+



In [13]:
df.withColumn('Converted_Name', convertToUDF(F.col("Name"))).show()

+---+-----------+--------------+
| no|       Name|Converted_Name|
+---+-----------+--------------+
|  1|  haren lin|    Haren Lin |
|  2|  jimmy lin|    Jimmy Lin |
|  3|watson wang|  Watson Wang |
+---+-----------+--------------+



In [30]:
# Registering PySpark UDF & use it on SQL
spark.udf.register(name="convertToUDF", f=convertCase, returnType=StringType())
df.createOrReplaceTempView("TempTable")
spark.sql("select no, Name, convertToUDF(Name) as Converted_Name from TempTable").show()

+---+-----------+--------------+
| no|       Name|Converted_Name|
+---+-----------+--------------+
|  1|  haren lin|    Haren Lin |
|  2|  jimmy lin|    Jimmy Lin |
|  3|watson wang|  Watson Wang |
+---+-----------+--------------+



# Creating UDF using annotation

In the previous sections, you have learned creating a UDF is a 2 step process, first, you need to create a Python function, second convert function to UDF using SQL udf() function, however, you can avoid these two steps and create it with just a single step by using annotations.

In [35]:
# reference: https://sparkbyexamples.com/pyspark/pyspark-udf-user-defined-function/#pyspark-udf-introduction

# convert function to udf - method 2: define udf with annotation
@F.udf(returnType=StringType())  
def upperCase(str):
    return str.upper()

df.withColumn("Name2Upper", upperCase(F.col("Name"))).show(truncate=False)

+---+-----------+-----------+
|no |Name       |Name2Upper |
+---+-----------+-----------+
|1  |haren lin  |HAREN LIN  |
|2  |jimmy lin  |JIMMY LIN  |
|3  |watson wang|WATSON WANG|
+---+-----------+-----------+



# Exception Handling

In [36]:
# handling a null check
columns = ["no", "Name"]
data = [("1", "haren lin"), ("2", "jimmy lin"), ("3", "watson wang"), ("4", None)]
df2 = spark.createDataFrame(data=data,schema=columns)
df2.show(truncate=False)

df2.createOrReplaceTempView("NAME_TABLE2")
spark.sql("select convertToUDF(Name) from NAME_TABLE2").show(truncate=False)

+---+-----------+
|no |Name       |
+---+-----------+
|1  |haren lin  |
|2  |jimmy lin  |
|3  |watson wang|
|4  |null       |
+---+-----------+



PythonException: ignored

In [38]:
spark.udf.register("_nullsafeUDF", lambda str: convertCase(str) if not str is None else "", StringType())
spark.sql("select _nullsafeUDF(Name) from NAME_TABLE2").show(truncate=False) # no more error msg

+------------------+
|_nullsafeUDF(Name)|
+------------------+
|Haren Lin         |
|Jimmy Lin         |
|Watson Wang       |
|                  |
+------------------+

