<a href="https://colab.research.google.com/github/harenlin/PySpark-Learning/blob/main/UDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('UDF').getOrCreate()
cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
[K     |████████████████████████████████| 212.4MB 71kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 16.9MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=87fca4acd5f67c300517478b0b276ac42993bd4614d04004dc1c52d4fdcdac33
  Stored in directory: /root/.cache/pip/wheels/40/1b/2c/30f43be2627857ab80062bef1527c0128f7b4070b6b2d02139
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2
You 

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *

In [3]:
columns = ["no","Name"]
data = [("1", "john jones"), ("2", "tracey smith"), ("3", "amy sanders")]
df = spark.createDataFrame(data=data, schema=columns)
df.show()

+---+------------+
| no|        Name|
+---+------------+
|  1|  john jones|
|  2|tracey smith|
|  3| amy sanders|
+---+------------+



In [4]:
def convertCase(str):
    resStr = ""
    arr = str.split(" ")
    for x in arr:
       resStr = resStr + x[0:1].upper() + x[1:len(x)] + " "
    return resStr 

In [5]:
convertToUDF = F.udf(lambda z: convertCase(z), StringType())

In [6]:
df.select(F.col("no"), convertToUDF(F.col("Name")).alias("Converted_Name")).show()

+---+--------------+
| no|Converted_Name|
+---+--------------+
|  1|   John Jones |
|  2| Tracey Smith |
|  3|  Amy Sanders |
+---+--------------+



In [7]:
df.withColumn('Converted_Name', convertToUDF(F.col("Name"))).show()

+---+------------+--------------+
| no|        Name|Converted_Name|
+---+------------+--------------+
|  1|  john jones|   John Jones |
|  2|tracey smith| Tracey Smith |
|  3| amy sanders|  Amy Sanders |
+---+------------+--------------+



In [8]:
spark.udf.register("convertToUDF", convertCase, StringType())
df.createOrReplaceTempView("NAME_TABLE")

In [9]:
spark.sql("select no, Name, convertToUDF(Name) as Converted_Name from NAME_TABLE").show()

+---+------------+--------------+
| no|        Name|Converted_Name|
+---+------------+--------------+
|  1|  john jones|   John Jones |
|  2|tracey smith| Tracey Smith |
|  3| amy sanders|  Amy Sanders |
+---+------------+--------------+



# Creating UDF using annotation

In the previous sections, you have learned creating a UDF is a 2 step process, first, you need to create a Python function, second convert function to UDF using SQL udf() function, however, you can avoid these two steps and create it with just a single step by using annotations.

In [11]:
# reference: https://sparkbyexamples.com/pyspark/pyspark-udf-user-defined-function/#pyspark-udf-introduction

@F.udf(returnType = StringType())  # define udf with annotation
def upperCase(str):
    return str.upper()

df.withColumn("U_Name", upperCase(F.col("Name"))).show(truncate=False)

+---+------------+------------+
|no |Name        |U_Name      |
+---+------------+------------+
|1  |john jones  |JOHN JONES  |
|2  |tracey smith|TRACEY SMITH|
|3  |amy sanders |AMY SANDERS |
+---+------------+------------+



# Exception Handling

In [17]:
# handling a null check
columns = ["no","Name"]
data = [("1", "john jones"), ("2", "tracey smith"), ("3", "amy sanders"), ('4',None)]
df2 = spark.createDataFrame(data=data,schema=columns)
df2.show(truncate=False)

df2.createOrReplaceTempView("NAME_TABLE2")
spark.sql("select convertToUDF(Name) from NAME_TABLE2").show(truncate=False)

+---+------------+
|no |Name        |
+---+------------+
|1  |john jones  |
|2  |tracey smith|
|3  |amy sanders |
|4  |null        |
+---+------------+



PythonException: ignored

In [18]:
spark.udf.register("_nullsafeUDF", lambda str: convertCase(str) if not str is None else "", StringType())
spark.sql("select _nullsafeUDF(Name) from NAME_TABLE2").show(truncate=False) # no more error msg

spark.sql("select no, _nullsafeUDF(Name) as Name from NAME_TABLE2 " + " where Name is not null and _nullsafeUDF(Name) like '%John%'").show(truncate=False)

+------------------+
|_nullsafeUDF(Name)|
+------------------+
|John Jones        |
|Tracey Smith      |
|Amy Sanders       |
|                  |
+------------------+

+---+-----------+
|no |Name       |
+---+-----------+
|1  |John Jones |
+---+-----------+

