<a href="https://colab.research.google.com/github/khaledn66/pyspark2/blob/main/UDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**CFREATE UDF  **
***1. Create data frame***

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

columns = ["NUM","Name"]
data = [("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df.show(truncate=False)

+---+------------+
|NUM|Name        |
+---+------------+
|1  |john jones  |
|2  |tracey smith|
|3  |amy sanders |
+---+------------+



**2.2 Create a Python Function**

split  function in python

In [3]:
text = "apple,banana,cherry"
result = text.split(',')
print(result) # Output: ['apple', 'banana', 'cherry']

['apple', 'banana', 'cherry']


In [13]:
def convertCase(str):
    resStr=" "
    arr = str.split(" ")
    for x in arr:
       resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
    return resStr


In [15]:
print(convertCase("hello baby"))

 Hello Baby 


**2.3 Convert a Python function to PySpark UDF**

In [16]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

# Converting function to UDF
convertUDF = udf(lambda z: convertCase(z),StringType())

**3. Using UDF with DataFrame**

3.1 Using UDF with PySpark DataFrame select()**Текст,

In [17]:
df.select(col("NUM"), \
    convertUDF(col("Name")).alias("Name") ) \
   .show(truncate=False)

+---+--------------+
|NUM|Name          |
+---+--------------+
|1  | John Jones   |
|2  | Tracey Smith |
|3  | Amy Sanders  |
+---+--------------+



**3.2 Using UDF with PySpark DataFrame withColumn()**

In [19]:
def upperCase(str):
    return str.upper()

In [20]:
upperCaseUDF = udf(lambda z:upperCase(z),StringType())

df.withColumn("Cureated Name", upperCaseUDF(col("Name"))) \
  .show(truncate=False)

+---+------------+-------------+
|NUM|Name        |Cureated Name|
+---+------------+-------------+
|1  |john jones  |JOHN JONES   |
|2  |tracey smith|TRACEY SMITH |
|3  |amy sanders |AMY SANDERS  |
+---+------------+-------------+



**3.3 Registering PySpark UDF & use it on SQL**

In [22]:
""" Using UDF on SQL """
spark.udf.register("convertUDF", convertCase,StringType())
df.createOrReplaceTempView("NAME_TABLE")
spark.sql("select NUM, convertUDF(Name) as Name from NAME_TABLE") \
     .show(truncate=False)

+---+--------------+
|NUM|Name          |
+---+--------------+
|1  | John Jones   |
|2  | Tracey Smith |
|3  | Amy Sanders  |
+---+--------------+



**4. Creating UDF using annotation**

In [23]:
@udf(returnType=StringType())
def upperCase(str):
    return str.upper()

df.withColumn("Cureated Name", upperCase(col("Name"))) \
.show(truncate=False)

+---+------------+-------------+
|NUM|Name        |Cureated Name|
+---+------------+-------------+
|1  |john jones  |JOHN JONES   |
|2  |tracey smith|TRACEY SMITH |
|3  |amy sanders |AMY SANDERS  |
+---+------------+-------------+



5. Special Handling
5.1 Execution order **Текст, выделенный полужирным шрифтом**

In [32]:
"""
No guarantee Name is not null will execute first
If convertUDF(Name) like '%John%' execute first then
you will get runtime error
"""
spark.sql("select NUM, convertUDF(Name) as Name from NAME_TABLE " \
          +"where Name is not null and convertUDF(Name) like '%John%'") \
     .show(truncate=False)

+---+------------+
|NUM|Name        |
+---+------------+
|1  | John Jones |
+---+------------+



**Complete PySpark UDF Example**

In [33]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

columns = ["Seqno","Name"]
data = [("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df.show(truncate=False)

def convertCase(str):
    resStr=""
    arr = str.split(" ")
    for x in arr:
       resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
    return resStr

""" Converting function to UDF """
convertUDF = udf(lambda z: convertCase(z))

df.select(col("Seqno"), \
    convertUDF(col("Name")).alias("Name") ) \
.show(truncate=False)

def upperCase(str):
    return str.upper()

upperCaseUDF = udf(lambda z:upperCase(z),StringType())

df.withColumn("Cureated Name", upperCaseUDF(col("Name"))) \
.show(truncate=False)

""" Using UDF on SQL """
spark.udf.register("convertUDF", convertCase,StringType())
df.createOrReplaceTempView("NAME_TABLE")
spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE") \
     .show(truncate=False)

spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE " + \
          "where Name is not null and convertUDF(Name) like '%John%'") \
     .show(truncate=False)

""" null check """

columns = ["Seqno","Name"]
data = [("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders"),
    ('4',None)]

df2 = spark.createDataFrame(data=data,schema=columns)
df2.show(truncate=False)
df2.createOrReplaceTempView("NAME_TABLE2")

spark.udf.register("_nullsafeUDF", lambda str: convertCase(str) if not str is None else "" , StringType())

spark.sql("select _nullsafeUDF(Name) from NAME_TABLE2") \
     .show(truncate=False)

spark.sql("select Seqno, _nullsafeUDF(Name) as Name from NAME_TABLE2 " + \
          " where Name is not null and _nullsafeUDF(Name) like '%John%'") \
     .show(truncate=False)

+-----+------------+
|Seqno|Name        |
+-----+------------+
|1    |john jones  |
|2    |tracey smith|
|3    |amy sanders |
+-----+------------+

+-----+-------------+
|Seqno|Name         |
+-----+-------------+
|1    |John Jones   |
|2    |Tracey Smith |
|3    |Amy Sanders  |
+-----+-------------+

+-----+------------+-------------+
|Seqno|Name        |Cureated Name|
+-----+------------+-------------+
|1    |john jones  |JOHN JONES   |
|2    |tracey smith|TRACEY SMITH |
|3    |amy sanders |AMY SANDERS  |
+-----+------------+-------------+

+-----+-------------+
|Seqno|Name         |
+-----+-------------+
|1    |John Jones   |
|2    |Tracey Smith |
|3    |Amy Sanders  |
+-----+-------------+

+-----+-----------+
|Seqno|Name       |
+-----+-----------+
|1    |John Jones |
+-----+-----------+

+-----+------------+
|Seqno|Name        |
+-----+------------+
|1    |john jones  |
|2    |tracey smith|
|3    |amy sanders |
|4    |NULL        |
+-----+------------+

+------------------+
|_nul