The user-defined functions are considered deterministic by default. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. If your function is not deterministic, call asNondeterministic on the user defined function. E.g.:

### def udf(f=None, returnType=StringType()):

#### Parameters
    ----------
    f : function
        python function if used as a standalone function
    returnType : :class:`pyspark.sql.types.DataType` or str
        the return type of the user-defined function. The value can be either a
        :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.



In [1]:
import pyspark

In [2]:
# Create SparkSession from builder
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local") \
                    .appName('UDF Pyspark') \
                    .getOrCreate()


In [3]:
from pyspark.sql.types import IntegerType

In [7]:
from pyspark.sql.functions import  udf

In [4]:
import random

In [8]:
random_udf=udf(lambda: int(random.random()*100),IntegerType()).asNondeterministic()

In [9]:
slen=udf(lambda s:len(s),IntegerType())

In [10]:
@udf
def to_upper(s):
    if s is not None:
        return s.upper()
    
@udf(returnType=IntegerType())
def add_one(x):
    if x is not None:
        return x+1


In [14]:
df=spark.createDataFrame([(1,"John Doe",21,"johndoe@gmail.com")],("id","name","age","email"))
df.select(slen("name").alias("slen(name)"),to_upper("name"),add_one("age")).show()

+----------+--------------+------------+
|slen(name)|to_upper(name)|add_one(age)|
+----------+--------------+------------+
|         8|      JOHN DOE|          22|
+----------+--------------+------------+



#### Define a Function

In [15]:
def first_letter_function(email):
    return email[0]

first_letter_function("anurag@gmail.com")

'a'

In [16]:
first_letter_udf=udf(first_letter_function)

In [17]:
df.select(slen("name").alias("slen(name)"),to_upper("name"),add_one("age"),first_letter_udf("email")).show()

+----------+--------------+------------+----------------------------+
|slen(name)|to_upper(name)|add_one(age)|first_letter_function(email)|
+----------+--------------+------------+----------------------------+
|         8|      JOHN DOE|          22|                           j|
+----------+--------------+------------+----------------------------+



#### Register UDF to use in SQL
Register the UDF using spark.udf.register to also make it available for use in the SQL namespace.

In [18]:
df.createOrReplaceTempView("emp")

In [19]:
first_letter_udfsql=spark.udf.register("sql_udf",first_letter_function)

In [20]:
spark.sql("select sql_udf(email) from emp").show()

+--------------+
|sql_udf(email)|
+--------------+
|             j|
+--------------+



### Pandas/Vectorized UDFs
Pandas UDFs are available in Python to improve the efficiency of UDFs. Pandas UDFs utilize Apache Arrow to speed up computation.

The user-defined functions are executed using:

Apache Arrow, an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes with near-zero (de)serialization cost
Pandas inside the function, to work with Pandas instances and APIs

Pandas UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high-performance UDFs entirely in Python.

In Spark 2.3, there will be two types of Pandas UDFs: scalar and grouped map. Next, we illustrate their usage using four example programs: Plus One, Cumulative Probability, Subtract Mean, Ordinary Least Squares Linear Regression.

Scalar Pandas UDFs
Scalar Pandas UDFs are used for vectorizing scalar operations. To define a scalar Pandas UDF, simply use @pandas_udf to annotate a Python function that takes in pandas.


In [23]:
import pandas as pd
from pyspark.sql.functions import pandas_udf

# We have a string input/output
@pandas_udf("string")
def vectorized_udf(email: pd.Series) -> pd.Series:
    return email.str[0]

In [24]:
def vectorized_udf(email: pd.Series) -> pd.Series:
     return email.str[0]
vectorized_udf = pandas_udf(vectorized_udf, "string")

In [25]:
df.select(slen("name").alias("slen(name)"),to_upper("name"),add_one("age"),vectorized_udf("email")).show()

+----------+--------------+------------+---------------------+
|slen(name)|to_upper(name)|add_one(age)|vectorized_udf(email)|
+----------+--------------+------------+---------------------+
|         8|      JOHN DOE|          22|                    j|
+----------+--------------+------------+---------------------+

