## Test overhead of pyspark UDF

Results:
1. 80ns per pyspark UDF call
2. almost no time cost passing large objects to python

In [18]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("SingleTaskApp") \
    .master("local[1]") \
    .getOrCreate()

sc = spark.sparkContext

In [19]:
from pyspark.sql.types import LongType, IntegerType

### Declare two simplest UDFs 

They are simple and use almost no compute time, only need object passing and function calling

In [None]:
def increment(x):
    return x + 1

spark.udf.register("getLen", len, LongType())
spark.udf.register("incrementInt", increment, IntegerType())

### Function call cost

Test overhead to invoke python UDF as compared to Spark SQL

In [21]:
import pandas as pd, numpy as np
# Generate 10 million random 32-bit integers
pd.DataFrame({'num': np.random.randint(np.iinfo(np.int32).max+1, size=10000000, dtype=np.int32)}).to_parquet('randint.parquet')

In [22]:
spark.read.parquet('randint.parquet').createOrReplaceTempView("randint")

In [23]:
spark.sql("SELECT incrementInt(num) FROM randint").write.mode("overwrite").parquet("incremented.parquet")

                                                                                

takes 9.3s

In [24]:
spark.sql("SELECT 1+num FROM randint").write.mode("overwrite").parquet("incremented.parquet")

                                                                                

takes 1.3s

Thus, the minimum time overhead for each python UDF call is estimated to be 8e-7 seconds (80 nanoseconds)

### Data transfer cost

Test time cost to pass large data between Spark and python UDF

In [25]:
# use spark sql to generate a dataframe with a column 'zeros' containing 100000 all-zero bytes, each of size 100KB, total size 10GB

spark.sql("SELECT repeat('0', 100000) AS zeros FROM range(100000)").write.mode("overwrite").parquet("zerostr.parquet")

                                                                                

In [26]:
spark.read.parquet('zerostr.parquet').createOrReplaceTempView("zerostr")

In [27]:
df = spark.sql("SELECT LENGTH(zeros) FROM zerostr").toPandas()

                                                                                

In [28]:
df = spark.sql("SELECT getLen(zeros) FROM zerostr").toPandas()

                                                                                

both takes 29.7s to calculate length of 10GB binary bytes, thus passing objects to python takes almost no time