# Coding pandas User Defined Functions (UDF)

![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png)

More examples are available on the Spark website: http://spark.apache.org/examples.html

Documentation on pandas UDFs at:
https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/udf-python-pandas

## Author: Bryan Cafferky Copyright 09/13/2021

### Warning!!!

#### To run this code, you need to have uploaded the files and created the database tables - see Lesson 9 - Creating the SQL Tables on Databricks.  Link in video description to that video.

In [0]:
sc.version

In [0]:
# See if Arrow is enabled.
spark.conf.get("spark.sql.execution.arrow.enabled")

In [0]:
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

### Enabling for Conversion to/from Pandas

Arrow is available as an optimization when converting a Spark DataFrame to a Pandas DataFrame using the call toPandas() and when creating a Spark DataFrame from a Pandas DataFrame with createDataFrame(pandas_df). To use Arrow when executing these calls, users need to first set the Spark configuration spark.sql.execution.arrow.pyspark.enabled to true. This is disabled by default.

See https://spark.apache.org/docs/3.0.1/sql-pyspark-pandas-with-arrow.html#enabling-for-conversion-tofrom-pandas

In [0]:
# Enable Arrow-based columnar data transfers
spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")

In [0]:
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

In [0]:
spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")

In addition, optimizations enabled by spark.sql.execution.arrow.pyspark.enabled could fallback automatically to non-Arrow optimization implementation if an error occurs before the actual computation within Spark. This can be controlled by spark.sql.execution.arrow.pyspark.fallback.enabled.

In [0]:
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "true")

Recommended Pandas and PyArrow Versions
For usage with pyspark.sql, the supported versions of Pandas is 0.24.2 and PyArrow is 0.15.1. Higher versions may be used, however, compatibility and data correctness can not be guaranteed and should be verified by the user.

See https://spark.apache.org/docs/3.0.0/sql-pyspark-pandas-with-arrow.html#recommended-pandas-and-pyarrow-versions

In [0]:
import pandas as pd

pd.show_versions()

In [0]:
import pyarrow

pyarrow.__version__

## Create dataframe from a Spark SQL table

### Dataframe naming prefix convention:
##### 1st character is s for Spark DF
##### 2nd character is p for Python
##### 3rd and 4th character is df for dataframe
##### 5th = _ separator
##### rest is a meaningful name

##### spdf_salessummary = a Spark Python dataframe containing sales summary information.

In [0]:
spark.sql('use awproject')
spdf_sales = spark.sql('select CustomerKey, OrderDateKey, SalesAmount, TotalProductCost from factinternetsales limit 10').dropna()

In [0]:
display(spdf_sales)

In [0]:
import pandas as pd
from pyspark.sql.functions import pandas_udf       

@pandas_udf('double')  
def margin_precent_udf(salesamount: pd.Series, productcost: pd.Series) -> pd.Series:
  return (salesamount - productcost) / salesamount

spdf_sales.select("SalesAmount", "TotalProductCost", margin_precent_udf("SalesAmount", "TotalProductCost")).show()

In [0]:
b_taxrate = sc.broadcast(.07)

In [0]:
b_taxrate.value

In [0]:
from typing import Iterator
import pandas as pd
from pyspark.sql.functions import pandas_udf      

@pandas_udf("long")
def tax_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:

    # Do some expensive initialization with a state   
    taxrate = b_taxrate.value
    
    for salesamount in iterator:
        # Use that state for the whole iterator.
        yield (taxrate * salesamount)

spdf_sales.select(tax_udf("SalesAmount").alias("Tax")).show()

In [0]:
from typing import Iterator, Tuple
import pandas as pd

from pyspark.sql.functions import pandas_udf

@pandas_udf('double')  
def margin_precent_multi_iter_udf(iterator: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:
   for salesamount, productcost in iterator:
        yield (salesamount - productcost) / salesamount

In [0]:
# spdf_sales.select(multiply_two_cols("SalesAmount", "SalesAmount")).show()
spdf_sales.select("SalesAmount", "TotalProductCost", margin_precent_multi_iter_udf("SalesAmount", "TotalProductCost")).show()