# Big data: Using pandas UDFs

PySpark’s interoperability with pandas (also colloquially called pandas UDF) is a huge selling point when performing data analysis at scale. pandas is the dominant in-memory Python data manipulation library, while PySpark is the dominantly distributed one. Combining both of them unlocks additional possibilities.

We will look into operations on `GroupedData` and how PySpark plus Pandas implement the split-apply-combine pattern common to data analysis. We finish with the ultimate interaction between pandas and PySpark: treating a PySpark data frame like a small collection of pandas DataFrames.

### Column transformations with pandas: Using Series UDF

The Series UDFs family shares a column-first focus with regular PySpark data transformation functions. All of our UDFs in this section will take a Column object (or objects) as input and return a Column object as output. 

PySpark provides three types of Series UDFs.
- The *Series to Series* takes `Columns` objects as inputs, converts them to pandas Series objects, and returns a Series object that gets promoted back to a PySpark Column object.
- The *Iterator of Series to Iterator of Series*  differs in the sense that the `Column` objects get batched into batches and then fed as Iterator objects. It takes a single Column object as input and returns a single `Column`.
- The *Iterator of multiple Series to Iterator of Series* is a combination of the previous Series UDFs and can take multiple Columns as input, like the Series to Series UDF, yet preserves the iterator pattern from the Iterator of Series to Iterator of Series.


#### Connecting Spark to Google's BigQuery

We connect PySpark to Google’s BigQuery, where we will use the National Oceanic and Atmospheric Administration’s (NOAA) Global Surface Summary of the Day (GSOD) data set. In the same vein, this provides a blueprint for connecting PySpark to other data warehouses, such as SQL or NoSQL databases. 

You need a GCP account. Once your account is created, you need to create a service account and a service account key to tell BigQuery to give you access to the public data programmatically. To do so, select Service Account (under IAM & Admin) and click + Create Service Account. Give a meaningful name to your service account.
In the service account permissions menu, select BigQuery → BigQuery admin and click
Continue. In the last step, click + CREATE KEY and select JSON. Download the key and store it somewhere safe

Download the Google's BigQuery connector from [here](https://github.com/GoogleCloudDataproc/spark-bigquery-connector)



#### Making the connection betweek PySpark and BigQuery through connector

We instruct Spark to fetch and install external dependencies, in our case, the `com.google.cloud.spark:spark-bigquery connector`. As it is a Java/Scala dependency, we need to match the correct Spark and Scala version

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(
    "spark.jars.packages",
    "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.28.0",
).getOrCreate()

#### Reading data from BigQuery using our secret key

we can start creating pandas UDFs: we just have to read the data. we assemble 10 years worth of weather data located in BigQuery, which totals over 40 million records.

we use the bigqueryspecialized SparkReader—provided by the connector library we embedded to our PySpark shell—which provides two options:
- The table parameter pointing to the table we want to ingest. The format is `project.dataset.table`; the `bigquery-public-data` is a project available to all.
- The `credentialsFile` is the JSON key downloaded beffore. You need
to adjust the path and file name according to the location of the file

In [6]:
from functools import reduce
import pyspark.sql.functions as F

def read_df_from_bq(year):
    return (
        spark.read.format("bigquery").option(
            "table", f"bigquery-public-data.noaa_gsod.gsod{year}"
        )
        .option("credentialsFile", "big-query-spark-key.json")
        .load()
    )


gsod = (
    reduce(
        lambda x, y: x.unionByName(y, allowMissingColumns=True),
        [read_df_from_bq(year) for year in range(2018, 2019)],
    )
    .dropna(subset=["year", "mo", "da", "temp"])
    .where(F.col("temp") != 9999.9)
    .drop("date")
)

#### Series to Series UDF: Column functions, but with pandas

The Series to Series UDF, also called Scalar UDF, are akin to most of the functions in the `pyspark.sql model`. For the most part, they work just like Python UDFs as seen in previous notebook, with one key difference: Python UDFs work on one record at a time, and you express your logic through regular Python code. Scalar UDFs work on one Series at a time, and you express your logic through pandas code. 

<img src="images/series_2_series_udf.png" width="600px">

In a Python UDF, when you pass column objects to your UDF, PySpark will unpack
each value, perform the computation, and then return the value for each record in
a Column object. Whereas in a Scaler UDF, PySpark will serialize (through a library called PyArrow) each partitioned column into a pandas Series object. You then perform the operations on the Series object directly, returning a Series of
the same dimension from your UDF.




Let's create a simple function that will transform Fahrenheit degrees to Celsius. 
- Instead of `udf()`, we use `pandas_udf()`, again, from the `pyspark.sql.functions` module. Optionally (but recommended), we can pass the return type of the UDF as an argument to the `pandas_udf()` decorator.
- Our function signature is also different: rather than using scalar values (such as int or str), the UDF takes `pd.Series` and return a `pd.Series`.

In [7]:
import pandas as pd
import pyspark.sql.types as T

@F.pandas_udf(T.DoubleType())
def f_to_c(degrees: pd.Series) -> pd.Series:
    """Transforms Farhenheit to Celcius."""
    return (degrees - 32) * 5 / 9

we apply our newly created Series to Series UDF to the temp column of the gsod data frame, which contains the temperature (in Fahrenheit) of each stationday combination. 

In [8]:
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)

+----+-------------------+
|temp|             temp_c|
+----+-------------------+
|29.6|-1.3333333333333326|
|53.5| 11.944444444444445|
|71.6| 21.999999999999996|
|70.4| 21.333333333333336|
|37.2| 2.8888888888888906|
+----+-------------------+
only showing top 5 rows



#### Scalar UDF + cold start = Iterator of Series UDF

This section combines the other two types of Scalar UDFs: the *Iterator of Series to Iterator of Series* UDF and the *Iterator of multiple Series to Iterator of Series*. 

Iterator of Series UDFs are very useful when you have an expensive cold start operation you need to perform. By cold start, we mean an operation we need to perform once at the beginning of the processing step, before working through the data. 

In [9]:
from time import sleep
from typing import Iterator

@F.pandas_udf(T.DoubleType())
def f_to_c2(degrees: Iterator[pd.Series]) -> Iterator[pd.Series]:
    """Transforms Farhenheit to Celcius."""
    # We simulate a cold start using sleep() for five seconds. 
    # The cold start will happen on each worker
    # once, rather than for every batch
    sleep(5)
     
    for batch in degrees:
        yield (batch - 32) * 5 / 9


gsod.select(
    "temp", f_to_c2(F.col("temp")).alias("temp_c")
).distinct().show(5)

+----+-------------------+
|temp|             temp_c|
+----+-------------------+
|29.6|-1.3333333333333326|
|53.5| 11.944444444444445|
|71.6| 21.999999999999996|
|70.4| 21.333333333333336|
|37.2| 2.8888888888888906|
+----+-------------------+
only showing top 5 rows



the Iterator of multiple Series to Iterator of Series is a special case to wrap multiple columns in a single iterator. We'll assemble the year, mo, and da columns
(representing the year, month, and day) into a single column. This example requires more data transformation than when using an Iterator of a single Series.

Our date assembly UDF works like this:
1. year_mo_da is an Iterator of a tuple of Series, representing all the batches of values contained in the year, mo, and da columns.
2. To access each batch, we use a for loop over the iterator, the same principle as for the Iterator of Series UDF.
3. To extract each individual series from the tuple, we use multiple assignments.
In this case, year will map to the first Series of the tuple, mo to the second, and da to the third.
4. Since pd.to_datetime requests a data frame containing the year, month, and
day columns, we create the data frame via a dictionary, giving the keys the relevant column names. pd.to_datetime returns a Series.
5. Finally, we yield the answer to build the Iterator of Series, fulfilling our contract.

<img src="images/iterator_of_mutl_series.png">

In [10]:
from typing import Tuple

@F.pandas_udf(T.DateType())
def create_date(year_mo_da: Iterator[Tuple[pd.Series, pd.Series, pd.Series]]) -> Iterator[pd.Series]:
    """Merges three cols (representing Y-M-D of a date) into a Date col."""
    for year, mo, da in year_mo_da:
        yield pd.to_datetime(
            pd.DataFrame(dict(year=year, month=mo, day=da))
        )


gsod.select(
    "year", "mo", "da",
    create_date(F.col("year"), F.col("mo"), F.col("da")).alias("date"),
).distinct().show(5)

+----+---+---+----------+
|year| mo| da|      date|
+----+---+---+----------+
|2018| 08| 21|2018-08-21|
|2018| 07| 29|2018-07-29|
|2018| 05| 12|2018-05-12|
|2018| 03| 20|2018-03-20|
|2018| 09| 11|2018-09-11|
+----+---+---+----------+
only showing top 5 rows



Scalar UDFs are very useful when you make column-level transformations, just like the functions in `pyspark.sql.functions`. When using any Scalar user-defined function, you need to remember that PySpark will not guarantee the order or the composition of the batches when applying it.


#### UDFs on grouped data: Aggregate and apply

- *Group aggregate UDFs*: You need to perform aggregate functions such as `count()`
or `sum()` as we saw in [Joing and Grouping](./5_Joining_Grouping.ipynb)
- *Group map UDFs*: Your data frame can be split into batches based on the values
of certain columns; you then apply a function on each batch as if it were a pandas `DataFrame` before combining each batch back into a Spark data frame. For instance, we could have our gsod data batched by station month and perform operations on the resulting data frames.

Both group aggregate and group map UDFs are PySpark’s answer to the split-applycombine pattern. At the core, split-apply-combine is just a series of three steps that are frequently used in data analysis:
1. Split your data set into logical batches (using `groupby()`).
2. Apply a function to each batch independently.
3. Combine the batches into a unified data set.

<img src="images/split-apply-combine.png">

#### Group aggregate UDFs

The group aggregate UDF is also known as the *Series to Scalar UDF*. Unlike the *Series to Series*, the group aggregate UDF distills the Series received as input to a single value. PySpark provides the group aggregate functionality though the `groupby().agg()` pattern we saw in [Joining and Grouping](./5_Joining_Grouping.ipynb). A group aggregate UDF is simply a custom aggregate function we pass as an argument to `agg()`. 

As an exmaple, we compute the linear slope of the temperature for a given period using scikit-learn’s LinearRegression object. 

In [11]:
from sklearn.linear_model import LinearRegression

@F.pandas_udf(T.DoubleType())
def rate_of_change_temperature(day: pd.Series, temp: pd.Series) -> float:
    """Returns the slope of the daily temperature for a given period of time."""
    return (
        LinearRegression()
        .fit(X=day.astype(int).values.reshape(-1, 1), y=temp)
        .coef_[0]
    )

result = gsod.groupby("stn", "year", "mo").agg(
    rate_of_change_temperature(gsod["da"], gsod["temp"]).alias(
        "rt_chg_temp"
    )
)

result.show(5,False)

+------+----+---+--------------------+
|stn   |year|mo |rt_chg_temp         |
+------+----+---+--------------------+
|010014|2018|02 |-0.17159955688276657|
|010060|2018|09 |-0.4721980771763467 |
|010060|2018|11 |-0.21319905213270146|
|010070|2018|01 |0.08330645161290319 |
|010070|2018|04 |0.35804226918798654 |
+------+----+---+--------------------+
only showing top 5 rows



#### Group map UDF

The second type of UDF on grouped data is the group map UDF. Unlike the group
aggregate UDF, which returns a scalar value as a result over a batch, the grouped map UDF maps over each batch and returns a (pandas) data frame that gets combined
back into a single (Spark) data frame.
Scalar UDFs relied on pandas Series, group map UDFs use pandas DataFrame.
Each logical batch from step 1 in figure above becomes a pandas DataFrame ready for action. Our function must return a complete DataFrame, meaning that all the columns we want to display need to be returned, including the one we grouped against.


In [12]:
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
    """Returns a simple normalization of the temperature for a site.
    If the temperature is constant for the whole window, defaults to 0.5."""
    temp = temp_by_day.temp
    answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
    
    if temp.min() == temp.max():
        return answer.assign(temp_norm=0.5)
    
    return answer.assign(
        temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
    )

gsod_map = gsod.groupby("stn", "year", "mo").applyInPandas(
    scale_temperature,
    schema=(
        "stn string, year string, mo string, "     # StructType syntax can also be used
        "da string, temp double, temp_norm double" # similar to 6_PySpark_w_JSON.ipynb
    ),
)

gsod_map.show(5, False)

+------+----+---+---+----+------------------+
|stn   |year|mo |da |temp|temp_norm         |
+------+----+---+---+----+------------------+
|010014|2018|02 |25 |35.6|0.859090909090909 |
|010014|2018|02 |20 |35.5|0.8545454545454544|
|010014|2018|02 |06 |29.8|0.5954545454545455|
|010014|2018|02 |15 |35.4|0.8499999999999999|
|010014|2018|02 |19 |36.8|0.9136363636363634|
+------+----+---+---+----+------------------+
only showing top 5 rows



Group map UDFs are highly flexible constructs: as long as you respect the schema you provide to the `applyInPandas()`, Spark will not require that you keep the same (or any) number of records. This is as close as we will get to treating a Spark data frame like a predetermined collection (via `groupby()`) of a pandas DataFrame. 

#### What to use, when

- If you need to control how the batches are made, you need to use a grouped data UDF. If the return value is scalar, group aggregate, or otherwise, use a group map and return a transformed (complete) data frame.
- If you only want batches, you have more options. The most flexible is mapInPandas(), where an iterator of pandas DataFrame comes in and a transformed one comes out. This is very useful when you want to distribute a pandas/local data transformation on the whole data frame, such as with inference of local ML models. Use it if you work with most of the columns from the data frame, and use a Series to Series UDF if you only need a few columns.
- If you have a cold-start process, use a Iterator of Series/multiple Series UDF, depending on the number of columns you need within your UDF.
- Finally, if you only need to transform some columns using pandas, a Series to Series UDF is the way to go.

<img src="images/group_udf_when.png">


The most important aspect of a pandas UDF (and any UDF) is that it needs to work on the nondistributed version of your data. For regular UDFs, this means passing *any argument of the type of values you expect* should yield an answer. For instance, if you divide an array of values by another one, you need to cover the case of dividing by zero. The same is true for any pandas UDF: you need to be lenient with the input you accept and strict with the output you provide