# Big data: Using pandas UDFs

PySpark’s interoperability with pandas (also colloquially called pandas UDF) is a huge selling point when performing data analysis at scale. pandas is the dominant in-memory Python data manipulation library, while PySpark is the dominantly distributed one. Combining both of them unlocks additional possibilities.

We will look into operations on `GroupedData` and how PySpark plus Pandas implement the split-apply-combine pattern common to data analysis. We finish with the ultimate interaction between pandas and PySpark: treating a PySpark data frame like a small collection of pandas DataFrames.

### Column transformations with pandas: Using Series UDF

The Series UDFs family shares a column-first focus with regular PySpark data transformation functions. All of our UDFs in this section will take a Column object (or objects) as input and return a Column object as output. 

PySpark provides three types of Series UDFs.
- The *Series to Series* takes `Columns` objects as inputs, converts them to pandas Series objects, and returns a Series object that gets promoted back to a PySpark Column object.
- The *Iterator of Series to Iterator of Series*  differs in the sense that the `Column` objects get batched into batches and then fed as Iterator objects. It takes a single Column object as input and returns a single `Column`.
- The *Iterator of multiple Series to Iterator of Series* is a combination of the previous Series UDFs and can take multiple Columns as input, like the Series to Series UDF, yet preserves the iterator pattern from the Iterator of Series to Iterator of Series.


#### Connecting Spark to Google's BigQuery

We connect PySpark to Google’s BigQuery, where we will use the National Oceanic and Atmospheric Administration’s (NOAA) Global Surface Summary of the Day (GSOD) data set. In the same vein, this provides a blueprint for connecting PySpark to other data warehouses, such as SQL or NoSQL databases. 

You need a GCP account. Once your account is created, you need to create a service account and a service account key to tell BigQuery to give you access to the public data programmatically. To do so, select Service Account (under IAM & Admin) and click + Create Service Account. Give a meaningful name to your service account.
In the service account permissions menu, select BigQuery → BigQuery admin and click
Continue. In the last step, click + CREATE KEY and select JSON. Download the key and store it somewhere safe

Download the Google's BigQuery connector from [here](https://github.com/GoogleCloudDataproc/spark-bigquery-connector)



#### Making the connection betweek PySpark and BigQuery through connector

We instruct Spark to fetch and install external dependencies, in our case, the `com.google.cloud.spark:spark-bigquery connector`. As it is a Java/Scala dependency, we need to match the correct Spark and Scala version

In [6]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(
    "spark.jars.packages",
    "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.28.0",
).getOrCreate()

#### Reading data from BigQuery using our secret key

we can start creating pandas UDFs: we just have to read the data. we assemble 10 years worth of weather data located in BigQuery, which totals over 40 million records.

we use the bigqueryspecialized SparkReader—provided by the connector library we embedded to our PySpark shell—which provides two options:
- The table parameter pointing to the table we want to ingest. The format is `project.dataset.table`; the `bigquery-public-data` is a project available to all.
- The `credentialsFile` is the JSON key downloaded beffore. You need
to adjust the path and file name according to the location of the file

In [10]:
from functools import reduce
import pyspark.sql.functions as F

def read_df_from_bq(year):
    return (
        spark.read.format("bigquery").option(
            "table", f"bigquery-public-data.noaa_gsod.gsod{year}"
        )
        .option("credentialsFile", "big-query-spark-key.json")
        .load()
    )


gsod = (
    reduce(
        lambda x, y: x.unionByName(y, allowMissingColumns=True),
        [read_df_from_bq(year) for year in range(2018, 2019)],
    )
    .dropna(subset=["year", "mo", "da", "temp"])
    .where(F.col("temp") != 9999.9)
    .drop("date")
)

#### Series to Series UDF: Column functions, but with pandas

The Series to Series UDF, also called Scalar UDF, are akin to most of the functions in the `pyspark.sql model`. For the most part, they work just like Python UDFs as seen in previous notebook, with one key difference: Python UDFs work on one record at a time, and you express your logic through regular Python code. Scalar UDFs work on one Series at a time, and you express your logic through pandas code. 

<img src="images/series_2_series_udf.png" width="600px">

In a Python UDF, when you pass column objects to your UDF, PySpark will unpack
each value, perform the computation, and then return the value for each record in
a Column object. Whereas in a Scaler UDF, PySpark will serialize (through a library called PyArrow) each partitioned column into a pandas Series object. You then perform the operations on the Series object directly, returning a Series of
the same dimension from your UDF.




Let's create a simple function that will transform Fahrenheit degrees to Celsius. 
- Instead of `udf()`, we use `pandas_udf()`, again, from the `pyspark.sql.functions` module. Optionally (but recommended), we can pass the return type of the UDF as an argument to the `pandas_udf()` decorator.
- Our function signature is also different: rather than using scalar values (such as int or str), the UDF takes `pd.Series` and return a `pd.Series`.

In [12]:
import pandas as pd
import pyspark.sql.types as T

@F.pandas_udf(T.DoubleType())
def f_to_c(degrees: pd.Series) -> pd.Series:
    """Transforms Farhenheit to Celcius."""
    return (degrees - 32) * 5 / 9

we apply our newly created Series to Series UDF to the temp column of the gsod data frame, which contains the temperature (in Fahrenheit) of each stationday combination. 

In [13]:
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)

+----+-------------------+
|temp|             temp_c|
+----+-------------------+
|29.6|-1.3333333333333326|
|53.5| 11.944444444444445|
|71.6| 21.999999999999996|
|70.4| 21.333333333333336|
|37.2| 2.8888888888888906|
+----+-------------------+
only showing top 5 rows



#### Scalar UDF + cold start = Iterator of Series UDF

This section combines the other two types of Scalar UDFs: the *Iterator of Series to Iterator of Series* UDF and the *Iterator of multiple Series to Iterator of Series*. 

Iterator of Series UDFs are very useful when you have an expensive cold start operation you need to perform. By cold start, we mean an operation we need to perform once at the beginning of the processing step, before working through the data. 

In [14]:
from time import sleep
from typing import Iterator

@F.pandas_udf(T.DoubleType())
def f_to_c2(degrees: Iterator[pd.Series]) -> Iterator[pd.Series]:
    """Transforms Farhenheit to Celcius."""
    # We simulate a cold start using sleep() for five seconds. 
    # The cold start will happen on each worker
    # once, rather than for every batch
    sleep(5)
     
    for batch in degrees:
        yield (batch - 32) * 5 / 9


gsod.select(
    "temp", f_to_c2(F.col("temp")).alias("temp_c")
).distinct().show(5)

+----+-------------------+
|temp|             temp_c|
+----+-------------------+
|29.6|-1.3333333333333326|
|53.5| 11.944444444444445|
|71.6| 21.999999999999996|
|70.4| 21.333333333333336|
|37.2| 2.8888888888888906|
+----+-------------------+
only showing top 5 rows



the Iterator of multiple Series to Iterator of Series is a special case to wrap multiple columns in a single iterator. We'll assemble the year, mo, and da columns
(representing the year, month, and day) into a single column. This example requires more data transformation than when using an Iterator of a single Series.

Our date assembly UDF works like this:
1. year_mo_da is an Iterator of a tuple of Series, representing all the batches of values contained in the year, mo, and da columns.
2. To access each batch, we use a for loop over the iterator, the same principle as for the Iterator of Series UDF.
3. To extract each individual series from the tuple, we use multiple assignments.
In this case, year will map to the first Series of the tuple, mo to the second, and da to the third.
4. Since pd.to_datetime requests a data frame containing the year, month, and
day columns, we create the data frame via a dictionary, giving the keys the relevant column names. pd.to_datetime returns a Series.
5. Finally, we yield the answer to build the Iterator of Series, fulfilling our contract.

<img src="images/iterator_of_mutl_series.png">

In [15]:
from typing import Tuple

@F.pandas_udf(T.DateType())
def create_date(year_mo_da: Iterator[Tuple[pd.Series, pd.Series, pd.Series]]) -> Iterator[pd.Series]:
    """Merges three cols (representing Y-M-D of a date) into a Date col."""
    for year, mo, da in year_mo_da:
        yield pd.to_datetime(
            pd.DataFrame(dict(year=year, month=mo, day=da))
        )


gsod.select(
    "year", "mo", "da",
    create_date(F.col("year"), F.col("mo"), F.col("da")).alias("date"),
).distinct().show(5)

+----+---+---+----------+
|year| mo| da|      date|
+----+---+---+----------+
|2018| 08| 21|2018-08-21|
|2018| 07| 29|2018-07-29|
|2018| 05| 12|2018-05-12|
|2018| 03| 20|2018-03-20|
|2018| 09| 11|2018-09-11|
+----+---+---+----------+
only showing top 5 rows

