# Your data under a different lens: Window functions

On first glance, they look like a watered-down version of the split-applycombine pattern introduced in previous notebook [Pandas UDFs](./9_Pandas_UDF.ipynb). But it contains powerful manipulations in a short and expressive body of code.

Window functions fill a niche between group aggregate (groupBy().agg()) and
group map UDF (groupBy().apply()) transformations, both seen in previous notebook [Pandas UDFs](./9_Pandas_UDF.ipynb). Both rely on partitioning to split the data frame based on a predicate. A group aggregate transformation will yield one record per grouping, while a group map UDF allows for any shape of a resulting data frame; a window function always keeps the dimensions of the data frame intact. Window functions have a secret weapon in the window frame that we define within a partition: it determines which records are included in the application of the function.

Window functions are mostly used for creating new columns, so they leverage
some familiar methods, such as select() and withColumn(). But we will see different approach here.

### Growing and using a simle window function

For this section, we reuse the temperature data set from [RDD & UDF](./8_RDD_n_UDFs.ipynb); the data set contains weather observations for a series of stations, summarized by day. Window functions especially shine when working with time series-like data (e.g., daily observations of temperature) because you can slice the data by day, month, or year and get useful statistics.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

spark = SparkSession.builder.getOrCreate()

gsod = spark.read.parquet("./data/window/gsod.parquet")

#### Identifying the coldest day of each year, the long way

In this section, we emulate a simple window function through functionality we learned in previous notebooks using the `join()` method.

we start with simple questions to ask our data frame: when and where were the lowest temperature recorded each year? In other words, we want a data frame containing three records, one for each year and showing the station, the date (year, month, day), and the temperature of the coldest day recorded for that year

In [2]:
coldest_temp = gsod.groupby("year").agg(F.min("temp").alias("temp"))
coldest_temp.orderBy("temp").show()

+----+------+
|year|  temp|
+----+------+
|2019|-114.7|
|2017|-114.7|
|2018|-113.5|
+----+------+



This provides the year and the temperature, which are about 40% of the original ask. To get the other three columns (`mo`, `da`, `stn`), we can use a left-semi join on the original table, using the results of `coldest_temp` to resolve the join.

we join `gsod` to `coldest_temp` using a left-semi equi-join on the `year` and `temp` columns. Because `coldest_temp` only contains the coldest temperature for each year, the left semi-join keeps only the records from `gsod` that correspond to that year-temperature pair; this is equivalent to keeping only the records where the temperature was coldest for each year.

In [3]:
coldest_when = gsod.join(
    coldest_temp, how="left_semi", on=["year", "temp"]
).select("stn", "year", "mo", "da", "temp")

coldest_when.orderBy("year", "mo", "da").show()

+------+----+---+---+------+
|   stn|year| mo| da|  temp|
+------+----+---+---+------+
|896250|2017| 06| 20|-114.7|
|896060|2018| 08| 27|-113.5|
|895770|2019| 06| 15|-114.7|
+------+----+---+---+------+



From the above codes we are performing a join between the gsod table and, well, something coming from the gsod table. A self-join, which is when you join a table with itself, is often considered an anti-pattern for data manipulation. While it’s not technically wrong, it can be slow and make the code look more complex than it needs to be.

<img src="images/self_join_table.png">

Fortunately, a window function gives you the same result faster, and with less code
clutter. 

#### Creating and using a simple window function to get the coldest days

We use the `Window` object and parameterize it to split a data frame over column values. We then apply the window over a data frame, using the traditional selector approach.

Similar to split-apply-combine pattern we covered in previous notebooks, we will employ three stages
- Instead of splitting, we’ll partition the data frame.
- Instead of applying, we’ll select values over the window.
- The combine/union operation is implicit (i.e., not explicitly coded) in a window function

Window functions apply over a window of data split according to the values on a column. Each split, called a partition, gets the window function applied to each of its records as if they were independent data frames. The result then gets unioned back into a single data frame.

In [4]:
from pyspark.sql.window import Window

each_year = Window.partitionBy("year")

print(each_year)

<pyspark.sql.window.WindowSpec object at 0x000001F99B78B3A0>


A WindowSpec object is nothing more than a blueprint for an eventual window function. We created a window specification called each_year that instructs the window application to split the data frame according to the values in the year column. The real magic happens when you apply the window function to your data
frame.

In [5]:
# Self-Join Approach (repeat)

coldest_when = gsod.join(
    coldest_temp, how="left_semi", on=["year", "temp"]
).select("stn", "year", "mo", "da", "temp")

coldest_when.orderBy("year", "mo", "da").show()

+------+----+---+---+------+
|   stn|year| mo| da|  temp|
+------+----+---+---+------+
|896250|2017| 06| 20|-114.7|
|896060|2018| 08| 27|-113.5|
|895770|2019| 06| 15|-114.7|
+------+----+---+---+------+



Through the `withColumn()` method we define a column, `min_temp`, that collects the minimum of the `temp` column. Now, rather than picking the minimum temperature of the whole data frame, the `min()` is applied over the window specification we defined, using the `over()` method. For each window partition, Spark computes the minimum and then broadcasts the value over each record.

This is an important distinction compared to aggregating functions or UDF: in the
case of a window function, *the number of records in the data frame does not change*. Although `min()` is an aggregate function, since it’s applied with the `over()` method, every record in the window has the minimum value appended. The same would apply for any other aggregate function from `pyspark.sql.functions`, such as `sum()`, `avg()`, `min()`, `max()`, and `count()`.

In [6]:
# Window function

(gsod
.withColumn("min_temp", F.min("temp").over(each_year))
.where("temp = min_temp")
.select("year", "mo", "da", "stn", "temp")
.orderBy("year", "mo", "da")
.show())

+----+---+---+------+------+
|year| mo| da|   stn|  temp|
+----+---+---+------+------+
|2017| 06| 20|896250|-114.7|
|2018| 08| 27|896060|-113.5|
|2019| 06| 15|895770|-114.7|
+----+---+---+------+------+



> **Window functions are just methods on columns**\
Since a window function is applied though a method on a `Column` object, you can also apply them in a `select()`. You can also apply more than one window (or different ones) within the same `select()`. Spark won’t allow you to use a window directly in a `groupby()` or `where()` method, where it’ll spit an `AnalysisException`. If you want to group by or filter according to the result of a window function, “materialize” the column using `select()` or `withColumn()` before using the desired operation

In [7]:
gsod.select(
    "year",
    "mo",
    "da",
    "stn",
    "temp",
    F.min("temp").over(each_year).alias("min_temp"),
).where(
    "temp = min_temp"
).drop(
    "min_temp"
).orderBy(
    "year", "mo", "da"
).show()

+----+---+---+------+------+
|year| mo| da|   stn|  temp|
+----+---+---+------+------+
|2017| 06| 20|896250|-114.7|
|2018| 08| 27|896060|-113.5|
|2019| 06| 15|895770|-114.7|
+----+---+---+------+------+



> **But data frames already have partitions**
Since the beginning of the notebooks, partition has referred to the physical splits of the data on each executor node. Now we are also using partitions with window functions to mean logical splits of the data, which may or may not be equal to the Spark physical ones.

<img src="images/partitions.png">

### Beyond summarizing: Using ranking and analytical functions

There are two families of functions which allow performance of a wider range of operations versus aggregate functions such as `count()`, `sum()`, or `min()`:

- The `ranking` family, which provides information about rank (first, second, all
the way to last), n-tiles, and the ever so useful row number.
- The `analytical` family, which, despite its namesake, covers a variety of behaviors not related to summary or ranking

In [8]:
gsod_light = spark.read.parquet("./data/window/gsod_light.parquet")
gsod_light.show()

+------+----+---+---+----+----------+
|   stn|year| mo| da|temp|count_temp|
+------+----+---+---+----+----------+
|994979|2017| 12| 11|21.3|        21|
|998012|2017| 03| 02|31.4|        24|
|719200|2017| 10| 09|60.5|        11|
|917350|2018| 04| 21|82.6|         9|
|076470|2018| 06| 07|65.0|        24|
|996470|2018| 03| 12|55.6|        12|
|041680|2019| 02| 19|16.1|        15|
|949110|2019| 11| 23|54.9|        14|
|998252|2019| 04| 18|44.7|        11|
|998166|2019| 03| 20|34.8|        12|
+------+----+---+---+----+----------+



#### Ranking functions: Quick, who’s first?

Ranking functions are used for getting the top (or bottom) record for each window partition, or, more generally, for getting an order according to some column’s value.

In [9]:
temp_per_month_asc = Window.partitionBy("mo").orderBy("count_temp")

#### Gold, Silver, Bronze: Simple Ranking using `rank()`

With `rank()`, each record gets a position based on the value contained in one (or more) columns. Identical values have identical ranks—just like medalists in the Olympics, where the same score/time yields the same rank.

`rank()` takes no parameters since it ranks according to the `orderBy()` method
from the window spec; it would not make sense to order according to one column but
rank according to another

In [10]:
gsod_light.withColumn(
    "rank_tpm", F.rank().over(temp_per_month_asc)
).show()

+------+----+---+---+----+----------+--------+
|   stn|year| mo| da|temp|count_temp|rank_tpm|
+------+----+---+---+----+----------+--------+
|041680|2019| 02| 19|16.1|        15|       1|
|996470|2018| 03| 12|55.6|        12|       1|
|998166|2019| 03| 20|34.8|        12|       1|
|998012|2017| 03| 02|31.4|        24|       3|
|917350|2018| 04| 21|82.6|         9|       1|
|998252|2019| 04| 18|44.7|        11|       2|
|076470|2018| 06| 07|65.0|        24|       1|
|719200|2017| 10| 09|60.5|        11|       1|
|949110|2019| 11| 23|54.9|        14|       1|
|994979|2017| 12| 11|21.3|        21|       1|
+------+----+---+---+----+----------+--------+



The function `rank()` provides nonconsecutive ranks for each record, based on the
value of the ordered value, or the column(s) provided in the `orderBy()` method of
the window spec we call. 

for each window, the lower the count_temp, the lower the rank. When two records have the same ordered value, their ranks are the same. We say that the rank is nonconsecutive because, when you have multiple records that tie for a rank, the next one will be offset by the number of ties. For instance, for `mo` = 03, we have two records with `count_temp` = 12: both are rank 1. The next record (`count_temp` = 24) has a position of 3 rather than 2, because two records tied for the first position.

#### No Ties When Ranking: Using `dense_rank()`

What if we want, say, a denser ranking that would allocate consecutive ranks for records? Enter `dense_rank()`. The same principle as rank() applies, where ties share the same rank, but there won’t be any gap between the ranks: 1, 2, 3, and so on. This is practical when you want the second (or third, or any ordinal position) value over a window, rather than the record.

In [11]:
gsod_light.withColumn(
    "rank_tpm", F.dense_rank().over(temp_per_month_asc)
).show()

+------+----+---+---+----+----------+--------+
|   stn|year| mo| da|temp|count_temp|rank_tpm|
+------+----+---+---+----+----------+--------+
|041680|2019| 02| 19|16.1|        15|       1|
|996470|2018| 03| 12|55.6|        12|       1|
|998166|2019| 03| 20|34.8|        12|       1|
|998012|2017| 03| 02|31.4|        24|       2|
|917350|2018| 04| 21|82.6|         9|       1|
|998252|2019| 04| 18|44.7|        11|       2|
|076470|2018| 06| 07|65.0|        24|       1|
|719200|2017| 10| 09|60.5|        11|       1|
|949110|2019| 11| 23|54.9|        14|       1|
|994979|2017| 12| 11|21.3|        21|       1|
+------+----+---+---+----+----------+--------+



#### Ranking? Scoring? `percent_rank()` gives you both!

What if you want something closer to a scope, perhaps even a percentage that would
reflect where a record stands compared to its peers in the same window partition?
Enter `percent_rank()`.

For every window, `percent_rank()` will compute the percentage rank (between
zero and one) based on the ordered value. 

$$\frac{\# \text{records with a lower value than the current one}}{\# \text{of records in the window} - 1}$$

In [12]:
temp_each_year = each_year.orderBy("temp") # create a window spec from another window spec by chaining

gsod_light.withColumn(
    "rank_tpm", F.percent_rank().over(temp_each_year)
).show()

+------+----+---+---+----+----------+------------------+
|   stn|year| mo| da|temp|count_temp|          rank_tpm|
+------+----+---+---+----+----------+------------------+
|994979|2017| 12| 11|21.3|        21|               0.0|
|998012|2017| 03| 02|31.4|        24|               0.5|
|719200|2017| 10| 09|60.5|        11|               1.0|
|996470|2018| 03| 12|55.6|        12|               0.0|
|076470|2018| 06| 07|65.0|        24|               0.5|
|917350|2018| 04| 21|82.6|         9|               1.0|
|041680|2019| 02| 19|16.1|        15|               0.0|
|998166|2019| 03| 20|34.8|        12|0.3333333333333333|
|998252|2019| 04| 18|44.7|        11|0.6666666666666666|
|949110|2019| 11| 23|54.9|        14|               1.0|
+------+----+---+---+----+----------+------------------+



#### Creating Buckets Based on Ranks, using `ntile()`

`ntile()` functions allows you to create an arbitrary number of buckets (called tiles) based on the rank of your data. It computes n-tile for a given parameter `n`. 

In [13]:
gsod_light.withColumn("rank_tpm", F.ntile(2).over(temp_each_year)).show()

+------+----+---+---+----+----------+--------+
|   stn|year| mo| da|temp|count_temp|rank_tpm|
+------+----+---+---+----+----------+--------+
|994979|2017| 12| 11|21.3|        21|       1|
|998012|2017| 03| 02|31.4|        24|       1|
|719200|2017| 10| 09|60.5|        11|       2|
|996470|2018| 03| 12|55.6|        12|       1|
|076470|2018| 06| 07|65.0|        24|       1|
|917350|2018| 04| 21|82.6|         9|       2|
|041680|2019| 02| 19|16.1|        15|       1|
|998166|2019| 03| 20|34.8|        12|       1|
|998252|2019| 04| 18|44.7|        11|       2|
|949110|2019| 11| 23|54.9|        14|       2|
+------+----+---+---+----+----------+--------+



<img src="./images/n_tile.png">

#### Plain Row Numbers using `row_number()`

Given an ordered window, `row_number()` will give an increasing rank (1, 2, 3, . . .) regardless of the ties (the row number of tied records is nondeterministic, so if you need to have reproducible results, make sure you order each window so that there are no ties). This is identical to indexing each window.

In [14]:
gsod_light.withColumn(
    "rank_tpm", F.row_number().over(temp_each_year)
).show()

+------+----+---+---+----+----------+--------+
|   stn|year| mo| da|temp|count_temp|rank_tpm|
+------+----+---+---+----+----------+--------+
|994979|2017| 12| 11|21.3|        21|       1|
|998012|2017| 03| 02|31.4|        24|       2|
|719200|2017| 10| 09|60.5|        11|       3|
|996470|2018| 03| 12|55.6|        12|       1|
|076470|2018| 06| 07|65.0|        24|       2|
|917350|2018| 04| 21|82.6|         9|       3|
|041680|2019| 02| 19|16.1|        15|       1|
|998166|2019| 03| 20|34.8|        12|       2|
|998252|2019| 04| 18|44.7|        11|       3|
|949110|2019| 11| 23|54.9|        14|       4|
+------+----+---+---+----+----------+--------+



#### Losers First: Ordering Your WindowSpec using `orderBy()`

Finally, what if we want to reverse the order of our window? Unlike the `orderBy()` method on the data frame, the `orderBy()` method on a window does not have an ascending parameter we can use. We need to resort to the `desc()` method on the
Column object directly. 

In [15]:
temp_per_month_desc = Window.partitionBy("mo").orderBy(
    F.col("count_temp").desc()
)

gsod_light.withColumn(
    "row_number", F.row_number().over(temp_per_month_desc)
).show()

+------+----+---+---+----+----------+----------+
|   stn|year| mo| da|temp|count_temp|row_number|
+------+----+---+---+----+----------+----------+
|041680|2019| 02| 19|16.1|        15|         1|
|998012|2017| 03| 02|31.4|        24|         1|
|996470|2018| 03| 12|55.6|        12|         2|
|998166|2019| 03| 20|34.8|        12|         3|
|998252|2019| 04| 18|44.7|        11|         1|
|917350|2018| 04| 21|82.6|         9|         2|
|076470|2018| 06| 07|65.0|        24|         1|
|719200|2017| 10| 09|60.5|        11|         1|
|949110|2019| 11| 23|54.9|        14|         1|
|994979|2017| 12| 11|21.3|        21|         1|
+------+----+---+---+----+----------+----------+



### Analytic Functions: Looking back and ahead

Being able to look at a previous or following record unlocks a lot of functionality when building a time series feature. For instance, when doing modeling on time series data, one of the most important features are the observations in the past. Analytic window functions are by far the easiest
way to do this.

#### Access The Records Before or After using `lag()` and `lead()`

The two most important functions in the analytic functions family are `lag(col, n=1, default=None)` and `lead(col, n=1, default=None)`, which will give you the value of the `col` column of the `n`-th record before and after the record you’re over, respectively. If the record, offset by the lag/lead, falls beyond the boundaries of the window, Spark will default to `default`. To avoid `null` values, pass a value to the optional parameter `default`.


In [16]:
gsod_light.withColumn(
    "previous_temp", F.lag("temp").over(temp_each_year)
).withColumn(
    "previous_temp_2", F.lag("temp", 2).over(temp_each_year)
).show()

+------+----+---+---+----+----------+-------------+---------------+
|   stn|year| mo| da|temp|count_temp|previous_temp|previous_temp_2|
+------+----+---+---+----+----------+-------------+---------------+
|994979|2017| 12| 11|21.3|        21|         null|           null|
|998012|2017| 03| 02|31.4|        24|         21.3|           null|
|719200|2017| 10| 09|60.5|        11|         31.4|           21.3|
|996470|2018| 03| 12|55.6|        12|         null|           null|
|076470|2018| 06| 07|65.0|        24|         55.6|           null|
|917350|2018| 04| 21|82.6|         9|         65.0|           55.6|
|041680|2019| 02| 19|16.1|        15|         null|           null|
|998166|2019| 03| 20|34.8|        12|         16.1|           null|
|998252|2019| 04| 18|44.7|        11|         34.8|           16.1|
|949110|2019| 11| 23|54.9|        14|         44.7|           34.8|
+------+----+---+---+----+----------+-------------+---------------+



#### Cumulative Distribution Of the Records uing `cume_dist()`

The last analytical function we cover is `cume_dist()`, and it is similar to `percent_rank()`. `cume_dist()`, as its name indicates, provides a cumulative distribution (in the statistical sense of the term) rather than a ranking (where `percent_rank()` shines).

$$\frac{\# \text{records with a lower or equal value than the current one}}{\# \text{of records in the window}}$$

`cume_dist()` is an analytic function. It provides the cumulative density function `F(x)` for the records in the data frame.

In [17]:
gsod_light.withColumn(
    "percent_rank", F.percent_rank().over(temp_each_year)
).withColumn("cume_dist", F.cume_dist().over(temp_each_year)).show()

+------+----+---+---+----+----------+------------------+------------------+
|   stn|year| mo| da|temp|count_temp|      percent_rank|         cume_dist|
+------+----+---+---+----+----------+------------------+------------------+
|994979|2017| 12| 11|21.3|        21|               0.0|0.3333333333333333|
|998012|2017| 03| 02|31.4|        24|               0.5|0.6666666666666666|
|719200|2017| 10| 09|60.5|        11|               1.0|               1.0|
|996470|2018| 03| 12|55.6|        12|               0.0|0.3333333333333333|
|076470|2018| 06| 07|65.0|        24|               0.5|0.6666666666666666|
|917350|2018| 04| 21|82.6|         9|               1.0|               1.0|
|041680|2019| 02| 19|16.1|        15|               0.0|              0.25|
|998166|2019| 03| 20|34.8|        12|0.3333333333333333|               0.5|
|998252|2019| 04| 18|44.7|        11|0.6666666666666666|              0.75|
|949110|2019| 11| 23|54.9|        14|               1.0|               1.0|
+------+----

### Using row and range boundaries

We introduce how to build static, growing, and unbounded windows based on rows and ranges.

Let's start by applying an average computation over two windows identically partitioned. The only difference is that the first one is not ordered while the second one is. Surely the order of a window would have no impact on the computation of the average, right?

In [18]:
not_ordered = Window.partitionBy("year")

ordered = not_ordered.orderBy("temp")

gsod_light.withColumn(
    "avg_NO", F.avg("temp").over(not_ordered)
).withColumn("avg_O", F.avg("temp").over(ordered)).show()

+------+----+---+---+----+----------+------------------+------------------+
|   stn|year| mo| da|temp|count_temp|            avg_NO|             avg_O|
+------+----+---+---+----+----------+------------------+------------------+
|994979|2017| 12| 11|21.3|        21|37.733333333333334|              21.3|
|998012|2017| 03| 02|31.4|        24|37.733333333333334|             26.35|
|719200|2017| 10| 09|60.5|        11|37.733333333333334|37.733333333333334|
|996470|2018| 03| 12|55.6|        12| 67.73333333333333|              55.6|
|076470|2018| 06| 07|65.0|        24| 67.73333333333333|              60.3|
|917350|2018| 04| 21|82.6|         9| 67.73333333333333| 67.73333333333333|
|041680|2019| 02| 19|16.1|        15|            37.625|              16.1|
|998166|2019| 03| 20|34.8|        12|            37.625|             25.45|
|998252|2019| 04| 18|44.7|        11|            37.625|31.866666666666664|
|949110|2019| 11| 23|54.9|        14|            37.625|            37.625|
+------+----

                                     ^                      ^
            _________________________|   ___________________|
            All good: average is cons-   Some odd stuff is happening
            -istent across each window   It looks like each window grows,
                                         record by record, so the avg changes

Something with the ordering of a window messes up the computation. The official Spark API documentation informs us that when ordering is not defined, an unbounded window frame (`rowFrame`, `unboundedPreceding`, `unboundedFollowing`) is used by default. When ordering is defined, a growing window frame (`rangeFrame`, `unboundedPreceding`, `currentRow`) is used by default.

We need to understand the types of window frames we build and how they are used. We start by introducing the different _frame sizes_ (static versus growing versus unbounded) and how to reason about them before adding the second dimension, the _frame type_ (range versus rows). At the end of this section, the explanation for the previous code will make perfect sense.

#### Counting, window style: Static, growing, unbounded

We will cover the boundaries of a window, something we call a _window frame_. We will introduce record based boundaries which will provide an incredible new layer of flexibility when using window functions, as it controls the scope of visibility of a record within the window. We'll be able to create window functions that only look in the past and avoid feature leakage when working with time series.

Let’s take a visual of a window: when a function is applied to it, a window spec partitions a data frame based on one or more column values and then (potentially) orders them. Spark also provides the `rowsBetween()` and `rangeBetween()` methods to create window frame boundaries. 

<img src="images/window_frame.png">


In [19]:
not_ordered = Window.partitionBy("year").rowsBetween(
    Window.unboundedPreceding, Window.unboundedFollowing
)

ordered = not_ordered.orderBy("temp").rangeBetween(
    Window.unboundedPreceding, Window.currentRow
)

We explicitly added the boundaries that Spark assumes when none are provided. This means that not_ordered and ordered will provide the same results whether we
define the boundaries or not. The ordered window spec is bounded by range, not rows, but for our data frame, it works just the same. 

Because the window used in the computation of avg_NO is unbounded, meaning that it spans from the first to the last record of the window, the average is consistent across the whole window. The one used in the computation of avg_O is growing on the left, meaning that the right record is bounded to the currentRow, where the left record is set at the first value of the window. As you move from one record to
the next, the average is over more and more values. The average of the last record of the window contains all the values (because currentRow is the last record of the
window). A static window frame is nothing more than a window where both records are bounded relative to the current row; for example, rowsBetween(-1, 1) for a
window that contains the current row, the record immediately preceding, and the record immediately following

#### What you are Vs. Where you are: Range Vs Rows

Working with ranges is useful when working with dates and time, as you may want to gather windows based on time intervals that are different than the primary measure. As an example, the gsod data frame collects daily temperature information. What happens if we want to compare this temperature to the average of the previous month? Months have 28, 29, 30, or 31 days. This is where ranges get useful

In [20]:
gsod_light_p = (
    gsod_light.withColumn("year", F.lit(2019))
    .withColumn(
        "dt",
        F.to_date(
            F.concat_ws("-", F.col("year"), F.col("mo"), F.col("da"))
        ),
    )
    .withColumn("dt_num", F.unix_timestamp("dt"))
)

gsod_light_p.show()

+------+----+---+---+----+----------+----------+----------+
|   stn|year| mo| da|temp|count_temp|        dt|    dt_num|
+------+----+---+---+----+----------+----------+----------+
|994979|2019| 12| 11|21.3|        21|2019-12-11|1576018800|
|998012|2019| 03| 02|31.4|        24|2019-03-02|1551481200|
|719200|2019| 10| 09|60.5|        11|2019-10-09|1570572000|
|917350|2019| 04| 21|82.6|         9|2019-04-21|1555797600|
|076470|2019| 06| 07|65.0|        24|2019-06-07|1559858400|
|996470|2019| 03| 12|55.6|        12|2019-03-12|1552345200|
|041680|2019| 02| 19|16.1|        15|2019-02-19|1550530800|
|949110|2019| 11| 23|54.9|        14|2019-11-23|1574463600|
|998252|2019| 04| 18|44.7|        11|2019-04-18|1555538400|
|998166|2019| 03| 20|34.8|        12|2019-03-20|1553036400|
+------+----+---+---+----+----------+----------+----------+



For a simple range window, let’s compute the average of the temperatures recorded one month before and after a given day. Because our numerical date is in seconds, I’ll keep things simple and say that 1 month = 30 days = 720 hours = 43,200 minutes = 2,592,000 seconds.

<img src="images/window_range.png">

In [21]:
ONE_MONTH_ISH = 30 * 60 * 60 * 24 # or 2_592_000 seconds

one_month_ish_before_and_after = (
    Window.partitionBy("year")
    .orderBy("dt_num")
    .rangeBetween(-ONE_MONTH_ISH, ONE_MONTH_ISH)
)

gsod_light_p.withColumn(
    "avg_count", F.avg("count_temp").over(one_month_ish_before_and_after)
).show()

+------+----+---+---+----+----------+----------+----------+------------------+
|   stn|year| mo| da|temp|count_temp|        dt|    dt_num|         avg_count|
+------+----+---+---+----+----------+----------+----------+------------------+
|041680|2019| 02| 19|16.1|        15|2019-02-19|1550530800|             15.75|
|998012|2019| 03| 02|31.4|        24|2019-03-02|1551481200|             15.75|
|996470|2019| 03| 12|55.6|        12|2019-03-12|1552345200|             15.75|
|998166|2019| 03| 20|34.8|        12|2019-03-20|1553036400|              14.8|
|998252|2019| 04| 18|44.7|        11|2019-04-18|1555538400|10.666666666666666|
|917350|2019| 04| 21|82.6|         9|2019-04-21|1555797600|              10.0|
|076470|2019| 06| 07|65.0|        24|2019-06-07|1559858400|              24.0|
|719200|2019| 10| 09|60.5|        11|2019-10-09|1570572000|              11.0|
|949110|2019| 11| 23|54.9|        14|2019-11-23|1574463600|              17.5|
|994979|2019| 12| 11|21.3|        21|2019-12-11|1576

<img src="images/window_types.png">

### Going full circle: Using UDFs within windows

Let's see how to use UDFs within windows, using UDFs and split-apply-combine paradigm we saw in ![Previous chapter](./9_Pandas_UDF.ipynb). The recipe for applying a pandas UDF is very simple:
1. We need to use a Series to Scalar UDF (or a group aggregate UDF). PySpark will apply the UDF to every window (once per record) and put the (scalar) value as a result.
2. A UDF over _unbounded window frames_ is only supported by Spark 2.4 and above.
3. A UDF over _bounded window frames_ is only supported by Spark 3.0 and above.

This simple `median` function computes the median of a pandas `Series`. Then we apply it twice to the `gsod_light` data frame.

In [22]:
import pandas as pd
# Spark 2.4, use the following
# @F.pandas_udf("double", PandasUDFType.GROUPED_AGG)

@F.pandas_udf("double")
def median(vals: pd.Series) -> float:
    return vals.median()

gsod_light.withColumn(
    "median_temp", median("temp").over(Window.partitionBy("year"))
).withColumn(
    "median_temp_g",
    median("temp").over(
        Window.partitionBy("year").orderBy("mo", "da")
    ),
).show()

+------+----+---+---+----+----------+-----------+-------------+
|   stn|year| mo| da|temp|count_temp|median_temp|median_temp_g|
+------+----+---+---+----+----------+-----------+-------------+
|998012|2017| 03| 02|31.4|        24|       31.4|         31.4|
|719200|2017| 10| 09|60.5|        11|       31.4|        45.95|
|994979|2017| 12| 11|21.3|        21|       31.4|         31.4|
|996470|2018| 03| 12|55.6|        12|       65.0|         55.6|
|917350|2018| 04| 21|82.6|         9|       65.0|         69.1|
|076470|2018| 06| 07|65.0|        24|       65.0|         65.0|
|041680|2019| 02| 19|16.1|        15|      39.75|         16.1|
|998166|2019| 03| 20|34.8|        12|      39.75|        25.45|
|998252|2019| 04| 18|44.7|        11|      39.75|         34.8|
|949110|2019| 11| 23|54.9|        14|      39.75|        39.75|
+------+----+---+---+----+----------+-----------+-------------+



### The main steps to a successful window function

If you are stumped on how to perform a certain transformation, always remember the basic parameters of using a window function:
1. What kind of operation do I want to perform? Summarize, rank, or look ahead/behind.
2. How do I need to construct my window? Should it be bounded or unbounded? Do I need every record to have the same window value (unbounded), or should the answer depend on where the record fits within the window (bounded)? When bounding a window frame, you most often want to order it as well.
3. For bounded windows, do you want the window frame to be set according to the position of the record (row based) or the value of the record (range based)?
4. Finally, remember that a window function does not make your data frame special. After your function is applied, you can filter, group by, and even apply another, completely different, window.