## Window functions

- Windowed functions fill the niche between `.groupBy().agg()` and group map UDFs.
- Windows determine which records are used for the application of a function.
- Windowed functions preserve the number of records in the data frame: the results of aggregations are broadcasted to all the rows of the group. 
- `pyspark.sql.window.Window` is the builder class for windows, which are represented by the `WindowSpec` objects. 
- `Window.partitionBy(*columns)` will create partitions for values of the columns. 
- To apply a function in the defined window, the `.over(window)` method of a `Column` is used.
- Window functions are an elegant way to avoid self-joins. 
- Ranking functions rank records based on the value of a field. 
- Windows have an `.orderBy()` method that will sort the records within each window partition.
- `F.rank().over(ordered_window)` will rank the values in the window partition according to the column used in `.orderBy()
- Some ranking functions:
  - `F.rank()` is a nonconsecutive rank: same values result in the same rank, the next value after duplicates is offset by the number of duplicates. 
  - `F.dense_rank()` is a dense rank: ties will still have the same rank, but there will be no skips. 
  - `F.percent_rank()` computes `number_of_records_smaller_than_current / (number_of_records_in_window - 1)`.
  - `F.ntile()` creates an arbitrary number of tiles based on the rank of the data (i.e., quartiles, percentiles). 
  - `F.row_number()` will generate a row number regardless of ties.
- The `.orderBy()` of the window does not have the `ascending` parameter. `Column.desc()` should be used instead. 
- `F.lag()` and `F.lead()` will get shifted values of the column.
- `F.cume_dist()` computes the cumulative density function: `number_of_records_le_current / number_of_records_in_window`. 

In [3]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql import Window

In [3]:
spark = SparkSession.builder.appName("Window functions").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

In [4]:
spark

---

In [75]:
df = spark.read.parquet("/user/ivan.dubrovin/gsod_noaa_2019_")
df = df.filter(F.col("stn").isin([998252, 949110]))  # N = 365
df.printSchema()

root
 |-- stn: string (nullable = true)
 |-- year: string (nullable = true)
 |-- mo: string (nullable = true)
 |-- da: string (nullable = true)
 |-- temp: double (nullable = true)



In [76]:
df.show(5, truncate=False, vertical=False)

+------+----+---+---+----+
|stn   |year|mo |da |temp|
+------+----+---+---+----+
|949110|2019|01 |13 |70.0|
|949110|2019|01 |26 |66.8|
|949110|2019|11 |05 |54.3|
|949110|2019|12 |13 |57.3|
|949110|2019|01 |10 |64.3|
+------+----+---+---+----+
only showing top 5 rows



In [112]:
a = Window.partitionBy("mo")
b = Window.partitionBy("mo").orderBy("da")
c = Window.partitionBy("mo").orderBy("da").rowsBetween(Window.unboundedPreceding, Window.currentRow)
d = Window.partitionBy("mo").orderBy("da").rowsBetween(Window.currentRow, Window.unboundedFollowing)
e = Window.partitionBy("mo").orderBy("da").rowsBetween(-2, 2)

three_days_sec = 3600 * 24 * 3
f = Window.partitionBy("mo").orderBy("epoch").rangeBetween(-three_days_sec, Window.currentRow)


res = (
    df
    .withColumn("a", F.max("temp").over(a))
    .withColumn("b", F.max("temp").over(b))
    .withColumn("c", F.max("temp").over(c))
    .withColumn("d", F.max("temp").over(d))
    .withColumn("e", F.max("temp").over(e))
    .withColumn("dt", F.to_date(F.concat_ws("-", "year", "mo", "da")))
    .withColumn("epoch", F.unix_timestamp("dt"))
    .withColumn("f", F.max("temp").over(f))
)

res.orderBy("mo", "da").show(33)

+------+----+---+---+----+----+----+----+----+----+----------+----------+----+
|   stn|year| mo| da|temp|   a|   b|   c|   d|   e|        dt|     epoch|   f|
+------+----+---+---+----+----+----+----+----+----+----------+----------+----+
|949110|2019| 01| 01|65.8|75.9|65.8|65.8|75.9|71.3|2019-01-01|1546290000|65.8|
|949110|2019| 01| 02|61.6|75.9|65.8|65.8|75.9|74.1|2019-01-02|1546376400|65.8|
|949110|2019| 01| 03|71.3|75.9|71.3|71.3|75.9|74.1|2019-01-03|1546462800|71.3|
|949110|2019| 01| 04|74.1|75.9|74.1|74.1|75.9|74.1|2019-01-04|1546549200|74.1|
|949110|2019| 01| 05|59.8|75.9|74.1|74.1|75.9|74.1|2019-01-05|1546635600|74.1|
|949110|2019| 01| 06|63.9|75.9|74.1|74.1|75.9|74.1|2019-01-06|1546722000|74.1|
|949110|2019| 01| 07|65.5|75.9|74.1|74.1|75.9|65.5|2019-01-07|1546808400|74.1|
|949110|2019| 01| 08|64.1|75.9|74.1|74.1|75.9|65.5|2019-01-08|1546894800|65.5|
|949110|2019| 01| 09|59.7|75.9|74.1|74.1|75.9|72.3|2019-01-09|1546981200|65.5|
|949110|2019| 01| 10|64.3|75.9|74.1|74.1|75.9|72.3|2